Skip to content

hharshhsaini/Ai-Docstring-Generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Docstring Generator

A Python CLI tool that automatically generates and injects docstrings into Python codebases using LLM providers. It uses AST parsing to extract function metadata, leverages LLMs to generate contextually accurate docstrings, and patches them directly back into source files while preserving formatting.

Features

  • πŸ€– Multiple LLM Providers: Support for Anthropic, OpenAI, and Ollama (local)
  • πŸ“ Multiple Docstring Styles: Google, NumPy, and Sphinx formats
  • πŸ” AST-Based Parsing: Accurate extraction using libcst
  • 🎨 Formatting Preservation: Maintains original code structure and whitespace
  • πŸš€ CI/CD Integration: Pre-commit hooks and GitHub Actions support
  • πŸ“Š Coverage Reports: Track documentation completeness
  • πŸ”„ Dry-Run Mode: Preview changes before applying
  • ⚑ Batch Processing: Process entire codebases with error recovery
  • 🎯 Smart Filtering: Process only missing docstrings or staged files

Table of Contents

Installation

From Source (Development)

# Clone the repository
git clone https://github.com/yourusername/docstring-generator.git
cd docstring-generator

# Install in editable mode with dependencies
pip install -e .

From PyPI (When Published)

pip install docstring-generator

Verify Installation

docgen --help

Quick Start

1. Set Up API Key

Choose one LLM provider and set your API key:

Anthropic (Recommended):

export ANTHROPIC_API_KEY="your-anthropic-api-key-here"

OpenAI:

export OPENAI_API_KEY="your-openai-api-key-here"

Ollama (Local, No API Key):

# Install Ollama from https://ollama.ai/
ollama pull llama2
ollama serve

2. Create a Test File

mkdir -p test_project
cat > test_project/calculator.py << 'EOF'
def add(a: int, b: int) -> int:
    return a + b

def subtract(x: int, y: int) -> int:
    return x - y

class Calculator:
    def multiply(self, a: int, b: int) -> int:
        return a * b
EOF

3. Generate Docstrings

Preview changes first (dry-run):

docgen test_project/ --dry-run

Generate docstrings:

docgen test_project/

Check the results:

cat test_project/calculator.py

Example Output

Before:

def add(a: int, b: int) -> int:
    return a + b

After (Google style):

def add(a: int, b: int) -> int:
    """Add two integers.
    
    Args:
        a: First integer to add.
        b: Second integer to add.
    
    Returns:
        Sum of a and b.
    """
    return a + b

Testing the Tool

Basic Test Scenarios

Test 1: Dry Run (Preview Mode)

# Preview changes without modifying files
docgen test_project/ --dry-run

Expected: Shows diffs of proposed changes, no files modified.

Test 2: Generate Docstrings

# Actually generate and inject docstrings
docgen test_project/

Expected: "Files processed: 1", "Docstrings added: X"

Test 3: Only Missing Docstrings

# Run again - should skip existing docstrings
docgen test_project/ --only-missing

Expected: "Docstrings added: 0" (all functions already documented)

Test 4: Different Styles

# Create a new file
cat > test_project/utils.py << 'EOF'
def format_name(first: str, last: str) -> str:
    return f"{first} {last}"
EOF

# Try NumPy style
docgen test_project/utils.py --style numpy

# Try Sphinx style (remove docstring first)
cat > test_project/utils.py << 'EOF'
def format_name(first: str, last: str) -> str:
    return f"{first} {last}"
EOF
docgen test_project/utils.py --style sphinx

Test 5: Coverage Report

# Check documentation coverage
docgen test_project/ --report

Expected output:

Docstring Coverage Report
=========================
Total functions: 4
Documented functions: 4
Coverage: 100.00%

βœ“ All functions are documented!

Test 6: Coverage Check with Threshold

# Create file with missing docstrings
cat > test_project/incomplete.py << 'EOF'
def documented():
    """Has docstring."""
    pass

def undocumented():
    pass
EOF

# Check coverage (should fail)
docgen test_project/ --check --min-coverage 100
echo "Exit code: $?"

Expected: Exit code 1, error about coverage below threshold.

Test 7: JSON Output

# Generate JSON report for CI/CD
docgen test_project/ --report --format json

Expected: Valid JSON with coverage statistics.

Test 8: Configuration File

# Create config file
cat > test_project/docgen.toml << 'EOF'
[docgen]
style = "google"
provider = "anthropic"
overwrite_existing = false

[docgen.exclude]
patterns = ["**/test_*.py"]
EOF

# Run with config
cd test_project && docgen . && cd ..

Test 9: Overwrite Existing

# Overwrite existing docstrings
docgen test_project/ --overwrite-existing

Test 10: Error Handling

# Create invalid Python file
cat > test_project/broken.py << 'EOF'
def broken(
    pass
EOF

# Tool should continue processing
docgen test_project/

Expected: Shows error but continues with other files.

Verification Checklist

After testing, verify:

  • βœ… Docstrings generated in correct style
  • βœ… Original formatting preserved (indentation, comments)
  • βœ… Type hints included in docstrings
  • βœ… Parameters documented
  • βœ… Return values documented
  • βœ… Files remain syntactically valid: python -m py_compile test_project/*.py
  • βœ… Coverage reports accurate
  • βœ… Error handling works

Clean Up Test Files

rm -rf test_project/

Configuration

Configuration File

Create a docgen.toml file in your project root:

[docgen]
# Docstring style: "google", "numpy", or "sphinx"
style = "google"

# LLM provider: "anthropic", "openai", or "ollama"
provider = "anthropic"

# Model name (provider-specific)
model = "claude-3-5-sonnet-20241022"

# Whether to overwrite existing docstrings
overwrite_existing = false

[docgen.exclude]
# Glob patterns for files to exclude
patterns = [
    "**/test_*.py",
    "**/migrations/**",
    "**/__pycache__/**",
]

[docgen.llm]
# Maximum retry attempts for API calls
max_retries = 3

# Request timeout in seconds
timeout = 30

Configuration Precedence

Configuration is loaded with the following precedence (highest to lowest):

  1. CLI flags (e.g., --style google)
  2. Environment variables (e.g., DOCGEN_STYLE=google)
  3. Config file (docgen.toml)
  4. Default values

CLI Reference

Commands and Flags

docgen [PATH] [OPTIONS]

Arguments

  • PATH: Directory or file path to process (required)

Options

Generation Options:

  • --style [google|numpy|sphinx]: Docstring style (default: google)
  • --provider [anthropic|openai|ollama]: LLM provider (default: anthropic)
  • --model TEXT: Model name (provider-specific)
  • --only-missing: Process only functions without docstrings
  • --overwrite-existing: Replace existing docstrings

Preview and Reporting:

  • --dry-run: Preview changes without modifying files
  • --report: Generate coverage report
  • --format [text|json]: Output format (default: text)

CI/CD Integration:

  • --check: Exit with code 1 if any functions lack docstrings
  • --min-coverage FLOAT: Minimum coverage percentage (0-100)
  • --staged: Process only git-staged files

Examples:

# Use NumPy style with OpenAI
docgen ./src --style numpy --provider openai --model gpt-4

# Check coverage and fail if below 80%
docgen ./src --check --min-coverage 80

# Generate JSON report for CI
docgen ./src --report --format json > coverage.json

# Process only staged files (for pre-commit)
docgen ./src --staged --only-missing

Docstring Styles

Google Style

def function(arg1: int, arg2: str) -> bool:
    """Short description.
    
    Longer description if needed.
    
    Args:
        arg1: Description of arg1.
        arg2: Description of arg2.
    
    Returns:
        Description of return value.
    
    Raises:
        ValueError: When invalid input is provided.
    """

NumPy Style

def function(arg1: int, arg2: str) -> bool:
    """Short description.
    
    Longer description if needed.
    
    Parameters
    ----------
    arg1 : int
        Description of arg1.
    arg2 : str
        Description of arg2.
    
    Returns
    -------
    bool
        Description of return value.
    
    Raises
    ------
    ValueError
        When invalid input is provided.
    """

Sphinx Style

def function(arg1: int, arg2: str) -> bool:
    """Short description.
    
    Longer description if needed.
    
    :param arg1: Description of arg1.
    :type arg1: int
    :param arg2: Description of arg2.
    :type arg2: str
    :return: Description of return value.
    :rtype: bool
    :raises ValueError: When invalid input is provided.
    """

LLM Provider Setup

Anthropic (Claude)

  1. Get an API key from Anthropic Console
  2. Set environment variable:
    export ANTHROPIC_API_KEY="your-api-key"
  3. Configure in docgen.toml:
    [docgen]
    provider = "anthropic"
    model = "claude-3-5-sonnet-20241022"

Supported Models:

  • claude-3-5-sonnet-20241022 (recommended)
  • claude-3-opus-20240229
  • claude-3-sonnet-20240229
  • claude-3-haiku-20240307

OpenAI (GPT)

  1. Get an API key from OpenAI Platform
  2. Set environment variable:
    export OPENAI_API_KEY="your-api-key"
  3. Configure in docgen.toml:
    [docgen]
    provider = "openai"
    model = "gpt-4"

Supported Models:

  • gpt-4 (recommended)
  • gpt-4-turbo
  • gpt-3.5-turbo

Ollama (Local)

  1. Install Ollama
  2. Pull a model:
    ollama pull llama2
  3. Start Ollama server:
    ollama serve
  4. Configure in docgen.toml:
    [docgen]
    provider = "ollama"
    model = "llama2"
    
    [docgen.llm]
    base_url = "http://localhost:11434"

Supported Models:

  • llama2
  • codellama
  • mistral
  • Any model available in Ollama

API Usage and Billing

Understanding Token Usage

Token usage depends on:

  • Number of functions to document
  • Function complexity (parameters, type hints, body length)
  • Docstring style (Google/NumPy/Sphinx have similar token counts)
  • Model used (different tokenization methods)

Each function typically requires:

  • Input tokens: Function signature + body preview + prompt template (~200-500 tokens per function)
  • Output tokens: Generated docstring (~100-300 tokens per function)

Rough estimates per function:

  • Simple function (2-3 params): 300-500 total tokens
  • Medium function (4-6 params): 500-800 total tokens
  • Complex function (7+ params, long body): 800-1500 total tokens

Cost Comparison by Provider

Anthropic Claude Models

Model Context Window Best For Cost (per 1M tokens) Cost per Function*
claude-3-5-sonnet-20241022 200K Recommended - Best balance Input: $3, Output: $15 $0.0015-0.003
claude-3-opus-20240229 200K Highest quality docstrings Input: $15, Output: $75 $0.008-0.015
claude-3-sonnet-20240229 200K Budget-friendly option Input: $3, Output: $15 $0.0015-0.003
claude-3-haiku-20240307 200K Fastest, cheapest Input: $0.25, Output: $1.25 $0.0001-0.0003

OpenAI GPT Models

Model Context Window Best For Cost (per 1M tokens) Cost per Function*
gpt-4 8K High quality docstrings Input: $30, Output: $60 $0.015-0.030
gpt-4-turbo 128K Large codebases Input: $10, Output: $30 $0.005-0.012
gpt-3.5-turbo 16K Budget option Input: $0.50, Output: $1.50 $0.0003-0.001

Ollama (Local)

Model Best For Cost Notes
llama2 General purpose FREE Requires local GPU/CPU
codellama Code-focused FREE Better for technical docs
mistral Fast generation FREE Good balance of speed/quality

*Estimated cost per function assuming 400 input tokens and 200 output tokens.

Real-World Cost Examples

Small Project (50 functions)

Provider Model Total Tokens Estimated Cost
Anthropic Claude 3.5 Sonnet 30K $0.08
Anthropic Claude 3 Haiku 30K $0.01
OpenAI GPT-3.5 Turbo 30K $0.03
OpenAI GPT-4 Turbo 30K $0.30
Ollama Llama2 30K $0.00

Medium Project (200 functions)

Provider Model Total Tokens Estimated Cost
Anthropic Claude 3.5 Sonnet 120K $0.32
Anthropic Claude 3 Haiku 120K $0.04
OpenAI GPT-3.5 Turbo 120K $0.12
OpenAI GPT-4 Turbo 120K $1.20
Ollama Llama2 120K $0.00

Large Project (1000 functions)

Provider Model Total Tokens Estimated Cost
Anthropic Claude 3.5 Sonnet 600K $1.62
Anthropic Claude 3 Haiku 600K $0.19
OpenAI GPT-3.5 Turbo 600K $0.60
OpenAI GPT-4 Turbo 600K $6.00
Ollama Llama2 600K $0.00

Monthly Cost Estimates (CI/CD Usage)

Assuming 20 PRs per month, average 25 functions per PR:

Configuration Cost per PR Monthly Cost
Claude 3.5 Sonnet $0.04 $0.80
Claude 3 Haiku $0.005 $0.10
GPT-3.5 Turbo $0.015 $0.30
GPT-4 Turbo $0.15 $3.00
Ollama (Local) $0.00 $0.00

Cost Control Strategies

1. Use --only-missing Flag

Only generate docstrings for functions that don't have them:

docgen ./src --only-missing

Savings: 80-90% on subsequent runs

2. Process Specific Files

Target only changed files instead of entire codebase:

docgen ./src/module.py

Savings: 90-99% by avoiding unnecessary processing

3. Use Cheaper Models for Testing

Use Haiku or GPT-3.5 for development, upgrade for production:

# Development
docgen ./src --model claude-3-haiku-20240307

# Production
docgen ./src --model claude-3-5-sonnet-20241022

Savings: 80-90% during development

4. Batch Processing with Dry-Run

Preview changes before committing to API calls:

docgen ./src --dry-run

Savings: Verify tool behavior before spending credits

5. Use Ollama for Free Local Generation

No API costs, runs entirely on your machine:

ollama pull codellama
docgen ./src --provider ollama --model codellama

Savings: 100% (completely free)

6. Configure Exclusion Patterns

Skip test files, migrations, and generated code:

[docgen.exclude]
patterns = [
    "**/test_*.py",
    "**/tests/**",
    "**/migrations/**",
    "**/*_pb2.py",  # Generated protobuf files
]

Savings: 30-50% by excluding non-essential files

7. Use Pre-commit Hooks with --staged

Only process files being committed:

- id: docstring-generator
  entry: docgen
  args: [--staged, --only-missing]

Savings: 95-99% by processing only changed files

Monitoring API Usage

Track Token Usage

Most providers offer usage dashboards:

Set Budget Alerts

Configure spending limits in your provider dashboard:

  • Anthropic: Set monthly budget limits
  • OpenAI: Configure usage notifications

Log Token Counts

Enable verbose logging to see token usage per request:

export DOCGEN_LOG_LEVEL=DEBUG
docgen ./src

Cost Optimization Recommendations

For Individual Developers:

  • βœ… Use Ollama (free) for local development
  • βœ… Use Claude 3 Haiku or GPT-3.5 Turbo for quick iterations
  • βœ… Use Claude 3.5 Sonnet for final documentation

For Small Teams (< 10 developers):

  • βœ… Use Claude 3 Haiku for pre-commit hooks
  • βœ… Use Claude 3.5 Sonnet for CI/CD checks
  • βœ… Expected monthly cost: $5-20

For Large Teams (10+ developers):

  • βœ… Use Ollama for local development
  • βœ… Use Claude 3 Haiku for pre-commit hooks
  • βœ… Use Claude 3.5 Sonnet for PR reviews only
  • βœ… Expected monthly cost: $20-100

For Open Source Projects:

  • βœ… Use Ollama exclusively (free)
  • βœ… Document in README how contributors can set up Ollama
  • βœ… No API costs for maintainers or contributors

Pre-commit Hook Setup

Installation

  1. Install pre-commit:

    pip install pre-commit
  2. Create .pre-commit-config.yaml in your project root:

    repos:
      - repo: local
        hooks:
          - id: docstring-generator
            name: Generate docstrings
            entry: docgen
            args: [--staged, --only-missing, --check]
            language: system
            types: [python]
            pass_filenames: false
  3. Install the hook:

    pre-commit install

Configuration Options

Enforce documentation (fail if missing):

args: [--staged, --only-missing, --check]

Auto-generate without failing:

args: [--staged, --only-missing]

Require minimum coverage:

args: [--staged, --check, --min-coverage, "80"]

Usage

Once installed, the hook runs automatically on git commit:

git add src/mymodule.py
git commit -m "Add new feature"
# Hook runs automatically and generates docstrings

GitHub Actions Setup

Basic Workflow

Create .github/workflows/docstring-check.yml:

name: Docstring Coverage Check

on:
  pull_request:
    paths:
      - '**.py'

jobs:
  check-docstrings:
    runs-on: ubuntu-latest
    
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      
      - name: Install docstring-generator
        run: pip install docstring-generator
      
      - name: Check docstring coverage
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          docgen . --check --min-coverage 80
      
      - name: Generate coverage report
        if: failure()
        run: |
          docgen . --report

Advanced Workflow with PR Comments

name: Docstring Coverage Report

on:
  pull_request:
    paths:
      - '**.py'

jobs:
  coverage-report:
    runs-on: ubuntu-latest
    
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      
      - name: Install docstring-generator
        run: pip install docstring-generator
      
      - name: Generate coverage report
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          docgen . --report --format json > coverage.json
          docgen . --report --format text > coverage.txt
      
      - name: Check coverage threshold
        run: |
          docgen . --check --min-coverage 80
      
      - name: Comment on PR
        if: always()
        uses: actions/github-script@v6
        with:
          script: |
            const fs = require('fs');
            const coverage = JSON.parse(fs.readFileSync('coverage.json', 'utf8'));
            const textReport = fs.readFileSync('coverage.txt', 'utf8');
            
            const comment = `## πŸ“Š Docstring Coverage Report
            
            - **Total functions:** ${coverage.total_functions}
            - **Documented:** ${coverage.documented_functions}
            - **Coverage:** ${coverage.coverage_percentage.toFixed(2)}%
            
            ${coverage.missing_docstrings.length > 0 ? '### ⚠️ Missing Docstrings\n\n' + coverage.missing_docstrings.slice(0, 10).map(([file, line]) => `- \`${file}:${line}\``).join('\n') : 'βœ… All functions are documented!'}
            
            ${coverage.missing_docstrings.length > 10 ? `\n... and ${coverage.missing_docstrings.length - 10} more` : ''}
            `;
            
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: comment
            });
      
      - name: Upload coverage report
        uses: actions/upload-artifact@v3
        with:
          name: coverage-report
          path: coverage.json

Setting Up Secrets

  1. Go to your repository settings
  2. Navigate to Secrets and variables β†’ Actions
  3. Add your API key:
    • Name: ANTHROPIC_API_KEY (or OPENAI_API_KEY)
    • Value: Your API key

Troubleshooting

Common Issues

Issue: "API key not found"

Error: API key required for anthropic. Set ANTHROPIC_API_KEY environment variable.

Solution:

# Set environment variable
export ANTHROPIC_API_KEY="your-api-key"

# Or add to your shell profile (~/.bashrc, ~/.zshrc)
echo 'export ANTHROPIC_API_KEY="your-api-key"' >> ~/.bashrc

Issue: "Rate limit exceeded"

Error: Anthropic rate limit exceeded

Solution:

  • The tool automatically retries with exponential backoff
  • Reduce batch size by processing fewer files at once
  • Wait a few minutes and try again
  • Consider upgrading your API plan

Issue: "Invalid docstring format"

Error: Validation failed: Missing short description

Solution:

  • The LLM occasionally generates invalid docstrings
  • The tool will skip invalid docstrings and continue
  • Check the error summary at the end for details
  • Consider trying a different model or provider

Issue: "File parsing error"

Error: Parse error: invalid syntax

Solution:

  • Ensure your Python files have valid syntax
  • Run python -m py_compile yourfile.py to check
  • The tool will skip unparseable files and continue

Issue: "Pre-commit hook is slow"

Solution:

# Process only staged files
args: [--staged, --only-missing]

# Or set a timeout
- id: docstring-generator
  name: Generate docstrings
  entry: timeout 30 docgen
  args: [--staged, --only-missing]

Issue: "Ollama connection refused"

Error: Connection refused to http://localhost:11434

Solution:

# Start Ollama server
ollama serve

# Or configure custom URL in docgen.toml
[docgen.llm]
base_url = "http://your-ollama-server:11434"

Debug Mode

Enable verbose logging:

# Set log level
export DOCGEN_LOG_LEVEL=DEBUG

# Run with verbose output
docgen ./src --dry-run

Getting Help

Contributing

We welcome contributions! Here's how to get started:

Development Setup

  1. Clone the repository:

    git clone https://github.com/yourusername/docstring-generator.git
    cd docstring-generator
  2. Create a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install development dependencies:

    pip install -e ".[dev]"
  4. Install pre-commit hooks:

    pre-commit install

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=docgen --cov-report=html

# Run only unit tests
pytest tests/unit/

# Run only property-based tests
pytest tests/property/

# Run only integration tests
pytest tests/integration/

# Run specific test file
pytest tests/unit/test_parser.py

# Run with verbose output
pytest -v

Code Style

We use:

  • black for code formatting
  • isort for import sorting
  • mypy for type checking
  • ruff for linting
# Format code
black src/ tests/

# Sort imports
isort src/ tests/

# Type check
mypy src/

# Lint
ruff check src/ tests/

Submitting Changes

  1. Create a new branch:

    git checkout -b feature/your-feature-name
  2. Make your changes and add tests

  3. Run tests and linting:

    pytest
    black src/ tests/
    mypy src/
  4. Commit your changes:

    git commit -m "Add your feature"
  5. Push and create a pull request:

    git push origin feature/your-feature-name

Guidelines

  • Write tests for new features
  • Update documentation for user-facing changes
  • Follow existing code style
  • Keep commits focused and atomic
  • Write clear commit messages

License

MIT License - see LICENSE file for details.

Acknowledgments

  • Built with libcst for CST manipulation
  • Uses Hypothesis for property-based testing
  • Inspired by the need for better documentation tooling

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages