Skip to main content

src2md: Fitting Codebases into LLM Context Windows

src2md solves a practical problem: you want an LLM to understand your codebase, but the codebase doesn’t fit in the context window.

GPT-4 gives you ~128K tokens. Claude gives you ~200K. A medium-sized project blows past both. Naive truncation loses critical context. Manual curation doesn’t scale. So I built a tool that does it automatically.

How It Works

src2md reads a source tree, scores files by importance, and compresses them to fit a target token budget. The output is structured Markdown (or JSON, or plain text) ready to paste into an LLM conversation.

pip install src2md

# Basic markdown generation
src2md /path/to/project -o documentation.md

# With context optimization for GPT-4
src2md /path/to/project --gpt4 -o optimized.md

# With intelligent summarization
src2md /path/to/project --summarize --compression-ratio 0.3

Context Window Targeting

You can target specific LLM context windows:

# Target specific LLM context windows
src2md . --target-tokens 128000  # GPT-4
src2md . --target-tokens 200000  # Claude

# Predefined windows
src2md . --window gpt-4-turbo
src2md . --window claude-3

Multi-Tier Summarization

Not all files are equally important. src2md uses progressive compression: critical files get full source, important files get AST-level summaries, supporting files get docstrings only, and peripheral files get dropped.

from src2md import Converter

converter = Converter(
    target_tokens=100000,
    summarization_levels={
        'critical': 'full',      # Keep full source
        'important': 'ast',       # AST-based summary
        'supporting': 'minimal',  # Docstrings only
        'peripheral': 'exclude'   # Skip entirely
    }
)

File Importance Scoring

The importance scoring considers multiple factors:

  • Centrality: How many other files import this one?
  • Complexity: Cyclomatic complexity, lines of code
  • Recency: Recently modified files matter more
  • Naming: main.py, index.ts get a priority boost

AST-Based Analysis

For supported languages, src2md parses the AST to extract structure rather than just truncating text:

# From a 500-line Python file, extract:
# - Class/function signatures
# - Docstrings
# - Type hints
# - Key logic patterns

This preserves the information an LLM actually needs to reason about the code.

Output Formats

src2md . --format markdown    # Default
src2md . --format json        # Structured data
src2md . --format jsonl       # Line-delimited JSON
src2md . --format html        # Web-viewable
src2md . --format text        # Plain text

Python API

from src2md import Repository, ContextWindow

# Basic usage
output = Repository("/path/to/project").analyze().to_markdown()

# Optimize for GPT-4 context window
output = (Repository("/path/to/project")
    .optimize_for(ContextWindow.GPT_4)
    .analyze()
    .to_markdown())

# Full fluent API with all features
result = (Repository("/path/to/project")
    .name("MyProject")
    .include("src/", "lib/")
    .exclude("tests/", "*.log")
    .with_importance_scoring()
    .with_summarization(
        compression_ratio=0.3,
        preserve_important=True,
        use_llm=True
    )
    .optimize_for_tokens(100_000)
    .analyze()
    .to_json(pretty=True))

LLM-Powered Compression

For semantic understanding beyond AST extraction, you can use an LLM to do the summarization itself:

# Use OpenAI for semantic compression
export OPENAI_API_KEY=...
src2md . --llm-compress --provider openai

# Use Anthropic
export ANTHROPIC_API_KEY=...
src2md . --llm-compress --provider anthropic

The LLM produces human-readable summaries rather than mechanical truncation. It knows what matters.

Installation

pip install src2md

# With LLM support
pip install src2md[llm]

Resources

Discussion