src2md solves a practical problem: you want an LLM to understand your codebase, but the codebase doesn’t fit in the context window.
GPT-4 gives you ~128K tokens. Claude gives you ~200K. A medium-sized project blows past both. Naive truncation loses critical context. Manual curation doesn’t scale. So I built a tool that does it automatically.
How It Works
src2md reads a source tree, scores files by importance, and compresses them to fit a target token budget. The output is structured Markdown (or JSON, or plain text) ready to paste into an LLM conversation.
pip install src2md
# Basic markdown generation
src2md /path/to/project -o documentation.md
# With context optimization for GPT-4
src2md /path/to/project --gpt4 -o optimized.md
# With intelligent summarization
src2md /path/to/project --summarize --compression-ratio 0.3
Context Window Targeting
You can target specific LLM context windows:
# Target specific LLM context windows
src2md . --target-tokens 128000 # GPT-4
src2md . --target-tokens 200000 # Claude
# Predefined windows
src2md . --window gpt-4-turbo
src2md . --window claude-3
Multi-Tier Summarization
Not all files are equally important. src2md uses progressive compression: critical files get full source, important files get AST-level summaries, supporting files get docstrings only, and peripheral files get dropped.
from src2md import Converter
converter = Converter(
target_tokens=100000,
summarization_levels={
'critical': 'full', # Keep full source
'important': 'ast', # AST-based summary
'supporting': 'minimal', # Docstrings only
'peripheral': 'exclude' # Skip entirely
}
)
File Importance Scoring
The importance scoring considers multiple factors:
- Centrality: How many other files import this one?
- Complexity: Cyclomatic complexity, lines of code
- Recency: Recently modified files matter more
- Naming:
main.py,index.tsget a priority boost
AST-Based Analysis
For supported languages, src2md parses the AST to extract structure rather than just truncating text:
# From a 500-line Python file, extract:
# - Class/function signatures
# - Docstrings
# - Type hints
# - Key logic patterns
This preserves the information an LLM actually needs to reason about the code.
Output Formats
src2md . --format markdown # Default
src2md . --format json # Structured data
src2md . --format jsonl # Line-delimited JSON
src2md . --format html # Web-viewable
src2md . --format text # Plain text
Python API
from src2md import Repository, ContextWindow
# Basic usage
output = Repository("/path/to/project").analyze().to_markdown()
# Optimize for GPT-4 context window
output = (Repository("/path/to/project")
.optimize_for(ContextWindow.GPT_4)
.analyze()
.to_markdown())
# Full fluent API with all features
result = (Repository("/path/to/project")
.name("MyProject")
.include("src/", "lib/")
.exclude("tests/", "*.log")
.with_importance_scoring()
.with_summarization(
compression_ratio=0.3,
preserve_important=True,
use_llm=True
)
.optimize_for_tokens(100_000)
.analyze()
.to_json(pretty=True))
LLM-Powered Compression
For semantic understanding beyond AST extraction, you can use an LLM to do the summarization itself:
# Use OpenAI for semantic compression
export OPENAI_API_KEY=...
src2md . --llm-compress --provider openai
# Use Anthropic
export ANTHROPIC_API_KEY=...
src2md . --llm-compress --provider anthropic
The LLM produces human-readable summaries rather than mechanical truncation. It knows what matters.
Installation
pip install src2md
# With LLM support
pip install src2md[llm]
Resources
- PyPI: pypi.org/project/src2md/
- GitHub: github.com/queelius/src2md
Discussion