I keep building the same thing.
Not the same product — the products are different. One indexes a Hugo blog. One indexes AI conversations. One consolidates medical records from three hospitals. One catalogs a hundred git repositories. But underneath, they all have the same skeleton. After the fifth time, I think the skeleton deserves a name.
The pattern
Domain files (ground truth)
↓ index
SQLite database (read-only cache, FTS5)
↓ expose
MCP server (tools + resources → AI assistant)
That’s it. Three layers. The domain files are always canonical — the database is a disposable cache you can rebuild from them at any time. SQLite gives you structured queries, full-text search, and JSON extraction over data that was previously trapped in flat files. MCP exposes it to an AI assistant that can write SQL, retrieve content, and (in some cases) create new content.
Here’s the inventory:
| Project | Domain | Ground Truth | What the MCP Exposes |
|---|---|---|---|
| hugo-memex | Blog content | Markdown files with YAML front matter | 951 pages, FTS5 search, taxonomy queries, JSON front matter extraction |
| memex | AI conversations | ChatGPT/Claude/Gemini exports | Conversation trees, FTS5 message search, tags, enrichments |
| chartfold | Medical records | Epic, MEDITECH, athenahealth exports | Labs, meds, encounters, imaging, pathology, cross-source reconciliation |
| arkiv | Personal archives | JSONL files from various sources | Unified SQL over heterogeneous personal data |
| repoindex | Git repositories | Local git repos + GitHub/PyPI/CRAN metadata | Repository catalog with activity tracking, publication status |
Five projects. Five completely different domains. One architecture.
Why SQLite
SQLite is the most deployed database in history. It’s on every phone, every browser, every Python installation. But that’s not why I use it.
I use it because it solves three problems at once:
Structured queries over unstructured data. Hugo front matter is YAML trapped inside markdown files. Medical records are scattered across three incompatible EHR export formats. AI conversations are JSON trees with branching paths. SQLite turns all of these into tables you can JOIN, GROUP BY, and aggregate. json_extract() handles the long tail of fields that don’t fit a fixed schema.
Full-text search. FTS5 with porter stemming and unicode61 tokenization gives you relevance-ranked search across any text corpus. No Elasticsearch, no external service, no running daemon. Just a virtual table that lives in the same database file.
Read-only enforcement. SQLite’s authorizer callback lets you whitelist specific SQL operations at the statement level. My MCP servers allow SELECT, READ, and FUNCTION — everything else gets SQLITE_DENY. This isn’t PRAGMA query_only (which can be disabled by the caller). It’s engine-level enforcement that cannot be bypassed via SQL.
And the operational properties are free: WAL mode for concurrent readers, a single file you can back up with cp, zero configuration, zero running processes.
Why MCP
The Model Context Protocol is the thin layer that makes SQLite useful to an AI assistant. An MCP server exposes tools (functions the AI can call) and resources (reference material the AI can read). That’s the whole API surface.
For each project, the MCP layer follows the same shape:
execute_sql — The power tool. Read-only SQL with exemplar queries in the docstring. The docstring is critical: it’s the AI’s primary reference for writing correct SQL. Ten well-chosen example queries teach the model more than a schema diagram.
get_<things> — Bulk retrieval. Instead of execute_sql to find IDs then N individual fetches, one call returns full content for a filtered set. This matters when you’re sharing a context window across multiple MCP servers.
<domain>://schema — A resource containing the full DDL, relationship documentation, and query patterns. The AI reads this once, then writes SQL against it for the rest of the session.
The specific tools vary by domain. Hugo-memex has create_page and suggest_tags. Chartfold has get_clinical_summary and get_timeline. Memex has get_conversations with message previews. But the structural pattern is the same.
The database is a cache
This is the most important architectural decision, and it’s easy to get wrong.
The database is not the source of truth. The files are. The database is a materialized index that can be rebuilt from the files at any time. This means:
No migrations. If the schema changes, drop the database and re-index. For a 951-page Hugo site, full re-indexing takes six seconds. Why maintain migration code for a disposable cache?
No write conflicts. The files are edited by humans (or by AI tools that write to the filesystem). The database is updated by the indexer. There’s exactly one write path.
No backup strategy. You already back up your files. The database is derived from them. Lose the database? Rebuild it.
Incremental sync is an optimization, not a requirement. SHA-256 content hashes + file mtimes make re-indexing fast. But if incremental sync has a bug, force a full rebuild. The cache being disposable means you can always recover.
The one exception in my inventory is chartfold, where the database is the store for imported medical records (the original exports aren’t kept in a format that round-trips cleanly). Even there, the exports are the moral source of truth — you re-import to rebuild.
What large context changes
With a million-token context window, you might think this pattern is obsolete. Why index into SQLite when you can just load everything into context?
The math says otherwise. My Hugo blog is 951 pages, ~480K words, ~1.9M tokens. It doesn’t fit. And that’s one data source. Add AI conversations (memex), medical records (chartfold), and repository metadata (repoindex), and you’re well past the limit.
But even if it did fit, the pattern would still be useful. Loading 480K words into context to answer “which posts are tagged ‘reinforcement-learning’?” is like loading an entire database into memory to run a SELECT with a WHERE clause. SQLite does it in microseconds. Context loading costs seconds and tokens.
The right model is: MCP for navigation, context for understanding. Use execute_sql to find the five relevant posts, then use get_pages to load their full content into context. One tool call for discovery, one for deep reading. Two calls total instead of flooding the context with everything.
This is how I actually use it. When I asked my AI assistant to analyze my blog, it ran:
SELECT section, COUNT(*) as pages, SUM(word_count) as words,
ROUND(AVG(word_count)) as avg_words
FROM pages WHERE draft = 0
GROUP BY section ORDER BY words DESC
One tool call. Instant orientation. Then it loaded specific posts for deeper analysis. The total context cost was a fraction of what a full dump would have been, and the assistant had better data because it could aggregate, filter, and join.
The tools that earn their keep
After building five of these, certain tools prove their worth and others don’t.
The tools that matter:
execute_sqlwith good docstring examples. This is 80% of the value. A well-documented schema + exemplar queries lets the AI answer almost any question.- Bulk retrieval (
get_pages,get_conversations,get_clinical_summary). One call instead of N+1. Critical when you’re sharing context across multiple servers. - Schema/stats resources. Quick orientation without burning a tool call.
The tools that surprised me:
suggest_tagsin hugo-memex. Uses FTS5 similarity to find pages like your draft, then returns their most common tags with canonical casing. Solved a real problem: my blog had 40 case-duplicate tag pairs (Python/python,AI/ai).get_timelinein chartfold. Merges encounters, procedures, labs, imaging, and notes into a single chronological stream. The AI can reason about temporal relationships that are invisible when the data is in separate tables.
What I’d skip next time:
- Fine-grained CRUD tools for things that are easier to do via the filesystem or raw SQL. The
update_pagetool in hugo-memex is fine, but I could also just edit the file. The MCP tool is one more surface to maintain.
The Unix connection
This pattern is the Unix philosophy applied to AI tooling:
- Small tools that do one thing well. Each MCP server handles one domain. No mega-server that does everything.
- Text as the universal interface. SQL in, JSON out. The AI speaks text natively.
- Composition over integration. Five independent MCP servers, each ignorable, each replaceable. Claude Code connects to all of them simultaneously without any of them knowing about each other.
- Files as ground truth. The oldest pattern in computing. Files are durable, inspectable, versionable. Databases are derived views.
The difference from classical Unix pipes is the composition layer. Instead of grep | sort | uniq, the AI is the orchestrator. It reads the schema, decides which tool to call, interprets the results, and decides what to do next. The pipes are implicit in the AI’s reasoning, not explicit in a shell command.
What I’d tell you if you’re building one
Start with execute_sql and a schema resource. That’s enough to be useful. The AI can figure out the rest if the schema is well-documented.
Make the database disposable. If you’re writing migration code, you’ve made it too important.
Put the exemplar queries in the tool docstring, not in a separate document. The docstring is the one thing the AI definitely reads.
Use FTS5. The marginal cost is one virtual table and a few INSERT statements during indexing. The marginal benefit is that the AI can search your content by meaning, not just by exact column values.
Enforce read-only at the engine level, not the application level. SQLite’s authorizer callback is the right mechanism. PRAGMA query_only is a suggestion, not a wall.
Build bulk retrieval tools early. The N+1 pattern (find IDs, then fetch one at a time) is the biggest efficiency problem in MCP servers. One tool call that returns full content for a filtered set changes the economics of every conversation.
The projects: hugo-memex (PyPI: hugo-memex), memex (PyPI: py-memex), chartfold, arkiv, repoindex.
Discussion