Repoindex

Below you will find pages that utilize the taxonomy term “Repoindex”

The MCP Pattern: SQLite as the AI-Queryable Cache

March 20, 2026

I keep building the same thing.

Not the same product — the products are different. One indexes a Hugo blog. One indexes AI conversations. One consolidates medical records from three hospitals. One catalogs a hundred git repositories. But underneath, they all have the same skeleton. After the fifth time, I think the skeleton deserves a name.

The pattern

Domain files (ground truth)
    ↓ index
SQLite database (read-only cache, FTS5)
    ↓ expose
MCP server (tools + resources → AI assistant)

That’s it. Three layers. The domain files are always canonical — the database is a disposable cache you can rebuild from them at any time. SQLite gives you structured queries, full-text search, and JSON extraction over data that was previously trapped in flat files. MCP exposes it to an AI assistant that can write SQL, retrieve content, and (in some cases) create new content.

Here’s the inventory:

Project	Domain	Ground Truth	What the MCP Exposes
hugo-memex	Blog content	Markdown files with YAML front matter	951 pages, FTS5 search, taxonomy queries, JSON front matter extraction
memex	AI conversations	ChatGPT/Claude/Gemini exports	Conversation trees, FTS5 message search, tags, enrichments
chartfold	Medical records	Epic, MEDITECH, athenahealth exports	Labs, meds, encounters, imaging, pathology, cross-source reconciliation
arkiv	Personal archives	JSONL files from various sources	Unified SQL over heterogeneous personal data
repoindex	Git repositories	Local git repos + GitHub/PyPI/CRAN metadata	Repository catalog with activity tracking, publication status

Five projects. Five completely different domains. One architecture.

Why SQLite

SQLite is the most deployed database in history. It’s on every phone, every browser, every Python installation. But that’s not why I use it.

I use it because it solves three problems at once:

Structured queries over unstructured data. Hugo front matter is YAML trapped inside markdown files. Medical records are scattered across three incompatible EHR export formats. AI conversations are JSON trees with branching paths. SQLite turns all of these into tables you can JOIN, GROUP BY, and aggregate. json_extract() handles the long tail of fields that don’t fit a fixed schema.

Full-text search. FTS5 with porter stemming and unicode61 tokenization gives you relevance-ranked search across any text corpus. No Elasticsearch, no external service, no running daemon. Just a virtual table that lives in the same database file.

repoindex: Collection Awareness for Your Git Repos

December 16, 2025

I have around 100 git repos. When I’m working with Claude Code on one of them, the AI has deep knowledge of that repo but zero awareness of the rest. Questions like “which of my repos already has a fuzzy search implementation?” or “what other projects use this pattern?” require me to go dig around manually.

repoindex fixes that.

The Idea

Separation of concerns:

Claude Code (deep work on ONE repo)
         |
         |  "What else do I have?"
         |  "Which repos need X?"
         v
    repoindex (collection awareness)
         |
         +-- repo://...     -> what exists
         +-- tags://...     -> organization
         +-- stats://...    -> aggregations
         +-- events://...   -> what happened

Claude Code works inside repositories. repoindex knows about repositories: metadata, tags, status, relationships. Together they give you full portfolio awareness.

MCP Server Integration

The most useful part is the MCP (Model Context Protocol) server. Add it to your Claude Code configuration and the AI can query your collection directly:

“Which of my Python repos don’t have a LICENSE file?”
“What repos have I updated in the last week?”
“Show me all projects tagged with ml”

The server exposes resources like repo://, tags://, stats://, and events:// that Claude Code reads to understand your portfolio.

Core Features

Tag-Based Organization. Hierarchical tags for categorizing repos. Tags can be explicit (repoindex tag add myproject topic:ml) or implicit (derived automatically from language, directory, features).

Query Language. Filter repos with expressions:

repoindex query "language == 'Python' and 'ml' in tags"
repoindex query "stars > 10 and has:docs"

Event Tracking. What happened across your collection:

repoindex events --since 7d --pretty

New releases, tags, PyPI publishes, all in one view.

JSONL Output. Every command outputs newline-delimited JSON by default, so it plays well with Unix pipelines:

repoindex status | jq 'select(.status.clean == false)'

Installation

Available on PyPI:

pip install repoindex

Configure your repository directories and start indexing:

repoindex config generate
repoindex list --pretty

Why the Rename?

This was previously called ghops. The new name is more honest about what it does: it indexes repositories. The old name implied GitHub-specific operations, but the tool works with any git repo.

Everything is a File: Virtual Filesystems for CLI Data Tools

October 20, 2025

I had a bookmark manager. Then an ebook library manager. Then a chat history manager. Each started with the standard CRUD CLI:

btk add https://example.com --tags python,tutorial
btk list --tag python
btk search "async"
btk delete 1234

ebk import book.pdf --author "Knuth"
ebk list --author Knuth
ebk search "algorithms"

This works fine until you have 10,000+ bookmarks organized with hierarchical tags like programming/python/async, research/ml/transformers, work/clients/acme. Your ebook library has similar structure. Your exported chat conversations from Claude, ChatGPT, and Copilot are piling up.

Traditional CRUD commands become unwieldy:

btk list --tag programming/python/async/io --format json | jq '.[].title'
ebk list --category "Computer Science/Algorithms/Graph Theory" --limit 50
ctk search "machine learning" --source ChatGPT --date-from 2024-01-01

Each command requires precise arguments. Each tool has different flag conventions. You can’t navigate your data. You can only query it. And queries require knowing exactly what you’re looking for.

The insight: everything is a file

When I have thousands of source files organized in directories, I don’t run:

list-files --path /src/components/auth --extension .tsx

I run:

cd src/components/auth
ls *.tsx

The difference matters. With a filesystem, I can navigate incrementally (cd from general to specific), explore (ls to see what’s there), compose (cat file | grep pattern | wc -l), and use familiar tools (find, grep, xargs, pipes, redirection).

What if my bookmarks, ebooks, and chat histories were filesystems?

The pattern

Over the past year, I built six Python tools that all follow the same architecture:

Tool	Domain	VFS Root Structure
btk	Bookmarks	`/bookmarks/`, `/tags/`, `/recent/`, `/domains/`, `/unread/`, `/popular/`
ebk	Ebook library	`/books/`, `/authors/`, `/series/`, `/subjects/`, `/recent/`, `/unread/`
ctk	Chat conversations	`/conversations/`, `/sources/`, `/topics/`, `/starred/`, `/recent/`
ghops	Git repositories	`/repos/`, `/languages/`, `/topics/`, `/stars/`, `/recent/`
infinigram	N-gram models	`/datasets/`, `/models/`, `/corpora/`
AlgoTree	Tree structures	`/nodes/`, `/paths/`, `/subtrees/`

Each tool provides:

A stateless CLI for scripting: btk bookmark add URL, ebk import book.pdf
An interactive shell with a virtual filesystem: btk shell, ebk shell, ctk chat
POSIX-like commands: cd, ls, pwd, cat, mv, cp, rm, find, grep
Unix pipeline support: most commands output JSONL by default for piping

The interesting part is the shell.

Navigating 10,000 bookmarks

Live recording captured with asciinema. You can pause, copy text, and replay. The entire recording is 78KB of text.