<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/"><channel><title>metafunctor</title><link>https://metafunctor.com/</link><description>Research Engineer &amp; Computer Scientist - Machine Learning, Statistical Computing, Open Source Development</description><language>en-us</language><managingEditor>lex@metafunctor.com (Alex Towell)</managingEditor><webMaster>lex@metafunctor.com (Alex Towell)</webMaster><lastBuildDate>Fri, 13 Mar 2026 18:29:48 +0000</lastBuildDate><atom:link href="https://metafunctor.com/" rel="self" type="application/rss+xml"/><image><url>https://metafunctor.com/images/og-image.jpg</url><title>metafunctor</title><link>https://metafunctor.com/</link></image><item><title>Reverse-Process Synthetic Data Generation for Math Reasoning</title><link>https://metafunctor.com/post/rpsdg/</link><pubDate>Tue, 25 Jun 2024 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/post/rpsdg/</guid><description>Training LLMs on mathematical reasoning by inverting easy-to-solve problems: generate derivatives, reverse them into integration exercises with full step-by-step solutions.</description><content:encoded><![CDATA[<p>Check out the (early) project and source code on <a href="https://github.com/queelius/RPSDG">GitHub</a>.</p>
<h2 id="the-idea">The idea</h2>
<p>Some problems are easy in one direction and hard in the other. Taking a derivative is mechanical. Finding an antiderivative can require genuine creativity. Generating a random expression and verifying a proof is easy. Discovering the proof is hard.</p>
<p>RPSDG (Reverse-Process Synthetic Data Generation) exploits this asymmetry. Run the easy direction with full step-by-step work, then reverse the result to get a hard problem with a known solution. You end up with process-supervised training data: not just the answer, but the entire derivation.</p>
<p>Richard Sutton&rsquo;s &ldquo;The Bitter Lesson&rdquo; argues that methods scaling with compute and data will eventually win. The bottleneck is high-quality data. A lot of the world&rsquo;s data is latent, the processes that generated it are not written down. In math, the way a proof was discovered is usually hidden behind a polished presentation. RPSDG is one way to manufacture that hidden process data.</p>
<h2 id="derivatives-to-integrals">Derivatives to Integrals</h2>
<p>Computing derivatives is mechanical. Integration often is not. That asymmetry gives us a data pipeline.</p>
<ol>
<li>
<p><strong>Start with known functions.</strong> Pick functions \( f(x) \) with closed-form derivatives: polynomials, trig, exponentials, logarithms. Vary complexity.</p>
</li>
<li>
<p><strong>Differentiate with full work shown.</strong> Take the derivative of \( f(x) \) to get \( f'(x) \), recording every step.</p>
</li>
<li>
<p><strong>Reverse the process.</strong> Now \( f'(x) \) is the problem and \( f(x) \) is the solution. The recorded steps, read backward, give you a worked integration example.</p>
</li>
</ol>
<p>By composing functions of varying complexity, you get integration problems of graduating difficulty. The training data comes with step-by-step solutions for free, because you generated it by running the easy direction.</p>
<h2 id="proofs-by-random-walk">Proofs by Random Walk</h2>
<p>The same idea works for theorem proving. Generating proofs is hard. Verifying them is (comparatively) easy.</p>
<ol>
<li>
<p><strong>Random walks in expression space.</strong> Start with a random expression \( e_{\text{start}} \). Apply rewrite rules \( r_1, r_2, \ldots, r_n \) to get a chain of intermediate expressions ending at \( e_{\text{end}} \).</p>
</li>
<li>
<p><strong>Read off the theorem.</strong> The pair \( (e_{\text{start}}, e_{\text{end}}) \) is a theorem. The chain of rewrites is its proof.</p>
</li>
<li>
<p><strong>Reverse when useful.</strong> Running the chain backward works too, especially when a complex step in one direction (integration) becomes simple in the other (differentiation).</p>
</li>
<li>
<p><strong>Scale it.</strong> Random starting points and random rewrite sequences give you a diverse set of theorems and proofs automatically. No human has to come up with the theorem first.</p>
</li>
</ol>
<h2 id="what-this-gets-you">What this gets you</h2>
<p>The training data has process supervision baked in. Every example includes intermediate steps, not just the final answer. That should help LLMs learn multi-step reasoning rather than pattern-matching to answers.</p>
<p>It also gives you something like explainability for free: the model&rsquo;s training data literally consists of step-by-step solutions, so the model has a better chance of producing step-by-step reasoning at inference time.</p>
<h2 id="limitations-and-next-steps">Limitations and next steps</h2>
<p>This is early work. The data generation pipeline exists, but I have not yet run the full fine-tuning experiments and benchmarks. The planned pipeline is:</p>
<ul>
<li>Data generation with graduating difficulty (curriculum learning)</li>
<li>Fine-tuning transformer-based LMs on the generated data</li>
<li>Self-supervised learning experiments</li>
<li>Evaluation against standard math reasoning benchmarks</li>
</ul>
<p>Further out, I want to explore reinforcement learning for rewarding multi-step reasoning even when the solution is not known in advance but can be verified.</p>]]></content:encoded><category>artificial intelligence</category><category>machine learning</category><category>mathematics</category><category>algebra</category><category>calculus</category><category>LLMs</category><category>synthetic data</category><category>data generation</category><category>reasoning</category><category>AI training</category><category>explainable AI</category></item><item><title>SLUUG Talk: Demystifying Large Language Models on Linux</title><link>https://metafunctor.com/post/gave-a-presentation-for-sluug-about-llms/</link><pubDate>Fri, 23 Feb 2024 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/post/gave-a-presentation-for-sluug-about-llms/</guid><description>Talk for the St. Louis Unix Users Group about running and understanding Large Language Models on Linux.</description><content:encoded><![CDATA[<p>I gave a talk for the St. Louis Unix Users Group (SLUUG) titled &ldquo;Demystifying Large Language Models (LLMs) on Linux: From Theory to Application.&rdquo; The goal was to walk through how LLMs actually work and how to run them locally on Linux.</p>
<p>I demoed two projects:</p>
<ol>
<li>
<p>A simple Colab notebook using basic Python to generate text with an n-gram model. The point was to illustrate the core idea behind language models and show concretely why n-grams fall short, which motivates the transformer architecture.</p>
</li>
<li>
<p>A project that uses ElasticSearch and LLMs to allow natural language search queries over databases.</p>
</li>
</ol>
<p>The talk went well. Content and code are on <a href="https://github.com/queelius/sluug-talk-llm">GitHub</a>.</p>
]]></content:encoded><category>llm</category><category>linux</category><category>ai</category><category>presentation</category></item><item><title>Master's Project: Reliability Estimation in Series Systems</title><link>https://metafunctor.com/post/masters-stats-siue-proj/</link><pubDate>Mon, 19 Feb 2024 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/post/masters-stats-siue-proj/</guid><description>My master's project on maximum likelihood estimation for series systems with right-censored and masked failure data.</description><content:encoded><![CDATA[<p>I presented my master&rsquo;s project in October 2023, finishing up my MS in statistics/mathematics at SIUE. The associated paper is titled &ldquo;Reliability Estimation in Series Systems: Maximum Likelihood Techniques for Right-Censored and Masked Failure Data.&rdquo;</p>
<h2 id="the-problem">The Problem</h2>
<p>In reliability engineering, you often find yourself in an annoying situation: a system fails, but you do not know which component caused the failure. This is called <em>masked failure data</em>. On top of that, some systems are still running when you stop observing them, so you only know they survived at least that long. That is <em>right censoring</em>. Both are common in practice. Identifying the exact failed component is expensive or sometimes impossible.</p>
<p>The project builds a likelihood-based framework that handles both masking and censoring simultaneously, models component lifetimes with Weibull distributions, derives closed-form Fisher information for the exponential special case, and provides bootstrap methods for uncertainty quantification. I implemented it all in an R package so practitioners can actually use it.</p>
<h2 id="related-work">Related Work</h2>
<p>This connects to several other posts and projects:</p>
<ul>
<li><a href="/post/expo-masked-fim/">Closed-Form Results for Masked Exponential Series Systems</a> covers the exponential distribution special case with analytical solutions</li>
<li><a href="/projects/likelihood.model/">likelihood.model R package</a> is the software implementation</li>
</ul>
<p>See the full project page <a href="/projects/reliability-estimation-in-series-systems/">here</a>.</p>
]]></content:encoded><category>series systems</category><category>masked failure data</category><category>censoring</category><category>reliability analysis</category><category>maximum likelihood</category><category>Weibull distribution</category><category>R</category></item><item><title>Building an Immutable, Content-Addressed Filesystem in Python</title><link>https://metafunctor.com/projects/dagshell/posts/01-immutable-content-addressed-filesystem/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/projects/dagshell/posts/01-immutable-content-addressed-filesystem/</guid><description>How Git-style content addressing creates elegant, functional data structures</description><content:encoded><![CDATA[<p>When you run <code>git commit</code>, something interesting happens: Git doesn&rsquo;t store your files by name. Instead, it computes a SHA-1 hash of each file&rsquo;s content and stores the file under that hash. The filename is just a pointer to the hash. This is <strong>content addressing</strong>—identifying data by what it contains rather than where it lives.</p>
<p>This seemingly simple idea has profound implications. In this post, I&rsquo;ll show how to build a content-addressed virtual filesystem in Python, exploring how immutability and content addressing work together to create elegant, functional data structures.</p>
<h2 id="why-content-addressing">Why Content Addressing?</h2>
<p>Consider a traditional filesystem. When you modify a file, the system overwrites the old content. The file&rsquo;s identity (its path) stays the same, but its content changes. This mutable approach has problems:</p>
<ol>
<li><strong>No automatic history</strong>: Once you overwrite, the old data is gone</li>
<li><strong>No deduplication</strong>: Two identical files take up twice the space</li>
<li><strong>No integrity verification</strong>: Corruption can go undetected</li>
</ol>
<p>Content addressing solves all three. If a file&rsquo;s identity <em>is</em> its content (via a hash), then:</p>
<ol>
<li><strong>History is preserved</strong>: Changing content creates a new hash, so the old version still exists</li>
<li><strong>Deduplication is automatic</strong>: Identical content has identical hashes—stored once</li>
<li><strong>Integrity is built-in</strong>: If the content doesn&rsquo;t match the hash, you know something&rsquo;s wrong</li>
</ol>
<h2 id="the-node-hierarchy">The Node Hierarchy</h2>
<p>Let&rsquo;s build this. First, we define our filesystem nodes using Python&rsquo;s frozen dataclasses:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">dataclasses</span> <span class="kn">import</span> <span class="n">dataclass</span><span class="p">,</span> <span class="n">field</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">hashlib</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">json</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nd">@dataclass</span><span class="p">(</span><span class="n">frozen</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Node</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Base class for all filesystem nodes.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">mode</span><span class="p">:</span> <span class="nb">int</span>
</span></span><span class="line"><span class="cl">    <span class="n">uid</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">1000</span>
</span></span><span class="line"><span class="cl">    <span class="n">gid</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">1000</span>
</span></span><span class="line"><span class="cl">    <span class="n">mtime</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="n">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">compute_hash</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Compute SHA256 hash of this node including all metadata.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="n">data</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">to_dict</span><span class="p">(),</span> <span class="n">sort_keys</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">hashlib</span><span class="o">.</span><span class="n">sha256</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">encode</span><span class="p">())</span><span class="o">.</span><span class="n">hexdigest</span><span class="p">()</span>
</span></span></code></pre></div><p>The <code>frozen=True</code> parameter is crucial. It makes instances immutable—you cannot modify a Node after creation. Any &ldquo;change&rdquo; requires creating a new Node.</p>
<p>We then specialize for different node types:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="nd">@dataclass</span><span class="p">(</span><span class="n">frozen</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">FileNode</span><span class="p">(</span><span class="n">Node</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Regular file node.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">content</span><span class="p">:</span> <span class="nb">bytes</span> <span class="o">=</span> <span class="sa">b</span><span class="s2">&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nd">@dataclass</span><span class="p">(</span><span class="n">frozen</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">DirNode</span><span class="p">(</span><span class="n">Node</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Directory node containing references to child nodes.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">children</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="n">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="nb">dict</span><span class="p">)</span>  <span class="c1"># name -&gt; hash</span>
</span></span></code></pre></div><p>Notice that <code>DirNode.children</code> maps names to <em>hashes</em>, not to Node objects directly. This is the key insight: directories don&rsquo;t contain files; they contain <em>references</em> to file hashes. The actual nodes live in a separate store.</p>
<h2 id="the-dag-structure">The DAG Structure</h2>
<p>This reference-based approach creates a Directed Acyclic Graph (DAG):</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">FileSystem</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Content-addressable virtual filesystem.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># The DAG: hash -&gt; Node</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">nodes</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Node</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># Path index: absolute path -&gt; hash</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">paths</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">_add_node</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">node</span><span class="p">:</span> <span class="n">Node</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Add a node to the DAG, returning its hash.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="n">node_hash</span> <span class="o">=</span> <span class="n">node</span><span class="o">.</span><span class="n">compute_hash</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">node_hash</span> <span class="ow">not</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">nodes</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">nodes</span><span class="p">[</span><span class="n">node_hash</span><span class="p">]</span> <span class="o">=</span> <span class="n">node</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">node_hash</span>
</span></span></code></pre></div><p>When we add a node, we compute its hash and store the mapping <code>hash → node</code>. If an identical node already exists (same hash), we don&rsquo;t duplicate it—we just return the existing hash. <strong>Deduplication is automatic.</strong></p>
<h2 id="immutable-updates">Immutable Updates</h2>
<p>Here&rsquo;s where immutability shines. When we write to a file, we don&rsquo;t modify anything. Instead, we:</p>
<ol>
<li>Create a new FileNode with the new content</li>
<li>Create a new DirNode for the parent, pointing to the new file hash</li>
<li>Update the path index</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">write</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">path</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">content</span><span class="p">:</span> <span class="nb">bytes</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Write content to a file.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">parent_path</span><span class="p">,</span> <span class="n">name</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_get_parent_path</span><span class="p">(</span><span class="n">path</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">parent_hash</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">paths</span><span class="p">[</span><span class="n">parent_path</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="n">parent</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">nodes</span><span class="p">[</span><span class="n">parent_hash</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Create new file node</span>
</span></span><span class="line"><span class="cl">    <span class="n">file_node</span> <span class="o">=</span> <span class="n">FileNode</span><span class="p">(</span><span class="n">content</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">file_hash</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_add_node</span><span class="p">(</span><span class="n">file_node</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Create new parent directory with updated child reference</span>
</span></span><span class="line"><span class="cl">    <span class="n">new_children</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">parent</span><span class="o">.</span><span class="n">children</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">new_children</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">file_hash</span>
</span></span><span class="line"><span class="cl">    <span class="n">new_parent</span> <span class="o">=</span> <span class="n">DirNode</span><span class="p">(</span><span class="n">children</span><span class="o">=</span><span class="n">new_children</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">new_parent_hash</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_add_node</span><span class="p">(</span><span class="n">new_parent</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Update path index</span>
</span></span><span class="line"><span class="cl">    <span class="bp">self</span><span class="o">.</span><span class="n">paths</span><span class="p">[</span><span class="n">parent_path</span><span class="p">]</span> <span class="o">=</span> <span class="n">new_parent_hash</span>
</span></span><span class="line"><span class="cl">    <span class="bp">self</span><span class="o">.</span><span class="n">paths</span><span class="p">[</span><span class="n">path</span><span class="p">]</span> <span class="o">=</span> <span class="n">file_hash</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="kc">True</span>
</span></span></code></pre></div><p>The old FileNode still exists in <code>self.nodes</code>. The old DirNode still exists too. We&rsquo;ve just created new versions and updated where the path points. This is <strong>structural sharing</strong>—unchanged parts of the tree are shared between versions.</p>
<h2 id="visualizing-the-dag">Visualizing the DAG</h2>
<p>Let&rsquo;s trace through an example:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">fs</span> <span class="o">=</span> <span class="n">FileSystem</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">fs</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="s2">&#34;/project&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">fs</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s2">&#34;/project/main.py&#34;</span><span class="p">,</span> <span class="sa">b</span><span class="s2">&#34;print(&#39;hello&#39;)&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">fs</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s2">&#34;/project/main.py&#34;</span><span class="p">,</span> <span class="sa">b</span><span class="s2">&#34;print(&#39;world&#39;)&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>After these operations, our DAG contains:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">Hash: a1b2c3... → DirNode(children={})           # original /project
</span></span><span class="line"><span class="cl">Hash: d4e5f6... → FileNode(&#34;print(&#39;hello&#39;)&#34;)     # first version
</span></span><span class="line"><span class="cl">Hash: g7h8i9... → DirNode(children={&#34;main.py&#34;: &#34;d4e5f6...&#34;})
</span></span><span class="line"><span class="cl">Hash: j0k1l2... → FileNode(&#34;print(&#39;world&#39;)&#34;)     # second version
</span></span><span class="line"><span class="cl">Hash: m3n4o5... → DirNode(children={&#34;main.py&#34;: &#34;j0k1l2...&#34;})
</span></span></code></pre></div><p>Both versions of <code>main.py</code> exist. The path <code>/project/main.py</code> points to the latest hash (<code>j0k1l2...</code>), but we could easily restore the old version if we tracked which hashes corresponded to which versions.</p>
<h2 id="benefits-in-practice">Benefits in Practice</h2>
<p>This design enables powerful features almost for free:</p>
<p><strong>Snapshots</strong>: Save the current <code>paths</code> dictionary. Restore it later to go back in time.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">snapshot</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">str</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Create a snapshot of the current filesystem state.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="nb">dict</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">paths</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">restore</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">snapshot</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">str</span><span class="p">]):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Restore filesystem to a previous snapshot.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="bp">self</span><span class="o">.</span><span class="n">paths</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">snapshot</span><span class="p">)</span>
</span></span></code></pre></div><p><strong>Deduplication</strong>: Multiple paths can point to the same hash.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># These might share the same underlying node if content is identical</span>
</span></span><span class="line"><span class="cl"><span class="n">fs</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s2">&#34;/file1.txt&#34;</span><span class="p">,</span> <span class="sa">b</span><span class="s2">&#34;hello&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">fs</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s2">&#34;/file2.txt&#34;</span><span class="p">,</span> <span class="sa">b</span><span class="s2">&#34;hello&#34;</span><span class="p">)</span>  <span class="c1"># Same hash, no new storage</span>
</span></span></code></pre></div><p><strong>Integrity checking</strong>: If someone asks for a file, we can verify it.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">verify</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">path</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Verify a file&#39;s integrity.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">node_hash</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">paths</span><span class="p">[</span><span class="n">path</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="n">node</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">nodes</span><span class="p">[</span><span class="n">node_hash</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">node</span><span class="o">.</span><span class="n">compute_hash</span><span class="p">()</span> <span class="o">==</span> <span class="n">node_hash</span>
</span></span></code></pre></div><h2 id="the-functional-programming-connection">The Functional Programming Connection</h2>
<p>This approach is deeply connected to functional programming. In FP:</p>
<ul>
<li>Data is immutable</li>
<li>&ldquo;Changes&rdquo; create new values</li>
<li>Sharing is safe because nothing mutates</li>
</ul>
<p>Our filesystem follows these principles exactly. Nodes are frozen. &ldquo;Writing&rdquo; creates new nodes. Multiple paths can safely share nodes because nodes never change.</p>
<p>This is why Clojure&rsquo;s persistent data structures, Haskell&rsquo;s pure values, and Git&rsquo;s object store all use similar ideas. <strong>Content addressing + immutability = safe, efficient, verifiable data.</strong></p>
<h2 id="trade-offs">Trade-offs</h2>
<p>Nothing is free. This approach has costs:</p>
<ol>
<li>
<p><strong>Memory</strong>: Old versions accumulate. You need garbage collection to reclaim space from unreachable nodes.</p>
</li>
<li>
<p><strong>Performance</strong>: Creating new nodes for every change can be slower than in-place mutation for write-heavy workloads.</p>
</li>
<li>
<p><strong>Complexity</strong>: Path resolution requires extra indirection through the hash table.</p>
</li>
</ol>
<p>For many use cases—especially those valuing history, integrity, and safe concurrency—these trade-offs are worthwhile.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Content addressing transforms how we think about data. Instead of &ldquo;where is this file?&rdquo; we ask &ldquo;what is this content&rsquo;s identity?&rdquo; Instead of destructive updates, we create new versions while sharing unchanged structure.</p>
<p>This pattern appears everywhere: Git, IPFS, Nix, Docker layers, and many database internals. Understanding it opens doors to building robust, elegant systems.</p>
<p>The full implementation in <a href="https://github.com/queelius/dagshell">DagShell</a> extends these ideas with a complete POSIX-like interface, demonstrating how content addressing can underpin a full virtual filesystem.</p>
<hr>
<p><em>Next in this series: <a href="../02-unix-philosophy-python/">Unix Philosophy in Python</a> — building composable commands with method chaining.</em></p>
]]></content:encoded></item><item><title>Fine-Tuning a Tiny LLM for ElasticSearch DSL</title><link>https://metafunctor.com/post/llm-fine-tuning-es-dsl/</link><pubDate>Mon, 19 Feb 2024 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/post/llm-fine-tuning-es-dsl/</guid><description>Fine-tuning a small language model to generate ElasticSearch DSL queries from natural language, as a proof of concept for domain-specific LLM specialization.</description><content:encoded><![CDATA[<p>I am fine-tuning a small LLM to generate ElasticSearch DSL queries from natural language. The project is on <a href="https://github.com/queelius/elasticsearch-lm">GitHub</a>.</p>
<p>The idea: take a task that large models handle well (translating &ldquo;find all orders over $100 from last month&rdquo; into the corresponding ElasticSearch JSON query) and see if a tiny model can learn it from synthetic data.</p>
<p>The data pipeline works like this. I started by generating examples from GPT-4. Then I wrote a script that samples from those outputs and uses them as few-shot examples for Mistral, which generates a much larger synthetic dataset. The next step is reshaping the data into the expected format and fine-tuning.</p>
<p>This is early stage. The synthetic data exists, the fine-tuning has not happened yet. I will update this post with results.</p>
<p>If you are interested in collaborating on this or related projects, email me at <a href="mailto:lex@metafunctor.com">lex@metafunctor.com</a>.</p>
]]></content:encoded><category>large language models</category><category>fine-tuning</category><category>information retrieval</category><category>elastic search</category><category>domain-specific language</category><category>json</category></item><item><title>Unix Philosophy in Python: Composable Commands with Method Chaining</title><link>https://metafunctor.com/projects/dagshell/posts/02-unix-philosophy-python/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/projects/dagshell/posts/02-unix-philosophy-python/</guid><description>How to build pipeable, chainable interfaces that do one thing well</description><content:encoded><![CDATA[<p>The Unix philosophy, articulated by Doug McIlroy, can be summarized as:</p>
<blockquote>
<p>Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.</p></blockquote>
<p>This philosophy gave us pipes (<code>|</code>), small focused utilities (<code>cat</code>, <code>grep</code>, <code>sort</code>), and the ability to compose complex operations from simple parts:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">cat access.log <span class="p">|</span> grep <span class="s2">&#34;404&#34;</span> <span class="p">|</span> cut -d<span class="s1">&#39; &#39;</span> -f1 <span class="p">|</span> sort <span class="p">|</span> uniq -c <span class="p">|</span> sort -rn
</span></span></code></pre></div><p>Can we bring this composability to Python? Not just subprocess calls, but native Python objects that pipe and chain as naturally as Unix commands? Let&rsquo;s build it.</p>
<h2 id="the-problem-with-methods">The Problem with Methods</h2>
<p>Standard Python methods have a composition problem. Consider:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># We want to: read a file, filter lines, count words</span>
</span></span><span class="line"><span class="cl"><span class="n">content</span> <span class="o">=</span> <span class="n">read_file</span><span class="p">(</span><span class="s2">&#34;/data/log.txt&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">filtered</span> <span class="o">=</span> <span class="n">grep</span><span class="p">(</span><span class="n">content</span><span class="p">,</span> <span class="s2">&#34;ERROR&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">count</span> <span class="o">=</span> <span class="n">wc</span><span class="p">(</span><span class="n">filtered</span><span class="p">)</span>
</span></span></code></pre></div><p>This works, but it&rsquo;s inside-out compared to how we think about it. We think &ldquo;read, then filter, then count&rdquo;—a left-to-right pipeline. But we write it bottom-up with intermediate variables.</p>
<p>Method chaining helps:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">read_file</span><span class="p">(</span><span class="s2">&#34;/data/log.txt&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">grep</span><span class="p">(</span><span class="s2">&#34;ERROR&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">wc</span><span class="p">()</span>
</span></span></code></pre></div><p>Now it flows left-to-right. But how do we make arbitrary operations chainable?</p>
<h2 id="the-commandresult-pattern">The CommandResult Pattern</h2>
<p>The key insight is wrapping every result in a chainable container:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="nd">@dataclass</span>
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">CommandResult</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Wrapper that enables method chaining.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">data</span><span class="p">:</span> <span class="n">Any</span>           <span class="c1"># The actual result data</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="kc">None</span>    <span class="c1"># Text representation</span>
</span></span><span class="line"><span class="cl">    <span class="n">exit_code</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">0</span>  <span class="c1"># Unix-style exit code</span>
</span></span><span class="line"><span class="cl">    <span class="n">_shell</span><span class="p">:</span> <span class="s1">&#39;DagShell&#39;</span> <span class="o">=</span> <span class="kc">None</span>  <span class="c1"># Reference back to the shell</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__str__</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">text</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">text</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="nb">str</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">data</span><span class="p">)</span>
</span></span></code></pre></div><p>Every command returns a <code>CommandResult</code>. The result carries the data <em>and</em> a reference to the shell that produced it. This reference enables chaining—we can call more methods through it.</p>
<h2 id="making-commands-chainable">Making Commands Chainable</h2>
<p>Here&rsquo;s a simple command implementation:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">DagShell</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">echo</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">CommandResult</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Echo arguments to output.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="n">text</span> <span class="o">=</span> <span class="s1">&#39; &#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">arg</span><span class="p">)</span> <span class="k">for</span> <span class="n">arg</span> <span class="ow">in</span> <span class="n">args</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">CommandResult</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">data</span><span class="o">=</span><span class="n">text</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">text</span><span class="o">=</span><span class="n">text</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">exit_code</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">_shell</span><span class="o">=</span><span class="bp">self</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span>
</span></span></code></pre></div><p>The magic happens in <code>CommandResult</code>. We add methods that delegate back to the shell:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="nd">@dataclass</span>
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">CommandResult</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># ... fields as before ...</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">grep</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">pattern</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="s1">&#39;CommandResult&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Filter lines matching pattern.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">_shell</span><span class="o">.</span><span class="n">grep</span><span class="p">(</span><span class="n">pattern</span><span class="p">,</span> <span class="n">input_data</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">data</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">wc</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">flags</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="s1">&#39;CommandResult&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Count lines, words, or characters.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">_shell</span><span class="o">.</span><span class="n">wc</span><span class="p">(</span><span class="o">*</span><span class="n">flags</span><span class="p">,</span> <span class="n">input_data</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">data</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">out</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">path</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="s1">&#39;CommandResult&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Redirect output to a file.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">_shell</span><span class="o">.</span><span class="n">fs</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span><span class="o">.</span><span class="n">encode</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span>
</span></span></code></pre></div><p>Now we can chain:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">echo</span><span class="p">(</span><span class="s2">&#34;hello world&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">wc</span><span class="p">(</span><span class="s2">&#34;-w&#34;</span><span class="p">)</span>  <span class="c1"># Returns: 2</span>
</span></span><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">cat</span><span class="p">(</span><span class="s2">&#34;/data/log.txt&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">grep</span><span class="p">(</span><span class="s2">&#34;ERROR&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">wc</span><span class="p">(</span><span class="s2">&#34;-l&#34;</span><span class="p">)</span>  <span class="c1"># Count error lines</span>
</span></span></code></pre></div><h2 id="the-dual-nature-objects-and-files">The Dual Nature: Objects and Files</h2>
<p>Unix commands have a superpower: the same output can go to the screen, a file, or another command. We can replicate this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Chain more commands</span>
</span></span><span class="line"><span class="cl"><span class="n">result</span> <span class="o">=</span> <span class="n">shell</span><span class="o">.</span><span class="n">cat</span><span class="p">(</span><span class="s2">&#34;/data/log.txt&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">grep</span><span class="p">(</span><span class="s2">&#34;ERROR&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Get as Python object</span>
</span></span><span class="line"><span class="cl"><span class="n">lines</span> <span class="o">=</span> <span class="n">result</span><span class="o">.</span><span class="n">lines</span><span class="p">()</span>  <span class="c1"># List[str]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Or redirect to a file</span>
</span></span><span class="line"><span class="cl"><span class="n">result</span><span class="o">.</span><span class="n">out</span><span class="p">(</span><span class="s2">&#34;/data/errors.txt&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>The <code>CommandResult</code> is simultaneously:</p>
<ol>
<li>A Python object you can inspect and manipulate</li>
<li>A text stream you can redirect to files</li>
<li>An input source for the next command in a pipeline</li>
</ol>
<p>This dual nature—structured data <em>and</em> text stream—bridges the gap between Unix philosophy and Python&rsquo;s object orientation.</p>
<h2 id="implementing-piping">Implementing Piping</h2>
<p>True Unix pipes pass data between processes. We simulate this with a &ldquo;last result&rdquo; mechanism:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">DagShell</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">_last_result</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">CommandResult</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">_</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">CommandResult</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Return the last command&#39;s result (like $? or $_).&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">_last_result</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="n">CommandResult</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">,</span> <span class="n">text</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">,</span> <span class="n">exit_code</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">_shell</span><span class="o">=</span><span class="bp">self</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">_last_result</span>
</span></span></code></pre></div><p>Now we can build pipelines:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">cat</span><span class="p">(</span><span class="s2">&#34;/data/log.txt&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">_</span><span class="p">()</span><span class="o">.</span><span class="n">grep</span><span class="p">(</span><span class="s2">&#34;ERROR&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">_</span><span class="p">()</span><span class="o">.</span><span class="n">wc</span><span class="p">(</span><span class="s2">&#34;-l&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>Or more elegantly, commands can accept piped input:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">grep</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">pattern</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">input_data</span><span class="o">=</span><span class="kc">None</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">CommandResult</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Filter lines matching pattern.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">input_data</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">input_data</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_last_result</span><span class="o">.</span><span class="n">data</span> <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">_last_result</span> <span class="k">else</span> <span class="s1">&#39;&#39;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">lines</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">input_data</span><span class="p">)</span><span class="o">.</span><span class="n">splitlines</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">matching</span> <span class="o">=</span> <span class="p">[</span><span class="n">line</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">lines</span> <span class="k">if</span> <span class="n">pattern</span> <span class="ow">in</span> <span class="n">line</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">result</span> <span class="o">=</span> <span class="n">CommandResult</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">data</span><span class="o">=</span><span class="n">matching</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="n">text</span><span class="o">=</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">matching</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">        <span class="n">exit_code</span><span class="o">=</span><span class="mi">0</span> <span class="k">if</span> <span class="n">matching</span> <span class="k">else</span> <span class="mi">1</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="n">_shell</span><span class="o">=</span><span class="bp">self</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="bp">self</span><span class="o">.</span><span class="n">_last_result</span> <span class="o">=</span> <span class="n">result</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">result</span>
</span></span></code></pre></div><h2 id="method-chaining-in-action">Method Chaining in Action</h2>
<p>Let&rsquo;s build something real—a log analysis pipeline:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Create a shell and some test data</span>
</span></span><span class="line"><span class="cl"><span class="n">shell</span> <span class="o">=</span> <span class="n">DagShell</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="s2">&#34;/logs&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">echo</span><span class="p">(</span><span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">2024-01-15 10:30:00 INFO  User login: alice
</span></span></span><span class="line"><span class="cl"><span class="s2">2024-01-15 10:31:00 ERROR Database connection failed
</span></span></span><span class="line"><span class="cl"><span class="s2">2024-01-15 10:32:00 INFO  User login: bob
</span></span></span><span class="line"><span class="cl"><span class="s2">2024-01-15 10:33:00 ERROR Timeout waiting for response
</span></span></span><span class="line"><span class="cl"><span class="s2">2024-01-15 10:34:00 WARN  High memory usage
</span></span></span><span class="line"><span class="cl"><span class="s2">2024-01-15 10:35:00 ERROR Disk space low
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span><span class="o">.</span><span class="n">strip</span><span class="p">())</span><span class="o">.</span><span class="n">out</span><span class="p">(</span><span class="s2">&#34;/logs/app.log&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Pipeline: find errors, extract timestamps, save to file</span>
</span></span><span class="line"><span class="cl"><span class="p">(</span><span class="n">shell</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">cat</span><span class="p">(</span><span class="s2">&#34;/logs/app.log&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">grep</span><span class="p">(</span><span class="s2">&#34;ERROR&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">cut</span><span class="p">(</span><span class="n">delimiter</span><span class="o">=</span><span class="s2">&#34; &#34;</span><span class="p">,</span> <span class="n">fields</span><span class="o">=</span><span class="s2">&#34;1,2&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">out</span><span class="p">(</span><span class="s2">&#34;/logs/error_times.txt&#34;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Read the result</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">shell</span><span class="o">.</span><span class="n">cat</span><span class="p">(</span><span class="s2">&#34;/logs/error_times.txt&#34;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Output:</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 2024-01-15 10:31:00</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 2024-01-15 10:33:00</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 2024-01-15 10:35:00</span>
</span></span></code></pre></div><p>Each step does one thing. The chain composes them into a useful operation. The data flows left-to-right, just like our mental model.</p>
<h2 id="directory-navigation-with-a-stack">Directory Navigation with a Stack</h2>
<p>Unix has <code>cd</code>, but shells also have <code>pushd</code> and <code>popd</code> for directory stacks. We implement this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">DagShell</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">_cwd</span> <span class="o">=</span> <span class="s1">&#39;/&#39;</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">_dir_stack</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">pushd</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">path</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">CommandResult</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Push current directory and change to new one.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">_dir_stack</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_cwd</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">cd</span><span class="p">(</span><span class="n">path</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">popd</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">CommandResult</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Pop directory from stack and change to it.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="ow">not</span> <span class="bp">self</span><span class="o">.</span><span class="n">_dir_stack</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="n">CommandResult</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">,</span> <span class="n">text</span><span class="o">=</span><span class="s1">&#39;popd: directory stack empty&#39;</span><span class="p">,</span> <span class="n">exit_code</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">old_dir</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_dir_stack</span><span class="o">.</span><span class="n">pop</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">cd</span><span class="p">(</span><span class="n">old_dir</span><span class="p">)</span>
</span></span></code></pre></div><p>Now we can navigate without losing our place:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">pushd</span><span class="p">(</span><span class="s2">&#34;/project/src&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># ... work in src ...</span>
</span></span><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">pushd</span><span class="p">(</span><span class="s2">&#34;tests&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># ... work in tests ...</span>
</span></span><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">popd</span><span class="p">()</span>  <span class="c1"># back to /project/src</span>
</span></span><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">popd</span><span class="p">()</span>  <span class="c1"># back to original directory</span>
</span></span></code></pre></div><h2 id="exit-codes-success-and-failure">Exit Codes: Success and Failure</h2>
<p>Unix commands return exit codes: 0 for success, non-zero for failure. We include this in <code>CommandResult</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">grep</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">pattern</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">CommandResult</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># ... filtering logic ...</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">CommandResult</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">data</span><span class="o">=</span><span class="n">matching</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="n">text</span><span class="o">=</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">matching</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">        <span class="n">exit_code</span><span class="o">=</span><span class="mi">0</span> <span class="k">if</span> <span class="n">matching</span> <span class="k">else</span> <span class="mi">1</span><span class="p">,</span>  <span class="c1"># 1 if no matches</span>
</span></span><span class="line"><span class="cl">        <span class="n">_shell</span><span class="o">=</span><span class="bp">self</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span></code></pre></div><p>This enables conditional logic:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">result</span> <span class="o">=</span> <span class="n">shell</span><span class="o">.</span><span class="n">grep</span><span class="p">(</span><span class="s2">&#34;pattern&#34;</span><span class="p">,</span> <span class="n">file</span><span class="o">=</span><span class="s2">&#34;/data/log.txt&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">if</span> <span class="n">result</span><span class="o">.</span><span class="n">exit_code</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Found </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">result</span><span class="o">.</span><span class="n">lines</span><span class="p">())</span><span class="si">}</span><span class="s2"> matches&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="s2">&#34;No matches found&#34;</span><span class="p">)</span>
</span></span></code></pre></div><h2 id="the-philosophy-applied">The Philosophy Applied</h2>
<p>Let&rsquo;s revisit McIlroy&rsquo;s principles and see how we&rsquo;ve applied them:</p>
<p><strong>&ldquo;Do one thing well&rdquo;</strong>: Each method (<code>cat</code>, <code>grep</code>, <code>wc</code>, <code>cut</code>) does exactly one thing.</p>
<p><strong>&ldquo;Work together&rdquo;</strong>: <code>CommandResult</code> enables any command to connect to any other.</p>
<p><strong>&ldquo;Text streams as universal interface&rdquo;</strong>: Every result has a text representation via <code>__str__</code>, making it redirectable and pipeable.</p>
<p>We&rsquo;ve also added Python-specific benefits:</p>
<ul>
<li><strong>Type safety</strong>: Results carry structured <code>data</code>, not just text</li>
<li><strong>Introspection</strong>: <code>result.lines()</code>, <code>result.data</code>, etc.</li>
<li><strong>Chaining</strong>: Method chains read left-to-right like pipelines</li>
</ul>
<h2 id="the-fluent-pattern">The Fluent Pattern</h2>
<p>This is an instance of the <strong>Fluent Interface</strong> pattern, where methods return <code>self</code> (or a related object) to enable chaining. Martin Fowler described it in 2005, but the idea is older—Smalltalk embraced it from the beginning.</p>
<p>The key is designing methods that:</p>
<ol>
<li>Perform their action</li>
<li>Return something chainable</li>
<li>Maintain enough context for the next operation</li>
</ol>
<p>When done well, code reads almost like prose:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span><span class="n">shell</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="s2">&#34;/project&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">cd</span><span class="p">(</span><span class="s2">&#34;/project&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">echo</span><span class="p">(</span><span class="s2">&#34;# My Project&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">out</span><span class="p">(</span><span class="s2">&#34;README.md&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">echo</span><span class="p">(</span><span class="s2">&#34;def main(): pass&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">out</span><span class="p">(</span><span class="s2">&#34;main.py&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">ls</span><span class="p">(</span><span class="s2">&#34;-la&#34;</span><span class="p">))</span>
</span></span></code></pre></div><h2 id="trade-offs">Trade-offs</h2>
<p>This approach isn&rsquo;t free:</p>
<ol>
<li><strong>Wrapper overhead</strong>: Every result is wrapped in <code>CommandResult</code></li>
<li><strong>Learning curve</strong>: Users must understand the chaining pattern</li>
<li><strong>Debugging</strong>: Long chains can be hard to debug—where did it fail?</li>
</ol>
<p>For scripting and exploratory work, the expressiveness outweighs these costs. For performance-critical code, you might unwrap to raw operations.</p>
<h2 id="conclusion">Conclusion</h2>
<p>The Unix philosophy isn&rsquo;t about Unix—it&rsquo;s about composability. Small, focused operations that connect through a universal interface create systems greater than the sum of their parts.</p>
<p>In Python, we achieve this with:</p>
<ul>
<li><strong>Wrapper types</strong> that carry data and context</li>
<li><strong>Method chaining</strong> for left-to-right flow</li>
<li><strong>Dual representations</strong> as objects and text streams</li>
<li><strong>Exit codes</strong> for success/failure signaling</li>
</ul>
<p>The result is a fluent interface where complex operations emerge from simple, composable parts—Unix philosophy, realized in Python.</p>
<hr>
<p><em>See the full implementation in <a href="https://github.com/queelius/dagshell">DagShell</a>, which builds a complete virtual filesystem with this fluent pattern.</em></p>
<p><em>Next in this series: <a href="../03-embedding-scheme-interpreter/">Embedding a Scheme Interpreter</a> — adding a DSL for filesystem scripting.</em></p>
]]></content:encoded></item><item><title>Embedding a Scheme Interpreter: Building a DSL for Filesystem Operations</title><link>https://metafunctor.com/projects/dagshell/posts/03-embedding-scheme-interpreter/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/projects/dagshell/posts/03-embedding-scheme-interpreter/</guid><description>How to implement a minimal Scheme and integrate it with your application</description><content:encoded><![CDATA[<p>Every sufficiently complex application eventually grows a scripting language. Emacs has Elisp. AutoCAD has AutoLISP. Blender has Python. The pattern is clear: users need programmability beyond what a fixed UI provides.</p>
<p>But embedding a full language runtime is heavy. What if we just need something small—variables, functions, conditionals, loops? A Lisp-family language is perfect for this: the syntax is trivial to parse, the semantics are clean, and it&rsquo;s powerful enough for real scripting.</p>
<p>In this post, I&rsquo;ll walk through building a minimal Scheme interpreter in Python and integrating it with a virtual filesystem. We&rsquo;ll cover the complete pipeline: tokenization, parsing, evaluation, and extension with custom primitives.</p>
<h2 id="why-scheme">Why Scheme?</h2>
<p>Scheme is a minimalist Lisp dialect. Its syntax is just parentheses and atoms:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-scheme" data-lang="scheme"><span class="line"><span class="cl"><span class="p">(</span><span class="k">define </span><span class="nv">x</span> <span class="mi">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">(</span><span class="nb">+ </span><span class="nv">x</span> <span class="p">(</span><span class="nb">* </span><span class="mi">2</span> <span class="mi">3</span><span class="p">))</span>
</span></span><span class="line"><span class="cl"><span class="p">(</span><span class="k">if </span><span class="p">(</span><span class="nb">&gt; </span><span class="nv">x</span> <span class="mi">5</span><span class="p">)</span> <span class="s">&#34;big&#34;</span> <span class="s">&#34;small&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>This uniformity makes parsing trivial. There&rsquo;s no operator precedence, no complex grammar—just:</p>
<ol>
<li>Atoms: numbers, strings, symbols</li>
<li>Lists: <code>(thing thing thing ...)</code></li>
</ol>
<p>That&rsquo;s it. The entire parser can be written in under 50 lines.</p>
<h2 id="step-1-tokenization">Step 1: Tokenization</h2>
<p>First, we convert source code into tokens. Scheme&rsquo;s tokenization is simple: split on whitespace, but respect strings and parentheses.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">tokenize</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Convert Scheme code into tokens.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Handle comments (lines starting with ;)</span>
</span></span><span class="line"><span class="cl">    <span class="n">lines</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;;&#39;</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">lines</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Add spaces around parens for easy splitting</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;(&#39;</span><span class="p">,</span> <span class="s1">&#39; ( &#39;</span><span class="p">)</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;)&#39;</span><span class="p">,</span> <span class="s1">&#39; ) &#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Split, handling string literals</span>
</span></span><span class="line"><span class="cl">    <span class="n">tokens</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">    <span class="n">in_string</span> <span class="o">=</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">    <span class="n">current</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">char</span> <span class="ow">in</span> <span class="n">text</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">char</span> <span class="o">==</span> <span class="s1">&#39;&#34;&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">in_string</span> <span class="o">=</span> <span class="ow">not</span> <span class="n">in_string</span>
</span></span><span class="line"><span class="cl">            <span class="n">current</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">char</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">elif</span> <span class="n">in_string</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">current</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">char</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">elif</span> <span class="n">char</span><span class="o">.</span><span class="n">isspace</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">current</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="n">tokens</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">current</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">                <span class="n">current</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">        <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">current</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">char</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">current</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">tokens</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">current</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">tokens</span>
</span></span></code></pre></div><p>Input: <code>(define x (+ 1 2))</code>
Output: <code>['(', 'define', 'x', '(', '+', '1', '2', ')', ')']</code></p>
<h2 id="step-2-parsing">Step 2: Parsing</h2>
<p>Parsing converts tokens into an Abstract Syntax Tree (AST). In Scheme, the AST is just nested Python lists:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="nd">@dataclass</span>
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Symbol</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Represents a Scheme symbol.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">name</span><span class="p">:</span> <span class="nb">str</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">parse</span><span class="p">(</span><span class="n">tokens</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">Any</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Parse tokens into an AST.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">parse_expr</span><span class="p">(</span><span class="n">index</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Tuple</span><span class="p">[</span><span class="n">Any</span><span class="p">,</span> <span class="nb">int</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">        <span class="n">token</span> <span class="o">=</span> <span class="n">tokens</span><span class="p">[</span><span class="n">index</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">token</span> <span class="o">==</span> <span class="s1">&#39;(&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="c1"># Parse a list</span>
</span></span><span class="line"><span class="cl">            <span class="n">lst</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">            <span class="n">index</span> <span class="o">+=</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">            <span class="k">while</span> <span class="n">tokens</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">!=</span> <span class="s1">&#39;)&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="n">expr</span><span class="p">,</span> <span class="n">index</span> <span class="o">=</span> <span class="n">parse_expr</span><span class="p">(</span><span class="n">index</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">                <span class="n">lst</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">expr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="n">lst</span><span class="p">,</span> <span class="n">index</span> <span class="o">+</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">elif</span> <span class="n">token</span> <span class="o">==</span> <span class="s1">&#39;)&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">raise</span> <span class="ne">SyntaxError</span><span class="p">(</span><span class="s2">&#34;Unexpected )&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="c1"># Parse an atom</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="n">parse_atom</span><span class="p">(</span><span class="n">token</span><span class="p">),</span> <span class="n">index</span> <span class="o">+</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">expr</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">parse_expr</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">expr</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">parse_atom</span><span class="p">(</span><span class="n">token</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Any</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Parse a single atom.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Try integer</span>
</span></span><span class="line"><span class="cl">    <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="nb">int</span><span class="p">(</span><span class="n">token</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">except</span> <span class="ne">ValueError</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">pass</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Try float</span>
</span></span><span class="line"><span class="cl">    <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="nb">float</span><span class="p">(</span><span class="n">token</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">except</span> <span class="ne">ValueError</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">pass</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># String literal</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">token</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">&#39;&#34;&#39;</span><span class="p">)</span> <span class="ow">and</span> <span class="n">token</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s1">&#39;&#34;&#39;</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">token</span><span class="p">[</span><span class="mi">1</span><span class="p">:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Boolean</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">token</span> <span class="o">==</span> <span class="s1">&#39;#t&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">token</span> <span class="o">==</span> <span class="s1">&#39;#f&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Symbol</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">Symbol</span><span class="p">(</span><span class="n">token</span><span class="p">)</span>
</span></span></code></pre></div><p>Input tokens: <code>['(', 'define', 'x', '(', '+', '1', '2', ')', ')']</code>
Output AST: <code>[Symbol('define'), Symbol('x'), [Symbol('+'), 1, 2]]</code></p>
<p>The beauty of Lisp: the AST <em>is</em> the syntax. There&rsquo;s no separate tree structure—it&rsquo;s just lists of symbols and values.</p>
<h2 id="step-3-the-environment">Step 3: The Environment</h2>
<p>Before evaluation, we need an environment to track variable bindings. This is where lexical scoping lives:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Environment</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Lexical environment for variable bindings.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">parent</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="s1">&#39;Environment&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">bindings</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">parent</span> <span class="o">=</span> <span class="n">parent</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">define</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">value</span><span class="p">:</span> <span class="n">Any</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Define a new binding in this environment.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">bindings</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">value</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">get</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Any</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Look up a binding, checking parent scopes.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">name</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">bindings</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">bindings</span><span class="p">[</span><span class="n">name</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="k">elif</span> <span class="bp">self</span><span class="o">.</span><span class="n">parent</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">parent</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">name</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">raise</span> <span class="ne">NameError</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Undefined variable: </span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">set</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">value</span><span class="p">:</span> <span class="n">Any</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Update an existing binding.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">name</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">bindings</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">bindings</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">value</span>
</span></span><span class="line"><span class="cl">        <span class="k">elif</span> <span class="bp">self</span><span class="o">.</span><span class="n">parent</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">parent</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">value</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">raise</span> <span class="ne">NameError</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Undefined variable: </span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>Each function call creates a new Environment with the current one as its parent. This chain enables closures—inner functions that remember their enclosing scope.</p>
<h2 id="step-4-evaluation">Step 4: Evaluation</h2>
<p>The evaluator is the heart of the interpreter. It walks the AST and computes values:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">evaluate</span><span class="p">(</span><span class="n">expr</span><span class="p">:</span> <span class="n">Any</span><span class="p">,</span> <span class="n">env</span><span class="p">:</span> <span class="n">Environment</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Any</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Evaluate an expression in an environment.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Self-evaluating: numbers, strings, booleans</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">expr</span><span class="p">,</span> <span class="p">(</span><span class="nb">int</span><span class="p">,</span> <span class="nb">float</span><span class="p">,</span> <span class="nb">str</span><span class="p">,</span> <span class="nb">bool</span><span class="p">,</span> <span class="nb">type</span><span class="p">(</span><span class="kc">None</span><span class="p">))):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">expr</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Variable lookup</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">expr</span><span class="p">,</span> <span class="n">Symbol</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">env</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">expr</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Must be a list (function call or special form)</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">expr</span><span class="p">,</span> <span class="nb">list</span><span class="p">)</span> <span class="ow">or</span> <span class="ow">not</span> <span class="n">expr</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">expr</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">op</span> <span class="o">=</span> <span class="n">expr</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Special forms</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">Symbol</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">op</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="s1">&#39;quote&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="n">expr</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">op</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="s1">&#39;define&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">name</span> <span class="o">=</span> <span class="n">expr</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="n">value</span> <span class="o">=</span> <span class="n">evaluate</span><span class="p">(</span><span class="n">expr</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">env</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">value</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="n">value</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">op</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="s1">&#39;if&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">condition</span> <span class="o">=</span> <span class="n">evaluate</span><span class="p">(</span><span class="n">expr</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">env</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">condition</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="k">return</span> <span class="n">evaluate</span><span class="p">(</span><span class="n">expr</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">env</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">elif</span> <span class="nb">len</span><span class="p">(</span><span class="n">expr</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">3</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="k">return</span> <span class="n">evaluate</span><span class="p">(</span><span class="n">expr</span><span class="p">[</span><span class="mi">3</span><span class="p">],</span> <span class="n">env</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">op</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="s1">&#39;lambda&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">params</span> <span class="o">=</span> <span class="n">expr</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="n">body</span> <span class="o">=</span> <span class="n">expr</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="n">Procedure</span><span class="p">(</span><span class="n">params</span><span class="p">,</span> <span class="n">body</span><span class="p">,</span> <span class="n">env</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Function application</span>
</span></span><span class="line"><span class="cl">    <span class="n">func</span> <span class="o">=</span> <span class="n">evaluate</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">env</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">args</span> <span class="o">=</span> <span class="p">[</span><span class="n">evaluate</span><span class="p">(</span><span class="n">arg</span><span class="p">,</span> <span class="n">env</span><span class="p">)</span> <span class="k">for</span> <span class="n">arg</span> <span class="ow">in</span> <span class="n">expr</span><span class="p">[</span><span class="mi">1</span><span class="p">:]]</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">func</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">)</span>
</span></span></code></pre></div><p>Special forms (<code>define</code>, <code>if</code>, <code>lambda</code>, etc.) have custom evaluation rules. Everything else is a function call: evaluate the operator, evaluate the arguments, call the function.</p>
<h2 id="step-5-user-defined-functions">Step 5: User-Defined Functions</h2>
<p>The <code>lambda</code> form creates procedures. A Procedure captures its parameters, body, and defining environment:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="nd">@dataclass</span>
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Procedure</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;A user-defined function.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">params</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="n">Symbol</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="n">body</span><span class="p">:</span> <span class="n">Any</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="p">:</span> <span class="n">Environment</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># Create a new environment for the call</span>
</span></span><span class="line"><span class="cl">        <span class="n">local_env</span> <span class="o">=</span> <span class="n">Environment</span><span class="p">(</span><span class="n">parent</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># Bind parameters to arguments</span>
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">param</span><span class="p">,</span> <span class="n">arg</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">params</span><span class="p">,</span> <span class="n">args</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            <span class="n">local_env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="n">param</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">arg</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># Evaluate the body in this new environment</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">evaluate</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">body</span><span class="p">,</span> <span class="n">local_env</span><span class="p">)</span>
</span></span></code></pre></div><p>When called, a Procedure:</p>
<ol>
<li>Creates a new environment with its defining environment as parent (closure!)</li>
<li>Binds parameters to the passed arguments</li>
<li>Evaluates its body in this new environment</li>
</ol>
<p>This simple structure gives us closures, higher-order functions, and lexical scoping.</p>
<h2 id="step-6-built-in-primitives">Step 6: Built-in Primitives</h2>
<p>The global environment provides built-in functions:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">create_global_env</span><span class="p">()</span> <span class="o">-&gt;</span> <span class="n">Environment</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Create environment with built-in primitives.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span> <span class="o">=</span> <span class="n">Environment</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Arithmetic</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;+&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="o">*</span><span class="n">args</span><span class="p">:</span> <span class="nb">sum</span><span class="p">(</span><span class="n">args</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;-&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="o">=</span><span class="kc">None</span><span class="p">:</span> <span class="o">-</span><span class="n">a</span> <span class="k">if</span> <span class="n">b</span> <span class="ow">is</span> <span class="kc">None</span> <span class="k">else</span> <span class="n">a</span> <span class="o">-</span> <span class="n">b</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;*&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="o">*</span><span class="n">args</span><span class="p">:</span> <span class="n">reduce</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">:</span> <span class="n">x</span> <span class="o">*</span> <span class="n">y</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;/&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="n">a</span> <span class="o">/</span> <span class="n">b</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Comparison</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;=&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="n">a</span> <span class="o">==</span> <span class="n">b</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;&lt;&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="n">a</span> <span class="o">&lt;</span> <span class="n">b</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;&gt;&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="n">a</span> <span class="o">&gt;</span> <span class="n">b</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># List operations</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;car&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">lst</span><span class="p">:</span> <span class="n">lst</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;cdr&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">lst</span><span class="p">:</span> <span class="n">lst</span><span class="p">[</span><span class="mi">1</span><span class="p">:])</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;cons&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">+</span> <span class="nb">list</span><span class="p">(</span><span class="n">b</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;list&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="o">*</span><span class="n">args</span><span class="p">:</span> <span class="nb">list</span><span class="p">(</span><span class="n">args</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;null?&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">lst</span><span class="p">:</span> <span class="n">lst</span> <span class="o">==</span> <span class="p">[])</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Higher-order functions</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;map&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">f</span><span class="p">,</span> <span class="n">lst</span><span class="p">:</span> <span class="p">[</span><span class="n">f</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">lst</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;filter&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">f</span><span class="p">,</span> <span class="n">lst</span><span class="p">:</span> <span class="p">[</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">lst</span> <span class="k">if</span> <span class="n">f</span><span class="p">(</span><span class="n">x</span><span class="p">)])</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;reduce&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">f</span><span class="p">,</span> <span class="n">lst</span><span class="p">,</span> <span class="n">init</span><span class="p">:</span> <span class="n">reduce</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">lst</span><span class="p">,</span> <span class="n">init</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">env</span>
</span></span></code></pre></div><p>Each primitive is just a Python function. The interpreter treats them identically to user-defined procedures.</p>
<h2 id="step-7-filesystem-integration">Step 7: Filesystem Integration</h2>
<p>Now the interesting part: extending Scheme with filesystem primitives.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">create_global_env</span><span class="p">(</span><span class="n">shell</span><span class="o">=</span><span class="kc">None</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Environment</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span> <span class="o">=</span> <span class="n">Environment</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># ... standard primitives ...</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Filesystem primitives (if shell provided)</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">shell</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;ls&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">path</span><span class="o">=</span><span class="s1">&#39;/&#39;</span><span class="p">:</span> <span class="n">shell</span><span class="o">.</span><span class="n">ls</span><span class="p">(</span><span class="n">path</span><span class="p">)</span><span class="o">.</span><span class="n">lines</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">        <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;cat&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">path</span><span class="p">:</span> <span class="n">shell</span><span class="o">.</span><span class="n">cat</span><span class="p">(</span><span class="n">path</span><span class="p">)</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;mkdir&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">path</span><span class="p">:</span> <span class="n">shell</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">path</span><span class="p">)</span> <span class="ow">and</span> <span class="n">path</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;write&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">path</span><span class="p">,</span> <span class="n">content</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">shell</span><span class="o">.</span><span class="n">fs</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">content</span><span class="o">.</span><span class="n">encode</span><span class="p">())</span> <span class="ow">and</span> <span class="n">path</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;exists?&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">path</span><span class="p">:</span> <span class="n">shell</span><span class="o">.</span><span class="n">fs</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="n">path</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">        <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;cd&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">path</span><span class="p">:</span> <span class="n">shell</span><span class="o">.</span><span class="n">cd</span><span class="p">(</span><span class="n">path</span><span class="p">)</span> <span class="ow">and</span> <span class="n">path</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;pwd&#39;</span><span class="p">,</span> <span class="k">lambda</span><span class="p">:</span> <span class="n">shell</span><span class="o">.</span><span class="n">_cwd</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">env</span>
</span></span></code></pre></div><p>Now Scheme can script filesystem operations:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-scheme" data-lang="scheme"><span class="line"><span class="cl"><span class="c1">; Create a project structure</span>
</span></span><span class="line"><span class="cl"><span class="p">(</span><span class="nf">mkdir</span> <span class="s">&#34;/project&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">(</span><span class="nf">mkdir</span> <span class="s">&#34;/project/src&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">(</span><span class="nf">mkdir</span> <span class="s">&#34;/project/tests&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="p">(</span><span class="nb">write </span><span class="s">&#34;/project/README.md&#34;</span> <span class="s">&#34;# My Project\n&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">(</span><span class="nb">write </span><span class="s">&#34;/project/src/main.py&#34;</span> <span class="s">&#34;def main(): pass\n&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">; List files</span>
</span></span><span class="line"><span class="cl"><span class="p">(</span><span class="nf">ls</span> <span class="s">&#34;/project&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1">; =&gt; (&#34;README.md&#34; &#34;src&#34; &#34;tests&#34;)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">; Check existence</span>
</span></span><span class="line"><span class="cl"><span class="p">(</span><span class="k">if </span><span class="p">(</span><span class="nf">exists?</span> <span class="s">&#34;/project/README.md&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="s">&#34;Found it!&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="s">&#34;Not found&#34;</span><span class="p">)</span>
</span></span></code></pre></div><h2 id="step-8-the-repl">Step 8: The REPL</h2>
<p>A Read-Eval-Print Loop ties it together:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">SchemeREPL</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">shell</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">env</span> <span class="o">=</span> <span class="n">create_global_env</span><span class="p">(</span><span class="n">shell</span><span class="o">=</span><span class="n">shell</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">eval_string</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">code</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Any</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Evaluate a string of Scheme code.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="n">tokens</span> <span class="o">=</span> <span class="n">tokenize</span><span class="p">(</span><span class="n">code</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="ow">not</span> <span class="n">tokens</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="n">result</span> <span class="o">=</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">        <span class="n">idx</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">while</span> <span class="n">idx</span> <span class="o">&lt;</span> <span class="nb">len</span><span class="p">(</span><span class="n">tokens</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            <span class="n">expr</span><span class="p">,</span> <span class="n">idx</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_parse_one</span><span class="p">(</span><span class="n">tokens</span><span class="p">,</span> <span class="n">idx</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="n">result</span> <span class="o">=</span> <span class="n">evaluate</span><span class="p">(</span><span class="n">expr</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">result</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Interactive REPL.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="s2">&#34;Scheme REPL (type &#39;quit&#39; to exit)&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">while</span> <span class="kc">True</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="n">code</span> <span class="o">=</span> <span class="nb">input</span><span class="p">(</span><span class="s2">&#34;scheme&gt; &#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">                <span class="k">if</span> <span class="n">code</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span> <span class="o">==</span> <span class="s1">&#39;quit&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                    <span class="k">break</span>
</span></span><span class="line"><span class="cl">                <span class="n">result</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">eval_string</span><span class="p">(</span><span class="n">code</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">                <span class="k">if</span> <span class="n">result</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                    <span class="nb">print</span><span class="p">(</span><span class="n">format_value</span><span class="p">(</span><span class="n">result</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Error: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span></code></pre></div><h2 id="example-session">Example Session</h2>
<p>Here&rsquo;s what a session looks like:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">scheme&gt; (define double (lambda (x) (* x 2)))
</span></span><span class="line"><span class="cl">scheme&gt; (double 21)
</span></span><span class="line"><span class="cl">42
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">scheme&gt; (map double (list 1 2 3 4 5))
</span></span><span class="line"><span class="cl">(2 4 6 8 10)
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">scheme&gt; (mkdir &#34;/data&#34;)
</span></span><span class="line"><span class="cl">&#34;/data&#34;
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">scheme&gt; (write &#34;/data/numbers.txt&#34; &#34;1\n2\n3\n&#34;)
</span></span><span class="line"><span class="cl">&#34;/data/numbers.txt&#34;
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">scheme&gt; (cat &#34;/data/numbers.txt&#34;)
</span></span><span class="line"><span class="cl">&#34;1\n2\n3\n&#34;
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">scheme&gt; (filter (lambda (x) (&gt; x 1))
</span></span><span class="line"><span class="cl">               (map (lambda (s) (string-&gt;number s))
</span></span><span class="line"><span class="cl">                    (string-split (cat &#34;/data/numbers.txt&#34;) &#34;\n&#34;)))
</span></span><span class="line"><span class="cl">(2 3)
</span></span></code></pre></div><p>We&rsquo;ve built a scriptable filesystem with a dozen lines of primitives!</p>
<h2 id="design-patterns">Design Patterns</h2>
<p>Several patterns make this work:</p>
<p><strong>Homoiconicity</strong>: Code is data. The AST is just lists, so we can generate and manipulate code programmatically.</p>
<p><strong>Closures</strong>: Functions capture their environment, enabling powerful patterns like partial application and callbacks.</p>
<p><strong>Extension via primitives</strong>: Adding features is just adding functions to the global environment. No interpreter changes needed.</p>
<p><strong>Separation of concerns</strong>: Tokenizing, parsing, and evaluating are distinct phases with clean interfaces.</p>
<h2 id="trade-offs">Trade-offs</h2>
<p>This simple interpreter has limitations:</p>
<ol>
<li><strong>No tail-call optimization</strong>: Deep recursion will overflow the Python stack.</li>
<li><strong>No macros</strong>: We can&rsquo;t extend the language&rsquo;s syntax.</li>
<li><strong>Error messages are basic</strong>: Line numbers would require more tracking.</li>
<li><strong>Performance</strong>: Pure interpretation is slow for heavy computation.</li>
</ol>
<p>For a DSL extending an application, these trade-offs are often acceptable. The simplicity enables understanding and modification.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Embedding a language isn&rsquo;t black magic. At its core:</p>
<ol>
<li><strong>Tokenize</strong>: Break text into tokens</li>
<li><strong>Parse</strong>: Build an AST (for Lisp, just nested lists)</li>
<li><strong>Evaluate</strong>: Walk the AST, computing values</li>
<li><strong>Extend</strong>: Add primitives that call into your application</li>
</ol>
<p>Scheme&rsquo;s minimal syntax makes this especially tractable. In a few hundred lines, we get variables, functions, closures, conditionals, recursion, and extensibility.</p>
<p>The result is a scriptable application where users can automate, experiment, and build beyond what the core interface provides—exactly what we want from a DSL.</p>
<hr>
<p><em>The complete implementation is in <a href="https://github.com/queelius/dagshell">DagShell</a>, demonstrating Scheme as a filesystem scripting language alongside Python&rsquo;s fluent API.</em></p>
<p><em>This is the final post in the series. Previous posts: <a href="../01-immutable-content-addressed-filesystem/">Immutable Content-Addressed Filesystems</a> and <a href="../02-unix-philosophy-python/">Unix Philosophy in Python</a>.</em></p>
]]></content:encoded></item><item><title>Narrating a Hugo Blog with Sentence Highlighting</title><link>https://metafunctor.com/post/2026-02-26-narro-hugo-integration/</link><pubDate>Thu, 26 Feb 2026 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/post/2026-02-26-narro-hugo-integration/</guid><description>narro can now generate audio narration for Hugo blog posts, with synchronized sentence highlighting in the browser. The alignment problem turned out to be more interesting than expected.</description><content:encoded><![CDATA[<p>I wanted my blog posts to have audio narration. Not a podcast, not a read-aloud button that sends text to a cloud API. Local TTS with <a href="https://github.com/queelius/narro">narro</a>, my 80M parameter CPU model, generating Opus files that live next to the markdown source. One command to narrate an entire Hugo site.</p>
<p>That part was straightforward. The part that got interesting was highlighting: tracking which sentence is being spoken and lighting it up in the browser as the audio plays.</p>
<hr>
<h2 id="the-pipeline">The Pipeline</h2>
<p>Three commands:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">narro hugo install ~/mysite    <span class="c1"># copy player assets (JS, CSS, HTML partial)</span>
</span></span><span class="line"><span class="cl">narro hugo generate ~/mysite   <span class="c1"># narrate all posts with tts: true</span>
</span></span><span class="line"><span class="cl">narro hugo status ~/mysite     <span class="c1"># show what&#39;s been narrated and what hasn&#39;t</span>
</span></span></code></pre></div><p><code>generate</code> walks the content directory, finds posts with <code>tts: true</code> in frontmatter, extracts the prose (stripping code blocks, math, shortcodes, frontmatter), runs it through narro, converts to Opus via ffmpeg, and drops <code>narration.opus</code> and <code>narration.json</code> next to the <code>index.md</code>. The JSON file contains sentence-level timestamps. The HTML partial picks them up and wires everything together.</p>
<p>The player is a vanilla JS widget. No dependencies. Play, pause, spacebar toggle. The active paragraph highlights as the audio plays.</p>
<h2 id="the-alignment-problem">The Alignment Problem</h2>
<p>Timestamps are the interesting part. Narro uses a causal language model (Qwen-based, 80M params) to generate hidden states, then a Vocos decoder converts those to audio. There is no explicit alignment signal in this architecture.</p>
<p>My first attempt tried word-level timestamps using attention weights. The idea: each generated audio token attends to input text tokens, so you can compute a center-of-mass over the attention distribution to estimate when each word is spoken. This works beautifully for encoder-decoder models with cross-attention. It does not work for causal LM self-attention.</p>
<p>Causal self-attention attends to everything that came before. The distributions are diffuse. Every word&rsquo;s center-of-mass lands somewhere in the middle of the sequence, producing overlapping time ranges that are useless for highlighting.</p>
<h2 id="sentence-level-timestamps">Sentence-Level Timestamps</h2>
<p>The fix was to stop trying to be precise about words and use the precision the model actually gives you.</p>
<p>Narro processes each sentence independently. The number of generated tokens per sentence is known exactly. Multiply by the token duration (2048 samples / 32kHz = 64ms per token) and you get exact sentence timing. No approximation, no heuristics. The timestamps are ground truth.</p>
<p>The alignment JSON is one entry per sentence:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">[</span>
</span></span><span class="line"><span class="cl">  <span class="p">{</span><span class="nt">&#34;text&#34;</span><span class="p">:</span> <span class="s2">&#34;Hello world.&#34;</span><span class="p">,</span> <span class="nt">&#34;start&#34;</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span> <span class="nt">&#34;end&#34;</span><span class="p">:</span> <span class="mf">0.64</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">  <span class="p">{</span><span class="nt">&#34;text&#34;</span><span class="p">:</span> <span class="s2">&#34;Goodbye moon.&#34;</span><span class="p">,</span> <span class="nt">&#34;start&#34;</span><span class="p">:</span> <span class="mf">0.64</span><span class="p">,</span> <span class="nt">&#34;end&#34;</span><span class="p">:</span> <span class="mf">1.6</span><span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">]</span>
</span></span></code></pre></div><p>I initially tried distributing each sentence&rsquo;s duration across its words proportionally by character count. It worked, but the DOM manipulation was fragile. Wrapping every word in a <code>&lt;span&gt;</code>, walking the tree to skip headings and code blocks, matching word indices sequentially to alignment entries. If any word got missed or doubled, every subsequent highlight drifted.</p>
<p>Sentence-level is better. The JS player matches sentences to existing <code>&lt;p&gt;</code> elements by substring containment. The DOM stays untouched. The highlighting is a CSS class toggled on the paragraph. Simpler code, no drift, and the visual result is actually clearer to read.</p>
<h2 id="the-browser-side">The Browser Side</h2>
<p>The JS player runs a binary search on the alignment data at 60fps via <code>requestAnimationFrame</code> and toggles a CSS class on the active paragraph. That is essentially the whole thing. No DOM rewriting, no word wrapping, no skip-tag lists.</p>
<h2 id="what-it-sounds-like">What It Sounds Like</h2>
<p>Every post on this blog with <code>tts: true</code> in frontmatter has a player at the top. Hit play. The active paragraph highlights as it is spoken. The model runs on CPU in about 20x real-time.</p>
<p>The source is at <a href="https://github.com/queelius/narro">github.com/queelius/narro</a>.</p>]]></content:encoded><category>narro</category><category>hugo</category><category>tts</category><category>open-source</category><category>python</category></item><item><title>arkiv</title><link>https://metafunctor.com/projects/arkiv/</link><pubDate>Tue, 24 Feb 2026 23:57:06 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/projects/arkiv/</guid><description>Universal personal data format. JSONL in, SQL out, SQL back to JSONL. One format, one database, one query interface.</description><content:encoded><![CDATA[<h1 id="arkiv">arkiv</h1>
<p>Universal personal data format. JSONL in, SQL out, MCP to LLMs.</p>
<h2 id="the-format">The Format</h2>
<p>Every record is a JSON object. All fields optional.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">{&#34;mimetype&#34;: &#34;text/plain&#34;, &#34;content&#34;: &#34;I think the key insight is...&#34;, &#34;uri&#34;: &#34;https://chatgpt.com/c/abc&#34;, &#34;timestamp&#34;: &#34;2023-05-14T10:30:00Z&#34;, &#34;metadata&#34;: {&#34;role&#34;: &#34;user&#34;, &#34;conversation_id&#34;: &#34;abc&#34;}}
</span></span><span class="line"><span class="cl">{&#34;mimetype&#34;: &#34;audio/wav&#34;, &#34;uri&#34;: &#34;file://media/podcast.wav&#34;, &#34;timestamp&#34;: &#34;2024-01-15&#34;, &#34;metadata&#34;: {&#34;transcript&#34;: &#34;Welcome to...&#34;, &#34;duration&#34;: 45.2}}
</span></span><span class="line"><span class="cl">{&#34;mimetype&#34;: &#34;image/jpeg&#34;, &#34;uri&#34;: &#34;file://media/photo.jpg&#34;, &#34;metadata&#34;: {&#34;caption&#34;: &#34;My talk at MIT&#34;}}
</span></span></code></pre></div><h2 id="the-stack">The Stack</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gdscript3" data-lang="gdscript3"><span class="line"><span class="cl"><span class="n">JSONL</span> <span class="n">files</span> <span class="p">(</span><span class="n">canonical</span><span class="p">,</span> <span class="n">portable</span><span class="p">,</span> <span class="n">human</span><span class="o">-</span><span class="n">readable</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="err">↓</span> <span class="n">arkiv</span> <span class="n">import</span>
</span></span><span class="line"><span class="cl"><span class="n">SQLite</span> <span class="n">database</span> <span class="p">(</span><span class="n">queryable</span><span class="p">,</span> <span class="n">efficient</span><span class="p">,</span> <span class="n">standard</span> <span class="n">SQL</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="err">↓</span> <span class="n">arkiv</span> <span class="n">serve</span>
</span></span><span class="line"><span class="cl"><span class="n">MCP</span> <span class="n">server</span> <span class="p">(</span><span class="mi">3</span> <span class="n">tools</span> <span class="err">→</span> <span class="n">any</span> <span class="n">LLM</span><span class="p">)</span>
</span></span></code></pre></div><h2 id="quick-start">Quick Start</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">pip install arkiv
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Import JSONL to SQLite</span>
</span></span><span class="line"><span class="cl">arkiv import conversations.jsonl --db archive.db
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Query</span>
</span></span><span class="line"><span class="cl">arkiv query archive.db <span class="s2">&#34;SELECT content FROM records WHERE metadata-&gt;&gt;&#39;role&#39; = &#39;user&#39; LIMIT 5&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Serve to LLMs via MCP</span>
</span></span><span class="line"><span class="cl">arkiv serve archive.db
</span></span></code></pre></div><h2 id="mcp-tools">MCP Tools</h2>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>get_manifest()</code></td>
          <td>What collections exist, their descriptions and schemas</td>
      </tr>
      <tr>
          <td><code>get_schema(collection?)</code></td>
          <td>What metadata keys can be queried</td>
      </tr>
      <tr>
          <td><code>sql_query(query)</code></td>
          <td>Run read-only SQL</td>
      </tr>
  </tbody>
</table>
<h2 id="why">Why</h2>
<ul>
<li>Your data lives in silos (ChatGPT, email, bookmarks, photos, voice memos)</li>
<li>Source toolkits (ctk, mtk, btk, ptk, ebk) export it as JSONL</li>
<li>arkiv gives you one format, one database, one query interface</li>
<li>Any LLM can query it via MCP</li>
<li>JSONL is human-readable and durable. SQLite is the most deployed database in history.</li>
</ul>
<h2 id="spec">Spec</h2>
<p>See <a href="https://github.com/queelius/arkiv/blob/main/SPEC.md">SPEC.md</a> for the full technical specification.</p>
]]></content:encoded></item><item><title>sigmark</title><link>https://metafunctor.com/projects/sigmark/</link><pubDate>Tue, 24 Feb 2026 14:06:32 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/projects/sigmark/</guid><description>GPG signing for static site markdown content</description><content:encoded><![CDATA[<h1 id="sigmark">sigmark</h1>
<p>GPG signing for static site markdown content.</p>
<p>Sign Hugo/static-site markdown files with GPG, embedding ASCII-armored signatures directly in YAML front matter. Verify authenticity, strip signatures, or check signing status across your content directory.</p>
<h2 id="install">Install</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">pip install sigmark
</span></span></code></pre></div><p>Requires GPG (<code>gpg</code>) to be installed and available on your <code>PATH</code>.</p>
<h2 id="usage">Usage</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># Sign all markdown files in a directory</span>
</span></span><span class="line"><span class="cl">sigmark sign --key you@example.com content/
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Sign a single file</span>
</span></span><span class="line"><span class="cl">sigmark sign --key you@example.com content/post/hello/index.md
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Verify signatures</span>
</span></span><span class="line"><span class="cl">sigmark verify content/
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Check signing status</span>
</span></span><span class="line"><span class="cl">sigmark status content/
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Remove signatures</span>
</span></span><span class="line"><span class="cl">sigmark strip content/
</span></span></code></pre></div><h2 id="how-it-works">How It Works</h2>
<p>Sigmark signs only the <strong>body</strong> of each markdown file (everything below the closing <code>---</code> front-matter delimiter). The GPG signature is stored as a <code>signature</code> field in the YAML front matter:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nn">---</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">title</span><span class="p">:</span><span class="w"> </span><span class="l">Hello World</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">date</span><span class="p">:</span><span class="w"> </span><span class="ld">2026-01-01</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">signature</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd">
</span></span></span><span class="line"><span class="cl"><span class="sd">  -----BEGIN PGP SIGNATURE-----
</span></span></span><span class="line"><span class="cl"><span class="sd">  iQEzBAABCAAdFiEE...
</span></span></span><span class="line"><span class="cl"><span class="sd">  -----END PGP SIGNATURE-----</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nn">---</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="l">Your post body here.</span><span class="w">
</span></span></span></code></pre></div><p>This means front-matter changes (tags, categories, draft status) don&rsquo;t invalidate the signature, while any change to the actual content does.</p>
<h2 id="commands">Commands</h2>
<table>
  <thead>
      <tr>
          <th>Command</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>sign --key &lt;id&gt; PATHS...</code></td>
          <td>Sign markdown files with GPG</td>
      </tr>
      <tr>
          <td><code>verify PATHS...</code></td>
          <td>Verify GPG signatures (exit 1 on failure)</td>
      </tr>
      <tr>
          <td><code>strip PATHS...</code></td>
          <td>Remove signature fields from front matter</td>
      </tr>
      <tr>
          <td><code>status PATHS...</code></td>
          <td>Report unsigned / valid / invalid per file</td>
      </tr>
  </tbody>
</table>
<p>All commands accept files and/or directories. Directories are walked recursively for <code>.md</code> files with YAML front matter. Global flags: <code>--verbose</code>, <code>--dry-run</code>.</p>
<h2 id="license">License</h2>
<p>MIT</p>
]]></content:encoded></item><item><title>netloom</title><link>https://metafunctor.com/projects/netloom/</link><pubDate>Tue, 24 Feb 2026 05:00:27 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/projects/netloom/</guid><description>Declarative language for constructing complex networks from structured data</description><content:encoded><![CDATA[<h1 id="netloom">netloom</h1>
<p>Declarative language for constructing complex networks from structured data.</p>
<h2 id="what-is-netloom">What is netloom?</h2>
<p>netloom is a YAML-based DSL that describes how to build a <strong>weighted directed graph</strong> from structured documents. You declare <strong>node types</strong> (the units of your graph) and <strong>link types</strong> (how nodes relate), and netloom constructs a NetworkX <code>DiGraph</code> for analysis.</p>
<p>The output is a directed graph you can analyze with standard tools: community detection, centrality measures, shortest paths, visualization.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">netloom</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">G</span> <span class="o">=</span> <span class="n">netloom</span><span class="o">.</span><span class="n">build</span><span class="p">(</span><span class="s2">&#34;config.yaml&#34;</span><span class="p">)</span>   <span class="c1"># returns nx.DiGraph</span>
</span></span></code></pre></div><p>Or from the command line:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">netloom build config.yaml                            <span class="c1"># -&gt; output.graphml</span>
</span></span><span class="line"><span class="cl">netloom build config.yaml -o graph.json --format json
</span></span><span class="line"><span class="cl">netloom build config.yaml --format gexf
</span></span></code></pre></div><h2 id="why-not-just-use-a-vector-db">Why not just use a vector DB?</h2>
<p>Vector databases are fast nearest-neighbor lookup engines. They answer &ldquo;what&rsquo;s similar to X?&rdquo; netloom answers a different question: &ldquo;what is the <em>structure</em> of similarity across my corpus?&rdquo;</p>
<table>
  <thead>
      <tr>
          <th></th>
          <th>Vector DB</th>
          <th>Custom Python</th>
          <th>netloom</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Computes</strong></td>
          <td>Single embedding, cosine/L2</td>
          <td>Anything you code</td>
          <td>Multi-field similarity with composition</td>
      </tr>
      <tr>
          <td><strong>Scaling</strong></td>
          <td>ANN indexes, millions of docs</td>
          <td>Depends</td>
          <td>O(n^2) pairwise, practical to ~10K docs</td>
      </tr>
      <tr>
          <td><strong>Retrieval</strong></td>
          <td>Fast nearest-neighbor</td>
          <td>Custom</td>
          <td>Graph-aware: communities, hubs, bridges</td>
      </tr>
      <tr>
          <td><strong>Metadata</strong></td>
          <td>Filter only (WHERE clauses)</td>
          <td>Custom</td>
          <td>First-class similarity participant</td>
      </tr>
      <tr>
          <td><strong>Configuration</strong></td>
          <td>Code it</td>
          <td>Code it</td>
          <td>Declarative YAML</td>
      </tr>
  </tbody>
</table>
<p><strong>netloom wins when:</strong></p>
<ul>
<li>Metadata fields (tags, authors, categories) should <em>contribute to similarity</em>, not just filter results</li>
<li>You care about graph structure: which documents are hubs, which bridge two communities</li>
<li>You want declarative control over how similarity is composed from multiple signals</li>
<li>You have multiple node types in the same graph (heterogeneous networks)</li>
<li>You need directed relationships like citations or containment alongside similarity</li>
</ul>
<p><strong>Sweet spot</strong>: Small-to-medium corpora (&lt;10K documents) where you want to understand structure, not just retrieve.</p>
<h2 id="source-formats">Source formats</h2>
<p>netloom ingests structured data from multiple formats. A single config can combine multiple sources:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">source</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">path</span><span class="p">:</span><span class="w"> </span><span class="l">data/conversations.jsonl</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">format</span><span class="p">:</span><span class="w"> </span><span class="l">jsonl</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">path</span><span class="p">:</span><span class="w"> </span><span class="l">data/papers/</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">format</span><span class="p">:</span><span class="w"> </span><span class="l">json</span><span class="w">
</span></span></span></code></pre></div><table>
  <thead>
      <tr>
          <th>Format</th>
          <th>Example</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>JSONL</td>
          <td><code>data/conversations.jsonl</code></td>
      </tr>
      <tr>
          <td>JSON files</td>
          <td><code>data/papers/*.json</code></td>
      </tr>
      <tr>
          <td>YAML</td>
          <td><code>data/config.yaml</code> (single or multi-document)</td>
      </tr>
      <tr>
          <td>Markdown + frontmatter</td>
          <td><code>notes/*.md</code> (YAML frontmatter + body)</td>
      </tr>
      <tr>
          <td>Plain markdown</td>
          <td><code>docs/*.md</code> (headings and sections extracted as structured data)</td>
      </tr>
      <tr>
          <td>Plain text</td>
          <td><code>corpus/*.txt</code> (whole file becomes <code>body</code>)</td>
      </tr>
  </tbody>
</table>
<p>Markdown is treated as structured data: headings become <code>title</code>, <code>##</code> sections become a <code>sections</code> list, and the content becomes <code>body</code>. Every record gets a <code>_meta</code> block with full provenance (source path, timestamps, content hash).</p>
<h2 id="core-abstractions">Core abstractions</h2>
<p><strong>Defaults</strong> reduce repetition across node types:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">defaults</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">embed</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">model</span><span class="p">:</span><span class="w"> </span><span class="l">tfidf</span><span class="w">
</span></span></span></code></pre></div><p><strong>Nodes</strong> define the units of your graph. A single source document can produce multiple node types:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">nodes</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">conversation</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">from</span><span class="p">:</span><span class="w"> </span><span class="l">.</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">fields</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">title</span><span class="p">:</span><span class="w"> </span>{<span class="w"> </span><span class="nt">pluck</span><span class="p">:</span><span class="w"> </span><span class="l">title }</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">tags</span><span class="p">:</span><span class="w"> </span>{<span class="w"> </span><span class="nt">pluck</span><span class="p">:</span><span class="w"> </span><span class="l">tags }</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">embed</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">field: title              # inherits model</span><span class="p">:</span><span class="w"> </span><span class="l">tfidf from defaults</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">user_turn</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">from</span><span class="p">:</span><span class="w"> </span><span class="l">turns</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">where</span><span class="p">:</span><span class="w"> </span>{<span class="w"> </span><span class="nt">role</span><span class="p">:</span><span class="w"> </span><span class="l">user }</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">fields</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">text</span><span class="p">:</span><span class="w"> </span>{<span class="w"> </span><span class="nt">pluck</span><span class="p">:</span><span class="w"> </span><span class="l">text }</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">embed</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">field</span><span class="p">:</span><span class="w"> </span><span class="l">text</span><span class="w">
</span></span></span></code></pre></div><p><strong>Links</strong> define relationships between nodes &ndash; similarity, attribute overlap, structural containment, and foreign-key references:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">links</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">intent_similarity</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">between</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">user_turn, user_turn]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">method</span><span class="p">:</span><span class="w"> </span><span class="l">cosine</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">min</span><span class="p">:</span><span class="w"> </span><span class="m">0.3</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">tag_overlap</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">between</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">conversation, conversation]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">method</span><span class="p">:</span><span class="w"> </span><span class="l">jaccard</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">field</span><span class="p">:</span><span class="w"> </span><span class="l">tags</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">contains_turns</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">between</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">conversation, user_turn]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">method</span><span class="p">:</span><span class="w"> </span><span class="l">parent</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">cites</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">between</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">paper, paper]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">method</span><span class="p">:</span><span class="w"> </span><span class="l">reference</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">field</span><span class="p">:</span><span class="w"> </span><span class="l">references</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">target_field</span><span class="p">:</span><span class="w"> </span><span class="l">paper_id</span><span class="w">
</span></span></span></code></pre></div><p><strong>Network</strong> controls graph construction:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">network</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">min</span><span class="p">:</span><span class="w"> </span><span class="m">0.3</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">communities</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">algorithm</span><span class="p">:</span><span class="w"> </span><span class="l">louvain</span><span class="w">
</span></span></span></code></pre></div><h2 id="link-methods">Link methods</h2>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>cosine</code></td>
          <td>Cosine similarity on embedding vectors</td>
      </tr>
      <tr>
          <td><code>jaccard</code></td>
          <td>Jaccard set similarity on list fields</td>
      </tr>
      <tr>
          <td><code>dice</code></td>
          <td>Dice coefficient on list fields</td>
      </tr>
      <tr>
          <td><code>overlap</code></td>
          <td>Overlap coefficient on list fields</td>
      </tr>
      <tr>
          <td><code>exact</code></td>
          <td>Boolean equality (1.0 or 0.0)</td>
      </tr>
      <tr>
          <td><code>numeric</code></td>
          <td>Gaussian kernel similarity on number fields</td>
      </tr>
      <tr>
          <td><code>parent</code></td>
          <td>Structural containment (directed: parent-&gt;child)</td>
      </tr>
      <tr>
          <td><code>reference</code></td>
          <td>Foreign-key lookup (directed: source-&gt;target)</td>
      </tr>
  </tbody>
</table>
<p>Symmetric methods produce bidirectional edges. Parent and reference produce directed edges.</p>
<h2 id="full-example">Full example</h2>
<p>Given a corpus of conversation JSON documents like this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;id&#34;</span><span class="p">:</span> <span class="s2">&#34;conv-2024-0142&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;title&#34;</span><span class="p">:</span> <span class="s2">&#34;Debug authentication middleware&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;created_at&#34;</span><span class="p">:</span> <span class="s2">&#34;2024-11-15T09:23:00Z&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;model&#34;</span><span class="p">:</span> <span class="s2">&#34;claude-sonnet-4-20250514&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;turns&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;text&#34;</span><span class="p">:</span> <span class="s2">&#34;The auth middleware is rejecting valid tokens after the Redis upgrade&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;timestamp&#34;</span><span class="p">:</span> <span class="s2">&#34;2024-11-15T09:23:00Z&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;assistant&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;text&#34;</span><span class="p">:</span> <span class="s2">&#34;Let me check the Redis connection config and token validation logic.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;timestamp&#34;</span><span class="p">:</span> <span class="s2">&#34;2024-11-15T09:23:05Z&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;tool_calls&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;read_file&#34;</span><span class="p">,</span> <span class="s2">&#34;grep&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;assistant&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;text&#34;</span><span class="p">:</span> <span class="s2">&#34;Found it -- the Redis key prefix changed from &#39;session:&#39; to &#39;sess:&#39; in v7.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;timestamp&#34;</span><span class="p">:</span> <span class="s2">&#34;2024-11-15T09:23:15Z&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;text&#34;</span><span class="p">:</span> <span class="s2">&#34;Ah that makes sense, we upgraded Redis last week. Can you fix it?&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;timestamp&#34;</span><span class="p">:</span> <span class="s2">&#34;2024-11-15T09:23:30Z&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;tags&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;debugging&#34;</span><span class="p">,</span> <span class="s2">&#34;auth&#34;</span><span class="p">,</span> <span class="s2">&#34;redis&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;project&#34;</span><span class="p">:</span> <span class="s2">&#34;backend-api&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;outcome&#34;</span><span class="p">:</span> <span class="s2">&#34;resolved&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;tools_used&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;read_file&#34;</span><span class="p">,</span> <span class="s2">&#34;grep&#34;</span><span class="p">,</span> <span class="s2">&#34;edit&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>This netloom config builds a heterogeneous graph where conversations and individual turns are separate node types, connected by semantic similarity, tag overlap, and structural containment:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">defaults</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">embed</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">model</span><span class="p">:</span><span class="w"> </span><span class="l">tfidf</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">source</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">path</span><span class="p">:</span><span class="w"> </span><span class="l">data/conversations/</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">format</span><span class="p">:</span><span class="w"> </span><span class="l">jsonl</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">nodes</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">conversation</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">from</span><span class="p">:</span><span class="w"> </span><span class="l">.</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">fields</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">id</span><span class="p">:</span><span class="w"> </span>{<span class="w"> </span><span class="nt">pluck</span><span class="p">:</span><span class="w"> </span><span class="l">id }</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">title</span><span class="p">:</span><span class="w"> </span>{<span class="w"> </span><span class="nt">pluck</span><span class="p">:</span><span class="w"> </span><span class="l">title }</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">tags</span><span class="p">:</span><span class="w"> </span>{<span class="w"> </span><span class="nt">pluck</span><span class="p">:</span><span class="w"> </span><span class="l">tags }</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">project</span><span class="p">:</span><span class="w"> </span>{<span class="w"> </span><span class="nt">pluck</span><span class="p">:</span><span class="w"> </span><span class="l">project }</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">embed</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">field</span><span class="p">:</span><span class="w"> </span><span class="l">title</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">user_turn</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">from</span><span class="p">:</span><span class="w"> </span><span class="l">turns</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">where</span><span class="p">:</span><span class="w"> </span>{<span class="w"> </span><span class="nt">role</span><span class="p">:</span><span class="w"> </span><span class="l">user }</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">fields</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">text</span><span class="p">:</span><span class="w"> </span>{<span class="w"> </span><span class="nt">pluck</span><span class="p">:</span><span class="w"> </span><span class="l">text }</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">timestamp</span><span class="p">:</span><span class="w"> </span>{<span class="w"> </span><span class="nt">pluck</span><span class="p">:</span><span class="w"> </span><span class="l">timestamp }</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">embed</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">field</span><span class="p">:</span><span class="w"> </span><span class="l">text</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">chunking</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">method</span><span class="p">:</span><span class="w"> </span><span class="l">sentences</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">max_tokens</span><span class="p">:</span><span class="w"> </span><span class="m">256</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">aggregate</span><span class="p">:</span><span class="w"> </span><span class="l">mean</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">assistant_turn</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">from</span><span class="p">:</span><span class="w"> </span><span class="l">turns</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">where</span><span class="p">:</span><span class="w"> </span>{<span class="w"> </span><span class="nt">role</span><span class="p">:</span><span class="w"> </span><span class="l">assistant }</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">fields</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">text</span><span class="p">:</span><span class="w"> </span>{<span class="w"> </span><span class="nt">pluck</span><span class="p">:</span><span class="w"> </span><span class="l">text }</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">tools</span><span class="p">:</span><span class="w"> </span>{<span class="w"> </span><span class="nt">pluck: tool_calls, default</span><span class="p">:</span><span class="w"> </span><span class="p">[]</span><span class="w"> </span>}<span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">embed</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">field</span><span class="p">:</span><span class="w"> </span><span class="l">text</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">links</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">user_intent_similarity</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">between</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">user_turn, user_turn]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">method</span><span class="p">:</span><span class="w"> </span><span class="l">cosine</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">min</span><span class="p">:</span><span class="w"> </span><span class="m">0.3</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">cross_role_similarity</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">between</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">user_turn, assistant_turn]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">method</span><span class="p">:</span><span class="w"> </span><span class="l">cosine</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">min</span><span class="p">:</span><span class="w"> </span><span class="m">0.4</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">tag_overlap</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">between</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">conversation, conversation]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">method</span><span class="p">:</span><span class="w"> </span><span class="l">jaccard</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">field</span><span class="p">:</span><span class="w"> </span><span class="l">tags</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">same_project</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">between</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">conversation, conversation]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">method</span><span class="p">:</span><span class="w"> </span><span class="l">exact</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">field</span><span class="p">:</span><span class="w"> </span><span class="l">project</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">contains_user_turns</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">between</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">conversation, user_turn]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">method</span><span class="p">:</span><span class="w"> </span><span class="l">parent</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">contains_assistant_turns</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">between</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">conversation, assistant_turn]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">method</span><span class="p">:</span><span class="w"> </span><span class="l">parent</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">network</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">min</span><span class="p">:</span><span class="w"> </span><span class="m">0.3</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">communities</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">algorithm</span><span class="p">:</span><span class="w"> </span><span class="l">louvain</span><span class="w">
</span></span></span></code></pre></div><p>Each source document produces one <code>conversation</code> node and multiple <code>turn</code> nodes. The resulting graph has:</p>
<ul>
<li><strong>Semantic edges</strong> between turns with similar content (cosine similarity)</li>
<li><strong>Attribute edges</strong> between conversations sharing tags (Jaccard) or the same project (exact match)</li>
<li><strong>Structural edges</strong> connecting conversations to their constituent turns (parent links)</li>
</ul>
<h2 id="plugin-architecture">Plugin architecture</h2>
<p>netloom uses a registry pattern for all pluggable components. Built-in providers and user-written providers are structurally identical:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">netloom/
</span></span><span class="line"><span class="cl">  embeddings/tfidf.py        # built-in, no heavy deps
</span></span><span class="line"><span class="cl">  embeddings/ollama.py        # local Ollama models
</span></span><span class="line"><span class="cl">  embeddings/openai.py        # OpenAI API
</span></span><span class="line"><span class="cl">  metrics/jaccard.py          # built-in
</span></span><span class="line"><span class="cl">  metrics/cosine.py           # built-in
</span></span><span class="line"><span class="cl">  chunking/sentences.py       # built-in
</span></span></code></pre></div><p>Install optional providers via extras:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">pip install netloom<span class="o">[</span>openai<span class="o">]</span>
</span></span><span class="line"><span class="cl">pip install netloom<span class="o">[</span>ollama<span class="o">]</span>
</span></span></code></pre></div><p>Or write your own:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">netloom</span> <span class="kn">import</span> <span class="n">register_embedding</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nd">@register_embedding</span><span class="p">(</span><span class="s2">&#34;my-model&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">MyEmbedding</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">embed</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">float</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">        <span class="o">...</span>
</span></span></code></pre></div><p>Then reference it in the DSL:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">embed</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">field</span><span class="p">:</span><span class="w"> </span><span class="l">text</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">model</span><span class="p">:</span><span class="w"> </span><span class="l">my-model</span><span class="w">
</span></span></span></code></pre></div><h2 id="use-cases">Use cases</h2>
<ul>
<li><strong>Conversation analysis</strong>: Nodes are conversations and turns. Links are semantic similarity, topic overlap, temporal proximity, structural containment.</li>
<li><strong>Paper citation networks</strong>: Nodes are papers. Links are citation (reference), co-authorship (jaccard), topic similarity (cosine).</li>
<li><strong>Codebase analysis</strong>: Nodes are files, functions, modules. Links are imports, call graphs, semantic similarity.</li>
<li><strong>Multi-modal documents</strong>: Nodes are text chunks, images, tables from the same document. Links are co-occurrence and cross-modal similarity.</li>
<li><strong>E-commerce catalogs</strong>: Nodes are products. Links are semantic description similarity, shared categories, price-range proximity (numeric).</li>
</ul>
<h2 id="status">Status</h2>
<p><strong>Design phase.</strong> The DSL specification is in <a href="https://github.com/queelius/netloom/blob/main/docs/spec.md"><code>docs/spec.md</code></a>. No implementation code yet. We&rsquo;re refining the design before building.</p>
<h2 id="license">License</h2>
<p>MIT</p>
]]></content:encoded></item><item><title>Chartfold: Owning Your Medical Records</title><link>https://metafunctor.com/post/2026-02-24-chartfold-owning-your-medical-records/</link><pubDate>Tue, 24 Feb 2026 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/post/2026-02-24-chartfold-owning-your-medical-records/</guid><description>A walkthrough of Chartfold, a Python tool that loads your medical records into SQLite and exposes them to Claude via MCP for structured analysis, visit prep, and ad-hoc queries.</description><content:encoded><![CDATA[<p>I have cancer. My oncologist is at one hospital system (Siteman/BJC), my primary care doctor at another, and my earlier treatment history lives at a third (Anderson, where my first oncologist practiced). Patient portals are fine for browsing, but they don&rsquo;t answer questions. They show you your data one lab result at a time, one note at a time, one visit at a time.</p>
<p>I wanted to run queries against my medical records. Correlate lab trends with treatment changes. Generate structured question lists before oncology visits. Ask &ldquo;what changed since my last appointment&rdquo; and get a real answer. That means getting the data out of the portal and into something programmable.</p>
<p>Chartfold loads EHR exports into SQLite and exposes them to Claude via MCP.</p>
<hr>
<h2 id="the-problem">The Problem</h2>
<p>In the US, patients can export their medical records. HIPAA and the 21st Century Cures Act guarantee this. What you get depends on the system: Epic MyChart gives you CDA XML files, MEDITECH Expanse gives you FHIR JSON mixed with CCDA XML, athenahealth gives you FHIR R4 Bundles. Different formats, same clinical concepts.</p>
<p>If your hospitals use different EHR systems, none of them have the complete picture. Chartfold merges the exports into one database. But even if you&rsquo;re at a single hospital, the export format is not something you can work with directly. A directory of CDA XML files is not a database. You can&rsquo;t query it, chart it, or hand it to an LLM.</p>
<p>The point of Chartfold is to turn whatever your hospital gives you into a SQLite database, then make that database useful.</p>
<h2 id="what-it-does">What It Does</h2>
<p>Chartfold is a Python CLI. You point it at an EHR export directory, it parses the XML/FHIR, normalizes everything into a common data model (16 tables, ISO dates, deduplicated), and loads it into SQLite. Then you can query it directly, export it as a self-contained HTML dashboard, or connect Claude to it via MCP.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># Load data from your hospital exports</span>
</span></span><span class="line"><span class="cl">chartfold load epic ~/exports/epic/
</span></span><span class="line"><span class="cl">chartfold load meditech ~/exports/meditech/
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Query directly</span>
</span></span><span class="line"><span class="cl">chartfold query <span class="s2">&#34;SELECT test_name, value, result_date FROM lab_results
</span></span></span><span class="line"><span class="cl"><span class="s2">                 WHERE test_name LIKE &#39;%CEA%&#39; ORDER BY result_date DESC&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Export a self-contained HTML file</span>
</span></span><span class="line"><span class="cl">chartfold <span class="nb">export</span> html --output chartfold.html
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Start the MCP server for Claude</span>
</span></span><span class="line"><span class="cl">chartfold serve-mcp
</span></span></code></pre></div><hr>
<h2 id="the-claude-integration">The Claude Integration</h2>
<p>This is why Chartfold exists for me.</p>
<p>The MCP server exposes the database to Claude Code. Setup is one file. Drop a <code>.mcp.json</code> in any directory where you run Claude Code:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;mcpServers&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;chartfold&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;command&#34;</span><span class="p">:</span> <span class="s2">&#34;python&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;args&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;-m&#34;</span><span class="p">,</span> <span class="s2">&#34;chartfold&#34;</span><span class="p">,</span> <span class="s2">&#34;serve-mcp&#34;</span><span class="p">,</span> <span class="s2">&#34;--db&#34;</span><span class="p">,</span> <span class="s2">&#34;/path/to/chartfold.db&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>That&rsquo;s it. Claude now has read access to your entire medical history via SQL, plus tools for saving notes and structured analyses. I keep my database in a private directory and my <code>.mcp.json</code> pointing at it. Open Claude Code, and I&rsquo;m talking to my records.</p>
<p>The kinds of things I actually use it for:</p>
<blockquote>
<p>&ldquo;What&rsquo;s changed since my last oncology visit on January 15?&rdquo;</p></blockquote>
<p>Claude writes SQL, reads the results, and gives me a structured diff: new lab results, new imaging, changed medications, new clinical notes.</p>
<blockquote>
<p>&ldquo;Generate a prioritized question list for my appointment with Dr. Tan tomorrow.&rdquo;</p></blockquote>
<p>Claude reads my recent labs, imaging reports, pathology, and genomic results, then produces a tiered document organized by clinical urgency.</p>
<blockquote>
<p>&ldquo;Show me my CEA trend and flag any inflection points.&rdquo;</p></blockquote>
<p>Claude queries the lab_results table, filters by test name, and walks through the time series.</p>
<p>The analyses get saved back to the database (via dedicated MCP tools) and appear in the HTML export as tagged, searchable documents.</p>
<p><img src="analysis-cards.png" alt="Analysis section showing Claude-generated documents: prioritized questions for an oncologist visit, molecular profile summary, and deep clinical analysis"></p>
<p>Here&rsquo;s what one looks like expanded: a structured question list for an oncology appointment, organized by urgency tier, referencing specific test results and treatment options.</p>
<p><img src="analysis.png" alt="Full analysis document showing tiered questions for Dr. Tan, covering Guardant tissue results, liquid biopsy timing, BRCA1 methylation, and treatment planning"></p>
<p>I use this before every oncology visit. When you have 1776 lab results, 53 imaging reports, and 9 pathology reports, you need something to synthesize them. That&rsquo;s what Claude does well, but it needs structured data to work with. Chartfold provides the structured data. Claude provides the reasoning.</p>
<p>The MCP server exposes 25 tools. Here&rsquo;s the full list:</p>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th>What it does</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>run_sql</code></td>
          <td>Execute arbitrary read-only SQL against the database</td>
      </tr>
      <tr>
          <td><code>get_schema</code></td>
          <td>Get CREATE TABLE DDL for query planning</td>
      </tr>
      <tr>
          <td><code>get_database_summary</code></td>
          <td>Table counts and load history (start here)</td>
      </tr>
      <tr>
          <td><code>query_labs</code></td>
          <td>Lab results filtered by test name, date, source, LOINC</td>
      </tr>
      <tr>
          <td><code>get_lab_series_tool</code></td>
          <td>Cross-source time series for a specific test</td>
      </tr>
      <tr>
          <td><code>get_available_tests_tool</code></td>
          <td>All lab tests with frequency and date range</td>
      </tr>
      <tr>
          <td><code>get_abnormal_labs_tool</code></td>
          <td>All flagged-abnormal results</td>
      </tr>
      <tr>
          <td><code>get_medications</code></td>
          <td>Medication list, optionally filtered by status</td>
      </tr>
      <tr>
          <td><code>reconcile_medications_tool</code></td>
          <td>Cross-source medication reconciliation</td>
      </tr>
      <tr>
          <td><code>get_timeline</code></td>
          <td>Unified event timeline (encounters, procedures, imaging, labs, pathology)</td>
      </tr>
      <tr>
          <td><code>search_notes</code></td>
          <td>Full-text search across clinical notes</td>
      </tr>
      <tr>
          <td><code>get_pathology_report</code></td>
          <td>Retrieve a pathology report by ID</td>
      </tr>
      <tr>
          <td><code>get_visit_diff</code></td>
          <td>Everything new since a given date</td>
      </tr>
      <tr>
          <td><code>get_visit_prep</code></td>
          <td>Pre-appointment summary bundle</td>
      </tr>
      <tr>
          <td><code>get_surgical_timeline</code></td>
          <td>Procedures linked to pathology, imaging, and meds</td>
      </tr>
      <tr>
          <td><code>match_cross_source_encounters</code></td>
          <td>Same-day encounters across different EHR systems</td>
      </tr>
      <tr>
          <td><code>get_data_quality_report</code></td>
          <td>Duplicate detection and source coverage matrix</td>
      </tr>
      <tr>
          <td><code>get_source_files</code></td>
          <td>Find PDFs and images linked to clinical records</td>
      </tr>
      <tr>
          <td><code>get_asset_summary</code></td>
          <td>Source asset counts by type and source</td>
      </tr>
      <tr>
          <td><code>save_note</code> / <code>get_note</code> / <code>search_notes_personal</code> / <code>delete_note</code></td>
          <td>Personal notes (CRUD)</td>
      </tr>
      <tr>
          <td><code>save_analysis</code> / <code>get_analysis</code> / <code>search_analyses</code> / <code>list_analyses</code> / <code>delete_analysis</code></td>
          <td>Structured analyses (CRUD)</td>
      </tr>
  </tbody>
</table>
<p>Clinical data is read-only (the SQLite connection opens in <code>?mode=ro</code>, enforced at the engine level). Claude can&rsquo;t modify your clinical records, only read them and save its own notes and analyses.</p>
<p>Most of these tools exist so Claude doesn&rsquo;t have to write SQL for common tasks. But <code>run_sql</code> is the escape hatch: anything the specialized tools don&rsquo;t cover, Claude can query directly.</p>
<hr>
<h2 id="the-html-dashboard">The HTML Dashboard</h2>
<p>The HTML export embeds the entire SQLite database using sql.js (SQLite compiled to WebAssembly). Open the file in a browser and you get an interactive dashboard. Everything runs client-side. No server, no cloud, no account. The file never phones home.</p>
<p><img src="overview.png" alt="Dashboard overview showing record counts: 1776 lab results, 93 medications, 51 conditions, 99 encounters, 53 imaging reports, and more"></p>
<h3 id="lab-charts">Lab Charts</h3>
<p>Lab charts show time-series data across sources with reference ranges. Here, creatinine and albumin are tracked over four years across MEDITECH (blue) and Epic (orange).</p>
<p><img src="lab-charts.png" alt="Cross-source lab charts for Creatinine and Albumin, showing data series from two hospital systems with reference range bands"></p>
<p>For patients who do deal with fragmented records across incompatible systems, this is the view that doesn&rsquo;t exist in any single portal. The charts are configurable via a TOML file, which you can auto-generate from your data:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">chartfold init-config
</span></span></code></pre></div><h3 id="conditions">Conditions</h3>
<p>Conditions with ICD-10 codes, onset dates, and source provenance.</p>
<p><img src="conditions.png" alt="Conditions table showing 51 conditions with status, ICD-10 codes, onset dates, and source system"></p>
<h3 id="medications">Medications</h3>
<p>Medications show &ldquo;Multi-source&rdquo; badges when the same drug appears in multiple systems.</p>
<p><img src="medications.png" alt="Active medications list with multi-source badges"></p>
<h3 id="imaging">Imaging</h3>
<p>Imaging reports with the full narrative findings. Useful for visit prep: search for a specific study, read the impression, bring context.</p>
<p><img src="imaging.png" alt="Imaging section showing CT and MRI reports with narrative findings"></p>
<h3 id="source-documents">Source Documents</h3>
<p>Source PDFs and scanned documents grouped by date.</p>
<p><img src="sources.png" alt="Source documents grouped by date, showing PDF assets from EHR exports"></p>
<h3 id="sql-console">SQL Console</h3>
<p>For anything the UI doesn&rsquo;t cover, there&rsquo;s a SQL console. Every table, every column, every index.</p>
<p><img src="sql-console.png" alt="SQL Console showing the database schema"></p>
<h3 id="dark-mode">Dark Mode</h3>
<p><img src="dark-mode.png" alt="Dark mode overview"></p>
<hr>
<h2 id="architecture">Architecture</h2>
<p>Three-stage pipeline, each stage independently testable:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">Raw EHR files (CDA XML, FHIR JSON, CCDA XML)
</span></span><span class="line"><span class="cl">    |
</span></span><span class="line"><span class="cl">    v
</span></span><span class="line"><span class="cl">[Source Parser]  -- format-specific extraction
</span></span><span class="line"><span class="cl">    |
</span></span><span class="line"><span class="cl">    v
</span></span><span class="line"><span class="cl">[Adapter]        -- normalize to UnifiedRecords (16 dataclass types)
</span></span><span class="line"><span class="cl">    |
</span></span><span class="line"><span class="cl">    v
</span></span><span class="line"><span class="cl">[DB Loader]      -- idempotent upsert into SQLite
</span></span></code></pre></div><p>Source parsers handle the XML/FHIR parsing. Adapters normalize dates to ISO 8601, parse numeric values, deduplicate, and map into a common data model. The DB loader uses upsert with natural keys, so re-running a load is safe.</p>
<p>After loading, the CLI prints a stage comparison table: parser count, adapter count, DB count. If the numbers don&rsquo;t match, you know where data was lost.</p>
<p>Currently supports Epic (CDA), MEDITECH (FHIR + CCDA), and athenahealth (FHIR R4). These importers were written against my own exports. I can&rsquo;t guarantee they&rsquo;ll work for yours. EHR exports vary by site, software version, and configuration. The pipeline is designed as a plugin system for exactly this reason: adding a new source means writing a parser, an adapter, and wiring them into the CLI. The CLAUDE.md has a recipe, and Claude can write a new importer from a sample export in about an hour.</p>
<hr>
<h2 id="export-formats">Export Formats</h2>
<ul>
<li><strong>HTML SPA</strong>: self-contained single file with embedded SQLite, Chart.js, and sql.js. No external dependencies. Copy it to a USB drive.</li>
<li><strong>Markdown</strong>: visit-focused summary with configurable lookback, optional PDF via pandoc.</li>
<li><strong>JSON</strong>: full-fidelity round-trip format. Export, then import to a new database with identical record counts.</li>
<li><strong>Hugo site</strong>: static site with detail pages and cross-linked records.</li>
<li><strong>Arkiv</strong>: universal record format (JSONL + manifest) for long-term archival.</li>
</ul>
<p>The HTML export is a single file. No server, no backend, no account. You can host it on a static site (I host mine on GitHub Pages), email it to a family member, or hand it to a doctor on a USB drive. Because it&rsquo;s just a file, you can protect it with <a href="/post/2026-02-13-pagevault/">PageVault</a> to add password-based encryption before sharing. The recipient opens the file, enters the password, and gets the full interactive dashboard. No server involved at any step.</p>
<p>Medical records should not depend on someone else&rsquo;s infrastructure. A single HTML file with an embedded database and WebAssembly runtime is about as durable as digital data gets.</p>
<hr>
<h2 id="getting-started">Getting Started</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">pip install chartfold
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Load your exports</span>
</span></span><span class="line"><span class="cl">chartfold load auto ~/path/to/export/
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Query</span>
</span></span><span class="line"><span class="cl">chartfold query <span class="s2">&#34;SELECT test_name, value, result_date FROM lab_results LIMIT 10&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Export</span>
</span></span><span class="line"><span class="cl">chartfold <span class="nb">export</span> html --output my-records.html
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Connect Claude</span>
</span></span><span class="line"><span class="cl">chartfold serve-mcp
</span></span></code></pre></div><p>The code is on GitHub: <a href="https://github.com/queelius/chartfold">queelius/chartfold</a>. Python 3.11+, depends on lxml and not much else.</p>
<hr>
<p><em>Chartfold started because I wanted to ask questions about my own medical records and couldn&rsquo;t. Now I can.</em></p>]]></content:encoded><category>chartfold</category><category>python</category><category>ehr</category><category>sqlite</category><category>health-data</category><category>open-source</category><category>mcp</category><category>claude</category></item><item><title>Code Without Purpose</title><link>https://metafunctor.com/post/2026-02-25-code-without-purpose/</link><pubDate>Tue, 24 Feb 2026 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/post/2026-02-25-code-without-purpose/</guid><description>The problem isn't too much code. It's code without purpose.</description><content:encoded><![CDATA[<p>Time is finite in ways I can&rsquo;t ignore. That changes which questions about code feel important.</p>
<p>I read a post arguing that the most valuable programming skill in 2026 is deleting code. The thesis: AI generates code faster than anyone can review it, so the real value is in curation and subtraction. Code is a liability, not an asset.</p>
<p>I agree with the observation. I disagree with the prescription.</p>
<hr>
<h2 id="the-thesis">The Thesis</h2>
<p>The argument is straightforward. AI tools can produce entire modules in the time it takes to write a spec. Codebases are accumulating features nobody asked for, abstractions nobody needs, and boilerplate that exists because the model defaulted to verbosity. Teams that used to struggle to ship enough code now struggle with too much of it. In this world, the programmer&rsquo;s role shifts from writer to editor. The most valuable activity becomes knowing what to cut.</p>
<p>I&rsquo;ve seen this. Projects accumulate code the way attics accumulate boxes. Nobody remembers why half of it is there. It sits untouched for months, adding to cognitive load, making every change harder to reason about. When you finally clear it out, nothing breaks and the rest becomes legible again. Code rot is real. AI accelerates it. The instinct to subtract is correct.</p>
<p>But subtraction is symptom treatment. The underlying problem isn&rsquo;t volume. It&rsquo;s that most code doesn&rsquo;t know why it exists. It was written (or generated) to solve a local problem, it solved it or half-solved it, and then it sat there, disconnected from any larger purpose. Code without purpose is what bloats. Deletion is the right instinct pointed at the wrong layer. The cause isn&rsquo;t too much code. It&rsquo;s too little intent.</p>
<h2 id="what-i-found-instead">What I Found Instead</h2>
<p>I started building tools two years ago because I needed my personal data to survive even if I wasn&rsquo;t around to maintain it. That&rsquo;s the whole constraint. Whatever I build has to work without me.</p>
<p>The first tool was a conversation archiver. I had years of AI conversations across ChatGPT, Claude, and Copilot trapped in platforms that might not exist next decade. I needed them in formats that degrade gracefully. SQLite for structured queries. JSONL for streaming and interchange. Markdown for human reading. If the tool disappears, the data is still a file you can open with anything. If SQLite disappears, the JSONL is still searchable with grep. If everything disappears, the Markdown is still readable in a text editor. Each layer works without the one above it.</p>
<p>Then I needed the same thing for bookmarks. Then ebooks. Then photos, email, medical records, notes. Each tool exists because the previous one exposed a gap. The medical records needed secure sharing without a server. The whole collection needed a dead man&rsquo;s switch. The archive eventually needed something stranger: a way to answer questions after I couldn&rsquo;t.</p>
<p>None of this was planned as an ecosystem. I built the next thing I needed, and the next, and the next. At some point I looked back and realized they all pointed the same direction. This is a life project. Everything serves one purpose.</p>
<h2 id="the-stack">The Stack</h2>
<p>One purpose: durable personal data that outlasts its creator.</p>
<p><strong>Philosophy</strong></p>
<ul>
<li><a href="https://github.com/queelius/longecho">longecho</a>: The Long Echo specification. Self-describing data, durable formats, graceful degradation.</li>
</ul>
<p><strong>Universal Format</strong></p>
<ul>
<li><a href="https://github.com/queelius/arkiv">arkiv</a>: Universal personal data format. JSONL in, SQL out, SQL back to JSONL. Its MCP server can host any collection intelligently, regardless of domain. One format, one database, one query interface.</li>
</ul>
<p><strong>Source Toolkits</strong></p>
<ul>
<li><a href="https://github.com/queelius/memex">memex</a> / <a href="https://github.com/queelius/ctk">ctk</a>: AI conversations. Import, query, continue in the browser, export durable archives.</li>
<li>A family of domain toolkits for bookmarks, ebooks, photos, email, and notes. Different domains, identical architecture.</li>
<li><a href="https://github.com/queelius/chartfold">chartfold</a>: Medical records from three hospital systems, consolidated into one queryable database.</li>
</ul>
<p><strong>Infrastructure</strong></p>
<ul>
<li><a href="https://github.com/queelius/pagevault">pagevault</a>: Client-side encryption for any HTML file. No server.</li>
<li><a href="https://github.com/queelius/posthumous">posthumous</a>: Federated dead man&rsquo;s switch.</li>
<li><a href="https://github.com/queelius/repoindex">repoindex</a>: Index and query across ~120 git repos.</li>
<li>A collection of Claude Code plugins: MCP servers, agents, and skills that wire everything into my daily workflow.</li>
</ul>
<p><strong>The Endgame</strong></p>
<ul>
<li><a href="https://github.com/queelius/eidola">eidola</a>: A conversable persona assembled from all of the above. Its first form is a Claude Code plugin backed by the combined archive. Not resurrection. An echo.</li>
</ul>
<h2 id="the-pattern">The Pattern</h2>
<p>Every tool follows the same architecture. SQLite for storage. CLI for local use. MCP server for Claude, or the CLI wrapped in a light Claude Code plugin. Export to self-contained HTML you can host anywhere or open from a file. Export to longecho-compliant archives that work without the tool. The data always outlasts the software.</p>
<p>Take memex. Import your AI conversations, query them with SQL, talk to Claude about them via MCP, or export a single HTML file where you can browse and continue conversations in the browser. Download the SQLite from that same page, and you&rsquo;re back to durable local data. The cycle closes. This is how most of the tools work.</p>
<p>Arkiv sits in the middle. The source toolkits produce JSONL. Arkiv imports it to SQLite. Arkiv exports it back to JSONL. Its MCP server can expose any collection to Claude, regardless of what domain it came from. The data flows in a circle, always in formats that describe themselves.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gdscript3" data-lang="gdscript3"><span class="line"><span class="cl"><span class="n">Source</span> <span class="n">toolkits</span> <span class="err">→</span> <span class="n">arkiv</span> <span class="err">→</span> <span class="n">longecho</span>
</span></span><span class="line"><span class="cl">                              <span class="err">↓</span>
</span></span><span class="line"><span class="cl">                  <span class="n">pagevault</span> <span class="p">(</span><span class="n">encrypt</span><span class="p">)</span> <span class="o">+</span> <span class="n">posthumous</span> <span class="p">(</span><span class="n">deliver</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">                              <span class="err">↓</span>
</span></span><span class="line"><span class="cl">                         <span class="n">eidola</span> <span class="p">(</span><span class="n">echo</span><span class="p">)</span>
</span></span></code></pre></div><p>I have never once needed to delete one of these tools. Not because I&rsquo;m a better programmer than anyone else. Because each one exists for a specific reason that connects to the others. When code has purpose, dead weight doesn&rsquo;t accumulate.</p>
<p>This isn&rsquo;t architectural foresight. It&rsquo;s what happens when you build from a clear constraint. &ldquo;My data has to survive without me&rdquo; is a filter that works at the design stage. Every tool either serves that constraint or it doesn&rsquo;t get built. There is no third category.</p>
<h2 id="the-actual-skill">The Actual Skill</h2>
<p>The most valuable skill isn&rsquo;t deleting code or writing it. It&rsquo;s knowing why you&rsquo;re building.</p>
<p>If you know the purpose, code stays minimal because unnecessary code doesn&rsquo;t serve it. You don&rsquo;t need periodic purges. The purpose does the culling before anything gets written. Deletion is what happens when purpose was absent from the start. It&rsquo;s retrospective correction for a problem that clear intent would have prevented.</p>
<p>I know what I&rsquo;m building toward. The tools will echo after I stop maintaining them. That was the point.</p>]]></content:encoded><category>longecho</category><category>open-source</category><category>software-philosophy</category><category>legacy</category></item><item><title>bernoulli_sets</title><link>https://metafunctor.com/projects/bernoulli_sets/</link><pubDate>Sun, 22 Feb 2026 23:17:09 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/projects/bernoulli_sets/</guid><description>Bernoulli sets: a compositional algebra for random approximate sets with quantifiable false positive/negative error rates</description><content:encoded></content:encoded></item><item><title>trapdoor-computing</title><link>https://metafunctor.com/projects/trapdoor-computing/</link><pubDate>Sun, 22 Feb 2026 21:02:30 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/projects/trapdoor-computing/</guid><description>Trapdoor computing: a paradigm for computing with values hidden behind one-way trapdoors. Cipher maps, Bernoulli approximations, and composable encrypted computation.</description><content:encoded><![CDATA[<h1 id="trapdoor-computing">Trapdoor Computing</h1>
<p>A paradigm for computing with values whose meaning is hidden behind a one-way trapdoor.</p>
<h2 id="the-idea">The Idea</h2>
<p>A <strong>cipher map</strong> is a total function on bit strings. Every input produces output. The concepts of &ldquo;domain,&rdquo; &ldquo;correct,&rdquo; and &ldquo;incorrect&rdquo; only exist on the <strong>trusted machine</strong> that holds the decoder. The <strong>untrusted machine</strong> sees only opaque bits flowing through opaque lookup tables — it cannot distinguish signal from noise.</p>
<h2 id="foundations">Foundations</h2>
<p>The <code>foundations/</code> directory contains the authentic source documents (2023-2024) that define the framework:</p>
<ul>
<li><strong>bernoulli-model.md</strong> — The Bernoulli Model: latent/observed duality, confusion matrices, HashSet construction achieving <code>-log_2(epsilon)</code> bits per element</li>
<li><strong>noisy-gates.md</strong> — Noisy Turing Machines: composing Bernoulli logic gates, interval arithmetic for error propagation through circuits</li>
<li><strong>trapdoor-boolean-algebra.md</strong> — Boolean Algebra Over Trapdoors: approximate homomorphism from powerset to bit strings via cryptographic hash, marginal uniformity</li>
<li><strong>entropy-maps.md</strong> — Entropy Maps: prefix-free hash codes for function approximation, mu = H(Y) space bound, two-level hash construction</li>
</ul>
<h2 id="four-properties-of-a-cipher-map">Four Properties of a Cipher Map</h2>
<p>A cipher map f-hat implementing a latent function f satisfies:</p>
<ol>
<li><strong>Totality</strong> — f-hat is defined on all inputs (bits in, bits out). Out-of-domain inputs produce random output. With probability epsilon, random output happens to be a valid codeword.</li>
<li><strong>Representation Uniformity</strong> — Each domain value has multiple encodings. The distribution over encodings is delta-close to uniform, preventing frequency analysis.</li>
<li><strong>Correctness</strong> — For in-domain inputs, decode(f-hat(encode(x))) = f(x) with probability at least 1 - eta.</li>
<li><strong>Composability</strong> — Composing cipher maps compounds error predictably: eta_total = 1 - (1 - eta_f)(1 - eta_g).</li>
</ol>
<h2 id="author">Author</h2>
<p>Alexander Towell — <a href="https://metafunctor.com">metafunctor.com</a></p>
]]></content:encoded></item><item><title>deterministic-masking</title><link>https://metafunctor.com/projects/deterministic-masking/</link><pubDate>Sat, 21 Feb 2026 17:04:27 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/projects/deterministic-masking/</guid><description>Information Recovery under Deterministic Masking in Exponential Series Systems — short communication</description><content:encoded></content:encoded></item><item><title>claude-anvil</title><link>https://metafunctor.com/projects/claude-anvil/</link><pubDate>Wed, 18 Feb 2026 11:23:54 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/projects/claude-anvil/</guid><description>A forge for Claude Code plugins — academic research, fiction worldbuilding, publication workflows, and more</description><content:encoded><![CDATA[<h1 id="claude-anvil">claude-anvil</h1>
<p>A forge for Claude Code plugins.</p>
<p>Claude Anvil is a plugin marketplace containing nine Claude Code plugins for academic research, creative worldbuilding, publication workflows, site management, repository intelligence, personal metadata, encryption, and more. Each plugin provides skills, commands, and agents that extend Claude Code with domain-specific capabilities.</p>
<h2 id="quick-start">Quick Start</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># Install the marketplace</span>
</span></span><span class="line"><span class="cl">/plugin marketplace add queelius/claude-anvil
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Install an individual plugin</span>
</span></span><span class="line"><span class="cl">/plugin install papermill@queelius
</span></span></code></pre></div><h2 id="plugins">Plugins</h2>
<table>
  <thead>
      <tr>
          <th>Plugin</th>
          <th>Description</th>
          <th>Version</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/queelius/claude-anvil/tree/master/papermill/">papermill</a></td>
          <td>Academic paper lifecycle: thesis, lit survey, experiment, review, venue, submission</td>
          <td>0.3.0</td>
      </tr>
      <tr>
          <td><a href="https://github.com/queelius/claude-anvil/tree/master/worldsmith/">worldsmith</a></td>
          <td>Documentation-first fiction worldbuilding (the &ldquo;Silmarillion approach&rdquo;)</td>
          <td>0.2.0</td>
      </tr>
      <tr>
          <td><a href="https://github.com/queelius/claude-anvil/tree/master/pub-pipeline/">pub-pipeline</a></td>
          <td>Publication workflows: R/CRAN/JOSS, Python/PyPI</td>
          <td>0.5.0</td>
      </tr>
      <tr>
          <td><a href="https://github.com/queelius/claude-anvil/tree/master/mf/">mf</a></td>
          <td>Metafunctor site management: blog architecture, content workflows, crier</td>
          <td>1.0.0</td>
      </tr>
      <tr>
          <td><a href="https://github.com/queelius/claude-anvil/tree/master/repoindex/">repoindex</a></td>
          <td>Collection-aware repository intelligence — query, analyze, maintain git repos</td>
          <td>0.10.0</td>
      </tr>
      <tr>
          <td><a href="https://github.com/queelius/claude-anvil/tree/master/deets/">deets</a></td>
          <td>Personal metadata queries — identity, contact, academic, profiles</td>
          <td>1.0.0</td>
      </tr>
      <tr>
          <td><a href="https://github.com/queelius/claude-anvil/tree/master/locksmith/">alex-confidential</a></td>
          <td>Confidentiality toolkit — cryptoid, pagevault, gpg encryption</td>
          <td>0.1.0</td>
      </tr>
      <tr>
          <td><a href="https://github.com/queelius/claude-anvil/tree/master/kdp/">kdp</a></td>
          <td>Amazon KDP book publishing: manuscript audit, listing craft, submission workflow</td>
          <td>0.1.0</td>
      </tr>
      <tr>
          <td><a href="https://github.com/queelius/claude-anvil/tree/master/jot/">jot</a></td>
          <td>Journal-aware sessions — surfaces tasks, ideas, and plans from your jot journal</td>
          <td>0.1.0</td>
      </tr>
  </tbody>
</table>
<h2 id="plugin-anatomy">Plugin Anatomy</h2>
<p>Each plugin follows Claude Code plugin conventions:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">&lt;plugin&gt;/
</span></span><span class="line"><span class="cl">├── .claude-plugin/plugin.json   # Manifest: name, version, description, author
</span></span><span class="line"><span class="cl">├── skills/&lt;name&gt;/SKILL.md       # Interactive skills (the core logic)
</span></span><span class="line"><span class="cl">├── commands/&lt;name&gt;.md           # Slash commands (thin wrappers or rich docs)
</span></span><span class="line"><span class="cl">├── agents/&lt;name&gt;.md             # Autonomous subagents with system prompts
</span></span><span class="line"><span class="cl">└── hooks/hooks.json             # Event handlers (optional)
</span></span></code></pre></div><p><strong>Skills</strong> are the heart of each plugin — they contain the domain knowledge and workflow logic. Commands are thin wrappers that trigger skills via <code>/plugin:command</code> syntax. Agents run autonomously for tasks like literature surveys or code review.</p>
<p>Not every plugin uses every component. Minimal plugins like deets have only a skill and a command. Larger plugins like worldsmith add hooks for ambient project detection and propagation discipline.</p>
<h2 id="license">License</h2>
<p>MIT</p>
]]></content:encoded></item><item><title>claude-plugin-deets</title><link>https://metafunctor.com/projects/claude-plugin-deets/</link><pubDate>Wed, 18 Feb 2026 07:08:17 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/projects/claude-plugin-deets/</guid><description>Claude Code plugin for deets — personal metadata queries (identity, profiles, cross-platform disambiguation)</description><content:encoded></content:encoded></item><item><title>Masked Failure Data: Looking Back, Looking Forward</title><link>https://metafunctor.com/post/2026-02-18-mle-looking-back-forward/</link><pubDate>Wed, 18 Feb 2026 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/post/2026-02-18-mle-looking-back-forward/</guid><description>A retrospective on three years of building R packages and writing papers for masked series system reliability, and what comes next.</description><content:encoded><![CDATA[<p>I have been working on the same statistical problem since 2020. I am now a PhD student in CS. The problem has not changed, but my understanding of it has, and the tools I have built around it look nothing like what I started with.</p>
<p>The problem: a series system fails when any component fails. You observe system-level failure times. But you often cannot tell which component caused the failure (masking). Some systems are still running when testing ends (censoring). Given this incomplete data, estimate component reliability.</p>
<p>This is not a tutorial. It is a map of where things stand and where they are going.</p>
<hr>
<h2 id="how-it-started">How It Started</h2>
<p>I picked this for my <a href="/post/2024-06-15-reliability-estimation-in-series-systems/">math master&rsquo;s thesis</a> because it combined everything I wanted to learn: survival analysis, mixture models, the EM algorithm, bootstrap inference, simulation studies. I focused on Weibull components with right censoring, but the likelihood derivation was distribution-agnostic. The general framework was there; I just did not treat it as the main contribution.</p>
<p>The implementation was one monolithic R package (<code>wei.series.md.c1.c2.c3</code>). It worked. It was not reusable. The distribution algebra, the MLE infrastructure, the likelihood model, the series system logic, the masking conditions: everything tangled together. Want to change the distribution family? Rewrite half the package.</p>
<p>I defended in October 2023. The thesis was fine. The code was a dead end.</p>
<h2 id="pulling-it-apart">Pulling It Apart</h2>
<p>After the defense, I started asking: what are the actual algebraic structures here?</p>
<p>It took a while, but the answer was several independent layers:</p>
<ol>
<li>
<p><strong>Distributions form an algebra.</strong> Add, scale, compose. That became <a href="/projects/algebraic.dist/">algebraic.dist</a>.</p>
</li>
<li>
<p><strong>MLEs form an algebra.</strong> Delta method, bootstrap, reparameterization. That became <a href="/projects/algebraic.mle/">algebraic.mle</a>.</p>
</li>
<li>
<p><strong>Likelihood models are composable.</strong> Heterogeneous data is a sum of likelihood contributions, one per observation type. That became <a href="/projects/likelihood.model/">likelihood.model</a>.</p>
</li>
<li>
<p><strong>MLE solvers are composable.</strong> Chain them, race them, add restarts. The SICP closure property: composing solvers gives you a solver. That became <a href="/projects/compositional.mle/">compositional.mle</a>.</p>
</li>
<li>
<p><strong>Hazard functions define distributions.</strong> Write \(h(t; \theta)\), get \(S(t)\), \(F(t)\), \(f(t)\), quantiles, sampling, and MLE for free. That became <a href="/projects/flexhaz/">flexhaz</a>. A companion, <a href="/projects/serieshaz/">serieshaz</a>, composes component hazards into series system distributions.</p>
</li>
</ol>
<p>The dependency graph:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">                algebraic.dist
</span></span><span class="line"><span class="cl">                 |          \
</span></span><span class="line"><span class="cl">             algebraic.mle    \
</span></span><span class="line"><span class="cl">              |        \       \
</span></span><span class="line"><span class="cl">compositional.mle   likelihood.model
</span></span><span class="line"><span class="cl">                      |         \
</span></span><span class="line"><span class="cl">                   flexhaz    maskedcauses
</span></span><span class="line"><span class="cl">                     |
</span></span><span class="line"><span class="cl">                  serieshaz
</span></span><span class="line"><span class="cl">                     |
</span></span><span class="line"><span class="cl">                  maskedhaz
</span></span></code></pre></div><p>None of this was planned. It emerged from repeatedly asking &ldquo;why is this code so hard to change?&rdquo; until the natural joints appeared.</p>
<h2 id="ad-and-nabla">AD and nabla</h2>
<p>Preparing the master&rsquo;s defense (which included BCa bootstrap confidence intervals) exposed the places where my understanding was mechanical rather than structural. I knew the formulas. I did not always know why they were the right formulas.</p>
<p>After the defense, I experimented with PyTorch&rsquo;s autograd for MLE. I was already familiar with the relationship between gradients and score functions, but running it in a computational graph solidified it. The score function is the gradient of the log-likelihood. The observed information is the negative Hessian. Same computation, different names.</p>
<p>That led me to build <a href="/projects/nabla/">nabla</a>, which I think is the most general AD package in R. Nested dual numbers, S4 dispatch. <code>D(f)</code> gives gradients. <code>D(D(f))</code> gives Hessians. Works through loops and branches. Arbitrary order, machine precision.</p>
<p>It is too slow for optimization. That is not the point. You find your MLE however you want (analytical, <code>optim</code>, whatever), then use <code>nabla</code> to characterize the solution: observed Fisher information, skewness analysis, higher-order diagnostics. The right tool for final analysis, not for search.</p>
<h2 id="the-foundation-paper">The Foundation Paper</h2>
<p>In 2025, I wrote the paper that should have been separate from the start.</p>
<p>The thesis had the distribution-agnostic likelihood derivation buried in a document mostly about Weibull. The <a href="https://github.com/queelius/masked-causes-in-series-systems">foundation paper</a> extracts that general framework and gives it a proper treatment. It handles all four observation types (exact, right-censored, left-censored, interval-censored) and defines three conditions on the masking mechanism:</p>
<ul>
<li><strong>C1</strong>: The candidate set always contains the true failed component.</li>
<li><strong>C2</strong>: Masking probabilities are symmetric across components.</li>
<li><strong>C3</strong>: Masking probabilities do not depend on the system parameters.</li>
</ul>
<p>Under C1-C2-C3, the likelihood factors cleanly. The paper derives the general likelihood, score equations, and Fisher information, then instantiates them for exponential, Weibull, Pareto, log-normal, and gamma families.</p>
<p>The <a href="https://github.com/queelius/expo-masked-fim">exponential companion</a> pushes the simplest case as far as the math goes. Closed-form MLE. Analytical Fisher information. A proof that information loss from masking is monotone (more masking, less information, strictly). And a result I did not expect: uniform masking maximizes the entropy of the candidate set given the failure cause. It is the worst case for identifiability among all C2-compliant masking models. The closed-form results are not just a convenient special case. They are pessimistic bounds.</p>
<hr>
<h2 id="where-things-are">Where Things Are</h2>
<h3 id="software">Software</h3>
<p>Six packages on CRAN, five more in progress:</p>
<table>
  <thead>
      <tr>
          <th>Package</th>
          <th>What It Does</th>
          <th>Status</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>algebraic.dist</code></td>
          <td>Distribution algebra</td>
          <td>On CRAN</td>
      </tr>
      <tr>
          <td><code>algebraic.mle</code></td>
          <td>MLE algebra</td>
          <td>On CRAN</td>
      </tr>
      <tr>
          <td><code>likelihood.model</code></td>
          <td>Fisherian likelihood framework</td>
          <td>On CRAN</td>
      </tr>
      <tr>
          <td><code>compositional.mle</code></td>
          <td>Composable MLE solvers</td>
          <td>On CRAN</td>
      </tr>
      <tr>
          <td><code>hypothesize</code></td>
          <td>Hypothesis testing framework</td>
          <td>On CRAN</td>
      </tr>
      <tr>
          <td><code>nabla</code></td>
          <td>Automatic differentiation (dual numbers)</td>
          <td>On CRAN</td>
      </tr>
      <tr>
          <td><code>flexhaz</code></td>
          <td>Distributions from hazard functions</td>
          <td>Targeting CRAN + JOSS</td>
      </tr>
      <tr>
          <td><code>serieshaz</code></td>
          <td>Series system distributions</td>
          <td>Targeting CRAN + JOSS</td>
      </tr>
      <tr>
          <td><code>maskedcauses</code></td>
          <td>Analytical MLE for masked series</td>
          <td>Targeting CRAN + JOSS</td>
      </tr>
      <tr>
          <td><code>maskedhaz</code></td>
          <td>Numerical MLE for masked series</td>
          <td>Targeting CRAN + JOSS</td>
      </tr>
      <tr>
          <td><code>mdrelax</code></td>
          <td>Relaxed masking conditions</td>
          <td>Paper first; package if it pans out</td>
      </tr>
  </tbody>
</table>
<p><code>maskedcauses</code> and <code>maskedhaz</code> solve the same problem at different levels of generality. <code>maskedcauses</code> has closed-form solutions for exponential and Weibull. <code>maskedhaz</code> handles arbitrary hazard functions via numerical integration. When both are installed, the test suites cross-validate.</p>
<p><code>mdrelax</code> explores relaxing the C1, C2, and C3 conditions: informative masking, parameter-dependent masking, masking probabilities less than one. Right now it is research code for a paper. If the results hold up, a proper package may come out of it.</p>
<h3 id="papers">Papers</h3>
<table>
  <thead>
      <tr>
          <th>Paper</th>
          <th>Status</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Foundation (C1-C2-C3 framework)</td>
          <td>Draft complete</td>
      </tr>
      <tr>
          <td>Exponential companion (closed-form FIM)</td>
          <td>Draft complete</td>
      </tr>
      <tr>
          <td>Model selection (LRT nesting chain)</td>
          <td>Draft complete, software in <code>maskedcauses</code> vignette</td>
      </tr>
      <tr>
          <td>Relaxed C1/C2/C3 conditions</td>
          <td>Draft in progress</td>
      </tr>
      <tr>
          <td>Master&rsquo;s thesis (original Weibull treatment)</td>
          <td>Published</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="what-i-am-working-on-next">What I Am Working On Next</h2>
<p>I have five companion paper directions. Some are real; some are barely more than a paragraph.</p>
<h3 id="identifiability-active">Identifiability (active)</h3>
<p>The foundation paper proves a necessary and sufficient condition for identifiability from masked data. But it is a yes/no result. It says nothing about how much diagnostic separation you need in practice, which candidate set structures help most, or how identifiability degrades as masking gets heavier.</p>
<p>I want to give a graph-theoretic characterization (confounding graph, connected components as &ldquo;super-components&rdquo;), a linear algebra condition for the exponential case (rank of the candidate-set matrix), and simulation ablation studies.</p>
<p>The exponential companion connects directly: uniform masking at \(w = m-1\) is the most pessimistic identifiable scenario. So the question inverts: what masking design <em>minimizes</em> information loss? That is a D-optimality problem for the masking mechanism. I do not know the answer yet.</p>
<h3 id="observation-scheme-composition-idea">Observation scheme composition (idea)</h3>
<p><code>maskedcauses</code> implements composable observation functors: <code>observe_right_censor()</code>, <code>observe_left_censor()</code>, <code>observe_periodic()</code>, <code>observe_mixture()</code>. The mathematical content is that C1-C2-C3 are preserved under these compositions. If your masking mechanism satisfies the conditions, and you compose it with an independent censoring scheme, the composition still satisfies them.</p>
<p>The proofs are straightforward. The interesting question is whether there is a category-theoretic formulation. Observation schemes as morphisms, composition as functor composition. If so, the closure theorems become instances of a general principle rather than case-by-case arguments. I do not know if this leads anywhere useful or is just abstraction for its own sake.</p>
<h3 id="parsimony-vs-physical-structure-idea">Parsimony vs physical structure (idea)</h3>
<p>If the data are heavily masked and the sample is small, a single Weibull fits the system lifetime data about as well as an \(m\)-component series model. Standard model selection says: use the simpler model. Engineering knowledge says: the components exist.</p>
<p>This is a real tension. If you want to predict system lifetimes, maybe the single Weibull is fine. If you want to estimate component reliability for maintenance planning, the series decomposition matters even if it is over-parameterized relative to the data. I think this is a short paper. Maybe just an essay with simulations.</p>
<h3 id="weibull-companion-planned">Weibull companion (planned)</h3>
<p>The exponential companion has closed-form results because exponential lifetimes are memoryless. Weibull has shape parameters, which means time-varying hazards and numerical integration for some censoring types.</p>
<p>The thesis covers Weibull estimation. The companion paper would redo it properly in the foundation paper&rsquo;s notation: score equations, Fisher information, simulation studies, guidance for choosing homogeneous vs heterogeneous models. The theory exists in pieces across the thesis and <code>maskedcauses</code>. Writing it up as a standalone paper is substantial work.</p>
<h3 id="cran-and-joss-submissions">CRAN and JOSS submissions</h3>
<p>The immediate practical work. <code>flexhaz</code>, <code>serieshaz</code>, <code>maskedcauses</code>, and <code>maskedhaz</code> are all targeting both CRAN and JOSS. I have been through the CRAN gauntlet six times. It is tedious but not mysterious.</p>
<hr>
<h2 id="what-i-have-learned">What I Have Learned</h2>
<p>Three things.</p>
<p><strong>Decompose first.</strong> The monolithic thesis code worked but could not grow. Pulling it into algebraic layers took months. Every piece became independently testable, reusable, publishable. The decomposition is itself a research contribution, not just engineering convenience.</p>
<p><strong>Publish the general theory separately.</strong> The thesis had the distribution-agnostic framework buried in a Weibull-specific document. Writing the foundation paper forced me to separate what is structural (C1-C2-C3 factorization, observation type taxonomy) from what is distribution-specific. That separation made the companion papers possible.</p>
<p><strong>The Fisher information matrix is the right lens.</strong> For two years I treated FIM as &ldquo;the thing you invert to get standard errors.&rdquo; The exponential companion forced me to see it as a measure of how much information the observation scheme actually provides. Once I saw that, the identifiability results, the information loss monotonicity, and the optimal design questions all fell out naturally. FIM connects the statistical theory to the practical question of how to design better diagnostics.</p>
<hr>
<p>The work is not close to done. But the pieces are in place: a clean theoretical framework, a modular software stack, two completed papers, a clear map of what comes next. Most days, that is enough to keep going.</p>]]></content:encoded><category>R</category><category>statistics</category><category>reliability</category><category>series-systems</category><category>masked-data</category><category>maximum-likelihood</category><category>CRAN</category><category>MLE</category><category>reflection</category></item><item><title>papermill</title><link>https://metafunctor.com/projects/papermill/</link><pubDate>Tue, 17 Feb 2026 17:28:32 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/projects/papermill/</guid><description>Claude Code plugin for academic research paper lifecycle management — thesis refinement, literature surveys, experimental design, editorial review, and publication venue matching</description><content:encoded><![CDATA[<h1 id="papermill">Papermill</h1>
<p>A <a href="https://claude.ai/code">Claude Code</a> plugin for academic research paper lifecycle management.</p>
<p>Papermill provides interactive skills and autonomous agents that cover the full pipeline from idea to submission: thesis refinement, literature surveys, experiment design, editorial review, and venue matching.</p>
<h2 id="installation">Installation</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">claude plugins add ~/github/papermill
</span></span></code></pre></div><p>Or from any directory containing the plugin:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">claude plugins add .
</span></span></code></pre></div><h2 id="commands">Commands</h2>
<p>Invoke commands with <code>/papermill:&lt;name&gt;</code> in Claude Code. Each command triggers its corresponding skill.</p>
<table>
  <thead>
      <tr>
          <th>Skill</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>init</strong></td>
          <td>Initialize a paper repo &ndash; discovers structure, sets up <code>.papermill.md</code> state file</td>
      </tr>
      <tr>
          <td><strong>status</strong></td>
          <td>Dashboard showing current stage, thesis, experiments, reviews, and next steps</td>
      </tr>
      <tr>
          <td><strong>thesis</strong></td>
          <td>Extract or refine the central claim and novelty (Socratic dialogue or draft extraction)</td>
      </tr>
      <tr>
          <td><strong>prior-art</strong></td>
          <td>Interactive literature survey with keyword generation, screening, classification, and gap analysis</td>
      </tr>
      <tr>
          <td><strong>outline</strong></td>
          <td>Design paper structure with section purposes, key content, and narrative arc</td>
      </tr>
      <tr>
          <td><strong>experiment</strong></td>
          <td>Design experiments with hypotheses, variables, methodology, and success criteria</td>
      </tr>
      <tr>
          <td><strong>simulation</strong></td>
          <td>Monte Carlo simulation design for validating theoretical results</td>
      </tr>
      <tr>
          <td><strong>proof</strong></td>
          <td>Mathematical proof development, verification, and presentation</td>
      </tr>
      <tr>
          <td><strong>review</strong></td>
          <td>Structured editorial review checking argument, correctness, writing, and venue fit</td>
      </tr>
      <tr>
          <td><strong>venue</strong></td>
          <td>Identify and evaluate publication venues with ranked recommendations</td>
      </tr>
      <tr>
          <td><strong>polish</strong></td>
          <td>Pre-submission checklist: formatting, citations, figures, metadata, build verification</td>
      </tr>
  </tbody>
</table>
<h2 id="agents">Agents</h2>
<p>Agents run autonomously and produce structured output files.</p>
<table>
  <thead>
      <tr>
          <th>Agent</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>surveyor</strong></td>
          <td>Deep autonomous literature search with citation network exploration</td>
      </tr>
      <tr>
          <td><strong>reviewer</strong></td>
          <td>Thorough autonomous editorial review of a paper draft</td>
      </tr>
  </tbody>
</table>
<h2 id="state-file">State File</h2>
<p>Papermill uses a <code>.papermill.md</code> file in each paper repository to persist state across sessions. This file has YAML frontmatter (structured metadata) and a markdown body (session notes).</p>
<p>The state file is created by <code>/papermill:init</code> and updated by other skills as you work.</p>
<h2 id="workflow">Workflow</h2>
<p>A typical workflow:</p>
<ol>
<li><strong><code>/papermill:init</code></strong> &ndash; Set up the state file in your paper repo</li>
<li><strong><code>/papermill:thesis</code></strong> &ndash; Crystallize your central claim</li>
<li><strong><code>/papermill:prior-art</code></strong> &ndash; Survey the literature, identify gaps</li>
<li><strong><code>/papermill:outline</code></strong> &ndash; Design the paper structure</li>
<li>Write the paper (papermill helps, but the writing is yours)</li>
<li><strong><code>/papermill:experiment</code></strong> / <strong><code>/papermill:simulation</code></strong> &ndash; Design computational work</li>
<li><strong><code>/papermill:proof</code></strong> &ndash; Develop and verify proofs</li>
<li><strong><code>/papermill:review</code></strong> &ndash; Get editorial feedback</li>
<li><strong><code>/papermill:venue</code></strong> &ndash; Choose where to submit</li>
<li><strong><code>/papermill:polish</code></strong> &ndash; Final pre-submission check</li>
</ol>
<p>Skills can be used in any order and revisited as needed. Use <code>/papermill:status</code> at any time for orientation.</p>
<h2 id="license">License</h2>
<p>MIT</p>
]]></content:encoded></item></channel></rss>