<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/"><channel><title>metafunctor</title><link>https://metafunctor.com/</link><description>Research Engineer &amp; Computer Scientist - Machine Learning, Statistical Computing, Open Source Development</description><language>en-us</language><managingEditor>lex@metafunctor.com (Alex Towell)</managingEditor><webMaster>lex@metafunctor.com (Alex Towell)</webMaster><lastBuildDate>Fri, 05 Jun 2026 01:08:32 -0500</lastBuildDate><atom:link href="https://metafunctor.com/" rel="self" type="application/rss+xml"/><image><url>https://metafunctor.com/images/og-image.jpg</url><title>metafunctor</title><link>https://metafunctor.com/</link></image><item><title>Reverse-Process Synthetic Data Generation for Math Reasoning</title><link>https://metafunctor.com/post/rpsdg/</link><pubDate>Tue, 25 Jun 2024 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/post/rpsdg/</guid><description>Training LLMs on mathematical reasoning by inverting easy-to-solve problems: generate derivatives, reverse them into integration exercises with full step-by-step solutions.</description><content:encoded><![CDATA[<p>Check out the (early) project and source code on <a href="https://github.com/queelius/RPSDG">GitHub</a>.</p>
<h2 id="the-idea">The idea</h2>
<p>Some problems are easy in one direction and hard in the other. Taking a derivative is mechanical. Finding an antiderivative can require genuine creativity. Generating a random expression and verifying a proof is easy. Discovering the proof is hard.</p>
<p>RPSDG (Reverse-Process Synthetic Data Generation) exploits this asymmetry. Run the easy direction with full step-by-step work, then reverse the result to get a hard problem with a known solution. You end up with process-supervised training data: not just the answer, but the entire derivation.</p>
<p>Richard Sutton&rsquo;s &ldquo;The Bitter Lesson&rdquo; argues that methods scaling with compute and data will eventually win. The bottleneck is high-quality data. A lot of the world&rsquo;s data is latent, the processes that generated it are not written down. In math, the way a proof was discovered is usually hidden behind a polished presentation. RPSDG is one way to manufacture that hidden process data.</p>
<h2 id="derivatives-to-integrals">Derivatives to Integrals</h2>
<p>Computing derivatives is mechanical. Integration often is not. That asymmetry gives us a data pipeline.</p>
<ol>
<li>
<p><strong>Start with known functions.</strong> Pick functions \( f(x) \) with closed-form derivatives: polynomials, trig, exponentials, logarithms. Vary complexity.</p>
</li>
<li>
<p><strong>Differentiate with full work shown.</strong> Take the derivative of \( f(x) \) to get \( f'(x) \), recording every step.</p>
</li>
<li>
<p><strong>Reverse the process.</strong> Now \( f'(x) \) is the problem and \( f(x) \) is the solution. The recorded steps, read backward, give you a worked integration example.</p>
</li>
</ol>
<p>By composing functions of varying complexity, you get integration problems of graduating difficulty. The training data comes with step-by-step solutions for free, because you generated it by running the easy direction.</p>
<h2 id="proofs-by-random-walk">Proofs by Random Walk</h2>
<p>The same idea works for theorem proving. Generating proofs is hard. Verifying them is (comparatively) easy.</p>
<ol>
<li>
<p><strong>Random walks in expression space.</strong> Start with a random expression \( e_{\text{start}} \). Apply rewrite rules \( r_1, r_2, \ldots, r_n \) to get a chain of intermediate expressions ending at \( e_{\text{end}} \).</p>
</li>
<li>
<p><strong>Read off the theorem.</strong> The pair \( (e_{\text{start}}, e_{\text{end}}) \) is a theorem. The chain of rewrites is its proof.</p>
</li>
<li>
<p><strong>Reverse when useful.</strong> Running the chain backward works too, especially when a complex step in one direction (integration) becomes simple in the other (differentiation).</p>
</li>
<li>
<p><strong>Scale it.</strong> Random starting points and random rewrite sequences give you a diverse set of theorems and proofs automatically. No human has to come up with the theorem first.</p>
</li>
</ol>
<h2 id="what-this-gets-you">What this gets you</h2>
<p>The training data has process supervision baked in. Every example includes intermediate steps, not just the final answer. That should help LLMs learn multi-step reasoning rather than pattern-matching to answers.</p>
<p>It also gives you something like explainability for free: the model&rsquo;s training data literally consists of step-by-step solutions, so the model has a better chance of producing step-by-step reasoning at inference time.</p>
<h2 id="limitations-and-next-steps">Limitations and next steps</h2>
<p>This is early work. The data generation pipeline exists, but I have not yet run the full fine-tuning experiments and benchmarks. The planned pipeline is:</p>
<ul>
<li>Data generation with graduating difficulty (curriculum learning)</li>
<li>Fine-tuning transformer-based LMs on the generated data</li>
<li>Self-supervised learning experiments</li>
<li>Evaluation against standard math reasoning benchmarks</li>
</ul>
<p>Further out, I want to explore reinforcement learning for rewarding multi-step reasoning even when the solution is not known in advance but can be verified.</p>
<h2 id="related">Related</h2>
<p>The verification asymmetry exploited here shows up at multiple levels.
At inference time, it enables tree search over LLM solutions:
<a href="/post/2026-04-01-i-spent-048-to-find-out-when-mcts-actually-works-for-llm-reasoning/">I Spent $0.48 to Find Out When MCTS Actually Works for LLM Reasoning</a>.
At the level of scientific method, it&rsquo;s the core of
<a href="/post/2025-01-05-science-as-verifiable-search/">Science as Verifiable Search</a>:
cheap testing enables fast iteration through hypothesis space.</p>]]></content:encoded><category>artificial intelligence</category><category>machine learning</category><category>mathematics</category><category>algebra</category><category>calculus</category><category>LLMs</category><category>synthetic data</category><category>data generation</category><category>reasoning</category><category>AI training</category><category>explainable AI</category></item><item><title>SLUUG Talk: Demystifying Large Language Models on Linux</title><link>https://metafunctor.com/post/gave-a-presentation-for-sluug-about-llms/</link><pubDate>Fri, 23 Feb 2024 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/post/gave-a-presentation-for-sluug-about-llms/</guid><description>Talk for the St. Louis Unix Users Group about running and understanding Large Language Models on Linux.</description><content:encoded><![CDATA[<p>I gave a talk for the St. Louis Unix Users Group (SLUUG) titled &ldquo;Demystifying Large Language Models (LLMs) on Linux: From Theory to Application.&rdquo; The goal was to walk through how LLMs actually work and how to run them locally on Linux.</p>
<p>I demoed two projects:</p>
<ol>
<li>
<p>A simple Colab notebook using basic Python to generate text with an n-gram model. The point was to illustrate the core idea behind language models and show concretely why n-grams fall short, which motivates the transformer architecture.</p>
</li>
<li>
<p>A project that uses ElasticSearch and LLMs to allow natural language search queries over databases.</p>
</li>
</ol>
<p>The talk went well. Content and code are on <a href="https://github.com/queelius/sluug-talk-llm">GitHub</a>.</p>
]]></content:encoded><category>llm</category><category>linux</category><category>ai</category><category>presentation</category></item><item><title>Master's Project: Reliability Estimation in Series Systems</title><link>https://metafunctor.com/post/masters-stats-siue-proj/</link><pubDate>Mon, 19 Feb 2024 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/post/masters-stats-siue-proj/</guid><description>My master's project on maximum likelihood estimation for series systems with right-censored and masked failure data.</description><content:encoded><![CDATA[<p>I presented my master&rsquo;s project in October 2023, finishing up my MS in statistics/mathematics at SIUE. The associated paper is titled &ldquo;Reliability Estimation in Series Systems: Maximum Likelihood Techniques for Right-Censored and Masked Failure Data.&rdquo;</p>
<h2 id="the-problem">The Problem</h2>
<p>In reliability engineering, you often find yourself in an annoying situation: a system fails, but you do not know which component caused the failure. This is called <em>masked failure data</em>. On top of that, some systems are still running when you stop observing them, so you only know they survived at least that long. That is <em>right censoring</em>. Both are common in practice. Identifying the exact failed component is expensive or sometimes impossible.</p>
<p>The project builds a likelihood-based framework that handles both masking and censoring simultaneously, models component lifetimes with Weibull distributions, derives closed-form Fisher information for the exponential special case, and provides bootstrap methods for uncertainty quantification. I implemented it all in an R package so practitioners can actually use it.</p>
<h2 id="related-work">Related Work</h2>
<p>This connects to several other posts and projects:</p>
<ul>
<li><a href="/post/expo-masked-fim/">Closed-Form Results for Masked Exponential Series Systems</a> covers the exponential distribution special case with analytical solutions</li>
<li><a href="/projects/likelihood.model/">likelihood.model R package</a> is the software implementation</li>
</ul>
<p>See the full project page <a href="/projects/reliability-estimation-in-series-systems/">here</a>.</p>
]]></content:encoded><category>series systems</category><category>masked failure data</category><category>censoring</category><category>reliability analysis</category><category>maximum likelihood</category><category>Weibull distribution</category><category>R</category></item><item><title>Building an Immutable, Content-Addressed Filesystem in Python</title><link>https://metafunctor.com/projects/dagshell/posts/01-immutable-content-addressed-filesystem/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/projects/dagshell/posts/01-immutable-content-addressed-filesystem/</guid><description>How Git-style content addressing creates elegant, functional data structures</description><content:encoded><![CDATA[<p>When you run <code>git commit</code>, something interesting happens: Git doesn&rsquo;t store your files by name. Instead, it computes a SHA-1 hash of each file&rsquo;s content and stores the file under that hash. The filename is just a pointer to the hash. This is <strong>content addressing</strong>—identifying data by what it contains rather than where it lives.</p>
<p>This seemingly simple idea has profound implications. In this post, I&rsquo;ll show how to build a content-addressed virtual filesystem in Python, exploring how immutability and content addressing work together to create elegant, functional data structures.</p>
<h2 id="why-content-addressing">Why Content Addressing?</h2>
<p>Consider a traditional filesystem. When you modify a file, the system overwrites the old content. The file&rsquo;s identity (its path) stays the same, but its content changes. This mutable approach has problems:</p>
<ol>
<li><strong>No automatic history</strong>: Once you overwrite, the old data is gone</li>
<li><strong>No deduplication</strong>: Two identical files take up twice the space</li>
<li><strong>No integrity verification</strong>: Corruption can go undetected</li>
</ol>
<p>Content addressing solves all three. If a file&rsquo;s identity <em>is</em> its content (via a hash), then:</p>
<ol>
<li><strong>History is preserved</strong>: Changing content creates a new hash, so the old version still exists</li>
<li><strong>Deduplication is automatic</strong>: Identical content has identical hashes—stored once</li>
<li><strong>Integrity is built-in</strong>: If the content doesn&rsquo;t match the hash, you know something&rsquo;s wrong</li>
</ol>
<h2 id="the-node-hierarchy">The Node Hierarchy</h2>
<p>Let&rsquo;s build this. First, we define our filesystem nodes using Python&rsquo;s frozen dataclasses:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">dataclasses</span> <span class="kn">import</span> <span class="n">dataclass</span><span class="p">,</span> <span class="n">field</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">hashlib</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">json</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nd">@dataclass</span><span class="p">(</span><span class="n">frozen</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Node</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Base class for all filesystem nodes.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">mode</span><span class="p">:</span> <span class="nb">int</span>
</span></span><span class="line"><span class="cl">    <span class="n">uid</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">1000</span>
</span></span><span class="line"><span class="cl">    <span class="n">gid</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">1000</span>
</span></span><span class="line"><span class="cl">    <span class="n">mtime</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="n">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">compute_hash</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Compute SHA256 hash of this node including all metadata.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="n">data</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">to_dict</span><span class="p">(),</span> <span class="n">sort_keys</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">hashlib</span><span class="o">.</span><span class="n">sha256</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">encode</span><span class="p">())</span><span class="o">.</span><span class="n">hexdigest</span><span class="p">()</span>
</span></span></code></pre></div><p>The <code>frozen=True</code> parameter is crucial. It makes instances immutable—you cannot modify a Node after creation. Any &ldquo;change&rdquo; requires creating a new Node.</p>
<p>We then specialize for different node types:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="nd">@dataclass</span><span class="p">(</span><span class="n">frozen</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">FileNode</span><span class="p">(</span><span class="n">Node</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Regular file node.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">content</span><span class="p">:</span> <span class="nb">bytes</span> <span class="o">=</span> <span class="sa">b</span><span class="s2">&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nd">@dataclass</span><span class="p">(</span><span class="n">frozen</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">DirNode</span><span class="p">(</span><span class="n">Node</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Directory node containing references to child nodes.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">children</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="n">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="nb">dict</span><span class="p">)</span>  <span class="c1"># name -&gt; hash</span>
</span></span></code></pre></div><p>Notice that <code>DirNode.children</code> maps names to <em>hashes</em>, not to Node objects directly. This is the key insight: directories don&rsquo;t contain files; they contain <em>references</em> to file hashes. The actual nodes live in a separate store.</p>
<h2 id="the-dag-structure">The DAG Structure</h2>
<p>This reference-based approach creates a Directed Acyclic Graph (DAG):</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">FileSystem</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Content-addressable virtual filesystem.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># The DAG: hash -&gt; Node</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">nodes</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Node</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># Path index: absolute path -&gt; hash</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">paths</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">_add_node</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">node</span><span class="p">:</span> <span class="n">Node</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Add a node to the DAG, returning its hash.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="n">node_hash</span> <span class="o">=</span> <span class="n">node</span><span class="o">.</span><span class="n">compute_hash</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">node_hash</span> <span class="ow">not</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">nodes</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">nodes</span><span class="p">[</span><span class="n">node_hash</span><span class="p">]</span> <span class="o">=</span> <span class="n">node</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">node_hash</span>
</span></span></code></pre></div><p>When we add a node, we compute its hash and store the mapping <code>hash → node</code>. If an identical node already exists (same hash), we don&rsquo;t duplicate it—we just return the existing hash. <strong>Deduplication is automatic.</strong></p>
<h2 id="immutable-updates">Immutable Updates</h2>
<p>Here&rsquo;s where immutability shines. When we write to a file, we don&rsquo;t modify anything. Instead, we:</p>
<ol>
<li>Create a new FileNode with the new content</li>
<li>Create a new DirNode for the parent, pointing to the new file hash</li>
<li>Update the path index</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">write</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">path</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">content</span><span class="p">:</span> <span class="nb">bytes</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Write content to a file.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">parent_path</span><span class="p">,</span> <span class="n">name</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_get_parent_path</span><span class="p">(</span><span class="n">path</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">parent_hash</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">paths</span><span class="p">[</span><span class="n">parent_path</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="n">parent</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">nodes</span><span class="p">[</span><span class="n">parent_hash</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Create new file node</span>
</span></span><span class="line"><span class="cl">    <span class="n">file_node</span> <span class="o">=</span> <span class="n">FileNode</span><span class="p">(</span><span class="n">content</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">file_hash</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_add_node</span><span class="p">(</span><span class="n">file_node</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Create new parent directory with updated child reference</span>
</span></span><span class="line"><span class="cl">    <span class="n">new_children</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">parent</span><span class="o">.</span><span class="n">children</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">new_children</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">file_hash</span>
</span></span><span class="line"><span class="cl">    <span class="n">new_parent</span> <span class="o">=</span> <span class="n">DirNode</span><span class="p">(</span><span class="n">children</span><span class="o">=</span><span class="n">new_children</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">new_parent_hash</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_add_node</span><span class="p">(</span><span class="n">new_parent</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Update path index</span>
</span></span><span class="line"><span class="cl">    <span class="bp">self</span><span class="o">.</span><span class="n">paths</span><span class="p">[</span><span class="n">parent_path</span><span class="p">]</span> <span class="o">=</span> <span class="n">new_parent_hash</span>
</span></span><span class="line"><span class="cl">    <span class="bp">self</span><span class="o">.</span><span class="n">paths</span><span class="p">[</span><span class="n">path</span><span class="p">]</span> <span class="o">=</span> <span class="n">file_hash</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="kc">True</span>
</span></span></code></pre></div><p>The old FileNode still exists in <code>self.nodes</code>. The old DirNode still exists too. We&rsquo;ve just created new versions and updated where the path points. This is <strong>structural sharing</strong>—unchanged parts of the tree are shared between versions.</p>
<h2 id="visualizing-the-dag">Visualizing the DAG</h2>
<p>Let&rsquo;s trace through an example:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">fs</span> <span class="o">=</span> <span class="n">FileSystem</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">fs</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="s2">&#34;/project&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">fs</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s2">&#34;/project/main.py&#34;</span><span class="p">,</span> <span class="sa">b</span><span class="s2">&#34;print(&#39;hello&#39;)&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">fs</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s2">&#34;/project/main.py&#34;</span><span class="p">,</span> <span class="sa">b</span><span class="s2">&#34;print(&#39;world&#39;)&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>After these operations, our DAG contains:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">Hash: a1b2c3... → DirNode(children={})           # original /project
</span></span><span class="line"><span class="cl">Hash: d4e5f6... → FileNode(&#34;print(&#39;hello&#39;)&#34;)     # first version
</span></span><span class="line"><span class="cl">Hash: g7h8i9... → DirNode(children={&#34;main.py&#34;: &#34;d4e5f6...&#34;})
</span></span><span class="line"><span class="cl">Hash: j0k1l2... → FileNode(&#34;print(&#39;world&#39;)&#34;)     # second version
</span></span><span class="line"><span class="cl">Hash: m3n4o5... → DirNode(children={&#34;main.py&#34;: &#34;j0k1l2...&#34;})
</span></span></code></pre></div><p>Both versions of <code>main.py</code> exist. The path <code>/project/main.py</code> points to the latest hash (<code>j0k1l2...</code>), but we could easily restore the old version if we tracked which hashes corresponded to which versions.</p>
<h2 id="benefits-in-practice">Benefits in Practice</h2>
<p>This design enables powerful features almost for free:</p>
<p><strong>Snapshots</strong>: Save the current <code>paths</code> dictionary. Restore it later to go back in time.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">snapshot</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">str</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Create a snapshot of the current filesystem state.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="nb">dict</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">paths</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">restore</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">snapshot</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">str</span><span class="p">]):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Restore filesystem to a previous snapshot.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="bp">self</span><span class="o">.</span><span class="n">paths</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">snapshot</span><span class="p">)</span>
</span></span></code></pre></div><p><strong>Deduplication</strong>: Multiple paths can point to the same hash.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># These might share the same underlying node if content is identical</span>
</span></span><span class="line"><span class="cl"><span class="n">fs</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s2">&#34;/file1.txt&#34;</span><span class="p">,</span> <span class="sa">b</span><span class="s2">&#34;hello&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">fs</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s2">&#34;/file2.txt&#34;</span><span class="p">,</span> <span class="sa">b</span><span class="s2">&#34;hello&#34;</span><span class="p">)</span>  <span class="c1"># Same hash, no new storage</span>
</span></span></code></pre></div><p><strong>Integrity checking</strong>: If someone asks for a file, we can verify it.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">verify</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">path</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Verify a file&#39;s integrity.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">node_hash</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">paths</span><span class="p">[</span><span class="n">path</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="n">node</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">nodes</span><span class="p">[</span><span class="n">node_hash</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">node</span><span class="o">.</span><span class="n">compute_hash</span><span class="p">()</span> <span class="o">==</span> <span class="n">node_hash</span>
</span></span></code></pre></div><h2 id="the-functional-programming-connection">The Functional Programming Connection</h2>
<p>This approach is deeply connected to functional programming. In FP:</p>
<ul>
<li>Data is immutable</li>
<li>&ldquo;Changes&rdquo; create new values</li>
<li>Sharing is safe because nothing mutates</li>
</ul>
<p>Our filesystem follows these principles exactly. Nodes are frozen. &ldquo;Writing&rdquo; creates new nodes. Multiple paths can safely share nodes because nodes never change.</p>
<p>This is why Clojure&rsquo;s persistent data structures, Haskell&rsquo;s pure values, and Git&rsquo;s object store all use similar ideas. <strong>Content addressing + immutability = safe, efficient, verifiable data.</strong></p>
<h2 id="trade-offs">Trade-offs</h2>
<p>Nothing is free. This approach has costs:</p>
<ol>
<li>
<p><strong>Memory</strong>: Old versions accumulate. You need garbage collection to reclaim space from unreachable nodes.</p>
</li>
<li>
<p><strong>Performance</strong>: Creating new nodes for every change can be slower than in-place mutation for write-heavy workloads.</p>
</li>
<li>
<p><strong>Complexity</strong>: Path resolution requires extra indirection through the hash table.</p>
</li>
</ol>
<p>For many use cases—especially those valuing history, integrity, and safe concurrency—these trade-offs are worthwhile.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Content addressing transforms how we think about data. Instead of &ldquo;where is this file?&rdquo; we ask &ldquo;what is this content&rsquo;s identity?&rdquo; Instead of destructive updates, we create new versions while sharing unchanged structure.</p>
<p>This pattern appears everywhere: Git, IPFS, Nix, Docker layers, and many database internals. Understanding it opens doors to building robust, elegant systems.</p>
<p>The full implementation in <a href="https://github.com/queelius/dagshell">DagShell</a> extends these ideas with a complete POSIX-like interface, demonstrating how content addressing can underpin a full virtual filesystem.</p>
<hr>
<p><em>Next in this series: <a href="../02-unix-philosophy-python/">Unix Philosophy in Python</a> — building composable commands with method chaining.</em></p>
]]></content:encoded></item><item><title>Fine-Tuning a Tiny LLM for ElasticSearch DSL</title><link>https://metafunctor.com/post/llm-fine-tuning-es-dsl/</link><pubDate>Mon, 19 Feb 2024 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/post/llm-fine-tuning-es-dsl/</guid><description>Fine-tuning a small language model to generate ElasticSearch DSL queries from natural language, as a proof of concept for domain-specific LLM specialization.</description><content:encoded><![CDATA[<p>I am fine-tuning a small LLM to generate ElasticSearch DSL queries from natural language. The project is on <a href="https://github.com/queelius/elasticsearch-lm">GitHub</a>.</p>
<p>The idea: take a task that large models handle well (translating &ldquo;find all orders over $100 from last month&rdquo; into the corresponding ElasticSearch JSON query) and see if a tiny model can learn it from synthetic data.</p>
<p>The data pipeline works like this. I started by generating examples from GPT-4. Then I wrote a script that samples from those outputs and uses them as few-shot examples for Mistral, which generates a much larger synthetic dataset. The next step is reshaping the data into the expected format and fine-tuning.</p>
<p>This is early stage. The synthetic data exists, the fine-tuning has not happened yet. I will update this post with results.</p>
<p>If you are interested in collaborating on this or related projects, email me at <a href="mailto:lex@metafunctor.com">lex@metafunctor.com</a>.</p>
]]></content:encoded><category>large language models</category><category>fine-tuning</category><category>information retrieval</category><category>elastic search</category><category>domain-specific language</category><category>json</category></item><item><title>Unix Philosophy in Python: Composable Commands with Method Chaining</title><link>https://metafunctor.com/projects/dagshell/posts/02-unix-philosophy-python/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/projects/dagshell/posts/02-unix-philosophy-python/</guid><description>How to build pipeable, chainable interfaces that do one thing well</description><content:encoded><![CDATA[<p>The Unix philosophy, articulated by Doug McIlroy, can be summarized as:</p>
<blockquote>
<p>Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.</p>
</blockquote>
<p>This philosophy gave us pipes (<code>|</code>), small focused utilities (<code>cat</code>, <code>grep</code>, <code>sort</code>), and the ability to compose complex operations from simple parts:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">cat access.log <span class="p">|</span> grep <span class="s2">&#34;404&#34;</span> <span class="p">|</span> cut -d<span class="s1">&#39; &#39;</span> -f1 <span class="p">|</span> sort <span class="p">|</span> uniq -c <span class="p">|</span> sort -rn
</span></span></code></pre></div><p>Can we bring this composability to Python? Not just subprocess calls, but native Python objects that pipe and chain as naturally as Unix commands? Let&rsquo;s build it.</p>
<h2 id="the-problem-with-methods">The Problem with Methods</h2>
<p>Standard Python methods have a composition problem. Consider:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># We want to: read a file, filter lines, count words</span>
</span></span><span class="line"><span class="cl"><span class="n">content</span> <span class="o">=</span> <span class="n">read_file</span><span class="p">(</span><span class="s2">&#34;/data/log.txt&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">filtered</span> <span class="o">=</span> <span class="n">grep</span><span class="p">(</span><span class="n">content</span><span class="p">,</span> <span class="s2">&#34;ERROR&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">count</span> <span class="o">=</span> <span class="n">wc</span><span class="p">(</span><span class="n">filtered</span><span class="p">)</span>
</span></span></code></pre></div><p>This works, but it&rsquo;s inside-out compared to how we think about it. We think &ldquo;read, then filter, then count&rdquo;—a left-to-right pipeline. But we write it bottom-up with intermediate variables.</p>
<p>Method chaining helps:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">read_file</span><span class="p">(</span><span class="s2">&#34;/data/log.txt&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">grep</span><span class="p">(</span><span class="s2">&#34;ERROR&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">wc</span><span class="p">()</span>
</span></span></code></pre></div><p>Now it flows left-to-right. But how do we make arbitrary operations chainable?</p>
<h2 id="the-commandresult-pattern">The CommandResult Pattern</h2>
<p>The key insight is wrapping every result in a chainable container:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="nd">@dataclass</span>
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">CommandResult</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Wrapper that enables method chaining.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">data</span><span class="p">:</span> <span class="n">Any</span>           <span class="c1"># The actual result data</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="kc">None</span>    <span class="c1"># Text representation</span>
</span></span><span class="line"><span class="cl">    <span class="n">exit_code</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">0</span>  <span class="c1"># Unix-style exit code</span>
</span></span><span class="line"><span class="cl">    <span class="n">_shell</span><span class="p">:</span> <span class="s1">&#39;DagShell&#39;</span> <span class="o">=</span> <span class="kc">None</span>  <span class="c1"># Reference back to the shell</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__str__</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">text</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">text</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="nb">str</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">data</span><span class="p">)</span>
</span></span></code></pre></div><p>Every command returns a <code>CommandResult</code>. The result carries the data <em>and</em> a reference to the shell that produced it. This reference enables chaining—we can call more methods through it.</p>
<h2 id="making-commands-chainable">Making Commands Chainable</h2>
<p>Here&rsquo;s a simple command implementation:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">DagShell</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">echo</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">CommandResult</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Echo arguments to output.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="n">text</span> <span class="o">=</span> <span class="s1">&#39; &#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">arg</span><span class="p">)</span> <span class="k">for</span> <span class="n">arg</span> <span class="ow">in</span> <span class="n">args</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">CommandResult</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">data</span><span class="o">=</span><span class="n">text</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">text</span><span class="o">=</span><span class="n">text</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">exit_code</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">_shell</span><span class="o">=</span><span class="bp">self</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span>
</span></span></code></pre></div><p>The magic happens in <code>CommandResult</code>. We add methods that delegate back to the shell:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="nd">@dataclass</span>
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">CommandResult</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># ... fields as before ...</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">grep</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">pattern</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="s1">&#39;CommandResult&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Filter lines matching pattern.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">_shell</span><span class="o">.</span><span class="n">grep</span><span class="p">(</span><span class="n">pattern</span><span class="p">,</span> <span class="n">input_data</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">data</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">wc</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">flags</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="s1">&#39;CommandResult&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Count lines, words, or characters.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">_shell</span><span class="o">.</span><span class="n">wc</span><span class="p">(</span><span class="o">*</span><span class="n">flags</span><span class="p">,</span> <span class="n">input_data</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">data</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">out</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">path</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="s1">&#39;CommandResult&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Redirect output to a file.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">_shell</span><span class="o">.</span><span class="n">fs</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span><span class="o">.</span><span class="n">encode</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span>
</span></span></code></pre></div><p>Now we can chain:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">echo</span><span class="p">(</span><span class="s2">&#34;hello world&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">wc</span><span class="p">(</span><span class="s2">&#34;-w&#34;</span><span class="p">)</span>  <span class="c1"># Returns: 2</span>
</span></span><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">cat</span><span class="p">(</span><span class="s2">&#34;/data/log.txt&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">grep</span><span class="p">(</span><span class="s2">&#34;ERROR&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">wc</span><span class="p">(</span><span class="s2">&#34;-l&#34;</span><span class="p">)</span>  <span class="c1"># Count error lines</span>
</span></span></code></pre></div><h2 id="the-dual-nature-objects-and-files">The Dual Nature: Objects and Files</h2>
<p>Unix commands have a superpower: the same output can go to the screen, a file, or another command. We can replicate this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Chain more commands</span>
</span></span><span class="line"><span class="cl"><span class="n">result</span> <span class="o">=</span> <span class="n">shell</span><span class="o">.</span><span class="n">cat</span><span class="p">(</span><span class="s2">&#34;/data/log.txt&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">grep</span><span class="p">(</span><span class="s2">&#34;ERROR&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Get as Python object</span>
</span></span><span class="line"><span class="cl"><span class="n">lines</span> <span class="o">=</span> <span class="n">result</span><span class="o">.</span><span class="n">lines</span><span class="p">()</span>  <span class="c1"># List[str]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Or redirect to a file</span>
</span></span><span class="line"><span class="cl"><span class="n">result</span><span class="o">.</span><span class="n">out</span><span class="p">(</span><span class="s2">&#34;/data/errors.txt&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>The <code>CommandResult</code> is simultaneously:</p>
<ol>
<li>A Python object you can inspect and manipulate</li>
<li>A text stream you can redirect to files</li>
<li>An input source for the next command in a pipeline</li>
</ol>
<p>This dual nature—structured data <em>and</em> text stream—bridges the gap between Unix philosophy and Python&rsquo;s object orientation.</p>
<h2 id="implementing-piping">Implementing Piping</h2>
<p>True Unix pipes pass data between processes. We simulate this with a &ldquo;last result&rdquo; mechanism:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">DagShell</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">_last_result</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">CommandResult</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">_</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">CommandResult</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Return the last command&#39;s result (like $? or $_).&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">_last_result</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="n">CommandResult</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">,</span> <span class="n">text</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">,</span> <span class="n">exit_code</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">_shell</span><span class="o">=</span><span class="bp">self</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">_last_result</span>
</span></span></code></pre></div><p>Now we can build pipelines:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">cat</span><span class="p">(</span><span class="s2">&#34;/data/log.txt&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">_</span><span class="p">()</span><span class="o">.</span><span class="n">grep</span><span class="p">(</span><span class="s2">&#34;ERROR&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">_</span><span class="p">()</span><span class="o">.</span><span class="n">wc</span><span class="p">(</span><span class="s2">&#34;-l&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>Or more elegantly, commands can accept piped input:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">grep</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">pattern</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">input_data</span><span class="o">=</span><span class="kc">None</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">CommandResult</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Filter lines matching pattern.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">input_data</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">input_data</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_last_result</span><span class="o">.</span><span class="n">data</span> <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">_last_result</span> <span class="k">else</span> <span class="s1">&#39;&#39;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">lines</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">input_data</span><span class="p">)</span><span class="o">.</span><span class="n">splitlines</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">matching</span> <span class="o">=</span> <span class="p">[</span><span class="n">line</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">lines</span> <span class="k">if</span> <span class="n">pattern</span> <span class="ow">in</span> <span class="n">line</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">result</span> <span class="o">=</span> <span class="n">CommandResult</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">data</span><span class="o">=</span><span class="n">matching</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="n">text</span><span class="o">=</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">matching</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">        <span class="n">exit_code</span><span class="o">=</span><span class="mi">0</span> <span class="k">if</span> <span class="n">matching</span> <span class="k">else</span> <span class="mi">1</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="n">_shell</span><span class="o">=</span><span class="bp">self</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="bp">self</span><span class="o">.</span><span class="n">_last_result</span> <span class="o">=</span> <span class="n">result</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">result</span>
</span></span></code></pre></div><h2 id="method-chaining-in-action">Method Chaining in Action</h2>
<p>Let&rsquo;s build something real—a log analysis pipeline:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Create a shell and some test data</span>
</span></span><span class="line"><span class="cl"><span class="n">shell</span> <span class="o">=</span> <span class="n">DagShell</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="s2">&#34;/logs&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">echo</span><span class="p">(</span><span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">2024-01-15 10:30:00 INFO  User login: alice
</span></span></span><span class="line"><span class="cl"><span class="s2">2024-01-15 10:31:00 ERROR Database connection failed
</span></span></span><span class="line"><span class="cl"><span class="s2">2024-01-15 10:32:00 INFO  User login: bob
</span></span></span><span class="line"><span class="cl"><span class="s2">2024-01-15 10:33:00 ERROR Timeout waiting for response
</span></span></span><span class="line"><span class="cl"><span class="s2">2024-01-15 10:34:00 WARN  High memory usage
</span></span></span><span class="line"><span class="cl"><span class="s2">2024-01-15 10:35:00 ERROR Disk space low
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span><span class="o">.</span><span class="n">strip</span><span class="p">())</span><span class="o">.</span><span class="n">out</span><span class="p">(</span><span class="s2">&#34;/logs/app.log&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Pipeline: find errors, extract timestamps, save to file</span>
</span></span><span class="line"><span class="cl"><span class="p">(</span><span class="n">shell</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">cat</span><span class="p">(</span><span class="s2">&#34;/logs/app.log&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">grep</span><span class="p">(</span><span class="s2">&#34;ERROR&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">cut</span><span class="p">(</span><span class="n">delimiter</span><span class="o">=</span><span class="s2">&#34; &#34;</span><span class="p">,</span> <span class="n">fields</span><span class="o">=</span><span class="s2">&#34;1,2&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">out</span><span class="p">(</span><span class="s2">&#34;/logs/error_times.txt&#34;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Read the result</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">shell</span><span class="o">.</span><span class="n">cat</span><span class="p">(</span><span class="s2">&#34;/logs/error_times.txt&#34;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Output:</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 2024-01-15 10:31:00</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 2024-01-15 10:33:00</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 2024-01-15 10:35:00</span>
</span></span></code></pre></div><p>Each step does one thing. The chain composes them into a useful operation. The data flows left-to-right, just like our mental model.</p>
<h2 id="directory-navigation-with-a-stack">Directory Navigation with a Stack</h2>
<p>Unix has <code>cd</code>, but shells also have <code>pushd</code> and <code>popd</code> for directory stacks. We implement this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">DagShell</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">_cwd</span> <span class="o">=</span> <span class="s1">&#39;/&#39;</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">_dir_stack</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">pushd</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">path</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">CommandResult</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Push current directory and change to new one.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">_dir_stack</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_cwd</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">cd</span><span class="p">(</span><span class="n">path</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">popd</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">CommandResult</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Pop directory from stack and change to it.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="ow">not</span> <span class="bp">self</span><span class="o">.</span><span class="n">_dir_stack</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="n">CommandResult</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">,</span> <span class="n">text</span><span class="o">=</span><span class="s1">&#39;popd: directory stack empty&#39;</span><span class="p">,</span> <span class="n">exit_code</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">old_dir</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_dir_stack</span><span class="o">.</span><span class="n">pop</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">cd</span><span class="p">(</span><span class="n">old_dir</span><span class="p">)</span>
</span></span></code></pre></div><p>Now we can navigate without losing our place:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">pushd</span><span class="p">(</span><span class="s2">&#34;/project/src&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># ... work in src ...</span>
</span></span><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">pushd</span><span class="p">(</span><span class="s2">&#34;tests&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># ... work in tests ...</span>
</span></span><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">popd</span><span class="p">()</span>  <span class="c1"># back to /project/src</span>
</span></span><span class="line"><span class="cl"><span class="n">shell</span><span class="o">.</span><span class="n">popd</span><span class="p">()</span>  <span class="c1"># back to original directory</span>
</span></span></code></pre></div><h2 id="exit-codes-success-and-failure">Exit Codes: Success and Failure</h2>
<p>Unix commands return exit codes: 0 for success, non-zero for failure. We include this in <code>CommandResult</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">grep</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">pattern</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">CommandResult</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># ... filtering logic ...</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">CommandResult</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">data</span><span class="o">=</span><span class="n">matching</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="n">text</span><span class="o">=</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">matching</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">        <span class="n">exit_code</span><span class="o">=</span><span class="mi">0</span> <span class="k">if</span> <span class="n">matching</span> <span class="k">else</span> <span class="mi">1</span><span class="p">,</span>  <span class="c1"># 1 if no matches</span>
</span></span><span class="line"><span class="cl">        <span class="n">_shell</span><span class="o">=</span><span class="bp">self</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span></code></pre></div><p>This enables conditional logic:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">result</span> <span class="o">=</span> <span class="n">shell</span><span class="o">.</span><span class="n">grep</span><span class="p">(</span><span class="s2">&#34;pattern&#34;</span><span class="p">,</span> <span class="n">file</span><span class="o">=</span><span class="s2">&#34;/data/log.txt&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">if</span> <span class="n">result</span><span class="o">.</span><span class="n">exit_code</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Found </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">result</span><span class="o">.</span><span class="n">lines</span><span class="p">())</span><span class="si">}</span><span class="s2"> matches&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="s2">&#34;No matches found&#34;</span><span class="p">)</span>
</span></span></code></pre></div><h2 id="the-philosophy-applied">The Philosophy Applied</h2>
<p>Let&rsquo;s revisit McIlroy&rsquo;s principles and see how we&rsquo;ve applied them:</p>
<p><strong>&ldquo;Do one thing well&rdquo;</strong>: Each method (<code>cat</code>, <code>grep</code>, <code>wc</code>, <code>cut</code>) does exactly one thing.</p>
<p><strong>&ldquo;Work together&rdquo;</strong>: <code>CommandResult</code> enables any command to connect to any other.</p>
<p><strong>&ldquo;Text streams as universal interface&rdquo;</strong>: Every result has a text representation via <code>__str__</code>, making it redirectable and pipeable.</p>
<p>We&rsquo;ve also added Python-specific benefits:</p>
<ul>
<li><strong>Type safety</strong>: Results carry structured <code>data</code>, not just text</li>
<li><strong>Introspection</strong>: <code>result.lines()</code>, <code>result.data</code>, etc.</li>
<li><strong>Chaining</strong>: Method chains read left-to-right like pipelines</li>
</ul>
<h2 id="the-fluent-pattern">The Fluent Pattern</h2>
<p>This is an instance of the <strong>Fluent Interface</strong> pattern, where methods return <code>self</code> (or a related object) to enable chaining. Martin Fowler described it in 2005, but the idea is older—Smalltalk embraced it from the beginning.</p>
<p>The key is designing methods that:</p>
<ol>
<li>Perform their action</li>
<li>Return something chainable</li>
<li>Maintain enough context for the next operation</li>
</ol>
<p>When done well, code reads almost like prose:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="p">(</span><span class="n">shell</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="s2">&#34;/project&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">cd</span><span class="p">(</span><span class="s2">&#34;/project&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">echo</span><span class="p">(</span><span class="s2">&#34;# My Project&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">out</span><span class="p">(</span><span class="s2">&#34;README.md&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">echo</span><span class="p">(</span><span class="s2">&#34;def main(): pass&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">out</span><span class="p">(</span><span class="s2">&#34;main.py&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">ls</span><span class="p">(</span><span class="s2">&#34;-la&#34;</span><span class="p">))</span>
</span></span></code></pre></div><h2 id="trade-offs">Trade-offs</h2>
<p>This approach isn&rsquo;t free:</p>
<ol>
<li><strong>Wrapper overhead</strong>: Every result is wrapped in <code>CommandResult</code></li>
<li><strong>Learning curve</strong>: Users must understand the chaining pattern</li>
<li><strong>Debugging</strong>: Long chains can be hard to debug—where did it fail?</li>
</ol>
<p>For scripting and exploratory work, the expressiveness outweighs these costs. For performance-critical code, you might unwrap to raw operations.</p>
<h2 id="conclusion">Conclusion</h2>
<p>The Unix philosophy isn&rsquo;t about Unix—it&rsquo;s about composability. Small, focused operations that connect through a universal interface create systems greater than the sum of their parts.</p>
<p>In Python, we achieve this with:</p>
<ul>
<li><strong>Wrapper types</strong> that carry data and context</li>
<li><strong>Method chaining</strong> for left-to-right flow</li>
<li><strong>Dual representations</strong> as objects and text streams</li>
<li><strong>Exit codes</strong> for success/failure signaling</li>
</ul>
<p>The result is a fluent interface where complex operations emerge from simple, composable parts—Unix philosophy, realized in Python.</p>
<hr>
<p><em>See the full implementation in <a href="https://github.com/queelius/dagshell">DagShell</a>, which builds a complete virtual filesystem with this fluent pattern.</em></p>
<p><em>Next in this series: <a href="../03-embedding-scheme-interpreter/">Embedding a Scheme Interpreter</a> — adding a DSL for filesystem scripting.</em></p>
]]></content:encoded></item><item><title>Embedding a Scheme Interpreter: Building a DSL for Filesystem Operations</title><link>https://metafunctor.com/projects/dagshell/posts/03-embedding-scheme-interpreter/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/projects/dagshell/posts/03-embedding-scheme-interpreter/</guid><description>How to implement a minimal Scheme and integrate it with your application</description><content:encoded><![CDATA[<p>Every sufficiently complex application eventually grows a scripting language. Emacs has Elisp. AutoCAD has AutoLISP. Blender has Python. The pattern is clear: users need programmability beyond what a fixed UI provides.</p>
<p>But embedding a full language runtime is heavy. What if we just need something small—variables, functions, conditionals, loops? A Lisp-family language is perfect for this: the syntax is trivial to parse, the semantics are clean, and it&rsquo;s powerful enough for real scripting.</p>
<p>In this post, I&rsquo;ll walk through building a minimal Scheme interpreter in Python and integrating it with a virtual filesystem. We&rsquo;ll cover the complete pipeline: tokenization, parsing, evaluation, and extension with custom primitives.</p>
<h2 id="why-scheme">Why Scheme?</h2>
<p>Scheme is a minimalist Lisp dialect. Its syntax is just parentheses and atoms:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-scheme" data-lang="scheme"><span class="line"><span class="cl"><span class="p">(</span><span class="k">define </span><span class="nv">x</span> <span class="mi">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">(</span><span class="nb">+ </span><span class="nv">x</span> <span class="p">(</span><span class="nb">* </span><span class="mi">2</span> <span class="mi">3</span><span class="p">))</span>
</span></span><span class="line"><span class="cl"><span class="p">(</span><span class="k">if </span><span class="p">(</span><span class="nb">&gt; </span><span class="nv">x</span> <span class="mi">5</span><span class="p">)</span> <span class="s">&#34;big&#34;</span> <span class="s">&#34;small&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>This uniformity makes parsing trivial. There&rsquo;s no operator precedence, no complex grammar—just:</p>
<ol>
<li>Atoms: numbers, strings, symbols</li>
<li>Lists: <code>(thing thing thing ...)</code></li>
</ol>
<p>That&rsquo;s it. The entire parser can be written in under 50 lines.</p>
<h2 id="step-1-tokenization">Step 1: Tokenization</h2>
<p>First, we convert source code into tokens. Scheme&rsquo;s tokenization is simple: split on whitespace, but respect strings and parentheses.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">tokenize</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Convert Scheme code into tokens.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Handle comments (lines starting with ;)</span>
</span></span><span class="line"><span class="cl">    <span class="n">lines</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;;&#39;</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">lines</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Add spaces around parens for easy splitting</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;(&#39;</span><span class="p">,</span> <span class="s1">&#39; ( &#39;</span><span class="p">)</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;)&#39;</span><span class="p">,</span> <span class="s1">&#39; ) &#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Split, handling string literals</span>
</span></span><span class="line"><span class="cl">    <span class="n">tokens</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">    <span class="n">in_string</span> <span class="o">=</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">    <span class="n">current</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">char</span> <span class="ow">in</span> <span class="n">text</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">char</span> <span class="o">==</span> <span class="s1">&#39;&#34;&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">in_string</span> <span class="o">=</span> <span class="ow">not</span> <span class="n">in_string</span>
</span></span><span class="line"><span class="cl">            <span class="n">current</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">char</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">elif</span> <span class="n">in_string</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">current</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">char</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">elif</span> <span class="n">char</span><span class="o">.</span><span class="n">isspace</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">current</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="n">tokens</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">current</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">                <span class="n">current</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">        <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">current</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">char</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">current</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">tokens</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s1">&#39;&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">current</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">tokens</span>
</span></span></code></pre></div><p>Input: <code>(define x (+ 1 2))</code>
Output: <code>['(', 'define', 'x', '(', '+', '1', '2', ')', ')']</code></p>
<h2 id="step-2-parsing">Step 2: Parsing</h2>
<p>Parsing converts tokens into an Abstract Syntax Tree (AST). In Scheme, the AST is just nested Python lists:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="nd">@dataclass</span>
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Symbol</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Represents a Scheme symbol.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">name</span><span class="p">:</span> <span class="nb">str</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">parse</span><span class="p">(</span><span class="n">tokens</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">Any</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Parse tokens into an AST.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">parse_expr</span><span class="p">(</span><span class="n">index</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Tuple</span><span class="p">[</span><span class="n">Any</span><span class="p">,</span> <span class="nb">int</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">        <span class="n">token</span> <span class="o">=</span> <span class="n">tokens</span><span class="p">[</span><span class="n">index</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">token</span> <span class="o">==</span> <span class="s1">&#39;(&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="c1"># Parse a list</span>
</span></span><span class="line"><span class="cl">            <span class="n">lst</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">            <span class="n">index</span> <span class="o">+=</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">            <span class="k">while</span> <span class="n">tokens</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">!=</span> <span class="s1">&#39;)&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="n">expr</span><span class="p">,</span> <span class="n">index</span> <span class="o">=</span> <span class="n">parse_expr</span><span class="p">(</span><span class="n">index</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">                <span class="n">lst</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">expr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="n">lst</span><span class="p">,</span> <span class="n">index</span> <span class="o">+</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">elif</span> <span class="n">token</span> <span class="o">==</span> <span class="s1">&#39;)&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">raise</span> <span class="ne">SyntaxError</span><span class="p">(</span><span class="s2">&#34;Unexpected )&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="c1"># Parse an atom</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="n">parse_atom</span><span class="p">(</span><span class="n">token</span><span class="p">),</span> <span class="n">index</span> <span class="o">+</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">expr</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">parse_expr</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">expr</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">parse_atom</span><span class="p">(</span><span class="n">token</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Any</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Parse a single atom.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Try integer</span>
</span></span><span class="line"><span class="cl">    <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="nb">int</span><span class="p">(</span><span class="n">token</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">except</span> <span class="ne">ValueError</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">pass</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Try float</span>
</span></span><span class="line"><span class="cl">    <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="nb">float</span><span class="p">(</span><span class="n">token</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">except</span> <span class="ne">ValueError</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">pass</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># String literal</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">token</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">&#39;&#34;&#39;</span><span class="p">)</span> <span class="ow">and</span> <span class="n">token</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s1">&#39;&#34;&#39;</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">token</span><span class="p">[</span><span class="mi">1</span><span class="p">:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Boolean</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">token</span> <span class="o">==</span> <span class="s1">&#39;#t&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">token</span> <span class="o">==</span> <span class="s1">&#39;#f&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Symbol</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">Symbol</span><span class="p">(</span><span class="n">token</span><span class="p">)</span>
</span></span></code></pre></div><p>Input tokens: <code>['(', 'define', 'x', '(', '+', '1', '2', ')', ')']</code>
Output AST: <code>[Symbol('define'), Symbol('x'), [Symbol('+'), 1, 2]]</code></p>
<p>The beauty of Lisp: the AST <em>is</em> the syntax. There&rsquo;s no separate tree structure—it&rsquo;s just lists of symbols and values.</p>
<h2 id="step-3-the-environment">Step 3: The Environment</h2>
<p>Before evaluation, we need an environment to track variable bindings. This is where lexical scoping lives:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Environment</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Lexical environment for variable bindings.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">parent</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="s1">&#39;Environment&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">bindings</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">parent</span> <span class="o">=</span> <span class="n">parent</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">define</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">value</span><span class="p">:</span> <span class="n">Any</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Define a new binding in this environment.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">bindings</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">value</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">get</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Any</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Look up a binding, checking parent scopes.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">name</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">bindings</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">bindings</span><span class="p">[</span><span class="n">name</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="k">elif</span> <span class="bp">self</span><span class="o">.</span><span class="n">parent</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">parent</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">name</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">raise</span> <span class="ne">NameError</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Undefined variable: </span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">set</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">value</span><span class="p">:</span> <span class="n">Any</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Update an existing binding.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">name</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">bindings</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">bindings</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">value</span>
</span></span><span class="line"><span class="cl">        <span class="k">elif</span> <span class="bp">self</span><span class="o">.</span><span class="n">parent</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">parent</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">value</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">raise</span> <span class="ne">NameError</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Undefined variable: </span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>Each function call creates a new Environment with the current one as its parent. This chain enables closures—inner functions that remember their enclosing scope.</p>
<h2 id="step-4-evaluation">Step 4: Evaluation</h2>
<p>The evaluator is the heart of the interpreter. It walks the AST and computes values:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">evaluate</span><span class="p">(</span><span class="n">expr</span><span class="p">:</span> <span class="n">Any</span><span class="p">,</span> <span class="n">env</span><span class="p">:</span> <span class="n">Environment</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Any</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Evaluate an expression in an environment.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Self-evaluating: numbers, strings, booleans</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">expr</span><span class="p">,</span> <span class="p">(</span><span class="nb">int</span><span class="p">,</span> <span class="nb">float</span><span class="p">,</span> <span class="nb">str</span><span class="p">,</span> <span class="nb">bool</span><span class="p">,</span> <span class="nb">type</span><span class="p">(</span><span class="kc">None</span><span class="p">))):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">expr</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Variable lookup</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">expr</span><span class="p">,</span> <span class="n">Symbol</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">env</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">expr</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Must be a list (function call or special form)</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">expr</span><span class="p">,</span> <span class="nb">list</span><span class="p">)</span> <span class="ow">or</span> <span class="ow">not</span> <span class="n">expr</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">expr</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">op</span> <span class="o">=</span> <span class="n">expr</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Special forms</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">Symbol</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">op</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="s1">&#39;quote&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="n">expr</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">op</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="s1">&#39;define&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">name</span> <span class="o">=</span> <span class="n">expr</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="n">value</span> <span class="o">=</span> <span class="n">evaluate</span><span class="p">(</span><span class="n">expr</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">env</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">value</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="n">value</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">op</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="s1">&#39;if&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">condition</span> <span class="o">=</span> <span class="n">evaluate</span><span class="p">(</span><span class="n">expr</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">env</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">condition</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="k">return</span> <span class="n">evaluate</span><span class="p">(</span><span class="n">expr</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">env</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">elif</span> <span class="nb">len</span><span class="p">(</span><span class="n">expr</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">3</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="k">return</span> <span class="n">evaluate</span><span class="p">(</span><span class="n">expr</span><span class="p">[</span><span class="mi">3</span><span class="p">],</span> <span class="n">env</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">op</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="s1">&#39;lambda&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">params</span> <span class="o">=</span> <span class="n">expr</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="n">body</span> <span class="o">=</span> <span class="n">expr</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="n">Procedure</span><span class="p">(</span><span class="n">params</span><span class="p">,</span> <span class="n">body</span><span class="p">,</span> <span class="n">env</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Function application</span>
</span></span><span class="line"><span class="cl">    <span class="n">func</span> <span class="o">=</span> <span class="n">evaluate</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">env</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">args</span> <span class="o">=</span> <span class="p">[</span><span class="n">evaluate</span><span class="p">(</span><span class="n">arg</span><span class="p">,</span> <span class="n">env</span><span class="p">)</span> <span class="k">for</span> <span class="n">arg</span> <span class="ow">in</span> <span class="n">expr</span><span class="p">[</span><span class="mi">1</span><span class="p">:]]</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">func</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">)</span>
</span></span></code></pre></div><p>Special forms (<code>define</code>, <code>if</code>, <code>lambda</code>, etc.) have custom evaluation rules. Everything else is a function call: evaluate the operator, evaluate the arguments, call the function.</p>
<h2 id="step-5-user-defined-functions">Step 5: User-Defined Functions</h2>
<p>The <code>lambda</code> form creates procedures. A Procedure captures its parameters, body, and defining environment:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="nd">@dataclass</span>
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Procedure</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;A user-defined function.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">params</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="n">Symbol</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="n">body</span><span class="p">:</span> <span class="n">Any</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="p">:</span> <span class="n">Environment</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># Create a new environment for the call</span>
</span></span><span class="line"><span class="cl">        <span class="n">local_env</span> <span class="o">=</span> <span class="n">Environment</span><span class="p">(</span><span class="n">parent</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># Bind parameters to arguments</span>
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">param</span><span class="p">,</span> <span class="n">arg</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">params</span><span class="p">,</span> <span class="n">args</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            <span class="n">local_env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="n">param</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">arg</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># Evaluate the body in this new environment</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">evaluate</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">body</span><span class="p">,</span> <span class="n">local_env</span><span class="p">)</span>
</span></span></code></pre></div><p>When called, a Procedure:</p>
<ol>
<li>Creates a new environment with its defining environment as parent (closure!)</li>
<li>Binds parameters to the passed arguments</li>
<li>Evaluates its body in this new environment</li>
</ol>
<p>This simple structure gives us closures, higher-order functions, and lexical scoping.</p>
<h2 id="step-6-built-in-primitives">Step 6: Built-in Primitives</h2>
<p>The global environment provides built-in functions:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">create_global_env</span><span class="p">()</span> <span class="o">-&gt;</span> <span class="n">Environment</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Create environment with built-in primitives.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span> <span class="o">=</span> <span class="n">Environment</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Arithmetic</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;+&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="o">*</span><span class="n">args</span><span class="p">:</span> <span class="nb">sum</span><span class="p">(</span><span class="n">args</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;-&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="o">=</span><span class="kc">None</span><span class="p">:</span> <span class="o">-</span><span class="n">a</span> <span class="k">if</span> <span class="n">b</span> <span class="ow">is</span> <span class="kc">None</span> <span class="k">else</span> <span class="n">a</span> <span class="o">-</span> <span class="n">b</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;*&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="o">*</span><span class="n">args</span><span class="p">:</span> <span class="n">reduce</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">:</span> <span class="n">x</span> <span class="o">*</span> <span class="n">y</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;/&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="n">a</span> <span class="o">/</span> <span class="n">b</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Comparison</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;=&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="n">a</span> <span class="o">==</span> <span class="n">b</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;&lt;&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="n">a</span> <span class="o">&lt;</span> <span class="n">b</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;&gt;&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="n">a</span> <span class="o">&gt;</span> <span class="n">b</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># List operations</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;car&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">lst</span><span class="p">:</span> <span class="n">lst</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;cdr&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">lst</span><span class="p">:</span> <span class="n">lst</span><span class="p">[</span><span class="mi">1</span><span class="p">:])</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;cons&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">+</span> <span class="nb">list</span><span class="p">(</span><span class="n">b</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;list&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="o">*</span><span class="n">args</span><span class="p">:</span> <span class="nb">list</span><span class="p">(</span><span class="n">args</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;null?&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">lst</span><span class="p">:</span> <span class="n">lst</span> <span class="o">==</span> <span class="p">[])</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Higher-order functions</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;map&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">f</span><span class="p">,</span> <span class="n">lst</span><span class="p">:</span> <span class="p">[</span><span class="n">f</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">lst</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;filter&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">f</span><span class="p">,</span> <span class="n">lst</span><span class="p">:</span> <span class="p">[</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">lst</span> <span class="k">if</span> <span class="n">f</span><span class="p">(</span><span class="n">x</span><span class="p">)])</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;reduce&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">f</span><span class="p">,</span> <span class="n">lst</span><span class="p">,</span> <span class="n">init</span><span class="p">:</span> <span class="n">reduce</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">lst</span><span class="p">,</span> <span class="n">init</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">env</span>
</span></span></code></pre></div><p>Each primitive is just a Python function. The interpreter treats them identically to user-defined procedures.</p>
<h2 id="step-7-filesystem-integration">Step 7: Filesystem Integration</h2>
<p>Now the interesting part: extending Scheme with filesystem primitives.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">create_global_env</span><span class="p">(</span><span class="n">shell</span><span class="o">=</span><span class="kc">None</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Environment</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">env</span> <span class="o">=</span> <span class="n">Environment</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># ... standard primitives ...</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Filesystem primitives (if shell provided)</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">shell</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;ls&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">path</span><span class="o">=</span><span class="s1">&#39;/&#39;</span><span class="p">:</span> <span class="n">shell</span><span class="o">.</span><span class="n">ls</span><span class="p">(</span><span class="n">path</span><span class="p">)</span><span class="o">.</span><span class="n">lines</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">        <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;cat&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">path</span><span class="p">:</span> <span class="n">shell</span><span class="o">.</span><span class="n">cat</span><span class="p">(</span><span class="n">path</span><span class="p">)</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;mkdir&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">path</span><span class="p">:</span> <span class="n">shell</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">path</span><span class="p">)</span> <span class="ow">and</span> <span class="n">path</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;write&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">path</span><span class="p">,</span> <span class="n">content</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">shell</span><span class="o">.</span><span class="n">fs</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">content</span><span class="o">.</span><span class="n">encode</span><span class="p">())</span> <span class="ow">and</span> <span class="n">path</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;exists?&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">path</span><span class="p">:</span> <span class="n">shell</span><span class="o">.</span><span class="n">fs</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="n">path</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">        <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;cd&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">path</span><span class="p">:</span> <span class="n">shell</span><span class="o">.</span><span class="n">cd</span><span class="p">(</span><span class="n">path</span><span class="p">)</span> <span class="ow">and</span> <span class="n">path</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">env</span><span class="o">.</span><span class="n">define</span><span class="p">(</span><span class="s1">&#39;pwd&#39;</span><span class="p">,</span> <span class="k">lambda</span><span class="p">:</span> <span class="n">shell</span><span class="o">.</span><span class="n">_cwd</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">env</span>
</span></span></code></pre></div><p>Now Scheme can script filesystem operations:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-scheme" data-lang="scheme"><span class="line"><span class="cl"><span class="c1">; Create a project structure</span>
</span></span><span class="line"><span class="cl"><span class="p">(</span><span class="nf">mkdir</span> <span class="s">&#34;/project&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">(</span><span class="nf">mkdir</span> <span class="s">&#34;/project/src&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">(</span><span class="nf">mkdir</span> <span class="s">&#34;/project/tests&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="p">(</span><span class="nb">write </span><span class="s">&#34;/project/README.md&#34;</span> <span class="s">&#34;# My Project\n&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">(</span><span class="nb">write </span><span class="s">&#34;/project/src/main.py&#34;</span> <span class="s">&#34;def main(): pass\n&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">; List files</span>
</span></span><span class="line"><span class="cl"><span class="p">(</span><span class="nf">ls</span> <span class="s">&#34;/project&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1">; =&gt; (&#34;README.md&#34; &#34;src&#34; &#34;tests&#34;)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">; Check existence</span>
</span></span><span class="line"><span class="cl"><span class="p">(</span><span class="k">if </span><span class="p">(</span><span class="nf">exists?</span> <span class="s">&#34;/project/README.md&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="s">&#34;Found it!&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="s">&#34;Not found&#34;</span><span class="p">)</span>
</span></span></code></pre></div><h2 id="step-8-the-repl">Step 8: The REPL</h2>
<p>A Read-Eval-Print Loop ties it together:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">SchemeREPL</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">shell</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">env</span> <span class="o">=</span> <span class="n">create_global_env</span><span class="p">(</span><span class="n">shell</span><span class="o">=</span><span class="n">shell</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">eval_string</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">code</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Any</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Evaluate a string of Scheme code.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="n">tokens</span> <span class="o">=</span> <span class="n">tokenize</span><span class="p">(</span><span class="n">code</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="ow">not</span> <span class="n">tokens</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="n">result</span> <span class="o">=</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">        <span class="n">idx</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">while</span> <span class="n">idx</span> <span class="o">&lt;</span> <span class="nb">len</span><span class="p">(</span><span class="n">tokens</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            <span class="n">expr</span><span class="p">,</span> <span class="n">idx</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_parse_one</span><span class="p">(</span><span class="n">tokens</span><span class="p">,</span> <span class="n">idx</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="n">result</span> <span class="o">=</span> <span class="n">evaluate</span><span class="p">(</span><span class="n">expr</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">result</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Interactive REPL.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="s2">&#34;Scheme REPL (type &#39;quit&#39; to exit)&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">while</span> <span class="kc">True</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="n">code</span> <span class="o">=</span> <span class="nb">input</span><span class="p">(</span><span class="s2">&#34;scheme&gt; &#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">                <span class="k">if</span> <span class="n">code</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span> <span class="o">==</span> <span class="s1">&#39;quit&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                    <span class="k">break</span>
</span></span><span class="line"><span class="cl">                <span class="n">result</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">eval_string</span><span class="p">(</span><span class="n">code</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">                <span class="k">if</span> <span class="n">result</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                    <span class="nb">print</span><span class="p">(</span><span class="n">format_value</span><span class="p">(</span><span class="n">result</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Error: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span></code></pre></div><h2 id="example-session">Example Session</h2>
<p>Here&rsquo;s what a session looks like:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">scheme&gt; (define double (lambda (x) (* x 2)))
</span></span><span class="line"><span class="cl">scheme&gt; (double 21)
</span></span><span class="line"><span class="cl">42
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">scheme&gt; (map double (list 1 2 3 4 5))
</span></span><span class="line"><span class="cl">(2 4 6 8 10)
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">scheme&gt; (mkdir &#34;/data&#34;)
</span></span><span class="line"><span class="cl">&#34;/data&#34;
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">scheme&gt; (write &#34;/data/numbers.txt&#34; &#34;1\n2\n3\n&#34;)
</span></span><span class="line"><span class="cl">&#34;/data/numbers.txt&#34;
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">scheme&gt; (cat &#34;/data/numbers.txt&#34;)
</span></span><span class="line"><span class="cl">&#34;1\n2\n3\n&#34;
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">scheme&gt; (filter (lambda (x) (&gt; x 1))
</span></span><span class="line"><span class="cl">               (map (lambda (s) (string-&gt;number s))
</span></span><span class="line"><span class="cl">                    (string-split (cat &#34;/data/numbers.txt&#34;) &#34;\n&#34;)))
</span></span><span class="line"><span class="cl">(2 3)
</span></span></code></pre></div><p>We&rsquo;ve built a scriptable filesystem with a dozen lines of primitives!</p>
<h2 id="design-patterns">Design Patterns</h2>
<p>Several patterns make this work:</p>
<p><strong>Homoiconicity</strong>: Code is data. The AST is just lists, so we can generate and manipulate code programmatically.</p>
<p><strong>Closures</strong>: Functions capture their environment, enabling powerful patterns like partial application and callbacks.</p>
<p><strong>Extension via primitives</strong>: Adding features is just adding functions to the global environment. No interpreter changes needed.</p>
<p><strong>Separation of concerns</strong>: Tokenizing, parsing, and evaluating are distinct phases with clean interfaces.</p>
<h2 id="trade-offs">Trade-offs</h2>
<p>This simple interpreter has limitations:</p>
<ol>
<li><strong>No tail-call optimization</strong>: Deep recursion will overflow the Python stack.</li>
<li><strong>No macros</strong>: We can&rsquo;t extend the language&rsquo;s syntax.</li>
<li><strong>Error messages are basic</strong>: Line numbers would require more tracking.</li>
<li><strong>Performance</strong>: Pure interpretation is slow for heavy computation.</li>
</ol>
<p>For a DSL extending an application, these trade-offs are often acceptable. The simplicity enables understanding and modification.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Embedding a language isn&rsquo;t black magic. At its core:</p>
<ol>
<li><strong>Tokenize</strong>: Break text into tokens</li>
<li><strong>Parse</strong>: Build an AST (for Lisp, just nested lists)</li>
<li><strong>Evaluate</strong>: Walk the AST, computing values</li>
<li><strong>Extend</strong>: Add primitives that call into your application</li>
</ol>
<p>Scheme&rsquo;s minimal syntax makes this especially tractable. In a few hundred lines, we get variables, functions, closures, conditionals, recursion, and extensibility.</p>
<p>The result is a scriptable application where users can automate, experiment, and build beyond what the core interface provides—exactly what we want from a DSL.</p>
<hr>
<p><em>The complete implementation is in <a href="https://github.com/queelius/dagshell">DagShell</a>, demonstrating Scheme as a filesystem scripting language alongside Python&rsquo;s fluent API.</em></p>
<p><em>This is the final post in the series. Previous posts: <a href="../01-immutable-content-addressed-filesystem/">Immutable Content-Addressed Filesystems</a> and <a href="../02-unix-philosophy-python/">Unix Philosophy in Python</a>.</em></p>
]]></content:encoded></item><item><title>OEIS Contributions</title><link>https://metafunctor.com/research/oeis-contributions/</link><pubDate>Fri, 05 Jun 2026 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/research/oeis-contributions/</guid><description>Integer sequences I have authored or extended in the On-Line Encyclopedia of Integer Sequences, across Ramsey theory, extremal graph theory, knot theory, and algebraic graph theory. Computed by SAT solving and exact enumeration, verified against prior art.</description><content:encoded><![CDATA[<p><strong>Why these.</strong> These sequences have nothing in common at the level of
subject. They range over Ramsey theory, extremal graph theory, knot
theory, and the algebra of graphs. What they share is how they were found.
Each is a quantity with no known formula, computed exactly by SAT solving
or exhaustive search, pushed one value past the published frontier, and
checked against all prior art before it was submitted. The spread across
fields is deliberate: it is evidence that the method travels. And the
method has a direction. The computed values are where I go looking for
structure. The Rado numbers began as a column of computed integers and
ended as a proved theorem; the rest are trailheads, some already climbed,
others marked for whoever arrives next. The OEIS is where the results
live, as permanent reference points anyone can build on or check.</p>
<p>Author search:
<a href="https://oeis.org/search?q=alex+towell">oeis.org/search?q=alex+towell</a>.</p>
<h2 id="authored-sequences">Authored sequences</h2>
<p>These sequences were created and first populated by me.</p>
<table>
	<thead>
			<tr>
					<th>Sequence</th>
					<th>Description</th>
					<th>Domain</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td><a href="https://oeis.org/A394445">A394445</a></td>
					<td>Distinct-variable 2-color Rado numbers for x+y=nz: the least k such that every 2-coloring of {1,&hellip;,k} has a monochromatic distinct-variable solution</td>
					<td>Ramsey theory</td>
			</tr>
			<tr>
					<td><a href="https://oeis.org/A394661">A394661</a></td>
					<td>Triangle T(n,k): number of prime knots with n crossings and three-genus k</td>
					<td>Knot theory</td>
			</tr>
			<tr>
					<td><a href="https://oeis.org/A395521">A395521</a></td>
					<td>Number of non-isomorphic abelian groups appearing as the sandpile group K(G) over graphs on n vertices</td>
					<td>Algebraic graph theory</td>
			</tr>
			<tr>
					<td><a href="https://oeis.org/A395644">A395644</a></td>
					<td>Number of fibered prime knots with n crossings</td>
					<td>Knot theory</td>
			</tr>
	</tbody>
</table>
<p>A394445 is backed by a closed-form theorem (a proof that the distinct
2-color Rado number for x+y=kz follows an explicit parity-dependent formula
for all k &gt;= 8), with a 500-term b-file. A394661 and A395644 come from a
census of the KnotInfo prime-knot tables; A395521 from an exhaustive
sandpile-group computation.</p>
<h2 id="extended-sequences">Extended sequences</h2>
<p>Classical Zarankiewicz-problem sequences (originally by N. J. A. Sloane)
where I computed and added new terms past the known frontier.</p>
<table>
	<thead>
			<tr>
					<th>Sequence</th>
					<th>Description</th>
					<th>My terms</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td><a href="https://oeis.org/A006615">A006615</a></td>
					<td>z(n,n;3,4): least k forcing an all-ones 3x4 submatrix in an n x n 0/1 matrix</td>
					<td>a(10)=67; a(11)=79 (in review)</td>
			</tr>
			<tr>
					<td><a href="https://oeis.org/A006622">A006622</a></td>
					<td>z(n,n+1;3,4): same for n x (n+1) matrices</td>
					<td>a(9)=61; a(10)=73 (in review)</td>
			</tr>
			<tr>
					<td><a href="https://oeis.org/A006625">A006625</a></td>
					<td>z(n,n+2;3,4): same for n x (n+2) matrices</td>
					<td>a(9)=67; a(10)=79 (in review)</td>
			</tr>
	</tbody>
</table>
<p>The Zarankiewicz extensions use SAT: a satisfying assignment exhibits a
dense matrix (a lower bound), and an unsatisfiable instance proves the
matching upper bound. The most recent terms required a double-lex
symmetry-breaking encoding to make the upper-bound proof tractable: the
plain encoding could not settle z(10,11;3,4) in seven days, while the
symmetry-broken version proved it in 27 minutes.</p>
<h2 id="in-review-june-2026">In review (June 2026)</h2>
<ul>
<li>A006615 a(11)=79, A006622 a(10)=73, A006625 a(10)=79 (the three exact
Zarankiewicz values above), proposed June 5 2026.</li>
</ul>
<p>The A394445 b-file (500 terms) is already approved and live.</p>
<h2 id="reviewers">Reviewers</h2>
<p>These submissions were reviewed and approved by OEIS editors including
Michel Marcus, Sean A. Irvine, Max Alekseyev, and Jon E. Schoenfield.</p>
<hr>
<p>The OEIS has been a shared ledger of integer sequences since 1964. Adding
a term is a small, permanent contribution to a commons that outlasts any
one project: a number that was unknown is now known, checked, and citable
by anyone who needs it.</p>
]]></content:encoded><category>OEIS</category><category>combinatorics</category><category>Ramsey theory</category><category>extremal graph theory</category><category>knot theory</category><category>SAT solving</category></item><item><title>The Expensive Half of a SAT Proof</title><link>https://metafunctor.com/post/2026-06-05-symmetry-breaking-unsat/</link><pubDate>Fri, 05 Jun 2026 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/post/2026-06-05-symmetry-breaking-unsat/</guid><description>A from-first-principles tour of double-lex symmetry breaking for SAT: why the UNSAT side is expensive, how canonical-representative constraints fix it, a worked Zarankiewicz example that settled three new OEIS values, and an honest map of where the technique stops scaling.</description><content:encoded><![CDATA[<h1 id="the-expensive-half-of-a-sat-proof">The Expensive Half of a SAT Proof</h1>
<p>Here is a number that bothered me for a week. I wanted the exact value of a
small combinatorial quantity called z(10,11;3,4). A SAT solver found a good
configuration in about two seconds, which told me the answer was <em>at least</em>
72. To finish, I needed the other direction: a proof that nothing does
better, that the answer is <em>at most</em> 72. I started that second SAT job and
walked away. It ran for more than <strong>seven days</strong> without an answer.</p>
<p>Then I added one idea, an old and standard one, and the same proof finished
in <strong>27 minutes</strong>.</p>
<p>The idea is symmetry breaking. This post builds it up from scratch, watches
it work on the problem above (the payoff was three new values in the
<a href="https://oeis.org">On-Line Encyclopedia of Integer Sequences</a>), and then
watches it fail on a different problem, because knowing when a tool stops
working is most of the skill of using it.</p>
<h2 id="a-problem-you-can-hold-in-your-hand">A problem you can hold in your hand</h2>
<p>The quantity z(m,n;s,t) is the <strong>largest number of 1s you can put in an
m-by-n grid of 0s and 1s without creating an all-ones s-by-t block</strong>. By &ldquo;block&rdquo;
I mean: pick some s rows and some t columns; the s-by-t cells where they cross
must not all be 1.</p>
<p>Picture a 4-by-4 grid. If rows 1 and 3 and columns 2 and 4 all carry 1s at
their four intersections, that is a forbidden 2-by-2 block. Avoid every such
block, pack in as many 1s as you can, and the maximum is the Zarankiewicz
number. These numbers grow in a ragged, formula-resistant way, which is
exactly why people compute them one at a time.</p>
<p>This is a natural fit for a SAT solver. Put one Boolean variable on each
cell, true when the cell is 1. For every choice of s rows and t columns,
add one clause saying &ldquo;at least one of these s times t cells is 0.&rdquo; Add a
counting constraint, &ldquo;at least k cells are 1,&rdquo; and ask the solver:</p>
<ul>
<li>If it answers <strong>satisfiable</strong>, it hands back a grid with k ones and no
forbidden block. So z is at least k.</li>
<li>If it answers <strong>unsatisfiable</strong>, no such grid exists, so z is less than k.</li>
</ul>
<p>To pin the exact value you need both: a witness at k, and an impossibility
proof at k+1. The witness side is usually easy. The impossibility side, the
UNSAT side, is the expensive half.</p>
<h2 id="why-proving-impossibility-is-hard">Why proving impossibility is hard</h2>
<p>The asymmetry is intuitive once you say it out loud. To find a witness, the
solver needs <strong>one</strong> lucky arrangement. To prove impossibility, it must rule
out <strong>every</strong> arrangement. That is already a harder job. But there is a
second, sneakier cost, and it is the one that matters here.</p>
<p>The solver keeps rediscovering the same grid wearing different clothes.</p>
<p>Take any valid grid. Swap two of its rows. You get a different assignment of
the Boolean variables, but it is plainly the same object: same number of 1s,
same blocks present or absent. Swap two columns: same story. The solver does
not know this. To the solver, the original and the swapped version are two
different points in the search space, each to be explored and refuted from
scratch.</p>
<h2 id="counting-the-copies">Counting the copies</h2>
<p>Name the thing that is hurting us. The <strong>row permutations and the column
permutations</strong> form a group that acts on every grid while preserving both
what we are maximizing (the count of 1s) and what we are forbidding (the
blocks). So the grids come in families, called orbits, and every grid in an
orbit is the same object relabeled.</p>
<p>How big is an orbit? For a 10-by-11 grid, you can reorder the rows in 10!
ways and the columns in 11! ways:</p>
<blockquote>
<p>10! times 11! is about 1.4 times 10^14.</p>
</blockquote>
<p>That is roughly a hundred trillion relabelings of a single grid. When the
solver is trying to prove that no good grid exists, it is wading through a
hundred trillion copies of each candidate it has already dismissed. No
wonder the job ran for a week.</p>
<h2 id="the-fix-keep-one-copy-of-each">The fix: keep one copy of each</h2>
<p>The cure is to refuse all but one member of each orbit. If we can tell the
solver &ldquo;only consider the canonical version of each grid,&rdquo; the search shrinks
by the size of the orbits.</p>
<p>The standard canonical choice is the <strong>lexicographically smallest</strong> relabeling,
and the standard way to enforce it cheaply is called <strong>double-lex</strong>:</p>
<ul>
<li>require the <strong>rows</strong> to be in non-decreasing lexicographic order, and</li>
<li>require the <strong>columns</strong> to be in non-decreasing lexicographic order.</li>
</ul>
<p>Read each row left to right as a string of 0s and 1s; demand row 1 is no
larger than row 2, which is no larger than row 3, and so on. Do the same down
the columns.</p>
<p>The reason this is allowed, and it is worth saying carefully, is a soundness
fact: <strong>every orbit contains at least one grid that is both row-sorted and
column-sorted.</strong> So when we throw away every non-doubly-lex grid, we never
throw away an entire orbit. If a good grid existed before, a good doubly-lex
grid still exists. Satisfiability is untouched, and so is unsatisfiability.
We have deleted only redundant copies, never an actual solution. That is the
whole game.</p>
<h2 id="saying-this-row-is-not-bigger-in-clauses">Saying &ldquo;this row is not bigger&rdquo; in clauses</h2>
<p>A SAT solver speaks in clauses, so we have to express &ldquo;this 0/1 vector is
lexicographically less than or equal to that one&rdquo; as a pile of ORs. The
construction is a small classic, and building it is the kind of thing this
blog likes to do by hand.</p>
<p>Compare two vectors a and b, position by position from the left. Carry one
extra Boolean per position, an &ldquo;equal so far&rdquo; flag. The rule:</p>
<ul>
<li>while the prefixes are still equal, the current bits must satisfy a-bit
implies b-bit (you are not allowed to be the one that is larger);</li>
<li>update the &ldquo;equal so far&rdquo; flag to stay true only if the current bits match.</li>
</ul>
<p>In clause form, at each position, with <code>eq</code> meaning equal-so-far:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl"># if equal so far, then a &lt;= b at this position
</span></span><span class="line"><span class="cl">(not eq) OR (not a) OR b
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"># eq_next is true exactly when eq held and the bits matched
</span></span><span class="line"><span class="cl"># (a handful of clauses define eq_next = eq AND (a == b))
</span></span></code></pre></div><p>It is linear in the length of the vector, and you can check it against a
truth table to be sure (I did, for all vectors up to length four, before
trusting it on anything larger). Apply it to each adjacent pair of rows and
each adjacent pair of columns, and you have double-lex.</p>
<p>One honest caveat. Double-lex is not a <em>complete</em> symmetry break. Some orbits
still keep more than one doubly-lex member, and we only constrained adjacent
pairs rather than all permutations directly. It is sound and cheap, not
perfect. That is fine. Removing most of the symmetry is enough to turn
intractable into tractable, and chasing a unique representative would cost
more than it returns.</p>
<h2 id="the-payoff-in-real-seconds">The payoff, in real seconds</h2>
<p>First, sanity. A known value: z(9,9;3,4) is 56. The proof that it is <em>not</em> 57
is a small UNSAT instance. With the plain encoding, the solver did not finish
it inside a 90-second budget. With double-lex, it finished in about <strong>ten
seconds</strong>. The symmetry break is the difference between &ldquo;no answer&rdquo; and &ldquo;ten
seconds&rdquo; on a problem whose answer we already knew, which is exactly the kind
of validation you want before trusting it on a problem whose answer you do
not.</p>
<p>Then the one that started this post:</p>
<table>
	<thead>
			<tr>
					<th>instance</th>
					<th>plain encoding</th>
					<th>double-lex</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>z(9,9;3,4): no grid with 57 ones</td>
					<td>over 90 s, no result</td>
					<td>about 10 s</td>
			</tr>
			<tr>
					<td>z(10,11;3,4): no grid with 73 ones</td>
					<td>over 7 days, no result</td>
					<td>27 min</td>
			</tr>
	</tbody>
</table>
<p>That 27-minute proof, paired with the two-second witness, settled
z(10,11;3,4) = 72 exactly. Two siblings fell the same way. All three went to
the OEIS as new values of classic Zarankiewicz sequences
(<a href="https://oeis.org/A006615">A006615</a>,
<a href="https://oeis.org/A006622">A006622</a>,
<a href="https://oeis.org/A006625">A006625</a>), and they sit alongside the rest of my
<a href="/research/oeis-contributions/">OEIS contributions</a>, which is where this
kind of computed-then-verified result tends to land.</p>
<h2 id="now-watch-it-fail">Now watch it fail</h2>
<p>If the story stopped there, you might think symmetry-broken SAT is a frontier
machine. It is not, and the failure is more instructive than the success.</p>
<p>Try a different problem: the most edges you can put on n points without ever
forming a 4-cycle (a square). This is also a 0/1 matrix question, an adjacency
matrix this time, and it has the same flavor of symmetry, so the same tool
should apply. I built it and validated it: it reproduces the known values
(OEIS <a href="https://oeis.org/A006855">A006855</a>) exactly up to n = 15.</p>
<p>Then it dies. The runtime grows by roughly a factor of five for every extra
vertex. By n = 15 a single value already takes about a minute, and the known
frontier sits at n = 41. Five to the twenty-sixth power times a minute is not
a number of minutes anyone will wait. The tool that cracked Zarankiewicz in
27 minutes cannot get near this frontier at all.</p>
<p>Why the split? It comes down to the shape of the symmetry.</p>
<ul>
<li>In the Zarankiewicz grid, rows and columns are <strong>separate axes</strong>. You can
sort them independently, and the dimensions are modest (10 by 11). Adjacent
lexicographic order removes enough of the redundancy to win.</li>
<li>In a graph, there is <strong>one</strong> symmetry, the relabeling of vertices, and it
permutes the rows and the columns of the adjacency matrix <em>together</em>.
Adjacent-pair lexicographic order barely dents that single large group, and
the leftover symmetry still explodes the search.</li>
</ul>
<p>The people who actually reach the C4-free frontier do not use plain SAT at
all. They use orderly generation, building graphs in a canonical form so that
no two relabelings are ever produced (the lineage of nauty and its relatives).
That is a different and heavier instrument.</p>
<p>The lesson generalizes. <strong>Symmetry-broken SAT scales when the symmetry group
is a product of modest factors you can sort independently, and it stalls when
it is one big group.</strong> Before you reach for it, look at the symmetry you are
facing and ask which of those two worlds you are in.</p>
<h2 id="credit-where-it-is-due">Credit where it is due</h2>
<p>None of the machinery here is mine, and that is worth stating plainly rather
than burying. Double-lex symmetry breaking for matrix models is from Flener,
Frisch, Hnich, Kiziltan, Miguel, Pearson, and Walsh, &ldquo;Breaking row and column
symmetries in matrix models&rdquo; (CP 2002). Attacking Zarankiewicz numbers with
symmetry-aware SAT was done by Jeremy Tan in &ldquo;An attack on Zarankiewicz&rsquo;s
problem through SAT solving&rdquo; (2022). The lexicographic clause encoding is
textbook (Knuth, TAOCP 7.2.2.2; and Codish and collaborators). The only new
things in my version are the particular values that came out the other end.</p>
<p>That is, in a way, the point of the post. This is a standard technique that
lives mostly inside the constraint-programming community and is underused
everywhere else. If you find yourself staring at a SAT job that will not prove
the impossibility you are sure is true, count the symmetries first. There may
be a hundred trillion copies of your problem hiding in plain sight, and a few
lexicographic clauses to make them go away.</p>
<h2 id="takeaways">Takeaways</h2>
<ul>
<li>The UNSAT side of an exact-value SAT proof is the expensive half, and
symmetry is usually why.</li>
<li>Canonical-representative constraints (double-lex) are cheap, sound, and
often decisive: they keep one grid per orbit instead of all m! times n! of
them.</li>
<li>They are standard in constraint programming and underused elsewhere.</li>
<li>They scale on product-of-modest-groups symmetry (the Zarankiewicz grid) and
stall on one large group (graph vertex relabeling), where canonical
generation is the right tool instead. Match the method to the symmetry.</li>
</ul>
<p><em>The code (a lexicographic-CNF builder, the Zarankiewicz attack, and the
C4-free contrast case, all with tests) lives in my open-problems repository.</em></p>
]]></content:encoded><category>sat-solving</category><category>symmetry-breaking</category><category>zarankiewicz</category><category>oeis</category><category>constraint-programming</category><category>extremal-graph-theory</category><category>unsat</category><category>lexicographic-ordering</category></item><item><title>Synthesis: Codecs as Structure</title><link>https://metafunctor.com/post/2026-05-synthesis-wire-formats/</link><pubDate>Fri, 15 May 2026 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/post/2026-05-synthesis-wire-formats/</guid><description>The series closes by restating the codes-as-priors thesis across all twelve instances and connecting the wire-format side to the Stepanov type-algebra side.</description><content:encoded><![CDATA[<p><em>Twelve posts, twelve codes, one thesis that refused to change. This is the closing summary.</em></p>
<h2 id="a-the-twelve-codes-together">A. The Twelve Codes Together</h2>
<p>Every post in this series answered a version of the same question: given a source of positive integers, how do you represent its values compactly as a sequence of bits? The answers differ in shape, in assumptions, and in which distribution each code implicitly expects.</p>
<table>
	<thead>
			<tr>
					<th>Post</th>
					<th>Code</th>
					<th>Implied prior (one phrase)</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>1-2</td>
					<td>Foundations</td>
					<td>Prefix-free codes are possible iff Kraft&rsquo;s inequality holds</td>
			</tr>
			<tr>
					<td>3</td>
					<td>Priors framework</td>
					<td>Any code defines a prior; the best code matches the source</td>
			</tr>
			<tr>
					<td>4</td>
					<td>Unary</td>
					<td>Geometric(1/2): value 1 is twice as likely as value 2, etc.</td>
			</tr>
			<tr>
					<td>5a</td>
					<td>Elias Gamma</td>
					<td>Power-law: probability falls as 1/n^2</td>
			</tr>
			<tr>
					<td>5b</td>
					<td>Elias Delta</td>
					<td>Heavier-tailed power law: slower decay for large values</td>
			</tr>
			<tr>
					<td>5c</td>
					<td>Elias Omega</td>
					<td>Recursive structure: no fixed polynomial decay rate</td>
			</tr>
			<tr>
					<td>6</td>
					<td>Fibonacci</td>
					<td>Near-geometric with Zeckendorf structure; good for Zeckendorf-sparse integers</td>
			</tr>
			<tr>
					<td>7</td>
					<td>Rice / Golomb</td>
					<td>Geometric with known parameter m; optimal when m divides entropy</td>
			</tr>
			<tr>
					<td>8</td>
					<td>VByte</td>
					<td>Roughly uniform over byte-aligned ranges; engineering favorite</td>
			</tr>
			<tr>
					<td>9</td>
					<td>Huffman</td>
					<td>Source-optimal given the exact symbol distribution</td>
			</tr>
			<tr>
					<td>10</td>
					<td>Arithmetic coding</td>
					<td>Approaches entropy to an arbitrary fraction of a bit</td>
			</tr>
			<tr>
					<td>11</td>
					<td>Succinct bit vectors</td>
					<td>Not a code for integers: a representation that answers rank/select queries</td>
			</tr>
			<tr>
					<td>12</td>
					<td>RoaringBitmap</td>
					<td>Polyalgorithm: picks array, bitset, or run-length per container chunk</td>
			</tr>
	</tbody>
</table>
<p>Posts 1 and 2 (<a href="/post/2020-03-kraft-wire-formats/">Kraft&rsquo;s Inequality</a> and <a href="/post/2020-09-mcmillan-wire-formats/">McMillan&rsquo;s Converse</a>) established why prefix-free codes are the right unit of analysis. Post 3 (<a href="/post/2022-01-priors-wire-formats/">Universal Codes as Priors</a>) named the frame: a code is a hypothesis about the source. Posts 4 through 10 filled in the catalogue. Posts 11 and 12 extended from integer coding to set representation, where the questions shift from &ldquo;how long is this codeword?&rdquo; to &ldquo;how do you store membership?&rdquo; and &ldquo;how do you answer rank/select?&rdquo;</p>
<p>Looking across all twelve, the main lesson is not that one code dominates. It is that the question &ldquo;which code?&rdquo; is always empirically answerable given a sample.</p>
<h2 id="b-the-unifying-frame-restated">B. The Unifying Frame Restated</h2>
<p><a href="/post/2022-01-priors-wire-formats/">Post 3</a> introduced the codes-as-priors thesis with two instances behind it. We now have twelve. The thesis has not changed; it has only become more evidently true.</p>
<p>Here is the six-clause version.</p>
<p><strong>Clause 1: a code is a hypothesis.</strong> Every prefix-free code for positive integers assigns a codeword length $\ell(n)$ to each value $n$. By Kraft&rsquo;s equality for complete codes, the implied probability is $p(n) = 2^{-\ell(n)}$. A code is not neutral: it says something about how likely each value is.</p>
<p><strong>Clause 2: lengths determine a prior.</strong> If you choose Elias gamma, you are implicitly betting that values are power-law distributed with exponent 2. If you choose Unary, you are betting on geometric(1/2). If you choose VByte, you are betting on a near-uniform distribution over byte-aligned ranges. The bet is in the lengths, whether or not you named it as a bet.</p>
<p><strong>Clause 3: choosing a code is choosing a prior.</strong> This is just clauses 1 and 2 combined, but it bears stating as a unit: the act of picking a codec is a modeling decision, not an engineering neutral choice. Unary and Delta will produce radically different wire sizes on the same data, because they assume radically different sources.</p>
<p><strong>Clause 4: the best code matches the actual source.</strong> The code that minimizes expected description length for a source $p$ is the one whose implied prior most closely matches $p$. Formally, expected length under code $c$ for source $p$ equals entropy $H(p)$ plus the KL divergence $D_\text{KL}(p ,|, q_c)$ where $q_c$ is the code&rsquo;s implied prior. Minimum expected length occurs at $D_\text{KL} = 0$, i.e., when the code and source agree.</p>
<p><strong>Clause 5: universal codes give bounded redundancy when the source is unknown.</strong> Elias gamma and delta are universal: for any source with a finite entropy upper-bound, their redundancy per symbol is $O(\log \log n)$ relative to the optimal code for that source. You pay a small overhead to avoid committing to a specific prior. The overhead is worth paying when you cannot measure the source in advance.</p>
<p><strong>Clause 6: polyalgorithms adapt per-chunk rather than committing globally.</strong> RoaringBitmap is the clearest instance of this: it measures each 65536-element chunk and picks the container type (sorted array, bitset, run-length) that minimizes storage for that chunk&rsquo;s observed density. This is not &ldquo;pick the best universal code&rdquo; but &ldquo;pick the best structure from a menu, per block.&rdquo; The principle generalizes: any large dataset may be non-stationary, and a single global codec will be suboptimal for every region where the source differs from the global average.</p>
<h2 id="c-composition-with-type-algebra">C. Composition with Type Algebra</h2>
<p>Two posts in the companion Stepanov series made claims that interlock with everything above. Working through them changed how I think about codec design.</p>
<p>The post <a href="/post/2026-05-codecs-functors-stepanov/">Bits Follow Types</a> argued that codecs compose along the algebraic structure of types: a codec for <code>Either&lt;A, B&gt;</code> is built from codecs for <code>A</code> and <code>B</code> by prepending a tag bit and dispatching; a codec for <code>Vec&lt;A&gt;</code> is built from a codec for <code>A</code> by iterating. The composition law is structural. It follows the constructors of the algebraic data type, not the distribution of the data.</p>
<p>The post <a href="/post/2026-05-prefix-free-stepanov/">When Lists Become Bits</a> argued that prefix-freeness lifts the free-monoid construction into bit space. Concatenating prefix-free codewords is unambiguous because each codeword is self-delimiting. The monoid of bit strings decomposes along codec boundaries exactly as the monoid of abstract tokens decomposes along type boundaries.</p>
<p>This series adds a third claim: the choice of leaf codec in any composite type is determined by the prior over that leaf&rsquo;s data, while the composition structure is determined by the type. These are two orthogonal axes of design freedom.</p>
<p>Consider <code>Either&lt;uint32_t, std::string&gt;</code>. The composition structure says: write one bit for the variant tag, then write the payload for the selected variant. That part is fixed by the type. What is not fixed is which codec you use for the tag bit, and which codec you use for the <code>uint32_t</code> payload.</p>
<p>If 90% of values are <code>uint32_t</code> and 10% are <code>std::string</code>, a Huffman code over the two-symbol alphabet <code>{left, right}</code> gives you the tag bit nearly for free (0.47 bits expected vs. 1 bit for the flat tag). And if the <code>uint32_t</code> values cluster in <code>[0, 127]</code>, VByte costs exactly 8 bits per value. Swap those observations and you swap those choices.</p>
<p>The type algebra determines the composition; the empirical distribution determines the leaf codecs. Neither axis constrains the other. Attending to both is what makes codec design non-trivial.</p>
<h2 id="d-the-codec-selection-library">D. The Codec-Selection Library</h2>
<p>The <code>synthesis.hpp</code> library distills the selection process into three functions.</p>
<p><code>empirical_distribution</code> counts occurrences of each value in a sample and normalizes by the total count to produce a probability map.</p>
<p><code>redundancy_for</code> computes the redundancy of a named code on a given distribution: expected length minus entropy. By Shannon&rsquo;s source-coding theorem, this is always non-negative. A lower redundancy means the code&rsquo;s implied prior is a closer fit to the actual source.</p>
<p><code>recommend_code</code> ties them together: it estimates the empirical distribution from a sample, computes the redundancy of all six candidate universal codes (Unary, Gamma, Delta, Omega, Fibonacci, VByte), and returns the name of the code with the smallest redundancy.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="kr">inline</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">recommend_code</span><span class="p">(</span><span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">std</span><span class="o">::</span><span class="kt">uint64_t</span><span class="o">&gt;&amp;</span> <span class="n">sample</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">auto</span> <span class="n">dist</span> <span class="o">=</span> <span class="n">empirical_distribution</span><span class="p">(</span><span class="n">sample</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="k">constexpr</span> <span class="n">std</span><span class="o">::</span><span class="n">string_view</span> <span class="n">candidates</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="s">&#34;Unary&#34;</span><span class="p">,</span> <span class="s">&#34;Gamma&#34;</span><span class="p">,</span> <span class="s">&#34;Delta&#34;</span><span class="p">,</span> <span class="s">&#34;Omega&#34;</span><span class="p">,</span> <span class="s">&#34;Fibonacci&#34;</span><span class="p">,</span> <span class="s">&#34;VByte&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">};</span>
</span></span><span class="line"><span class="cl">    <span class="kt">double</span> <span class="n">best_redundancy</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">numeric_limits</span><span class="o">&lt;</span><span class="kt">double</span><span class="o">&gt;::</span><span class="n">infinity</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">    <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">best_code</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">string_view</span> <span class="nl">candidate</span> <span class="p">:</span> <span class="n">candidates</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="kt">double</span> <span class="n">r</span> <span class="o">=</span> <span class="n">redundancy_for</span><span class="p">(</span><span class="n">candidate</span><span class="p">,</span> <span class="n">dist</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">&lt;</span> <span class="n">best_redundancy</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="n">best_redundancy</span> <span class="o">=</span> <span class="n">r</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">            <span class="n">best_code</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="p">(</span><span class="n">candidate</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">best_code</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>On a strongly geometric sample (value 1 appears much more often than value 2, which appears more often than value 3, and so on), <code>recommend_code</code> returns <code>&quot;Unary&quot;</code> or <code>&quot;Fibonacci&quot;</code>. Both codes are matched to geometric-like priors; their redundancy on geometric data is low. On a sample of values in <code>[500, 1000]</code>, it returns <code>&quot;VByte&quot;</code> or <code>&quot;Delta&quot;</code>: byte-aligned codes have low overhead for values that cluster in that range, and neither Unary nor Gamma would be tolerable there. Unary would require hundreds of bits per value.</p>
<p>The point is not that <code>recommend_code</code> is a production-quality selector. It is that it makes the selection process concrete and mechanical. There is no best code in general. There is a best code given a sample.</p>
<h2 id="e-the-six-principles">E. The Six Principles</h2>
<p>These are the things I now believe about coding, after working through twelve instances.</p>
<p><strong>1. A code is a prior.</strong> Every codeword length implies a probability. Choosing a code is choosing what you believe about the source. The choice is never neutral, even when it is unconscious. When you reach for the first codec you know, you are betting on a prior you may never have examined.</p>
<p><strong>2. Universality is robustness.</strong> A universal code performs well across many priors, not just one. Elias gamma is asymptotically optimal for any source from a broad class of power-law distributions; VByte is a pragmatic universal for byte-oriented hardware. Use universal codes when you do not know the prior in advance, when the prior may shift over time, or when the engineering cost of measuring the source outweighs the compression gain. Use Huffman or arithmetic coding when you do know the prior and the gain from exploiting it justifies the overhead.</p>
<p><strong>3. Optimality is measurable.</strong> Shannon&rsquo;s source-coding theorem gives the lower bound: no uniquely-decodable code can have expected length below the entropy $H(p)$ of the source. Every code&rsquo;s redundancy is then measurable as expected-length minus entropy. This is not a theoretical abstraction; you can compute it from a sample in a few lines of code, as <code>redundancy_for</code> demonstrates. Pick codes by minimizing redundancy on the actual source, not by intuition about which code sounds right for the problem.</p>
<p><strong>4. Engineering trades dominate at scale.</strong> Theoretical optima (Elias gamma, delta, arithmetic coding) are outperformed in throughput benchmarks by byte-aligned approximations (VByte) when decode speed is the binding constraint. The reason is cache line alignment and SIMD: modern CPUs process data in 16-byte to 64-byte chunks, and a code that crosses byte boundaries pays a penalty that no redundancy saving can overcome at typical data volumes. Figure out where the binding constraint lives (compression ratio, decode throughput, encode throughput, memory, latency) before you decide which axis to optimize.</p>
<p><strong>5. Polyalgorithms beat single algorithms.</strong> When source characteristics vary across the data, adapt per-chunk rather than committing to a single global code. RoaringBitmap does this for integer sets: it observes each 65536-element chunk and picks the container type that minimizes storage for that chunk&rsquo;s density. Zstd does this for byte streams: it switches between LZ77, Huffman, and ANS depending on local symbol statistics. The principle is general. Non-stationarity is the norm in real data, and a globally optimal code is locally suboptimal everywhere the source deviates from the global average.</p>
<p><strong>6. The algebra of composition is orthogonal to the choice of leaf code.</strong> Type structure dictates how codecs compose: <code>Either</code> needs a tag, <code>Vec</code> needs a count or terminator, <code>Product</code> concatenates. These are structural facts about the algebraic data type and they do not change with the data distribution. What does change with the distribution is which codec you assign to each leaf. You can swap leaf codecs without changing the composition law, and you can change the composition structure without touching the leaf codecs. Keeping these two axes separate is what makes a codec library composable rather than a pile of special cases.</p>
<h2 id="f-what-comes-next">F. What Comes Next</h2>
<p>This series covered prefix-free codes from their theoretical foundations through twelve instances, ending at polyalgorithms. I stopped here because this is where the foundations end and the engineering starts getting domain-specific. The frontier extends in three directions I have not covered.</p>
<p><strong>Context mixing.</strong> The PAQ and ZPAQ family of compressors achieve near-arithmetic-limits on general data by running many predictive models in parallel and mixing their probability estimates with weights that are themselves adapted to the data. The key insight is that no single model is best everywhere; a mixture that weights models by their recent prediction accuracy outperforms any single model on non-stationary sources. Context mixing is the logical culmination of the codes-as-priors idea: you do not pick one prior, you maintain a portfolio.</p>
<p><strong>Asymmetric Numeral Systems (ANS).</strong> Developed by Jaroslaw Duda around 2014, ANS achieves the compression ratio of arithmetic coding while decoding at 5 to 10 times the speed. It underlies LZ4&rsquo;s entropy backend (FSE), Zstd&rsquo;s entropy coder, Apple&rsquo;s LZFSE, and several video compression standards. ANS works by maintaining a single integer state that encodes the accumulated probability of the symbols seen so far, updated with each new symbol using a lookup table rather than interval arithmetic. If you want to understand how modern fast compressors actually work at the symbol level, ANS is the answer.</p>
<p><strong>Rate-distortion theory.</strong> Everything in this series assumed lossless compression: the decoder must recover the exact original bit sequence. Rate-distortion theory introduces a different class of trade, where you bound the allowable reconstruction error and ask how few bits you can use subject to that bound. This is the foundation of audio (MP3, AAC, Opus), video (H.264, H.265, AV1), and image compression (JPEG, WebP). The information-theoretic results here are just as clean as Shannon&rsquo;s source-coding theorem, but the engineering is richer because the distortion measure (perceptual quality for audio and video, pixel error for images) shapes every design decision.</p>
<p>The principles from this series carry forward into all three directions: a code is a prior, universality is robustness, optimality is measurable, engineering trades dominate at scale. The new settings add new structure (context histories, state machines, distortion metrics) but not new foundations.</p>
<h2 id="g-cross-references">G. Cross-references</h2>
<p>Posts in this series (in order):
<a href="/post/2020-03-kraft-wire-formats/">Kraft&rsquo;s Inequality</a>,
<a href="/post/2020-09-mcmillan-wire-formats/">McMillan&rsquo;s Converse</a>,
<a href="/post/2022-01-priors-wire-formats/">Universal Codes as Priors</a>,
<a href="/post/2022-06-elias-gamma-wire-formats/">Unary and Elias Gamma</a>,
<a href="/post/2022-11-elias-delta-omega-wire-formats/">Elias Delta and Omega</a>,
<a href="/post/2023-04-fibonacci-wire-formats/">Fibonacci Coding</a>,
<a href="/post/2023-09-rice-golomb-wire-formats/">Rice / Golomb</a>,
<a href="/post/2024-02-vbyte-wire-formats/">VByte / Varint</a>,
<a href="/post/2024-08-huffman-wire-formats/">Huffman</a>,
<a href="/post/2025-01-arithmetic-coding-wire-formats/">Arithmetic Coding</a>,
<a href="/post/2025-06-succinct-wire-formats/">Succinct Bit Vectors</a>,
<a href="/post/2025-12-roaring-bitmap-wire-formats/">RoaringBitmap</a>.</p>
<p>Cross-series: <a href="/post/2026-05-codecs-functors-stepanov/">Bits Follow Types</a> and
<a href="/post/2026-05-prefix-free-stepanov/">When Lists Become Bits</a>.</p>
<p><strong>PFC footnote:</strong> The production implementation of all 12 codes (and several more)
is at <a href="https://github.com/queelius/wire-formats/tree/master/lib/pfc">github.com/queelius/wire-formats/tree/master/lib/pfc</a>. This series
develops the theory; PFC is the practice.</p>
]]></content:encoded><category>C++</category><category>information-theory</category><category>coding-theory</category><category>prefix-free</category><category>universal-codes</category><category>synthesis</category><category>algebra</category></item><item><title>Weekly Jigsaw</title><link>https://metafunctor.com/arcade/jigsaw/</link><pubDate>Mon, 27 Apr 2026 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/arcade/jigsaw/</guid><description>A fresh AI-generated picture every Monday. Each piece you place is a git commit. The puzzle’s solving history is the git log.
Sign in with GitHub to play. The “type a code at github.com/login/device” flow runs entirely in your browser; no server holds your token.</description><content:encoded><![CDATA[<p>A fresh AI-generated picture every Monday. Each piece you place is a git commit. The puzzle&rsquo;s solving history is the <a href="https://github.com/queelius/metafunctor-data/commits/master/jigsaw">git log</a>.</p>
<p>Sign in with GitHub to play. The &ldquo;type a code at github.com/login/device&rdquo; flow runs entirely in your browser; no server holds your token.</p>
]]></content:encoded></item><item><title>I Tried to Voice-Clone Myself in 90 MB</title><link>https://metafunctor.com/post/2026-04-26-behavioral-fine-tuning-without-the-tuning/</link><pubDate>Sun, 26 Apr 2026 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/post/2026-04-26-behavioral-fine-tuning-without-the-tuning/</guid><description>A tiny LLM in the browser, mixed at sample time with a token-level n-gram trained on every word I have published. Result is mediocre. Architecture is interesting. Notes on what worked, what didn't, and what would make it work.</description><content:encoded><![CDATA[<p>The experiment: a tiny SmolLM2 running in your browser. A token-level n-gram trained on every word I have published. Mix the two distributions in probability space at every generation step. Sample from the mix.</p>
<p>You can try it at <a href="/ask/">/ask</a>. There is a slider for how strongly the n-gram bleeds in.</p>
<p>I expected a small chatbot that sounds like me. I got a 135M parameter model that uses my words to produce paranoid lorem-ipsum, and a 1.7B model that mostly behaves like a competent chatbot decorated with my function words. Sometimes a phrase comes out that I might actually write. More often the output is grammatically OK but conceptually empty.</p>
<p>The result is weak. The architecture is interesting anyway, and I want to write up why.</p>
<h2 id="the-pieces">The Pieces</h2>
<p>Three things, all running locally:</p>
<ol>
<li>
<p>SmolLM2-Instruct in 135M, 360M, or 1.7B sizes. Q4_K_M GGUF, served from HuggingFace, run via <a href="https://github.com/ngxson/wllama">Wllama</a>. 90 MB to 1 GB on disk. CPU only.</p>
</li>
<li>
<p>A token-level n-gram over my blog corpus: every post tokenized with SmolLM2&rsquo;s BPE, indexed with a <a href="https://github.com/queelius/infinigram">suffix array</a>. 1.6 MB of source text, 470,000 tokens, 1.9 MB suffix array.</p>
</li>
<li>
<p>A token-by-token sampling loop that mixes the LLM&rsquo;s output distribution with the n-gram&rsquo;s, in probability space.</p>
</li>
</ol>
<p>The third piece is the part worth thinking about.</p>
<h2 id="the-math">The Math</h2>
<p>At each generation step the LLM produces <code>p_llm(t)</code> over its 49,152-token vocabulary, exposed by Wllama&rsquo;s <code>getLogits(-1)</code>. The n-gram, given the longest suffix of the current context that occurs in the corpus, produces a sparse <code>p_ngram(t)</code>: nonzero on tokens it has seen following that context, zero elsewhere.</p>
<p>Linear combination:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">p_mix(t) = α · p_ngram(t) + (1 − α) · p_llm(t)
</span></span></code></pre></div><p>That is the whole algorithm. The inner loop is small enough to fit on screen:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-js" data-lang="js"><span class="line"><span class="cl"><span class="k">for</span> <span class="p">(</span><span class="kd">let</span> <span class="nx">step</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="nx">step</span> <span class="o">&lt;</span> <span class="nx">N</span><span class="p">;</span> <span class="nx">step</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="kr">const</span> <span class="nx">llm</span> <span class="o">=</span> <span class="kr">await</span> <span class="nx">wllama</span><span class="p">.</span><span class="nx">getLogits</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">  <span class="kr">const</span> <span class="nx">m</span> <span class="o">=</span> <span class="nx">ig</span><span class="p">.</span><span class="nx">longestSuffixMatch</span><span class="p">(</span><span class="nx">context</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">  <span class="kr">const</span> <span class="nx">ngram</span> <span class="o">=</span> <span class="nx">m</span><span class="p">.</span><span class="nx">suffixLen</span> <span class="o">&gt;</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">    <span class="o">?</span> <span class="k">new</span> <span class="nx">Map</span><span class="p">(</span><span class="nx">ig</span><span class="p">.</span><span class="nx">continuations</span><span class="p">(</span><span class="nx">m</span><span class="p">.</span><span class="nx">matchedTokens</span><span class="p">).</span><span class="nx">map</span><span class="p">(</span><span class="nx">c</span> <span class="p">=&gt;</span> <span class="p">[</span><span class="nx">c</span><span class="p">.</span><span class="nx">token</span><span class="p">,</span> <span class="nx">c</span><span class="p">.</span><span class="nx">prob</span><span class="p">]))</span>
</span></span><span class="line"><span class="cl">    <span class="o">:</span> <span class="kc">null</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">  <span class="kr">const</span> <span class="nx">mix</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">Map</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">  <span class="k">for</span> <span class="p">(</span><span class="kr">const</span> <span class="p">{</span> <span class="nx">token</span><span class="p">,</span> <span class="nx">p</span> <span class="p">}</span> <span class="k">of</span> <span class="nx">llm</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kr">const</span> <span class="nx">pn</span> <span class="o">=</span> <span class="nx">ngram</span><span class="o">?</span><span class="p">.</span><span class="nx">get</span><span class="p">(</span><span class="nx">token</span><span class="p">)</span> <span class="o">??</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nx">mix</span><span class="p">.</span><span class="nx">set</span><span class="p">(</span><span class="nx">token</span><span class="p">,</span> <span class="nx">alpha</span> <span class="o">*</span> <span class="nx">pn</span> <span class="o">+</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="nx">alpha</span><span class="p">)</span> <span class="o">*</span> <span class="nx">p</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">  <span class="kr">const</span> <span class="nx">next</span> <span class="o">=</span> <span class="nx">sample</span><span class="p">(</span><span class="nx">mix</span><span class="p">,</span> <span class="nx">temperature</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">  <span class="k">if</span> <span class="p">(</span><span class="kr">await</span> <span class="nx">wllama</span><span class="p">.</span><span class="nx">isTokenEOG</span><span class="p">(</span><span class="nx">next</span><span class="p">))</span> <span class="k">break</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">  <span class="nx">context</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span><span class="nx">next</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">  <span class="kr">await</span> <span class="nx">wllama</span><span class="p">.</span><span class="nx">decode</span><span class="p">([</span><span class="nx">next</span><span class="p">]);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>Tokens unseen by the n-gram have <code>p_ngram = 0</code> and retain <code>(1 − α) · p_llm</code> in the mixture. They are not zeroed out, just unboosted.</p>
<p>α = 0 is the LLM. α = 1 is the n-gram, which loops as soon as the generated context drifts off-corpus. In between is a model running on LLM grammar with n-gram register.</p>
<h2 id="what-it-actually-produces">What It Actually Produces</h2>
<p>The 135M, prompt &ldquo;Hello&rdquo;, α = 0.1:</p>
<blockquote>
<p>Hi! I&rsquo;m working on a project with you! I&rsquo;m trying to hide some of your work. I&rsquo;ll have to wait for your response, but I&rsquo;ll give you a brief summary to let you know what I&rsquo;m working on. Hope you&rsquo;re doing well!</p>
</blockquote>
<p>The vocabulary is mine. I write about hiding data, encrypted search, comparing approaches. The 135M has no idea what those concepts mean. It glues my words into paragraphs that read like a paranoid academic argument with the visitor.</p>
<p>The 1.7B is better but the failure mode is more subtle. With α = 0.1 it produces grammatical paragraphs that drop in occasional Alex-shaped phrases (&ldquo;compositional depth&rdquo;, &ldquo;rules as data&rdquo;, project names from my corpus). At α = 0 it sounds like a generic chatbot. At α = 1 it sounds like me, in fragments, looping.</p>
<p>What mixing-at-sample-time actually does, I think, is pull register without pulling competence. The n-gram makes the surface me-shaped. It does not give the model any of the conceptual machinery that lets me mean something when I use those words. The output is texture, not thought.</p>
<h2 id="why-the-architecture-is-interesting-anyway">Why The Architecture Is Interesting Anyway</h2>
<p>The result is weak, but the structural properties are real:</p>
<p><strong>The corpus stays uncompressed.</strong> Conventional fine-tuning is a lossy gradient encoding of your training distribution. Here, the corpus ships in full. The &ldquo;fine-tune&rdquo; is a static asset you can read.</p>
<p><strong>Adding a document is a 30 ms rebuild.</strong> Not a training run.</p>
<p><strong>Removing a document is the same operation reversed.</strong> No question of whether the model &ldquo;forgot&rdquo; it. There is no gradient that could remember.</p>
<p><strong>Auditing memorization is a binary search.</strong> O(log n) to find every occurrence of a span in the corpus. With gradient-descent fine-tuning, you cannot easily answer this.</p>
<p><strong>Tokenizer parity makes model swaps free.</strong> All three SmolLM2 sizes share the BPE, so the same suffix array works against any of them.</p>
<p>These properties are why I think the idea has legs even though the demo doesn&rsquo;t.</p>
<h2 id="whats-wrong-with-it">What&rsquo;s Wrong With It</h2>
<p>The failure modes I observed, mostly so I know what to fix:</p>
<p><strong>Linear mixing is a soft prior, not a guide.</strong> It lifts tokens. It does not constrain meaning. The model still hallucinates; the prior just makes it hallucinate in my register.</p>
<p><strong>Token-level matching is too local.</strong> The n-gram looks at the longest in-corpus suffix, which is usually a handful of tokens. Enough for register, not enough for thematic coherence. Sequential sentences can pull from completely different parts of the corpus.</p>
<p><strong>The corpus is too small.</strong> 1.6 MB has only so many patterns. With α near 1 the model loops within seconds.</p>
<p><strong>Tiny models cannot be steered into competence.</strong> The 135M under any prior is a syntax engine. With the prior it is a syntax engine with my vocabulary. Useful for a curio, not for a tool.</p>
<p><strong>Temperature interacts unpredictably.</strong> Applying temperature to the mixed distribution is mathematically clean, but small changes to T produce large changes in output character that I cannot tune intuitively.</p>
<h2 id="what-would-make-it-work">What Would Make It Work</h2>
<p>Roughly in order of how much I think each would help:</p>
<p><strong>Passage-level retrieval, also mixed in.</strong> A second prior at the level of full passages, retrieved via embedding search, with passage-level distributions mixed alongside the token-level n-gram. Token-level for register, passage-level for theme.</p>
<p><strong>Logit-space mixing instead of probability-space.</strong> <code>log p_mix = log p_llm + λ · log(p_ngram + ε)</code> actively penalizes tokens with low corpus support, rather than only adding lift to high-support ones. Sharper steering.</p>
<p><strong>A different LLM family.</strong> SmolLM2 caps at 1.7B. Qwen 2.5 1.5B or Llama 3.2 3B are similar size and stronger on instruction-following. Requires retokenizing the corpus; the JS doesn&rsquo;t change.</p>
<p><strong>Context-dependent α.</strong> Right now α is fixed. When the longest match is short (the LLM is off-corpus), trust the LLM more. When the match is long (the LLM is producing in-corpus text), trust the prior more. Probably the change with the most upside of these.</p>
<h2 id="a-note-on-local-first">A Note On Local-First</h2>
<p>The reason the whole pipeline runs in the browser, even though the result is weak, is that this is the use case fine-tuning has historically been worst at: people who want a personal model from their own writing, without running a GPU cluster, without uploading their writing anywhere, and without ending up with an opaque-weights artifact they cannot inspect.</p>
<p>Sample-time mixing with a public LLM is one possible answer. The model is a public utility, the corpus is a static asset, the mixing is in user space. Anyone curious about what is being applied to the model can read the code.</p>
<p>This particular implementation is mediocre. The shape of the answer is right.</p>
<h2 id="what-this-is">What This Is</h2>
<p>Not a fine-tune. Not a chatbot that sounds like me. A research artifact. A thing that shows LLM and n-gram can be composed at sample time, with all the structural properties that follow, and that the architecture is worth more work even though this particular composition does not yet land.</p>
<p>The toy is at <a href="/ask/">/ask</a> and <a href="/infinigram/">/infinigram</a>. Have at it.</p>
]]></content:encoded><category>llm</category><category>fine-tuning</category><category>infinigram</category><category>n-grams</category><category>local-first</category><category>voice</category><category>sample-time</category></item><item><title>A Formal Theory of Inductive Inference</title><link>https://metafunctor.com/media/a-formal-theory-of-inductive-inference/</link><pubDate>Fri, 24 Apr 2026 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/media/a-formal-theory-of-inductive-inference/</guid><description>Notes Foundational paper on algorithmic probability and universal induction. Basis for AIXI.</description><content:encoded>&lt;h2 id="notes">Notes&lt;/h2>
&lt;p>Foundational paper on algorithmic probability and universal induction. Basis for AIXI.&lt;/p>
</content:encoded><category>algorithmic information</category><category>induction</category><category>prediction</category><category>from:language-models</category></item><item><title>A Philosophy of Software Design</title><link>https://metafunctor.com/media/a-philosophy-of-software-design/</link><pubDate>Fri, 24 Apr 2026 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/media/a-philosophy-of-software-design/</guid><description>Notes Deep modules, complexity management. Best modern design book.</description><content:encoded>&lt;h2 id="notes">Notes&lt;/h2>
&lt;p>Deep modules, complexity management. Best modern design book.&lt;/p>
</content:encoded><category>complexity</category><category>abstractions</category><category>interfaces</category><category>from:books</category></item><item><title>Against Method</title><link>https://metafunctor.com/media/against-method/</link><pubDate>Fri, 24 Apr 2026 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/media/against-method/</guid><description>Notes Epistemological anarchism. The strongest critique of scientific method.</description><content:encoded>&lt;h2 id="notes">Notes&lt;/h2>
&lt;p>Epistemological anarchism. The strongest critique of scientific method.&lt;/p>
</content:encoded><category>epistemology</category><category>methodology</category><category>from:books</category></item><item><title>An Introduction to Statistical Learning (ISLR2)</title><link>https://metafunctor.com/media/an-introduction-to-statistical-learning-islr2/</link><pubDate>Fri, 24 Apr 2026 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/media/an-introduction-to-statistical-learning-islr2/</guid><description>Notes Best practical stats-based ML introduction ever written.</description><content:encoded>&lt;h2 id="notes">Notes&lt;/h2>
&lt;p>Best practical stats-based ML introduction ever written.&lt;/p>
</content:encoded><category>regression</category><category>classification</category><category>from:books</category></item><item><title>An Introduction to Thermal Physics</title><link>https://metafunctor.com/media/an-introduction-to-thermal-physics/</link><pubDate>Fri, 24 Apr 2026 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/media/an-introduction-to-thermal-physics/</guid><description>Notes Thermodynamics + statistical mechanics at the right level.</description><content:encoded>&lt;h2 id="notes">Notes&lt;/h2>
&lt;p>Thermodynamics + statistical mechanics at the right level.&lt;/p>
</content:encoded><category>thermodynamics</category><category>statistical mechanics</category><category>from:books</category></item><item><title>API Design Patterns</title><link>https://metafunctor.com/media/api-design-patterns/</link><pubDate>Fri, 24 Apr 2026 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/media/api-design-patterns/</guid><description>Notes Practical patterns for designing internal and public APIs.</description><content:encoded>&lt;h2 id="notes">Notes&lt;/h2>
&lt;p>Practical patterns for designing internal and public APIs.&lt;/p>
</content:encoded><category>patterns</category><category>resource design</category><category>from:books</category></item><item><title>Artificial Intelligence: A Modern Approach</title><link>https://metafunctor.com/media/artificial-intelligence-a-modern-approach/</link><pubDate>Fri, 24 Apr 2026 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/media/artificial-intelligence-a-modern-approach/</guid><description>Notes The standard AI textbook. Search, logic, planning, learning, language.</description><content:encoded>&lt;h2 id="notes">Notes&lt;/h2>
&lt;p>The standard AI textbook. Search, logic, planning, learning, language.&lt;/p>
</content:encoded><category>search</category><category>planning</category><category>logic</category><category>learning</category><category>from:books</category></item><item><title>Attention Is All You Need</title><link>https://metafunctor.com/media/attention-is-all-you-need/</link><pubDate>Fri, 24 Apr 2026 00:00:00 +0000</pubDate><dc:creator>Alex Towell</dc:creator><guid isPermaLink="true">https://metafunctor.com/media/attention-is-all-you-need/</guid><description>Notes Introduced the Transformer architecture. The paper that started everything.</description><content:encoded>&lt;h2 id="notes">Notes&lt;/h2>
&lt;p>Introduced the Transformer architecture. The paper that started everything.&lt;/p>
</content:encoded><category>transformer</category><category>attention</category><category>architecture</category><category>from:language-models</category></item></channel></rss>