Linked_project

Synthesis: Codecs as Structure

May 15, 2026

Twelve posts, twelve codes, one thesis that refused to change. This is the closing summary.

A. The Twelve Codes Together

Every post in this series answered a version of the same question: given a source of positive integers, how do you represent its values compactly as a sequence of bits? The answers differ in shape, in assumptions, and in which distribution each code implicitly expects.

Post	Code	Implied prior (one phrase)
1-2	Foundations	Prefix-free codes are possible iff Kraft’s inequality holds
3	Priors framework	Any code defines a prior; the best code matches the source
4	Unary	Geometric(1/2): value 1 is twice as likely as value 2, etc.
5a	Elias Gamma	Power-law: probability falls as 1/n^2
5b	Elias Delta	Heavier-tailed power law: slower decay for large values
5c	Elias Omega	Recursive structure: no fixed polynomial decay rate
6	Fibonacci	Near-geometric with Zeckendorf structure; good for Zeckendorf-sparse integers
7	Rice / Golomb	Geometric with known parameter m; optimal when m divides entropy
8	VByte	Roughly uniform over byte-aligned ranges; engineering favorite
9	Huffman	Source-optimal given the exact symbol distribution
10	Arithmetic coding	Approaches entropy to an arbitrary fraction of a bit
11	Succinct bit vectors	Not a code for integers: a representation that answers rank/select queries
12	RoaringBitmap	Polyalgorithm: picks array, bitset, or run-length per container chunk

Posts 1 and 2 (Kraft’s Inequality and McMillan’s Converse) established why prefix-free codes are the right unit of analysis. Post 3 (Universal Codes as Priors) named the frame: a code is a hypothesis about the source. Posts 4 through 10 filled in the catalogue. Posts 11 and 12 extended from integer coding to set representation, where the questions shift from “how long is this codeword?” to “how do you store membership?” and “how do you answer rank/select?”

Looking across all twelve, the main lesson is not that one code dominates. It is that the question “which code?” is always empirically answerable given a sample.

B. The Unifying Frame Restated

Post 3 introduced the codes-as-priors thesis with two instances behind it. We now have twelve. The thesis has not changed; it has only become more evidently true.

Bits Follow Types

April 23, 2026

Every type decomposes structurally. So does its codec.

Codecs as Functors

You have an optional<vector<pair<int, string>>>. The type decomposes structurally: it is an optional of a free monoid of products of an integer and a string. That decomposition is not an observation about memory layout. It is a statement about the algebraic structure of the type.

Now ask: does the codec decompose the same way?

If the answer is yes, you stop writing one-off encoders. You build a codec for optional<T> from a codec for T. You build a codec for vector<T> from a codec for T. The codec for optional<vector<pair<int, string>>> assembles from its parts with no manual layout decisions, no hand-placed length headers, no ad-hoc format negotiation.

This post argues that the answer is always yes, and shows what the machinery looks like. The thesis: codecs are not ad-hoc bit formats. They are constructions on the algebraic structure of types. The algebraic structure of a type determines its codec, the same way it determines its algorithms.

This extends Stepanov’s claim. The peasant algorithm post showed that algorithms arise from algebraic structure. The homomorphism post showed that structure-preserving maps are the natural morphisms. Here, we show the codec itself is a structure-preserving map, and that it lifts from leaf types to compound types by the same algebraic logic.

Bit I/O: The Foundation

Before combinators, we need concrete bit I/O. The approach taken here follows Stepanov’s move in the algorithm posts: state the concept first, then provide a model.

Two concepts govern bit-level I/O:

template<typename T>
concept BitSink = requires(T& s, bool bit) {
    { s.write(bit) } -> std::same_as<void>;
};

template<typename T>
concept BitSource = requires(T& s) {
    { s.read() } -> std::same_as<bool>;
    { s.peek() } -> std::convertible_to<bool>;
};

A BitSink accepts bits. A BitSource supplies them. A codec is an algorithm parameterized over BitSink and BitSource, not a class hierarchy. This is Stepanov’s move at the bit level: require only what the algorithm needs, let anything that satisfies the concept participate.

The standard models are BitWriter and BitReader, which pack bits into byte buffers in LSB-first order:

class BitWriter {
    std::span<std::uint8_t> buf_;
    std::size_t byte_idx_ = 0;
    std::uint8_t byte_ = 0;
    std::uint8_t bit_pos_ = 0;
public:
    explicit BitWriter(std::span<std::uint8_t> buf) noexcept : buf_(buf) {}

    void write(bool bit) noexcept {
        byte_ |= (bit ? std::uint8_t{1} : std::uint8_t{0}) << bit_pos_;
        if (++bit_pos_ == 8) {
            buf_[byte_idx_++] = byte_;
            byte_ = 0;
            bit_pos_ = 0;
        }
    }

    void align() noexcept {
        if (bit_pos_ > 0) {
            buf_[byte_idx_++] = byte_;
            byte_ = 0;
            bit_pos_ = 0;
        }
    }

    [[nodiscard]] std::size_t bytes_written() const noexcept {
        return byte_idx_ + (bit_pos_ > 0 ? 1 : 0);
    }
};

class BitReader {
    std::span<const std::uint8_t> buf_;
    std::size_t byte_idx_ = 0;
    std::uint8_t bit_pos_ = 0;
public:
    explicit BitReader(std::span<const std::uint8_t> buf) noexcept : buf_(buf) {}

    bool read() noexcept {
        bool bit = ((buf_[byte_idx_] >> bit_pos_) & 1) != 0;
        if (++bit_pos_ == 8) {
            ++byte_idx_;
            bit_pos_ = 0;
        }
        return bit;
    }

    [[nodiscard]] bool peek() const noexcept {
        return byte_idx_ < buf_.size();
    }
};

A codec concept rounds out the three-concept core:

When Lists Become Bits

April 23, 2026

The free monoid on a type lifts to bit space. It lifts injectively only when the element codec is prefix-free.

Prefix-Free Codes and the Free Monoid

You have a list of unsigned integers. Encode the list as a single bit string.

Fixed-width encoding wastes space. If you allocate 64 bits per integer, small values like 1 or 7 cost as much as values near $2^{64}$. Variable-width encoding recovers that space, but immediately raises a harder question: where does one encoded integer end and the next begin?

Two escape routes. First, prefix each encoded item with its length. That works, but the length headers are overhead, and you now need a codec for the lengths as well. Second, choose a code where the structure of the codewords makes boundaries unambiguous without any headers. These are prefix-free codes, and this is the right answer, in a precise categorical sense.

The “precise categorical sense” is what this post develops. Encoding a list as the concatenation of encoded elements is a monoid homomorphism from the free monoid on $T$ to the monoid of bit strings under concatenation. The universal property of the free monoid guarantees this homomorphism always exists. The question of whether the decoder can invert it comes down to exactly one property of the element codec: whether it is prefix-free.

The Free Monoid, Recalled

A monoid is a set with an associative binary operation and an identity element. The free monoid on a set $S$ is the set of all finite sequences of elements from $S$, with concatenation as the operation and the empty sequence as the identity.

“Free” means no equations hold except those forced by the monoid axioms. Nothing is identified with anything else. If you need commutativity or idempotency, you quotient the free monoid by additional equations. But the free monoid itself imposes nothing beyond associativity and identity.

The universal property says: given any monoid $M$ and any function $f: S \to M$, there is exactly one monoid homomorphism $\hat{f}: \text{Free}(S) \to M$ that extends $f$. That unique extension is fold:

$$\hat{f}([x_1, x_2, \ldots, x_n]) = f(x_1) \cdot f(x_2) \cdot \cdots \cdot f(x_n)$$

where $\cdot$ is the operation in $M$. The free-algebra post develops this in full. For this post, the one fact that matters is that fold is canonical: it is the unique way to extend a per-element map to a list-consuming function that respects the monoid structure.

RoaringBitmap

December 7, 2025

No single representation is optimal across all density regimes. RoaringBitmap does not try to pick one. It measures, then decides, per chunk.

Hybrid Representation as Polyalgorithm

Representing a compressed integer set forces a decision you cannot avoid: the best encoding depends on how dense the data is. Four density regimes each have a natural winner:

Very sparse (fewer than ~4096 elements in a 64K range): a sorted array of 16-bit values. Each element costs 2 bytes; 4096 elements cost 8 KB.
Moderate to dense (more than ~4096 elements): a dense bit vector, 8 KB for the full 64K range. Contains-check is O(1) via bit indexing.
Very dense (almost full): a sorted array of the absent elements, using the same 2 bytes per gap.
Clustered runs (many consecutive values): run-length encoding. A run of $k$ consecutive integers costs one 4-byte record regardless of $k$.

No single prior dominates across all four regimes. RoaringBitmap (Lemire et al., 2014) is the hybrid answer: partition the 32-bit integer space into 64K-element chunks (indexed by the high 16 bits of each value), then assign each chunk the optimal container based on its actual density. The structure adapts per chunk, not globally.

The Three Container Types

Every 32-bit integer $v$ maps to chunk-id $v \gg 16$ and low bits $v \mathbin{&} 0\text{xFFFF}$. Each chunk stores its low bits in one of three containers:

class ArrayContainer  { std::vector<uint16_t> elems_;   /* sorted, unique */ };
class BitmapContainer { std::vector<uint64_t> words_;   /* 1024 words = 8 KB */ };
class RunContainer    { std::vector<std::pair<uint16_t,uint16_t>> runs_; /* (start, len-1) */ };

using ContainerVariant = std::variant<ArrayContainer, BitmapContainer, RunContainer>;

class RoaringBitmap {
    std::map<uint16_t, ContainerVariant> chunks_;
public:
    void add(uint32_t v);
    bool contains(uint32_t v) const noexcept;
    std::size_t cardinality() const noexcept;
    void optimize();
    RoaringBitmap union_with(const RoaringBitmap&) const;
    RoaringBitmap intersection_with(const RoaringBitmap&) const;
    RoaringBitmap difference(const RoaringBitmap&) const;
};

ArrayContainer uses a sorted std::vector<uint16_t>. Contains-check is binary search, O(log n). Add is sorted insertion, O(n) worst case but O(1) amortized for sequential loads. Space: 2 bytes per element.

BitmapContainer stores 1024 uint64_t words covering all 65536 possible values. Contains-check is O(1): (words_[v/64] >> (v%64)) & 1. Add is O(1). Cardinality requires scanning all 1024 words with popcount, which is fast in practice. Space: fixed 8 KB regardless of occupancy.

RunContainer stores runs as (start, length-1) pairs sorted by start. A run (s, l) covers values $s$ through $s+l$ inclusive, using 4 bytes. Space grows with the number of runs, not the number of values. One million consecutive integers fit in one 4-byte record.

Succinct Bit Vectors and Rank/Select

June 22, 2025

The claim that drives this post: store $n$ bits and answer prefix-count queries in $O(1)$ time, using only $n + o(n)$ bits total. The auxiliary index is asymptotically negligible. That is not obvious, and it is worth understanding why it holds.

Constant-Time Queries on Bit Vectors

Posts 1 through 10 of this series focused on encoding: universal codes, arithmetic coding, Huffman, LZ77. They all ask the same question in slightly different ways: how do we turn a sequence into a compact bit string? This post shifts direction. The question here is different: once we have a bit vector, how do we query it efficiently without expanding it?

The two fundamental queries on a bit vector of $n$ bits are:

rank$_1(i)$: the number of 1-bits in positions $[0, i)$, i.e., strictly before position $i$.
select$_1(j)$: the position of the $j$-th 1-bit (0-indexed).

These appear throughout data structures: inverted indexes, compressed graphs, FM-indexes, wavelet trees. Rank tells you how many elements precede a position. Select inverts it.

Three design points exist along the space-time axis:

Approach	Space	rank time	select time
Naive scan	$n$ bits	$O(n/64)$	$O(n/64)$
Full lookup table	$O(n \log n)$ bits	$O(1)$	$O(1)$
Succinct (this post)	$n + o(n)$ bits	$O(1)$	$O(\log n)$

The succinct approach hits the right trade-off: an auxiliary index that is asymptotically negligible (sublinear, so $o(n)$) while buying constant-time rank. Select costs $O(\log n)$ with a binary search. Getting $O(1)$ select requires one more index structure; that is covered briefly at the end and implemented in PFC’s production version.

The Structure

The bit vector stores its $n$ bits packed into 64-bit words, LSB-first. Bit $i$ lives at position $i \bmod 64$ within word $\lfloor i / 64 \rfloor$. Unused bits in the final word are zeroed.

The auxiliary index has two levels:

Superblock array: one uint64_t per 4096-bit chunk. Entry $s$ holds the absolute cumulative rank from bit 0 to the start of superblock $s$.
Block array: one uint16_t per 64-bit word. Entry $b$ holds the rank within the enclosing superblock, from the superblock’s start to the start of block $b$.

class SuccinctBitVector {
public:
    explicit SuccinctBitVector(const std::vector<bool>& bits);
    [[nodiscard]] std::size_t size()  const noexcept;
    [[nodiscard]] bool        bit(std::size_t i)  const noexcept;
    [[nodiscard]] std::size_t rank1(std::size_t i) const noexcept;   // O(1)
    [[nodiscard]] std::size_t select1(std::size_t j) const noexcept; // O(log n)

protected:
    std::size_t n_;
    std::vector<uint64_t> bits_;
    std::vector<uint64_t> superblock_ranks_;  // abs. rank at superblock boundaries
    std::vector<uint16_t> block_ranks_;       // superblock-relative rank per block

    static constexpr std::size_t SUPERBLOCK_BITS = 4096;
    static constexpr std::size_t BLOCK_BITS      = 64;
    static constexpr std::size_t BLOCKS_PER_SB   = 64;
};

The production version lives in PFC’s include/pfc/succinct.hpp. The pedagogical version in this post strips the production features to the core structure.

Arithmetic Coding

January 12, 2025

Huffman codes one symbol at a time. Arithmetic coding encodes the whole sequence as a single number. The difference is a factor of twelve, at least on the right source.

The Last Bit of Redundancy

Huffman coding gets expected codeword length within one bit of entropy. That is the best it can do, because codeword lengths must be integers while entropy is a real number.

The waste is structural. A symbol with probability $p = 0.7$ has optimal (fractional) length $-\log_2(0.7) \approx 0.515$ bits. Huffman rounds that up to 1 bit: 0.485 bits wasted per occurrence. For a nearly-deterministic source with $p_0 = 0.99$ and $p_1 = 0.01$, the entropy is $H \approx 0.081$ bits per symbol. Huffman is stuck at 1 bit per symbol. That is a factor-of-twelve gap, and Huffman cannot close it: a symbol that appears 99% of the time still gets a complete codeword.

Arithmetic coding steps back from per-symbol codewords entirely. It encodes an entire sequence as a single rational number in $[0, 1)$. The bit-length of that number converges to the entropy of the sequence as the sequence grows. No integer rounding, no per-symbol overhead.

This post builds an integer range coder in C++23 and demonstrates the factor-of-twelve improvement on the Bernoulli(0.99) source.

The Continuous View

Start with the unit interval $[0, 1)$. For a two-symbol source, partition it by probability: symbol 0 gets $[0, p_0)$ and symbol 1 gets $[p_0, 1)$.

To encode a sequence, begin with the full interval and narrow it with each symbol. After symbol $s_1$, restrict to the corresponding sub-interval. After $s_2$, apply the same proportional rule inside that sub-interval. After $L$ symbols, the interval has width $\prod_{i=1}^{L} p_{s_i}$.

Any number inside the final interval is a valid encoding. The shortest such number in binary requires approximately $-\log_2(\prod p_{s_i}) = \sum_{i=1}^{L} (-\log_2 p_{s_i})$ bits. As $L \to \infty$, bits per symbol approaches $H(p) = -\sum_k p_k \log_2 p_k$ exactly.

Decoding is the inverse: given the encoded number, determine at each step which sub-interval it falls in, recover the symbol, narrow the interval, and repeat.

The theory is complete. The practice is not. A real interval narrows exponentially fast: after a few dozen symbols you need arbitrary precision. The integer range coder fixes this with 32-bit arithmetic and a renormalization step.

Huffman Coding

August 4, 2024

Huffman coding is two things: the optimal length vector for a known distribution, and McMillan’s construction applied to that vector. This post develops both.

From Universal to Optimal

Every code in this series so far has been universal: no prior knowledge of the source distribution required. Elias gamma assigns shorter codewords to smaller integers regardless of which integers actually appear. Fibonacci does the same. VByte packs smaller values into fewer bytes without knowing whether your data clusters at the low end or the high end. Universal codes are defensive: they perform acceptably across a broad class of inputs by committing to none.

Huffman flips that stance. You bring a finite probability distribution. Huffman finds the prefix-free code with minimum expected codeword length for that distribution. The code is tuned to what you provide and will perform poorly on anything else. Call this the move from defensive to distribution-specific coding.

The payoff is real. Shannon’s source coding theorem says no prefix-free code can achieve expected length below $H(p) = -\sum_i p_i \log_2 p_i$. Huffman gets within 1 bit of that bound. For any prefix-free code, expected length satisfies

$$H(p) \le L(\text{code}) \le H(p) + 1.$$

The upper bound comes from the integer-length constraint: each codeword is a whole number of bits, and $\lceil -\log_2 p_i \rceil \le -\log_2 p_i + 1$. Huffman is optimal subject to this constraint. No prefix-free code does better without abandoning whole-bit codewords.

That last clause points to the limit of this post and the subject of the next. Arithmetic coding breaks the integer-length constraint by assigning fractional bits in effect, reaching entropy exactly in the limit.

The Algorithm

The four steps of Huffman’s construction:

Create one leaf node per symbol, weighted by its probability.
Push all leaves into a min-priority queue (lowest weight first).
While the queue contains more than one node: extract the two lowest-weight nodes, merge them into a new internal node whose weight is their sum, push the internal node back.
The remaining node is the root. The path from root to each leaf encodes that leaf’s codeword (“0” for left, “1” for right).

Here is the complete implementation from huffman.hpp:

PFC: Zero-Copy Data Compression Through Prefix-Free Codecs

June 10, 2024

PFC (Prefix-Free Codecs) is a header-only C++20 library built on a simple observation: data compression and zero-copy access are not contradictory goals, as long as you build on prefix-free codes and generic programming. The library gets 3-10x compression on typical integer distributions while maintaining full STL integration and type safety.

The zero-copy invariant

Traditional compression creates two worlds. Data lives uncompressed in memory (32 bits per integer) and compressed on disk (variable bits). You marshal between them. PFC eliminates that boundary:

\[ \text{In-memory representation} = \text{Wire representation} \]

// Traditional approach
std::vector<uint32_t> data = {1, 2, 3, 5, 8, 13};
auto compressed = compress(data);     // Marshal to wire format
store_to_disk(compressed);
auto restored = decompress(compressed); // Unmarshal back

// PFC approach
PackedContainer<uint32_t, EliasGamma> data;
data.push_back(1);
data.push_back(2);
data.push_back(3);

// Data is ALREADY compressed in memory
uint32_t value = data[0];  // Decodes from compressed form on access

// Write to disk? Zero copy.
write(fd, data.bytes().data(), data.bytes().size());

The data structure IS the compressed format. There’s no marshaling step.

Prefix-free codes

A code is prefix-free if no codeword is a prefix of another:

Prefix-free:           Not prefix-free:
  1 -> 0                 1 -> 0
  2 -> 10                2 -> 01    ("0" is a prefix of "01")
  3 -> 110               3 -> 011
  4 -> 1110

This matters because prefix-free codes compose naturally. Concatenate encodings without delimiters and decode unambiguously:

encode(1);  // 0
encode(2);  // 10
encode(3);  // 110
// Result: 010110 (self-delimiting)

decode();  // Reads "0" -> 1
decode();  // Reads "10" -> 2
decode();  // Reads "110" -> 3

This enables streaming and composition without any framing overhead.

Universal codes

PFC implements several universal codes, meaning they’re asymptotically optimal for any distribution without knowing the distribution in advance.

Elias Gamma encodes positive integer $n$ in $2\lfloor\log_2 n\rfloor + 1$ bits:

n     Binary   Elias Gamma
1     1        1
2     10       010
3     11       011
4     100      00100
5     101      00101
8     1000     0001000

Write $\lfloor\log_2 n\rfloor$ zeros, then the binary representation of $n$. Asymptotically optimal for geometric distributions.

Fibonacci encoding uses Zeckendorf representation (sum of non-consecutive Fibonacci numbers) with a terminal “11” marker. Every positive integer has a unique representation.

Rice/Golomb codes are parametric, optimal for geometric distributions with known parameter. Quotient in unary, remainder in binary. Good for values with exponential decay.

VByte / Varint

February 25, 2024

Every code in this series so far operates at bit granularity. VByte does not. It gives up bit-level precision for byte-alignment, and in production systems, that trade wins most of the time.

The Practical Question

Every code in this series so far operates at bit granularity. Elias gamma encodes 1 in a single bit. Fibonacci coding uses exactly as many bits as the Zeckendorf representation requires. Bit packing is theoretically attractive because it minimizes the number of bits written, which minimizes the encoded size.

But bit packing is computationally expensive. Reading or writing a single bit requires a shift, a mask, and often a branch to handle byte boundaries. Encoding a sequence of integers this way burns CPU cycles that scale with the number of integers, independent of their values. For high-throughput applications, the overhead of bit manipulation can easily exceed the savings from compact encoding.

VByte (also called Varint in Google’s ecosystem, and LEB128 in the DWARF debug format) trades a small amount of length efficiency for byte-alignment. The idea is simple: encode each integer as a sequence of 7-bit groups, one per byte, with a continuation flag in the high bit of each byte. The result is self-delimiting, compact for small values, and requires no bit-level manipulation to decode.

VByte is the encoding used by Protocol Buffers for all integer fields. It appears in Apache Arrow, Parquet, Snappy’s block format, LevelDB’s metadata, and most production columnar file formats. These are high-throughput systems. Byte-alignment is why VByte is their choice over the more compact universal codes from posts 4 through 7.

The Encoding

VByte splits an integer into 7-bit groups, starting from the least significant bits. Each group occupies one byte where bits 0 through 6 carry 7 data bits and bit 7 is a continuation flag: 1 means more bytes follow, 0 means this is the last byte.

A value in $[0, 127]$ fits in a single byte (continuation flag clear). A value in $[128, 16383]$ requires two bytes (first byte has flag set, second has flag clear). The pattern continues: each additional byte adds 7 bits of capacity.

Rice / Golomb

September 17, 2023

Every code in this series so far has been fixed. Rice and Golomb are different: they take a parameter, and the parameter is your model of the data.

The First Parametric Code

Every code examined so far in this series has been monolithic. Unary coding is just unary coding. Elias gamma is just Elias gamma. Each one encodes all non-negative integers with a single fixed strategy. You do not get to choose anything about the code beyond whether to use it.

Rice and Golomb codes break this pattern. They are parametric: a single integer parameter, $k$ for Rice or $m$ for Golomb, tunes the code to a specific source distribution. Rice$(k)$ is not one code but a family of codes, one per value of $k$. Each member of the family is optimal for a specific geometric distribution. Choosing $k$ is choosing your prior precisely.

This matters because data sources are rarely uniform. Run-length encodings, inter-frame video differences, and the gap sequences in inverted indexes are all approximately geometrically distributed. If you know the mean of your source, you can pick $k$ so that Rice$(k)$ performs near-optimally, without the overhead of a Huffman table or arithmetic coding.

The key insight: for a geometric source with mean approximately $2^k$, Rice$(k)$ is within a small constant of entropy. No other universal code in this series achieves this. Elias gamma and delta perform well asymptotically but can be far from optimal for a specific geometric distribution with a known mean. Rice exploits that knowledge directly.

Rice Coding

Rice coding splits a non-negative integer $n$ into two parts: a quotient $q = \lfloor n / 2^k \rfloor = n \gg k$ and a remainder $r = n \bmod 2^k = n \mathbin{&} (2^k - 1)$.

The quotient is encoded in unary: $q$ zero bits followed by a stop bit of 1. The remainder is encoded in exactly $k$ bits, MSB first. The total codeword is the concatenation of these two parts.

Codeword examples for $k = 2$ (remainder is always 2 bits):

$n$	$q$	$r$	Codeword	Bits
0	0	0	`1 00`	3
1	0	1	`1 01`	3
2	0	2	`1 10`	3
3	0	3	`1 11`	3
4	1	0	`0 1 00`	4
5	1	1	`0 1 01`	4

Codeword length: $(n \gg k) + 1 + k$ bits. The Kraft sum saturates to 1, so Rice is a complete prefix-free code.

Fibonacci Coding

April 23, 2023

Every code in this series so far has optimized expected length under some implied prior. Fibonacci coding does something different: it gives the decoder a way to recover from errors without help from a lower layer.

A Different Design Goal

All the codes in this series have aimed at the same target: assign short codewords to frequent symbols, with length growing roughly as $\log n$ for the $n$-th symbol under some implied prior. Elias gamma minimizes expected length for power-law distributions; delta and omega extend the recursion for heavier tails.

Fibonacci coding has a different goal. It does not optimize for average codeword length under a specific distribution. It optimizes for error resilience. In a stream of gamma-coded integers, a single bit flip in a codeword’s length prefix causes the decoder to misread that codeword’s length, then misread every subsequent codeword. The error propagates without limit until the decoder somehow reacquires sync. On a reliable channel this is a nonissue. On a noisy one, or in stored data that may have silently rotted, it is a serious problem.

Fibonacci coding avoids this. Every Fibonacci codeword ends in two consecutive 1 bits (“11”). This double-one marker appears nowhere else in the codeword. A single bit flip corrupts the codeword it hits, possibly spills into the next codeword, and then the decoder finds the next “11” and resynchronizes. At most two codewords are corrupted per error. The rest of the stream is intact.

The price is length overhead: Fibonacci codewords are approximately $1.44 \times \log_2 n$ bits long, compared to $\log_2 n$ bits for the entropy lower bound. On a reliable channel, that overhead is not worth paying. On a noisy channel, or in a long-running stream where rare bit errors must not lose the entire tail, the self-synchronization property is worth it.

Zeckendorf’s Theorem

Fibonacci numbers starting from $F_2 = 1$: $1, 2, 3, 5, 8, 13, 21, 34, \ldots$

Zeckendorf’s theorem: every positive integer $n$ has a unique representation as a sum of non-consecutive Fibonacci numbers. The greedy algorithm produces it by repeatedly subtracting the largest Fibonacci number that does not exceed $n$.

Elias Delta and Omega

November 13, 2022

Elias gamma spends too many bits saying how many bits it will use. Delta fixes that. Omega takes the fix one step further. This post is about what happens when you apply recursion to the length prefix.

Where Gamma Stops Being Good

Elias gamma, from the previous post, encodes a positive integer $n$ in $2\lfloor \log_2 n \rfloor + 1$ bits: a unary count of $\lfloor \log_2 n \rfloor$ zeros, then a stop bit, then the $\lfloor \log_2 n \rfloor$ trailing binary bits of $n$. For small $n$ this is fine. For large $n$, nearly half the bits are spent on the unary prefix alone.

The unary prefix is the bottleneck. It encodes the length $L = \lfloor \log_2 n \rfloor + 1$ in the most wasteful possible way: one bit per unit. For $n = 256$, that is 8 zero bits just to say “the payload is 8 bits long.” The payload itself is also 8 bits, so you are paying a 100% overhead on the length announcement. That is bad, and it gets worse as $n$ grows.

The fix is obvious once you see it: encode $L$ itself in some shorter code instead of unary. Elias delta does exactly this, replacing the unary length prefix with a gamma-coded length. Elias omega takes the idea one step further and applies the recursion to itself, all the way down.

Both codes are universal: they assign finite codewords to every positive integer, and the expected codeword length is within a constant factor of optimal for any source whose probabilities decrease with $n$. The improvement over gamma is real and measurable once $n$ grows past a few dozen.

This post shows both implementations, their implied priors, and the crossover points where each code wins. As in the rest of this series, the code is pedagogical: each header stands alone and the struct-with-encode/decode pattern maps directly onto the PFC library’s EliasDelta and EliasOmega in codecs.hpp.

Elias Delta

Algorithm. Let $L = \lfloor \log_2 n \rfloor + 1$ (the bit-width of $n$, equivalently std::bit_width(n)).

Encode $L$ in Elias gamma.
Write the $L - 1$ trailing bits of $n$ after its implicit leading 1, MSB first.

Gamma encodes $L$ (a small integer) in $O(\log \log n)$ bits instead of $O(\log n)$ bits for the unary prefix. The payload is identical to gamma’s: the trailing bits of $n$. The total length is $O(\log n + \log \log n)$.

Unary and Elias Gamma

June 19, 2022

Unary is older than information theory. Elias gamma is its 1975 improvement. Together they span the gap between optimal-but-impractical and practical-but-nearly-optimal. This post derives what each code bets on, and shows numerically what that means.

Unary and Elias Gamma

Unary is the oldest code in this series. It predates information theory by centuries: a shepherd counting sheep on a stick is using unary. Mark one notch per sheep; count the notches to decode. The codeword for $n$ is $n$ tally marks. Its information-theoretic justification came later, when Shannon showed it is exactly optimal for a geometric source.

Elias gamma is the 1975 extension by Peter Elias. It brings the codeword length from $O(n)$ to $O(\log n)$, making it practical for numbers beyond small single digits, while keeping the prefix-free property that makes self-delimiting streams possible.

Both codes are instances of the claim from Universal Codes as Priors: every prefix-free code is a bet about the source. Unary bets on a geometric distribution with parameter $1/2$. Gamma bets on a power-law distribution with exponent $\approx 2$. This post implements both, derives their implied priors, and shows numerically what the bets mean.

Unary: Geometric Prior

The encoding rule for unary is simple: to encode integer $n \geq 1$, write $(n-1)$ zero bits followed by one 1 bit. The decoder reads bits until it sees the 1; the number of bits read is the decoded value.

Examples: $1 \to$ 1, $2 \to$ 01, $3 \to$ 001, $4 \to$ 0001.

struct Unary {
    using value_type = std::uint64_t;

    template<BitSink S>
    static void encode(value_type n, S& sink) {
        assert(n >= 1 && "Unary is undefined for n = 0");
        for (value_type i = 1; i < n; ++i) sink.write(false);
        sink.write(true);
    }

    template<BitSource S>
    static value_type decode(S& source) {
        value_type n = 1;
        while (!source.read()) ++n;
        return n;
    }
};

Length analysis. The codeword for $n$ has length $n$. The Kraft sum is $\sum_{n=1}^{\infty} 2^{-n} = 1$: unary saturates Kraft exactly. The implied prior is $p_n = 2^{-n}$: a geometric distribution with parameter $1/2$, where each value is half as likely as the previous.

Optimality test. Because the implied prior is dyadic (all probabilities are powers of $1/2$) and Kraft saturates, unary achieves entropy exactly on this prior. For a 30-symbol truncation of geometric(1/2), the expected unary length equals the entropy to within the truncation tail ($\approx 2^{-30}$):

Universal Codes as Priors

January 15, 2022

When you pick a code for integers, you are making a bet about what integers the source will produce. The bet lives in the codeword lengths, not in a separate parameter. This post makes that precise.

Universal Codes as Priors

You want to compress a stream of positive integers. Which code should you use?

The question has more structure than it appears. A code for integers assigns a codeword to each integer. The codeword for 1 is short, for 2 a bit longer, for 100 much longer. The relative lengths encode an implicit bet: what fraction of the stream will be 1s? What fraction will be 100s? If the bet matches the source, the average codeword length will be close to the theoretical minimum, the entropy. If the bet is wrong, you pay an overhead proportional to how wrong you are.

The bet is not a separate parameter. It lives in the codeword lengths themselves. This is the central claim of this post:

Every prefix-free code is a prior over the integers. The codeword lengths determine, up to normalization, a probability distribution. The code is optimal for exactly the sources that match that distribution.

This post makes that claim precise and implements the tools to measure it. The rest of the series (posts 4 through 12) examines ten specific codes and the priors they embody.

The Correspondence: Lengths to Priors

For a prefix-free code with codeword lengths $(l_1, l_2, \ldots, l_n)$, define the unnormalized weight of symbol $i$ as $w_i = 2^{-l_i}$. This is the fraction of the Kraft budget consumed by that codeword.

If the code saturates Kraft (meaning $\sum_i 2^{-l_i} = 1$), then the weights are already a valid probability distribution: $p_i = 2^{-l_i}$. If the code does not saturate (meaning $\sum_i 2^{-l_i} < 1$), normalize: $p_i = 2^{-l_i} / \sum_j 2^{-l_j}$.

This is the inverse of Shannon’s prescription. Shannon says: given a distribution $p_i$, the optimal codeword length is $\lceil -\log_2 p_i \rceil$ bits. We reverse the direction: given a length $l_i$, the implied probability is $2^{-l_i}$.

The function implied_prior computes this map:

inline std::vector<double> implied_prior(const std::vector<std::size_t>& lengths) {
    std::vector<double> probs;
    probs.reserve(lengths.size());
    double total = 0.0;
    for (std::size_t l : lengths) {
        double p = std::ldexp(1.0, -static_cast<int>(l));
        probs.push_back(p);
        total += p;
    }
    // Normalize if Kraft sum is less than 1.
    if (total < 1.0) {
        for (double& p : probs) p /= total;
    }
    return probs;
}

Two examples show the range of priors you get in practice.

McMillan's Converse

September 13, 2020

Kraft’s inequality is necessary. McMillan’s theorem says it is also sufficient, and the proof is a construction.

McMillan’s Converse

The previous post in this series proved Kraft’s inequality: for any prefix-free binary code with codeword lengths $l_1, l_2, \ldots, l_n$,

$$\sum_{i=1}^{n} 2^{-l_i} \leq 1.$$

Every prefix-free code satisfies it. No exceptions. But necessity alone is not the useful direction. The question I want answered is the converse: given a length vector that satisfies Kraft, does a prefix-free code with those lengths actually exist?

Yes, and McMillan’s theorem (1956) proves it. Better still, the proof is a construction: given any Kraft-satisfying length vector, you can produce a specific prefix-free code with those exact lengths. No search required. No verification required after the fact. The construction always terminates, always produces a valid code, because Kraft pre-certifies that the budget is sufficient.

This post proves the constructive direction, then goes further. McMillan proved something stronger than just the prefix-free converse. He showed that even uniquely-decodable codes that are not prefix-free must satisfy Kraft. The consequence is worth sitting with: there is no advantage to non-prefix-free designs. If a code can be uniquely decoded, a prefix-free code with the same lengths exists. Prefix-freeness is not a restriction you impose for convenience. It is just the cleanest form of what unique decodability requires.

The Construction

The construction is a left-to-right walk through an imaginary binary trie. Sort the lengths, then assign codewords by taking the next available leaf at each step.

Concretely: fix a counter at zero, and for each length $l_i$ (in sorted order), emit the binary representation of counter >> (l_max - l_i) left-padded to $l_i$ bits. Then advance the counter by $2^{l_{\max} - l_i}$, which skips past the entire subtree rooted at the just-assigned codeword. That advance ensures the next codeword starts at the first unoccupied leaf position in the depth-$l_{\max}$ trie.

Work through the example from post 1: lengths $\{1, 2, 3, 3\}$. Sort: $1, 2, 3, 3$. Take $l_{\max} = 3$.

Length 1: counter is 0. Shift right by $3 - 1 = 2$: emit 0 >> 2 = 0 as a 1-bit string, giving codeword "0". Advance counter by $2^{3-1} = 4$. Counter is now 4.
Length 2: counter is 4 (binary 100). Shift right by $3 - 2 = 1$: emit 4 >> 1 = 2 as a 2-bit string, giving "10". Advance by $2^{3-2} = 2$. Counter is now 6.
Length 3: counter is 6 (binary 110). Shift right by $3 - 3 = 0$: emit 6 as a 3-bit string, giving "110". Advance by $2^0 = 1$. Counter is 7.
Length 3: counter is 7 (binary 111). Emit 7 as a 3-bit string: "111". Advance by 1. Counter is 8.

Result: {"0", "10", "110", "111"}. This is exactly the example code from post 1. The construction recovered it directly from the length vector, without any search.

Kraft's Inequality

March 22, 2020

Every prefix-free code satisfies one inequality. That inequality is also sufficient. This post develops the necessary direction.

Kraft’s Inequality

I want a code where each symbol maps to a bit string, and where any concatenation of codewords can be decoded unambiguously. The simplest way to guarantee that is prefix-freeness: no codeword is a prefix of any other. A prefix-free code is self-delimiting. The decoder reads bits left-to-right and knows exactly when each codeword ends, with no lookahead and no length headers.

The question I keep returning to is: which collections of lengths are actually achievable? If I want four codewords of lengths 1, 2, 3, and 3, can I build a prefix-free code with those lengths? What if I want two codewords of length 1? (No: there are only two 1-bit strings, and they are prefixes of everything longer.)

Kraft’s inequality is the answer. A length vector $(l_1, l_2, \ldots, l_n)$ is achievable by a prefix-free binary code only if

$$\sum_{i=1}^{n} 2^{-l_i} \leq 1.$$

This is the constraint you cannot escape. Any prefix-free code satisfies it. Any length vector that violates it cannot be realized as a prefix-free code, full stop.

The converse is also true: any length vector satisfying Kraft is realizable by some prefix-free code. That is McMillan’s theorem, and it is the subject of the next post in this series. This post develops the necessary direction: every prefix-free code satisfies Kraft.

The right tool for understanding why is the binary tree.

The Trie View

Represent each codeword as a path in a binary tree. Start at the root. For each bit, go left (0) or right (1). The codeword ends at a node, which I mark as a terminal. A code is prefix-free if and only if no terminal node has any descendants that are also terminals. Once you reach a terminal on the way down, you stop.

The example code $\{A \to \texttt{0},\ B \to \texttt{10},\ C \to \texttt{110},\ D \to \texttt{111}\}$ has lengths $(1, 2, 3, 3)$. Its trie looks like this:

A is at depth 1, left branch. B is at depth 2, right-then-left. C and D share a parent at depth 2, then split at depth 3. No codeword’s node is an ancestor of another’s: the code is prefix-free.

Synthesis: Codecs as Structure

May 15, 2026

Twelve posts, twelve codes, one thesis that refused to change. This is the closing summary.

A. The Twelve Codes Together

Every post in this series answered a version of the same question: given a source of positive integers, how do you represent its values compactly as a sequence of bits? The answers differ in shape, in assumptions, and in which distribution each code implicitly expects.

Post	Code	Implied prior (one phrase)
1-2	Foundations	Prefix-free codes are possible iff Kraft’s inequality holds
3	Priors framework	Any code defines a prior; the best code matches the source
4	Unary	Geometric(1/2): value 1 is twice as likely as value 2, etc.
5a	Elias Gamma	Power-law: probability falls as 1/n^2
5b	Elias Delta	Heavier-tailed power law: slower decay for large values
5c	Elias Omega	Recursive structure: no fixed polynomial decay rate
6	Fibonacci	Near-geometric with Zeckendorf structure; good for Zeckendorf-sparse integers
7	Rice / Golomb	Geometric with known parameter m; optimal when m divides entropy
8	VByte	Roughly uniform over byte-aligned ranges; engineering favorite
9	Huffman	Source-optimal given the exact symbol distribution
10	Arithmetic coding	Approaches entropy to an arbitrary fraction of a bit
11	Succinct bit vectors	Not a code for integers: a representation that answers rank/select queries
12	RoaringBitmap	Polyalgorithm: picks array, bitset, or run-length per container chunk

Posts 1 and 2 (Kraft’s Inequality and McMillan’s Converse) established why prefix-free codes are the right unit of analysis. Post 3 (Universal Codes as Priors) named the frame: a code is a hypothesis about the source. Posts 4 through 10 filled in the catalogue. Posts 11 and 12 extended from integer coding to set representation, where the questions shift from “how long is this codeword?” to “how do you store membership?” and “how do you answer rank/select?”

Looking across all twelve, the main lesson is not that one code dominates. It is that the question “which code?” is always empirically answerable given a sample.

B. The Unifying Frame Restated

Post 3 introduced the codes-as-priors thesis with two instances behind it. We now have twelve. The thesis has not changed; it has only become more evidently true.

RoaringBitmap

December 7, 2025

No single representation is optimal across all density regimes. RoaringBitmap does not try to pick one. It measures, then decides, per chunk.

Hybrid Representation as Polyalgorithm

Representing a compressed integer set forces a decision you cannot avoid: the best encoding depends on how dense the data is. Four density regimes each have a natural winner:

Very sparse (fewer than ~4096 elements in a 64K range): a sorted array of 16-bit values. Each element costs 2 bytes; 4096 elements cost 8 KB.
Moderate to dense (more than ~4096 elements): a dense bit vector, 8 KB for the full 64K range. Contains-check is O(1) via bit indexing.
Very dense (almost full): a sorted array of the absent elements, using the same 2 bytes per gap.
Clustered runs (many consecutive values): run-length encoding. A run of $k$ consecutive integers costs one 4-byte record regardless of $k$.

No single prior dominates across all four regimes. RoaringBitmap (Lemire et al., 2014) is the hybrid answer: partition the 32-bit integer space into 64K-element chunks (indexed by the high 16 bits of each value), then assign each chunk the optimal container based on its actual density. The structure adapts per chunk, not globally.

The Three Container Types

Every 32-bit integer $v$ maps to chunk-id $v \gg 16$ and low bits $v \mathbin{&} 0\text{xFFFF}$. Each chunk stores its low bits in one of three containers:

class ArrayContainer  { std::vector<uint16_t> elems_;   /* sorted, unique */ };
class BitmapContainer { std::vector<uint64_t> words_;   /* 1024 words = 8 KB */ };
class RunContainer    { std::vector<std::pair<uint16_t,uint16_t>> runs_; /* (start, len-1) */ };

using ContainerVariant = std::variant<ArrayContainer, BitmapContainer, RunContainer>;

class RoaringBitmap {
    std::map<uint16_t, ContainerVariant> chunks_;
public:
    void add(uint32_t v);
    bool contains(uint32_t v) const noexcept;
    std::size_t cardinality() const noexcept;
    void optimize();
    RoaringBitmap union_with(const RoaringBitmap&) const;
    RoaringBitmap intersection_with(const RoaringBitmap&) const;
    RoaringBitmap difference(const RoaringBitmap&) const;
};

ArrayContainer uses a sorted std::vector<uint16_t>. Contains-check is binary search, O(log n). Add is sorted insertion, O(n) worst case but O(1) amortized for sequential loads. Space: 2 bytes per element.

BitmapContainer stores 1024 uint64_t words covering all 65536 possible values. Contains-check is O(1): (words_[v/64] >> (v%64)) & 1. Add is O(1). Cardinality requires scanning all 1024 words with popcount, which is fast in practice. Space: fixed 8 KB regardless of occupancy.

RunContainer stores runs as (start, length-1) pairs sorted by start. A run (s, l) covers values $s$ through $s+l$ inclusive, using 4 bytes. Space grows with the number of runs, not the number of values. One million consecutive integers fit in one 4-byte record.

Succinct Bit Vectors and Rank/Select

June 22, 2025

The claim that drives this post: store $n$ bits and answer prefix-count queries in $O(1)$ time, using only $n + o(n)$ bits total. The auxiliary index is asymptotically negligible. That is not obvious, and it is worth understanding why it holds.

Constant-Time Queries on Bit Vectors

Posts 1 through 10 of this series focused on encoding: universal codes, arithmetic coding, Huffman, LZ77. They all ask the same question in slightly different ways: how do we turn a sequence into a compact bit string? This post shifts direction. The question here is different: once we have a bit vector, how do we query it efficiently without expanding it?

The two fundamental queries on a bit vector of $n$ bits are:

rank$_1(i)$: the number of 1-bits in positions $[0, i)$, i.e., strictly before position $i$.
select$_1(j)$: the position of the $j$-th 1-bit (0-indexed).

These appear throughout data structures: inverted indexes, compressed graphs, FM-indexes, wavelet trees. Rank tells you how many elements precede a position. Select inverts it.

Three design points exist along the space-time axis:

Approach	Space	rank time	select time
Naive scan	$n$ bits	$O(n/64)$	$O(n/64)$
Full lookup table	$O(n \log n)$ bits	$O(1)$	$O(1)$
Succinct (this post)	$n + o(n)$ bits	$O(1)$	$O(\log n)$

The succinct approach hits the right trade-off: an auxiliary index that is asymptotically negligible (sublinear, so $o(n)$) while buying constant-time rank. Select costs $O(\log n)$ with a binary search. Getting $O(1)$ select requires one more index structure; that is covered briefly at the end and implemented in PFC’s production version.

The Structure

The bit vector stores its $n$ bits packed into 64-bit words, LSB-first. Bit $i$ lives at position $i \bmod 64$ within word $\lfloor i / 64 \rfloor$. Unused bits in the final word are zeroed.

The auxiliary index has two levels:

Superblock array: one uint64_t per 4096-bit chunk. Entry $s$ holds the absolute cumulative rank from bit 0 to the start of superblock $s$.
Block array: one uint16_t per 64-bit word. Entry $b$ holds the rank within the enclosing superblock, from the superblock’s start to the start of block $b$.

class SuccinctBitVector {
public:
    explicit SuccinctBitVector(const std::vector<bool>& bits);
    [[nodiscard]] std::size_t size()  const noexcept;
    [[nodiscard]] bool        bit(std::size_t i)  const noexcept;
    [[nodiscard]] std::size_t rank1(std::size_t i) const noexcept;   // O(1)
    [[nodiscard]] std::size_t select1(std::size_t j) const noexcept; // O(log n)

protected:
    std::size_t n_;
    std::vector<uint64_t> bits_;
    std::vector<uint64_t> superblock_ranks_;  // abs. rank at superblock boundaries
    std::vector<uint16_t> block_ranks_;       // superblock-relative rank per block

    static constexpr std::size_t SUPERBLOCK_BITS = 4096;
    static constexpr std::size_t BLOCK_BITS      = 64;
    static constexpr std::size_t BLOCKS_PER_SB   = 64;
};

The production version lives in PFC’s include/pfc/succinct.hpp. The pedagogical version in this post strips the production features to the core structure.

Arithmetic Coding

January 12, 2025

Huffman codes one symbol at a time. Arithmetic coding encodes the whole sequence as a single number. The difference is a factor of twelve, at least on the right source.

The Last Bit of Redundancy

Huffman coding gets expected codeword length within one bit of entropy. That is the best it can do, because codeword lengths must be integers while entropy is a real number.

The waste is structural. A symbol with probability $p = 0.7$ has optimal (fractional) length $-\log_2(0.7) \approx 0.515$ bits. Huffman rounds that up to 1 bit: 0.485 bits wasted per occurrence. For a nearly-deterministic source with $p_0 = 0.99$ and $p_1 = 0.01$, the entropy is $H \approx 0.081$ bits per symbol. Huffman is stuck at 1 bit per symbol. That is a factor-of-twelve gap, and Huffman cannot close it: a symbol that appears 99% of the time still gets a complete codeword.

Arithmetic coding steps back from per-symbol codewords entirely. It encodes an entire sequence as a single rational number in $[0, 1)$. The bit-length of that number converges to the entropy of the sequence as the sequence grows. No integer rounding, no per-symbol overhead.

This post builds an integer range coder in C++23 and demonstrates the factor-of-twelve improvement on the Bernoulli(0.99) source.

The Continuous View

Start with the unit interval $[0, 1)$. For a two-symbol source, partition it by probability: symbol 0 gets $[0, p_0)$ and symbol 1 gets $[p_0, 1)$.

To encode a sequence, begin with the full interval and narrow it with each symbol. After symbol $s_1$, restrict to the corresponding sub-interval. After $s_2$, apply the same proportional rule inside that sub-interval. After $L$ symbols, the interval has width $\prod_{i=1}^{L} p_{s_i}$.

Any number inside the final interval is a valid encoding. The shortest such number in binary requires approximately $-\log_2(\prod p_{s_i}) = \sum_{i=1}^{L} (-\log_2 p_{s_i})$ bits. As $L \to \infty$, bits per symbol approaches $H(p) = -\sum_k p_k \log_2 p_k$ exactly.

Decoding is the inverse: given the encoded number, determine at each step which sub-interval it falls in, recover the symbol, narrow the interval, and repeat.

The theory is complete. The practice is not. A real interval narrows exponentially fast: after a few dozen symbols you need arbitrary precision. The integer range coder fixes this with 32-bit arithmetic and a renormalization step.

Huffman Coding

August 4, 2024

Huffman coding is two things: the optimal length vector for a known distribution, and McMillan’s construction applied to that vector. This post develops both.

From Universal to Optimal

Every code in this series so far has been universal: no prior knowledge of the source distribution required. Elias gamma assigns shorter codewords to smaller integers regardless of which integers actually appear. Fibonacci does the same. VByte packs smaller values into fewer bytes without knowing whether your data clusters at the low end or the high end. Universal codes are defensive: they perform acceptably across a broad class of inputs by committing to none.

Huffman flips that stance. You bring a finite probability distribution. Huffman finds the prefix-free code with minimum expected codeword length for that distribution. The code is tuned to what you provide and will perform poorly on anything else. Call this the move from defensive to distribution-specific coding.

The payoff is real. Shannon’s source coding theorem says no prefix-free code can achieve expected length below $H(p) = -\sum_i p_i \log_2 p_i$. Huffman gets within 1 bit of that bound. For any prefix-free code, expected length satisfies

$$H(p) \le L(\text{code}) \le H(p) + 1.$$

The upper bound comes from the integer-length constraint: each codeword is a whole number of bits, and $\lceil -\log_2 p_i \rceil \le -\log_2 p_i + 1$. Huffman is optimal subject to this constraint. No prefix-free code does better without abandoning whole-bit codewords.

That last clause points to the limit of this post and the subject of the next. Arithmetic coding breaks the integer-length constraint by assigning fractional bits in effect, reaching entropy exactly in the limit.

The Algorithm

The four steps of Huffman’s construction:

Create one leaf node per symbol, weighted by its probability.
Push all leaves into a min-priority queue (lowest weight first).
While the queue contains more than one node: extract the two lowest-weight nodes, merge them into a new internal node whose weight is their sum, push the internal node back.
The remaining node is the root. The path from root to each leaf encodes that leaf’s codeword (“0” for left, “1” for right).

Here is the complete implementation from huffman.hpp:

VByte / Varint

February 25, 2024

Every code in this series so far operates at bit granularity. VByte does not. It gives up bit-level precision for byte-alignment, and in production systems, that trade wins most of the time.

The Practical Question

Every code in this series so far operates at bit granularity. Elias gamma encodes 1 in a single bit. Fibonacci coding uses exactly as many bits as the Zeckendorf representation requires. Bit packing is theoretically attractive because it minimizes the number of bits written, which minimizes the encoded size.

But bit packing is computationally expensive. Reading or writing a single bit requires a shift, a mask, and often a branch to handle byte boundaries. Encoding a sequence of integers this way burns CPU cycles that scale with the number of integers, independent of their values. For high-throughput applications, the overhead of bit manipulation can easily exceed the savings from compact encoding.

VByte (also called Varint in Google’s ecosystem, and LEB128 in the DWARF debug format) trades a small amount of length efficiency for byte-alignment. The idea is simple: encode each integer as a sequence of 7-bit groups, one per byte, with a continuation flag in the high bit of each byte. The result is self-delimiting, compact for small values, and requires no bit-level manipulation to decode.

VByte is the encoding used by Protocol Buffers for all integer fields. It appears in Apache Arrow, Parquet, Snappy’s block format, LevelDB’s metadata, and most production columnar file formats. These are high-throughput systems. Byte-alignment is why VByte is their choice over the more compact universal codes from posts 4 through 7.

The Encoding

VByte splits an integer into 7-bit groups, starting from the least significant bits. Each group occupies one byte where bits 0 through 6 carry 7 data bits and bit 7 is a continuation flag: 1 means more bytes follow, 0 means this is the last byte.

A value in $[0, 127]$ fits in a single byte (continuation flag clear). A value in $[128, 16383]$ requires two bytes (first byte has flag set, second has flag clear). The pattern continues: each additional byte adds 7 bits of capacity.

Rice / Golomb

September 17, 2023

Every code in this series so far has been fixed. Rice and Golomb are different: they take a parameter, and the parameter is your model of the data.

The First Parametric Code

Every code examined so far in this series has been monolithic. Unary coding is just unary coding. Elias gamma is just Elias gamma. Each one encodes all non-negative integers with a single fixed strategy. You do not get to choose anything about the code beyond whether to use it.

Rice and Golomb codes break this pattern. They are parametric: a single integer parameter, $k$ for Rice or $m$ for Golomb, tunes the code to a specific source distribution. Rice$(k)$ is not one code but a family of codes, one per value of $k$. Each member of the family is optimal for a specific geometric distribution. Choosing $k$ is choosing your prior precisely.

This matters because data sources are rarely uniform. Run-length encodings, inter-frame video differences, and the gap sequences in inverted indexes are all approximately geometrically distributed. If you know the mean of your source, you can pick $k$ so that Rice$(k)$ performs near-optimally, without the overhead of a Huffman table or arithmetic coding.

The key insight: for a geometric source with mean approximately $2^k$, Rice$(k)$ is within a small constant of entropy. No other universal code in this series achieves this. Elias gamma and delta perform well asymptotically but can be far from optimal for a specific geometric distribution with a known mean. Rice exploits that knowledge directly.

Rice Coding

Rice coding splits a non-negative integer $n$ into two parts: a quotient $q = \lfloor n / 2^k \rfloor = n \gg k$ and a remainder $r = n \bmod 2^k = n \mathbin{&} (2^k - 1)$.

The quotient is encoded in unary: $q$ zero bits followed by a stop bit of 1. The remainder is encoded in exactly $k$ bits, MSB first. The total codeword is the concatenation of these two parts.

Codeword examples for $k = 2$ (remainder is always 2 bits):

$n$	$q$	$r$	Codeword	Bits
0	0	0	`1 00`	3
1	0	1	`1 01`	3
2	0	2	`1 10`	3
3	0	3	`1 11`	3
4	1	0	`0 1 00`	4
5	1	1	`0 1 01`	4

Codeword length: $(n \gg k) + 1 + k$ bits. The Kraft sum saturates to 1, so Rice is a complete prefix-free code.

Fibonacci Coding

April 23, 2023

Every code in this series so far has optimized expected length under some implied prior. Fibonacci coding does something different: it gives the decoder a way to recover from errors without help from a lower layer.

A Different Design Goal

All the codes in this series have aimed at the same target: assign short codewords to frequent symbols, with length growing roughly as $\log n$ for the $n$-th symbol under some implied prior. Elias gamma minimizes expected length for power-law distributions; delta and omega extend the recursion for heavier tails.

Fibonacci coding has a different goal. It does not optimize for average codeword length under a specific distribution. It optimizes for error resilience. In a stream of gamma-coded integers, a single bit flip in a codeword’s length prefix causes the decoder to misread that codeword’s length, then misread every subsequent codeword. The error propagates without limit until the decoder somehow reacquires sync. On a reliable channel this is a nonissue. On a noisy one, or in stored data that may have silently rotted, it is a serious problem.

Fibonacci coding avoids this. Every Fibonacci codeword ends in two consecutive 1 bits (“11”). This double-one marker appears nowhere else in the codeword. A single bit flip corrupts the codeword it hits, possibly spills into the next codeword, and then the decoder finds the next “11” and resynchronizes. At most two codewords are corrupted per error. The rest of the stream is intact.

The price is length overhead: Fibonacci codewords are approximately $1.44 \times \log_2 n$ bits long, compared to $\log_2 n$ bits for the entropy lower bound. On a reliable channel, that overhead is not worth paying. On a noisy channel, or in a long-running stream where rare bit errors must not lose the entire tail, the self-synchronization property is worth it.

Zeckendorf’s Theorem

Fibonacci numbers starting from $F_2 = 1$: $1, 2, 3, 5, 8, 13, 21, 34, \ldots$

Zeckendorf’s theorem: every positive integer $n$ has a unique representation as a sum of non-consecutive Fibonacci numbers. The greedy algorithm produces it by repeatedly subtracting the largest Fibonacci number that does not exceed $n$.

Elias Delta and Omega

November 13, 2022

Elias gamma spends too many bits saying how many bits it will use. Delta fixes that. Omega takes the fix one step further. This post is about what happens when you apply recursion to the length prefix.

Where Gamma Stops Being Good

Elias gamma, from the previous post, encodes a positive integer $n$ in $2\lfloor \log_2 n \rfloor + 1$ bits: a unary count of $\lfloor \log_2 n \rfloor$ zeros, then a stop bit, then the $\lfloor \log_2 n \rfloor$ trailing binary bits of $n$. For small $n$ this is fine. For large $n$, nearly half the bits are spent on the unary prefix alone.

The unary prefix is the bottleneck. It encodes the length $L = \lfloor \log_2 n \rfloor + 1$ in the most wasteful possible way: one bit per unit. For $n = 256$, that is 8 zero bits just to say “the payload is 8 bits long.” The payload itself is also 8 bits, so you are paying a 100% overhead on the length announcement. That is bad, and it gets worse as $n$ grows.

The fix is obvious once you see it: encode $L$ itself in some shorter code instead of unary. Elias delta does exactly this, replacing the unary length prefix with a gamma-coded length. Elias omega takes the idea one step further and applies the recursion to itself, all the way down.

Both codes are universal: they assign finite codewords to every positive integer, and the expected codeword length is within a constant factor of optimal for any source whose probabilities decrease with $n$. The improvement over gamma is real and measurable once $n$ grows past a few dozen.

This post shows both implementations, their implied priors, and the crossover points where each code wins. As in the rest of this series, the code is pedagogical: each header stands alone and the struct-with-encode/decode pattern maps directly onto the PFC library’s EliasDelta and EliasOmega in codecs.hpp.

Elias Delta

Algorithm. Let $L = \lfloor \log_2 n \rfloor + 1$ (the bit-width of $n$, equivalently std::bit_width(n)).

Encode $L$ in Elias gamma.
Write the $L - 1$ trailing bits of $n$ after its implicit leading 1, MSB first.

Gamma encodes $L$ (a small integer) in $O(\log \log n)$ bits instead of $O(\log n)$ bits for the unary prefix. The payload is identical to gamma’s: the trailing bits of $n$. The total length is $O(\log n + \log \log n)$.

Unary and Elias Gamma

June 19, 2022

Unary is older than information theory. Elias gamma is its 1975 improvement. Together they span the gap between optimal-but-impractical and practical-but-nearly-optimal. This post derives what each code bets on, and shows numerically what that means.

Unary and Elias Gamma

Unary is the oldest code in this series. It predates information theory by centuries: a shepherd counting sheep on a stick is using unary. Mark one notch per sheep; count the notches to decode. The codeword for $n$ is $n$ tally marks. Its information-theoretic justification came later, when Shannon showed it is exactly optimal for a geometric source.

Elias gamma is the 1975 extension by Peter Elias. It brings the codeword length from $O(n)$ to $O(\log n)$, making it practical for numbers beyond small single digits, while keeping the prefix-free property that makes self-delimiting streams possible.

Both codes are instances of the claim from Universal Codes as Priors: every prefix-free code is a bet about the source. Unary bets on a geometric distribution with parameter $1/2$. Gamma bets on a power-law distribution with exponent $\approx 2$. This post implements both, derives their implied priors, and shows numerically what the bets mean.

Unary: Geometric Prior

The encoding rule for unary is simple: to encode integer $n \geq 1$, write $(n-1)$ zero bits followed by one 1 bit. The decoder reads bits until it sees the 1; the number of bits read is the decoded value.

Examples: $1 \to$ 1, $2 \to$ 01, $3 \to$ 001, $4 \to$ 0001.

struct Unary {
    using value_type = std::uint64_t;

    template<BitSink S>
    static void encode(value_type n, S& sink) {
        assert(n >= 1 && "Unary is undefined for n = 0");
        for (value_type i = 1; i < n; ++i) sink.write(false);
        sink.write(true);
    }

    template<BitSource S>
    static value_type decode(S& source) {
        value_type n = 1;
        while (!source.read()) ++n;
        return n;
    }
};

Length analysis. The codeword for $n$ has length $n$. The Kraft sum is $\sum_{n=1}^{\infty} 2^{-n} = 1$: unary saturates Kraft exactly. The implied prior is $p_n = 2^{-n}$: a geometric distribution with parameter $1/2$, where each value is half as likely as the previous.

Optimality test. Because the implied prior is dyadic (all probabilities are powers of $1/2$) and Kraft saturates, unary achieves entropy exactly on this prior. For a 30-symbol truncation of geometric(1/2), the expected unary length equals the entropy to within the truncation tail ($\approx 2^{-30}$):

Universal Codes as Priors

January 15, 2022

When you pick a code for integers, you are making a bet about what integers the source will produce. The bet lives in the codeword lengths, not in a separate parameter. This post makes that precise.

Universal Codes as Priors

You want to compress a stream of positive integers. Which code should you use?

The question has more structure than it appears. A code for integers assigns a codeword to each integer. The codeword for 1 is short, for 2 a bit longer, for 100 much longer. The relative lengths encode an implicit bet: what fraction of the stream will be 1s? What fraction will be 100s? If the bet matches the source, the average codeword length will be close to the theoretical minimum, the entropy. If the bet is wrong, you pay an overhead proportional to how wrong you are.

The bet is not a separate parameter. It lives in the codeword lengths themselves. This is the central claim of this post:

Every prefix-free code is a prior over the integers. The codeword lengths determine, up to normalization, a probability distribution. The code is optimal for exactly the sources that match that distribution.

This post makes that claim precise and implements the tools to measure it. The rest of the series (posts 4 through 12) examines ten specific codes and the priors they embody.

The Correspondence: Lengths to Priors

For a prefix-free code with codeword lengths $(l_1, l_2, \ldots, l_n)$, define the unnormalized weight of symbol $i$ as $w_i = 2^{-l_i}$. This is the fraction of the Kraft budget consumed by that codeword.

If the code saturates Kraft (meaning $\sum_i 2^{-l_i} = 1$), then the weights are already a valid probability distribution: $p_i = 2^{-l_i}$. If the code does not saturate (meaning $\sum_i 2^{-l_i} < 1$), normalize: $p_i = 2^{-l_i} / \sum_j 2^{-l_j}$.

This is the inverse of Shannon’s prescription. Shannon says: given a distribution $p_i$, the optimal codeword length is $\lceil -\log_2 p_i \rceil$ bits. We reverse the direction: given a length $l_i$, the implied probability is $2^{-l_i}$.

The function implied_prior computes this map:

inline std::vector<double> implied_prior(const std::vector<std::size_t>& lengths) {
    std::vector<double> probs;
    probs.reserve(lengths.size());
    double total = 0.0;
    for (std::size_t l : lengths) {
        double p = std::ldexp(1.0, -static_cast<int>(l));
        probs.push_back(p);
        total += p;
    }
    // Normalize if Kraft sum is less than 1.
    if (total < 1.0) {
        for (double& p : probs) p /= total;
    }
    return probs;
}

Two examples show the range of priors you get in practice.

McMillan's Converse

September 13, 2020

Kraft’s inequality is necessary. McMillan’s theorem says it is also sufficient, and the proof is a construction.

McMillan’s Converse

The previous post in this series proved Kraft’s inequality: for any prefix-free binary code with codeword lengths $l_1, l_2, \ldots, l_n$,

$$\sum_{i=1}^{n} 2^{-l_i} \leq 1.$$

Every prefix-free code satisfies it. No exceptions. But necessity alone is not the useful direction. The question I want answered is the converse: given a length vector that satisfies Kraft, does a prefix-free code with those lengths actually exist?

Yes, and McMillan’s theorem (1956) proves it. Better still, the proof is a construction: given any Kraft-satisfying length vector, you can produce a specific prefix-free code with those exact lengths. No search required. No verification required after the fact. The construction always terminates, always produces a valid code, because Kraft pre-certifies that the budget is sufficient.

This post proves the constructive direction, then goes further. McMillan proved something stronger than just the prefix-free converse. He showed that even uniquely-decodable codes that are not prefix-free must satisfy Kraft. The consequence is worth sitting with: there is no advantage to non-prefix-free designs. If a code can be uniquely decoded, a prefix-free code with the same lengths exists. Prefix-freeness is not a restriction you impose for convenience. It is just the cleanest form of what unique decodability requires.

The Construction

The construction is a left-to-right walk through an imaginary binary trie. Sort the lengths, then assign codewords by taking the next available leaf at each step.

Concretely: fix a counter at zero, and for each length $l_i$ (in sorted order), emit the binary representation of counter >> (l_max - l_i) left-padded to $l_i$ bits. Then advance the counter by $2^{l_{\max} - l_i}$, which skips past the entire subtree rooted at the just-assigned codeword. That advance ensures the next codeword starts at the first unoccupied leaf position in the depth-$l_{\max}$ trie.

Work through the example from post 1: lengths $\{1, 2, 3, 3\}$. Sort: $1, 2, 3, 3$. Take $l_{\max} = 3$.

Length 1: counter is 0. Shift right by $3 - 1 = 2$: emit 0 >> 2 = 0 as a 1-bit string, giving codeword "0". Advance counter by $2^{3-1} = 4$. Counter is now 4.
Length 2: counter is 4 (binary 100). Shift right by $3 - 2 = 1$: emit 4 >> 1 = 2 as a 2-bit string, giving "10". Advance by $2^{3-2} = 2$. Counter is now 6.
Length 3: counter is 6 (binary 110). Shift right by $3 - 3 = 0$: emit 6 as a 3-bit string, giving "110". Advance by $2^0 = 1$. Counter is 7.
Length 3: counter is 7 (binary 111). Emit 7 as a 3-bit string: "111". Advance by 1. Counter is 8.

Result: {"0", "10", "110", "111"}. This is exactly the example code from post 1. The construction recovered it directly from the length vector, without any search.

Kraft's Inequality

March 22, 2020

Every prefix-free code satisfies one inequality. That inequality is also sufficient. This post develops the necessary direction.

Kraft’s Inequality

I want a code where each symbol maps to a bit string, and where any concatenation of codewords can be decoded unambiguously. The simplest way to guarantee that is prefix-freeness: no codeword is a prefix of any other. A prefix-free code is self-delimiting. The decoder reads bits left-to-right and knows exactly when each codeword ends, with no lookahead and no length headers.

The question I keep returning to is: which collections of lengths are actually achievable? If I want four codewords of lengths 1, 2, 3, and 3, can I build a prefix-free code with those lengths? What if I want two codewords of length 1? (No: there are only two 1-bit strings, and they are prefixes of everything longer.)

Kraft’s inequality is the answer. A length vector $(l_1, l_2, \ldots, l_n)$ is achievable by a prefix-free binary code only if

$$\sum_{i=1}^{n} 2^{-l_i} \leq 1.$$

This is the constraint you cannot escape. Any prefix-free code satisfies it. Any length vector that violates it cannot be realized as a prefix-free code, full stop.

The converse is also true: any length vector satisfying Kraft is realizable by some prefix-free code. That is McMillan’s theorem, and it is the subject of the next post in this series. This post develops the necessary direction: every prefix-free code satisfies Kraft.

The right tool for understanding why is the binary tree.

The Trie View

Represent each codeword as a path in a binary tree. Start at the root. For each bit, go left (0) or right (1). The codeword ends at a node, which I mark as a terminal. A code is prefix-free if and only if no terminal node has any descendants that are also terminals. Once you reach a terminal on the way down, you stop.

The example code $\{A \to \texttt{0},\ B \to \texttt{10},\ C \to \texttt{110},\ D \to \texttt{111}\}$ has lengths $(1, 2, 3, 3)$. Its trie looks like this:

A is at depth 1, left branch. B is at depth 2, right-then-left. C and D share a parent at depth 2, then split at depth 3. No codeword’s node is an ancestor of another’s: the code is prefix-free.

Bits Follow Types

April 23, 2026

Every type decomposes structurally. So does its codec.

Codecs as Functors

You have an optional<vector<pair<int, string>>>. The type decomposes structurally: it is an optional of a free monoid of products of an integer and a string. That decomposition is not an observation about memory layout. It is a statement about the algebraic structure of the type.

Now ask: does the codec decompose the same way?

If the answer is yes, you stop writing one-off encoders. You build a codec for optional<T> from a codec for T. You build a codec for vector<T> from a codec for T. The codec for optional<vector<pair<int, string>>> assembles from its parts with no manual layout decisions, no hand-placed length headers, no ad-hoc format negotiation.

This post argues that the answer is always yes, and shows what the machinery looks like. The thesis: codecs are not ad-hoc bit formats. They are constructions on the algebraic structure of types. The algebraic structure of a type determines its codec, the same way it determines its algorithms.

This extends Stepanov’s claim. The peasant algorithm post showed that algorithms arise from algebraic structure. The homomorphism post showed that structure-preserving maps are the natural morphisms. Here, we show the codec itself is a structure-preserving map, and that it lifts from leaf types to compound types by the same algebraic logic.

Bit I/O: The Foundation

Before combinators, we need concrete bit I/O. The approach taken here follows Stepanov’s move in the algorithm posts: state the concept first, then provide a model.

Two concepts govern bit-level I/O:

template<typename T>
concept BitSink = requires(T& s, bool bit) {
    { s.write(bit) } -> std::same_as<void>;
};

template<typename T>
concept BitSource = requires(T& s) {
    { s.read() } -> std::same_as<bool>;
    { s.peek() } -> std::convertible_to<bool>;
};

A BitSink accepts bits. A BitSource supplies them. A codec is an algorithm parameterized over BitSink and BitSource, not a class hierarchy. This is Stepanov’s move at the bit level: require only what the algorithm needs, let anything that satisfies the concept participate.

The standard models are BitWriter and BitReader, which pack bits into byte buffers in LSB-first order:

class BitWriter {
    std::span<std::uint8_t> buf_;
    std::size_t byte_idx_ = 0;
    std::uint8_t byte_ = 0;
    std::uint8_t bit_pos_ = 0;
public:
    explicit BitWriter(std::span<std::uint8_t> buf) noexcept : buf_(buf) {}

    void write(bool bit) noexcept {
        byte_ |= (bit ? std::uint8_t{1} : std::uint8_t{0}) << bit_pos_;
        if (++bit_pos_ == 8) {
            buf_[byte_idx_++] = byte_;
            byte_ = 0;
            bit_pos_ = 0;
        }
    }

    void align() noexcept {
        if (bit_pos_ > 0) {
            buf_[byte_idx_++] = byte_;
            byte_ = 0;
            bit_pos_ = 0;
        }
    }

    [[nodiscard]] std::size_t bytes_written() const noexcept {
        return byte_idx_ + (bit_pos_ > 0 ? 1 : 0);
    }
};

class BitReader {
    std::span<const std::uint8_t> buf_;
    std::size_t byte_idx_ = 0;
    std::uint8_t bit_pos_ = 0;
public:
    explicit BitReader(std::span<const std::uint8_t> buf) noexcept : buf_(buf) {}

    bool read() noexcept {
        bool bit = ((buf_[byte_idx_] >> bit_pos_) & 1) != 0;
        if (++bit_pos_ == 8) {
            ++byte_idx_;
            bit_pos_ = 0;
        }
        return bit;
    }

    [[nodiscard]] bool peek() const noexcept {
        return byte_idx_ < buf_.size();
    }
};

A codec concept rounds out the three-concept core:

When Lists Become Bits

April 23, 2026

The free monoid on a type lifts to bit space. It lifts injectively only when the element codec is prefix-free.

Prefix-Free Codes and the Free Monoid

You have a list of unsigned integers. Encode the list as a single bit string.

Fixed-width encoding wastes space. If you allocate 64 bits per integer, small values like 1 or 7 cost as much as values near $2^{64}$. Variable-width encoding recovers that space, but immediately raises a harder question: where does one encoded integer end and the next begin?

Two escape routes. First, prefix each encoded item with its length. That works, but the length headers are overhead, and you now need a codec for the lengths as well. Second, choose a code where the structure of the codewords makes boundaries unambiguous without any headers. These are prefix-free codes, and this is the right answer, in a precise categorical sense.

The “precise categorical sense” is what this post develops. Encoding a list as the concatenation of encoded elements is a monoid homomorphism from the free monoid on $T$ to the monoid of bit strings under concatenation. The universal property of the free monoid guarantees this homomorphism always exists. The question of whether the decoder can invert it comes down to exactly one property of the element codec: whether it is prefix-free.

The Free Monoid, Recalled

A monoid is a set with an associative binary operation and an identity element. The free monoid on a set $S$ is the set of all finite sequences of elements from $S$, with concatenation as the operation and the empty sequence as the identity.

“Free” means no equations hold except those forced by the monoid axioms. Nothing is identified with anything else. If you need commutativity or idempotency, you quotient the free monoid by additional equations. But the free monoid itself imposes nothing beyond associativity and identity.

The universal property says: given any monoid $M$ and any function $f: S \to M$, there is exactly one monoid homomorphism $\hat{f}: \text{Free}(S) \to M$ that extends $f$. That unique extension is fold:

$$\hat{f}([x_1, x_2, \ldots, x_n]) = f(x_1) \cdot f(x_2) \cdot \cdots \cdot f(x_n)$$

where $\cdot$ is the operation in $M$. The free-algebra post develops this in full. For this post, the one fact that matters is that fold is canonical: it is the unique way to extend a per-element map to a list-consuming function that respects the monoid structure.

Free Algebras: Why Lists and Polynomials Are Universal

March 13, 2026

Lists are everywhere in programming. Not because they are convenient. Because they are algebraically universal.

Why Lists?

Arrays are more cache-friendly. Hash maps have better lookup. Yet lists (sequences, vectors, streams) remain the default container in nearly every language. The standard explanation is convention, or ease of construction. The real explanation is algebraic.

A list is the free monoid. It is the most general monoid you can build from a set of generators. And the universal property of free monoids says that fold, the operation that processes a list element by element, is not a design pattern. It is a theorem.

The Free Monoid

Start with a set $S$ of generators. The free monoid on $S$ is the set of all finite sequences of elements from $S$, with concatenation as the operation and the empty sequence as the identity.

“Free” means: no equations hold except those forced by the monoid axioms (associativity and identity). In particular:

$[a, b] \neq [b, a]$. Commutativity is not imposed.
$[a, a] \neq [a]$. Idempotency is not imposed.
$[a, b, c] = [a] \cdot [b] \cdot [c]$. Every sequence is a product of singletons.

In C++:

template<typename T>
class free_monoid {
    std::vector<T> elements_;
public:
    free_monoid() = default;
    explicit free_monoid(T x) : elements_{std::move(x)} {}
    // ...
};

// Monoid operations via ADL
template<typename T>
free_monoid<T> op(const free_monoid<T>& a, const free_monoid<T>& b);  // concatenation

template<typename T>
free_monoid<T> identity(const free_monoid<T>&);  // empty sequence

The free_monoid<int> is the type of finite sequences of integers. Its operation is concatenation. It satisfies the Monoid concept. And it is the most general monoid on int: no structure beyond associativity and identity.

The Universal Property

Here is the key fact. Given any function $f: S \to M$ where $M$ is a monoid, there exists a unique monoid homomorphism $\overline{f}: \text{Free}(S) \to M$ extending $f$. This homomorphism is defined by:

$$\overline{f}([a_1, a_2, \ldots, a_n]) = f(a_1) \cdot f(a_2) \cdot \ldots \cdot f(a_n)$$

In code:

template<Monoid M, typename T, typename F>
M extend(F f, const free_monoid<T>& xs) {
    M result = identity(M{});
    for (const auto& x : xs.elements())
        result = op(result, f(x));
    return result;
}

This is fold. The universal property says fold is the only structure-preserving way to interpret a list in a monoid. Any function that respects the monoid structure must agree with fold.

Fold Is a Theorem

When you write std::accumulate or std::reduce, you are invoking the universal property. The homomorphism condition:

Homomorphisms: The Maps Between Structures

March 13, 2026

A homomorphism is a function that preserves algebraic structure. This post shows that fold, sum, length, and even the logarithm are all the same idea.

Structures and Maps

The series so far has built up algebraic structures: monoids in the peasant post, rings in the modular post, Euclidean domains in the polynomial post, product monoids in the accumulator post. But structures alone are half the story. The other half is the maps between them.

A homomorphism is a function $f: A \to B$ between two structures of the same kind that preserves the operation:

$$f(a \oplus b) = f(a) \oplus f(b)$$

The operation on the left is in $A$. The operation on the right is in $B$. The function $f$ “commutes” with the operation. That is the entire definition.

The Concept

We need a monoid concept for this post. A monoid has an associative binary operation and an identity element. As always, the operations are ADL free functions:

template<typename M>
concept Monoid = std::semiregular<M> &&
    requires(M a, M b) {
        { op(a, b) } -> std::convertible_to<M>;
        { identity(a) } -> std::convertible_to<M>;
    };

And a runtime check that a function preserves the monoid structure:

template<Monoid A, Monoid B, typename F>
bool is_homomorphism(F f, const A& a1, const A& a2) {
    return f(op(a1, a2)) == op(f(a1), f(a2));
}

This tests one pair of inputs. It is not a proof, but it catches violations.

Examples Everywhere

Length. The string monoid (concatenation, empty string) maps to the integers under addition. The map is length. And length is a homomorphism:

$$\text{length}(s_1 + s_2) = \text{length}(s_1) + \text{length}(s_2)$$

The length of a concatenation is the sum of the lengths. This is not a coincidence. It is the homomorphism property.

Sum. Lists of integers under concatenation form a monoid. The integers under addition form a monoid. The map sum is a homomorphism:

$$\text{sum}(xs \mathbin{+\!\!+} ys) = \text{sum}(xs) + \text{sum}(ys)$$

Product. Same source monoid, different target. Now the integers under multiplication:

$$\text{prod}(xs \mathbin{+\!\!+} ys) = \text{prod}(xs) \times \text{prod}(ys)$$

Logarithm. The positive reals under multiplication form a monoid. The reals under addition form a monoid. The logarithm maps one to the other:

$$\log(a \times b) = \log(a) + \log(b)$$

This is the defining property of the logarithm, stated as algebra. The logarithm is a homomorphism from $(\mathbb{R}^+, \times, 1)$ to $(\mathbb{R}, +, 0)$.

Count. For any type $T$, the function that sends a list to its length is a homomorphism from the list monoid to $(\mathbb{Z}, +, 0)$. This is the same as the string length example, generalized.

Lattices: Fixed Points and Iteration

March 13, 2026

Lattices have two operations of a different kind than rings. The structure determines a fixed-point algorithm.

Two Operations, Different Rules

Monoids have one binary operation. Rings have two (addition and multiplication) linked by distributivity. Lattices also have two operations, but with different laws entirely.

A lattice is a set with two operations:

meet ($\wedge$): greatest lower bound
join ($\vee$): least upper bound

Both are idempotent, commutative, and associative. And they satisfy the absorption laws:

$$a \wedge (a \vee b) = a \qquad a \vee (a \wedge b) = a$$

Absorption is what distinguishes lattices from a pair of unrelated monoids. It ties meet and join together: knowing one constrains the other.

A bounded lattice adds a least element (bottom, $\bot$) and a greatest element (top, $\top$). Bottom is the identity for join, top is the identity for meet.

In C++20 concepts, with ADL free functions:

template<typename L>
concept Lattice = std::semiregular<L> &&
    requires(L a, L b) {
        { meet(a, b) } -> std::convertible_to<L>;
        { join(a, b) } -> std::convertible_to<L>;
    };

template<typename L>
concept BoundedLattice = Lattice<L> &&
    requires(L a) {
        { bottom(a) } -> std::convertible_to<L>;
        { top(a) } -> std::convertible_to<L>;
    };

Four Examples

Sign lattice. Abstract signs of integers: bottom (unreachable), negative, zero, positive, top (unknown). Meet is greatest lower bound, join is least upper bound in the Hasse diagram. This is the classic abstract interpretation domain. You can define abstract arithmetic on it: pos * neg = neg, neg + neg = neg, pos + neg = top.

Intervals. Closed intervals $[a, b]$ ordered by inclusion. Meet is intersection. Join is the smallest enclosing interval. Bottom is the empty interval. Top is the full range. This is the foundation of interval arithmetic.

Divisors. Positive integers ordered by divisibility. Meet is gcd, join is lcm. Bottom is 1 (divides everything), top is 0 (everything divides 0). Lattice structure appearing in number theory.

Power sets. Subsets of $\{0, \ldots, N-1\}$. Meet is intersection (bitwise AND), join is union (bitwise OR). Bottom is the empty set, top is the full set.

All four satisfy BoundedLattice. All four satisfy the same laws. The concept constrains the interface; the laws constrain the semantics.

The Algorithm: Tarski’s Fixed-Point Theorem

Here is the payoff. Tarski’s theorem: any monotone function on a complete lattice has a least fixed point, computable by iterating from bottom.

Semirings: One Algorithm, Six Graph Problems

March 13, 2026

The peasant post showed that power() works on any monoid. But what happens when you have two operations instead of one?

From Monoids to Semirings

A monoid is one operation with an identity element. The peasant algorithm exploits this: give it any monoid and it computes powers by repeated squaring. The accumulator post used the same structure for streaming statistics.

A semiring is two monoids on the same set, linked by a compatibility condition. Formally, a semiring $(S, +, \times, 0, 1)$ satisfies:

$(S, +, 0)$ is a commutative monoid
$(S, \times, 1)$ is a monoid
$\times$ distributes over $+$: $a \times (b + c) = a \times b + a \times c$
$0$ annihilates: $0 \times a = a \times 0 = 0$

In C++20, using the same ADL free functions as the peasant post:

template<typename S>
concept Semiring = std::semiregular<S> &&
    requires(S a, S b) {
        { a + b } -> std::convertible_to<S>;
        { a * b } -> std::convertible_to<S>;
        { zero(a) } -> std::convertible_to<S>;
        { one(a) } -> std::convertible_to<S>;
    };

The concept captures syntax. The axioms (associativity, distributivity, annihilation) are semantic requirements that the programmer must ensure.

Five Semirings

The ordinary integers are a semiring. But there are others, and each one corresponds to a different graph problem.

Semiring	$+$	$\times$	$0$	$1$	Graph problem
Boolean	or	and	false	true	Reachability
Tropical min	min	plus	$\infty$	0	Shortest paths
Tropical max	max	plus	$-\infty$	0	Longest paths
Bottleneck	max	min	$-\infty$	$\infty$	Widest paths
Counting	plus	times	0	1	Number of paths

The naming of the tropical semirings is counterintuitive but standard. In the tropical min semiring, the “addition” is min and the “multiplication” is ordinary addition. This matters because matrix multiplication uses both operations, and we need the algebraic structure to be correct: the inner product of a row and column computes the best path through an intermediate node.

Each semiring is a small struct with operator+, operator*, and ADL functions zero() and one():

struct boolean_semiring {
    bool val;
    constexpr boolean_semiring operator+(boolean_semiring rhs) const {
        return boolean_semiring(val || rhs.val);
    }
    constexpr boolean_semiring operator*(boolean_semiring rhs) const {
        return boolean_semiring(val && rhs.val);
    }
};
constexpr boolean_semiring zero(boolean_semiring) { return boolean_semiring(false); }
constexpr boolean_semiring one(boolean_semiring)  { return boolean_semiring(true); }

Matrices Over a Semiring

Matrix multiplication requires addition and multiplication. That is exactly what a semiring provides. If $S$ is a semiring, then $n \times n$ matrices over $S$ form a semiring too, with entry-wise addition and the usual row-times-column product (using $S$’s operations).

Streaming Statistics, One Monoid at a Time

March 13, 2026

Accumulators are monoids. The same algebraic structure from the peasant post, in a different domain.

Accumulators as Monoids

An accumulator processes a stream of values, maintaining state that can be queried at any point. Write a class with operator+= for each statistic you need. Sum, mean, variance, min, max. Five statistics, five classes.

The problem is combinations. Sum and min? Write a sixth class. Sum, min, and max? A seventh. Every new combination requires new code.

But every accumulator has the same structure:

Process a value incrementally: operator+=(value)
Combine with another accumulator of the same type: operator+=(accumulator)
Extract a result: .eval()

Default construction gives you an empty accumulator: the identity element. Combination via += is associative. Together, a monoid. The peasant post used the same structure for exponentiation. Here we use it for streaming computation.

In C++20 concepts:

template<typename A>
concept Accumulator = std::semiregular<A> &&
    requires(A a, A b, typename A::value_type v) {
        typename A::value_type;
        { a += v } -> std::same_as<A&>;   // process one value
        { a += b } -> std::same_as<A&>;   // combine two accumulators
        { a.eval() };                       // extract result
    };

KBN: Compensated Summation

The simplest accumulator is a sum. But naive floating-point summation accumulates O(n) rounding error:

double sum = 0.0;
sum += 1.0;
for (int i = 0; i < 1'000'000; ++i)
    sum += 1e-10;
// Expected: 1.0001    Actual: ~1.00009999999...8

When you add a tiny number to a large one, the tiny number’s low-order bits get dropped. After a million additions, these losses add up.

Kahan-Babuska-Neumaier (KBN) summation tracks what gets lost:

template<std::floating_point T>
class kbn_sum {
    T sum_ = T(0);
    T comp_ = T(0);   // compensation for lost bits

public:
    using value_type = T;

    constexpr kbn_sum& operator+=(const T& v) {
        T t = sum_ + v;
        comp_ += abs_(sum_) >= abs_(v) ? (sum_ - t) + v
                                       : (v - t) + sum_;
        sum_ = t;
        return *this;
    }

    constexpr T eval() const { return sum_ + comp_; }
};

The correction term comp_ recovers the bits that floating-point addition drops. O(1) error instead of O(n), regardless of sequence length.

kbn_sum is a monoid:

Identity: kbn_sum{} (sum=0, compensation=0)
Operation: a += b (combine two compensated sums)

Welford: Online Mean and Variance

Computing the mean is just sum/count. Variance is harder. The textbook formula $\sigma^2 = \frac{1}{n}\sum(x_i - \bar{x})^2$ requires two passes: one for the mean, one for the deviations.

Welford’s algorithm computes both in a single pass:

welford& operator+=(const T& v) {
    ++n_;
    T delta = v - mean_;
    mean_ += delta / static_cast<T>(n_);
    T delta2 = v - mean_;   // uses *updated* mean
    m2_ += delta * delta2;
    return *this;
}

delta uses the old mean, delta2 uses the new mean. Their product accumulates into m2_, the sum of squared deviations. At any point, variance = m2_ / n.

What makes this a monoid is the combination formula. Given two independent Welford accumulators with means $\bar{x}_A, \bar{x}_B$ and counts $n_A, n_B$, Chan et al. showed how to merge them:

Duality: The Hidden Structure of Opposites

January 19, 2026

Many structures come in pairs. Recognizing duality lets you transfer insights between domains.

The Motivating Example

This collection includes two approaches to automatic differentiation:

Forward mode (in dual): Propagate derivatives alongside values, from inputs toward outputs
Reverse mode (in autodiff): Build a graph during forward evaluation, then propagate gradients backward from outputs toward inputs

These aren’t just two implementations of the same idea. They’re duals, mirror images with complementary strengths.

Forward mode computes one column of the Jacobian per pass. If $f: \mathbb{R}^n \to \mathbb{R}^m$, computing the full Jacobian takes $n$ passes. Reverse mode computes one row per pass, $m$ passes for the full Jacobian.

For neural network training, we have many inputs (millions of parameters) and one output (the loss). Reverse mode wins overwhelmingly: one backward pass gives all gradients. This is why backpropagation dominates deep learning.

For sensitivity analysis with few parameters and many outputs, forward mode wins. Same algorithm structure, opposite traversal direction, complementary use cases.

The mathematical explanation: forward mode computes Jacobian-vector products ($Jv$); reverse mode computes vector-Jacobian products ($v^T J$). These are transposes of each other. Duality is transposition.

Push vs Pull

Consider two ways to traverse a sequence:

Pull (iterator/consumer controls):

for (auto it = seq.begin(); it != seq.end(); ++it) {
    process(*it);  // Consumer pulls each element
}

Push (producer controls):

seq.for_each([](auto x) {
    process(x);  // Producer pushes each element
});

Same traversal. Same elements processed. But control flow is reversed:

Aspect	Pull (Iterator)	Push (Generator)
Who controls pace?	Consumer	Producer
Suspend/resume?	Consumer decides when to call `++`	Producer decides when to yield
Backpressure	Natural (just stop pulling)	Must be designed in
Composition	Chain iterators	Chain callbacks

C++ ranges are pull-based: view | filter | transform creates an iterator that pulls through the pipeline. Reactive streams (Rx) are push-based: events flow through a pipeline of observers.

These are duals. Given a pull-based algorithm, you can mechanically derive its push-based counterpart by reversing who initiates each step. The transformation preserves correctness because it’s just changing direction, not content.

Encode vs Decode

Compression algorithms come in pairs:

// Encoder: structure -> bits
auto encode(const Document& doc) -> Bitstream;

// Decoder: bits -> structure
auto decode(const Bitstream& bits) -> Document;

These must be inverses: decode(encode(x)) == x. But their implementations are often strikingly different:

Seeing Structure First

January 18, 2026

A reflection on eleven explorations in generic programming

The Question Behind the Code

What do these computations have in common?

Computing the millionth Fibonacci number
Finding the shortest path between cities in a weighted graph
Calculating compound interest over thirty years
Composing ten 3D rotations into one
Repeating a string n times

The answer: they’re all computed by the same twenty lines of code.

template<typename T>
constexpr T power(T const& base, T exp) {
    if (exp == zero(exp)) return one(exp);
    if (exp == one(exp))  return base;

    return even(exp)
        ? square(power(base, half(exp)))
        : product(base, power(base, decrement(exp)));
}

This shouldn’t work. Fibonacci numbers involve integer sequences. Shortest paths involve graphs. Rotations involve 3D geometry. Different domains, different mathematics.

Yet they share structure. Once you see it, a single algorithm serves them all.

This collection of eleven blog posts is an extended meditation on one idea: algorithms arise from algebraic structure. The posts cover different domains (number theory, calculus, linear algebra, polymorphism) but they circle the same insight. Recognize the structure; the algorithm follows.

The Principle

Alex Stepanov articulated this most clearly in Elements of Programming: “Generic programming is about abstracting and classifying algorithms and data structures.” But the deeper point is how to abstract. Not by common syntax or superficial similarity, but by the algebraic laws a type obeys.

Why does structure appear everywhere? Because reality has structure. The algebraic structures we discover in programming (groups, rings, monoids) are the same structures physicists discover in nature. Rotations form a group. Spacetime transformations form a group. This isn’t coincidence. We’re uncovering patterns that exist.

Noether’s theorem makes this precise: every continuous symmetry corresponds to a conservation law. Time-translation symmetry gives conservation of energy. Space-translation symmetry gives conservation of momentum. Rotational symmetry gives conservation of angular momentum. The symmetry groups of physics are algebraic structures.

When we recognize “this is a monoid” in our code, we’re tapping into the same mathematical substrate that governs physical law. The algorithms follow because the structure constrains what’s possible, both in computation and in nature.

Consider the power() function above. What does it require?

An associative binary operation (so we can regroup: $(a \cdot b) \cdot c = a \cdot (b \cdot c)$)
An identity element (so $1 \cdot x = x \cdot 1 = x$)
Halving and parity testing on the exponent

That’s it. Any type providing these operations, with these laws, can use this algorithm. The requirements are algebraic, not syntactic.

Choosing the Algebra

November 30, 2025

The rest of this series asks: given a structure, what algorithms does it support? This post inverts the question.

The Flip Side

The peasant post showed that power() works on any monoid. The semirings post showed that matrix multiplication over different semirings solves six graph problems. The thread running through the whole series is: algorithms arise from algebraic structure.

But there’s a flip side that we haven’t addressed directly.

Sometimes you’re stuck with an expensive algorithm not because the problem is hard, but because you’re working in the wrong algebra. Change the algebra, and the algorithm becomes trivial. The cost shows up somewhere else, always. But if the cheap operation is the one you actually need, you win.

This is a very old idea. Napier invented logarithms in 1614 to turn multiplication into addition. What’s worth noticing is that logarithms, odds ratios, tropical semirings, and quaternions are all doing the same thing.

The Pattern

A computational basis transform takes values from one domain and represents them in another, where different operations are cheap:

Domain	Cheap	Expensive
Log space	Multiplication (becomes addition)	Addition
Odds ratios	Bayesian updates (become multiplication)	Probability sums
Tropical $(\min, +)$	Shortest paths (become matrix mult)	Subtraction
Quaternions	Rotation composition	Euler angle extraction
Modular integers	Exponentiation	Ordering
Rationals	Exact arithmetic	Irrational representation

Each row follows the same structure. A transform $\varphi: D \to D’$ makes some operations cheaper and others more expensive. There is no free lunch.

This is not a deep theorem. It’s almost tautological: if you could make everything cheaper by relabeling, the labels would already be the standard ones. But making the pattern explicit helps you recognize when you’re paying for an operation you don’t need.

Three Examples

Log Space

The most familiar example. You have a million small probabilities and need their product.

// Standard: underflows to 0 after ~30 terms
double product = 1.0;
for (double p : probs) product *= p;  // 0.0

// Log domain: addition instead of multiplication
mutatio::lgd product(1.0);
for (double p : probs) product = product * mutatio::lgd(p);
// product.log() is finite. product.value() would overflow,
// but you stay in log space.

The algebra changed from $(\mathbb{R}^+, \times)$ to $(\mathbb{R}, +)$. Multiplication became addition. The isomorphism is $\log$.

Tropical Semirings

The semirings post showed this already. Replace $(+, \times)$ with $(\min, +)$ and matrix multiplication becomes shortest-path computation. That’s the same move: you changed the semiring to make the algorithm you wanted (all-pairs shortest paths) fall out of a generic operation (matrix power) that you already had.

Differentiation: Three Ways

January 15, 2025

A synthesis of three earlier posts, comparing forward-mode AD, reverse-mode AD, and numerical differentiation.

Computing derivatives shows up everywhere: optimization, machine learning, physics simulation, numerical analysis. This series has explored three distinct approaches:

Forward-mode AD via dual numbers
Reverse-mode AD via computational graphs
Numerical differentiation via finite differences

Each has different strengths. The right choice depends on the shape of your problem.

The Landscape

Method	Accuracy	Cost for $f: \mathbb{R}^n \to \mathbb{R}$	Cost for $f: \mathbb{R} \to \mathbb{R}^m$	Memory
Forward AD	Exact	$O(n)$ passes	$O(1)$ pass	$O(1)$
Reverse AD	Exact	$O(1)$ pass	$O(m)$ passes	$O(\text{ops})$
Finite Diff	$O(h^p)$	$O(n)$ evaluations	$O(n)$ evaluations	$O(1)$

The key point: problem structure determines the best method.

Forward-Mode AD: Dual Numbers

Forward-mode AD extends numbers with an infinitesimal $\varepsilon$ where $\varepsilon^2 = 0$. The derivative falls out of the arithmetic for free:

// f(x) = x^3 - 3x + 1
// f'(x) = 3x^2 - 3

auto x = dual<double>::variable(2.0);  // x = 2, dx = 1
auto f = x*x*x - 3.0*x + 1.0;

std::cout << f.value() << "\n";       // 3.0
std::cout << f.derivative() << "\n";  // 9.0

Strengths:

Simple implementation (operator overloading)
No memory overhead
Naturally composable for higher derivatives
Works with any function of overloaded operators

When to use:

Single input variable (or few inputs)
Computing Jacobian-vector products
Higher-order derivatives via nesting
Sensitivity analysis along one direction

Complexity: One forward pass per input variable. For f: R^n -> R^m, computing the full Jacobian requires n passes.

Reverse-Mode AD: Computational Graphs

Reverse-mode AD builds a computational graph during the forward pass, then propagates gradients backward via the chain rule:

auto f = [](const auto& x) {
    return sum(pow(x, 2.0));  // f(x) = sum(x^2)
};

auto df = grad(f);  // Returns gradient function
auto gradient = df(x);  // One backward pass for all partials

Strengths:

O(1) backward passes regardless of input dimension
Powers modern deep learning (backpropagation)
Efficient for loss functions: f: R^n -> R

When to use:

Many inputs, scalar output (neural networks)
Computing vector-Jacobian products
Optimization where you need the full gradient

Complexity: One forward pass to build the graph, one backward pass to compute all gradients. Memory scales with the number of operations because you have to store intermediate values.

Numerical Differentiation: Finite Differences

Approximate the derivative using the limit definition:

// Central difference: f'(x) ~ (f(x+h) - f(x-h)) / 2h
double df = central_difference(f, x);

Strengths:

Numerical Integration with Generic Concepts

August 28, 2023

Numerical integration (quadrature) for C++20.

Overview

The definite integral is the signed area under a curve:

$$\int_a^b f(x)\,dx$$

Most functions do not have closed-form antiderivatives, so we approximate integrals numerically using quadrature rules: weighted sums of function evaluations.

$$\int_a^b f(x)\,dx \approx \sum_i w_i f(x_i)$$

Different rules choose different nodes x_i and weights w_i. The tradeoff is always accuracy vs. computational cost.

Quick Start

#include <integration/integrate.hpp>
#include <cmath>
#include <iostream>

int main() {
    using namespace integration;

    // integral from 0 to pi of sin(x) dx = 2
    double result = integrate([](double x) { return std::sin(x); }, 0.0, 3.14159265);

    std::cout << "integral of sin(x) dx = " << result << "\n";  // ~2.0
}

The Quadrature Zoo

Basic Rules

Rule	Formula	Error	Exact for
Midpoint	$(b-a)f(m)$	$O(h^3)$	Linear
Trapezoidal	$\frac{b-a}{2}(f(a)+f(b))$	$O(h^3)$	Linear
Simpson’s	$\frac{b-a}{6}(f(a)+4f(m)+f(b))$	$O(h^5)$	Cubic

double m = midpoint_rule(f, a, b);
double t = trapezoidal_rule(f, a, b);
double s = simpsons_rule(f, a, b);

Composite Rules

Divide [a,b] into n subintervals and apply the basic rule to each:

// Error: O(h^2) where h = (b-a)/n
double m = composite_midpoint(f, a, b, 100);
double t = composite_trapezoidal(f, a, b, 100);

// Error: O(h^4) - much more accurate!
double s = composite_simpsons(f, a, b, 100);  // n must be even

Gauss-Legendre Quadrature

The optimal choice: n points exactly integrate polynomials of degree 2n-1.

// 5-point Gauss-Legendre: exact for degree <= 9
double g = gauss_legendre<5>(f, a, b);

Adaptive Integration

Automatically refines where the function is difficult:

// Recommended for general use
double result = integrate(f, a, b);              // Default tolerance 1e-10
double result = integrate(f, a, b, 1e-12);       // Custom tolerance

// With error estimate
auto [value, error, evals] = integrate_with_error(f, a, b);

Deriving Simpson’s Rule

Taylor expand $f(x)$ around the midpoint $m = (a+b)/2$. Odd powers vanish by symmetry when integrating from $a$ to $b$:

$$\int_a^b f(x)\,dx \approx (b-a)f(m) + f''(m)\frac{(b-a)^3}{24} + O(h^5)$$

Simpson’s rule is the unique combination of endpoint and midpoint values that cancels the $h^2$ error:

$$\int_a^b f(x)\,dx = \frac{b-a}{6}\left[f(a) + 4f(m) + f(b)\right] + O(h^5)$$

It also cancels the h^3 term. Simpson gets a “bonus degree” of accuracy for free. This is one of those happy accidents in numerical analysis.

Why Gauss-Legendre is Optimal

With $n$ evaluation points, we have $2n$ free parameters ($n$ nodes + $n$ weights). We can match $2n$ conditions: exact integration of $1, x, x^2, \ldots, x^{2n-1}$.

The nodes turn out to be roots of the $n$-th Legendre polynomial $P_n(x)$. Orthogonal polynomials arise naturally from the optimization. This is not a coincidence. It is the same reason Legendre polynomials show up in approximation theory: they are the optimal basis for polynomial approximation on [-1,1] with the right inner product.

Forward-Mode Automatic Differentiation

September 20, 2021

Forward-mode automatic differentiation via dual numbers for C++20.

Overview

Dual numbers are a simple yet powerful technique for computing exact derivatives. The key insight: if we extend our number system with an element epsilon where epsilon^2 = 0, then evaluating f(x + epsilon) yields f(x) + epsilon * f'(x). The derivative emerges automatically from the algebra.

Quick Start

#include <dual/dual.hpp>
#include <iostream>

int main() {
    using namespace dual;

    // Create a dual variable at x = 2
    auto x = dual<double>::variable(2.0);

    // Compute f(x) = x^3 - 3x + 1
    auto f = x*x*x - 3.0*x + 1.0;

    std::cout << "f(2) = " << f.value() << "\n";       // 3.0
    std::cout << "f'(2) = " << f.derivative() << "\n"; // 9.0
}

The Mathematics

A dual number has the form $a + b\varepsilon$ where $\varepsilon^2 = 0$. Arithmetic follows naturally:

$$(a + b\varepsilon) + (c + d\varepsilon) = (a+c) + (b+d)\varepsilon$$$$(a + b\varepsilon)(c + d\varepsilon) = ac + (ad + bc)\varepsilon + bd\varepsilon^2 = ac + (ad + bc)\varepsilon$$

Notice how the $bd\varepsilon^2$ term vanishes because $\varepsilon^2 = 0$.

For a function $f$, Taylor expansion gives:

$$f(a + b\varepsilon) = f(a) + bf'(a)\varepsilon + \frac{b^2}{2}f''(a)\varepsilon^2 + \cdots = f(a) + bf'(a)\varepsilon$$

If we set $b = 1$ (marking $x$ as “the variable we’re differentiating with respect to”), then:

$$f(x + \varepsilon) = f(x) + f'(x)\varepsilon$$

The derivative appears as the coefficient of epsilon!

API Reference

dual

The core dual number type.

// Create a variable for differentiation
auto x = dual<double>::variable(3.0);  // x = 3, dx = 1

// Create a constant
auto c = dual<double>::constant(2.0);  // c = 2, dc = 0

// Access values
double val = x.value();       // 3.0
double deriv = x.derivative(); // 1.0

// Arithmetic operators: +, -, *, /
auto y = sin(x*x) + exp(-x);

// Convenience function
auto [value, deriv] = differentiate([](auto x) { return x*x; }, 3.0);

Mathematical Functions

All standard math functions are supported with correct derivative propagation:

Basic: sqrt, cbrt, abs
Exponential: exp, exp2, expm1, log, log2, log10, log1p
Trigonometric: sin, cos, tan, asin, acos, atan, atan2
Hyperbolic: sinh, cosh, tanh, asinh, acosh, atanh
Power: pow, hypot
Special: erf, erfc

Higher-Order Derivatives

Second derivatives with dual2:

auto result = differentiate2([](auto x) { return sin(x); }, 1.0);
// result.value  = sin(1)
// result.first  = cos(1)
// result.second = -sin(1)

Arbitrary order with jets:

// Compute f, f', f'', f''', f'''' at x = 1
auto derivs = derivatives<4>([](auto x) { return exp(x); }, 1.0);
// All derivatives of e^x at x=1 equal e

Forward vs Reverse Mode

This library implements forward mode AD:

Polynomials as Euclidean Domains

July 14, 2020

The same GCD algorithm works for integers and polynomials. That’s not a coincidence. It’s because both are Euclidean domains.

The Observation

// For integers: gcd(48, 18) = 6
// For polynomials: gcd(x^3 - 1, x^2 - 1) = x - 1

// Same algorithm, different types
template<euclidean_domain E>
E gcd(E a, E b) {
    while (b != E(0)) {
        a = std::exchange(b, a % b);
    }
    return a;
}

That template compiles and works correctly for both integers and polynomials. The reason it works is algebraic: both types support division with remainder, and the remainder is always “smaller” than the divisor in a well-defined sense.

Quick Start

#include <polynomials/polynomial.hpp>
#include <iostream>

using namespace poly;

int main() {
    // Create polynomial x^2 - 1 = (x-1)(x+1)
    auto p = polynomial<double>{-1, 0, 1};

    // Create polynomial x^3 - 1 = (x-1)(x^2+x+1)
    auto q = polynomial<double>{-1, 0, 0, 1};

    // GCD should be (x - 1)
    auto g = gcd(p, q);

    std::cout << "gcd(x^2-1, x^3-1) has degree " << g.degree() << "\n";  // 1

    // Find roots of x^2 - 1
    auto roots = find_roots(p, -10.0, 10.0);
    for (double r : roots) {
        std::cout << "Root: " << r << "\n";  // -1 and 1
    }
}

API Reference

Creating Polynomials

// From dense coefficients (a[i] = coefficient of x^i)
polynomial<double> p{1, -2, 1};  // 1 - 2x + x^2

// Monomial: coefficient * x^degree
auto m = polynomial<double>::monomial(3.0, 4);  // 3x^4

// The variable x
auto x = polynomial<double>::x();  // x

// Constant
polynomial<double> c{5.0};  // 5

Arithmetic

auto sum = p + q;
auto diff = p - q;
auto prod = p * q;
auto [quot, rem] = divmod(p, q);  // Division with remainder
auto quot_only = p / q;
auto rem_only = p % q;

auto g = gcd(p, q);                    // Greatest common divisor
auto [g, s, t] = extended_gcd(p, q);   // Bezout: g = p*s + q*t
auto l = lcm(p, q);                    // Least common multiple
bool d = divides(p, q);                // Does p divide q?

Evaluation and Calculus

double val = evaluate(p, x);           // p(x)
auto dp = derivative(p);               // p'(x)
auto integral = antiderivative(p);     // integral of p
auto roots = find_roots(p, -10, 10);   // All real roots in interval
auto crit = stationary_points(p, -10, 10);  // Where p'(x) = 0

The Euclidean Domain Structure

What makes this work is shared algebraic structure. A Euclidean domain has a norm function and a division algorithm where the remainder is always smaller than the divisor:

Property	Integers	Polynomials
Norm	abs(n)	degree(p)
Division	a = b*q + r, abs(r) < abs(b)	a = b*q + r, deg(r) < deg(b)
GCD	gcd(48, 18) = 6	gcd(x^2-1, x-1) = x-1

The GCD algorithm doesn’t care which type it’s operating on. It only needs the division-with-remainder property. Stepanov’s whole point is exactly this: algorithms arise from algebraic structure. When you recognize that polynomials and integers share the same abstract structure, you immediately get:

Exact Rational Arithmetic

February 18, 2020

Floating-point lies to you.

double x = 0.1 + 0.2;
std::cout << (x == 0.3);  // Prints 0 (false!)

The number 0.1 has no exact binary representation, for the same reason 1/3 has no exact decimal representation. Floating-point represents numbers as m x 2^e, and most decimal fractions don’t land on a power of two.

Rational arithmetic fixes this. 1/3 stays exactly 1/3.

The Representation

A rational number is a pair (numerator, denominator) kept in lowest terms:

template<std::integral T>
class rat {
    T num_;  // numerator (carries sign)
    T den_;  // denominator (always positive)

    void reduce() {
        T g = std::gcd(abs(num_), den_);
        num_ /= g;
        den_ /= g;
    }
};

Three invariants, always maintained:

The denominator is positive (sign lives in the numerator)
GCD(|num|, den) = 1 (always reduced)
Zero is uniquely 0/1

Arithmetic

Addition needs a common denominator:

$$\frac{a}{b} + \frac{c}{d} = \frac{ad + bc}{bd}$$

Then reduce. In code:

rat operator+(rat const& rhs) const {
    return rat(num_ * rhs.den_ + rhs.num_ * den_,
               den_ * rhs.den_);
}

The constructor calls reduce() automatically.

Multiplication is simpler:

$$\frac{a}{b} \times \frac{c}{d} = \frac{ac}{bd}$$

Division multiplies by the reciprocal:

$$\frac{a/b}{c/d} = \frac{ad}{bc}$$

Exact Comparison

No floating-point fuzziness. Two reduced rationals are equal iff their numerators and denominators match:

bool operator==(rat const& rhs) const {
    return num_ == rhs.num_ && den_ == rhs.den_;
}

For ordering, cross-multiply (valid because denominators are positive):

$$\frac{a}{b} < \frac{c}{d} \iff ad < cb$$

The Mediant

The mediant of a/b and c/d is (a+c)/(b+d). It’s not the average. It has different, more interesting properties:

rat mediant(rat const& a, rat const& b) {
    return rat(a.numerator() + b.numerator(),
               a.denominator() + b.denominator());
}

If a/b < c/d, then a/b < mediant < c/d. The mediant is always in lowest terms when a/b and c/d are neighbors in the Stern-Brocot tree. And mediants generate all positive rationals exactly once.

The Stern-Brocot Tree

Start with 0/1 and 1/0 (representing infinity). Repeatedly take mediants:

Level 0:     0/1                     1/0
Level 1:     0/1       1/1           1/0
Level 2:     0/1   1/2   1/1   2/1   1/0
Level 3:  0/1 1/3 1/2 2/3 1/1 3/2 2/1 3/1 1/0

Every positive rational appears exactly once. The path from root to any node encodes its continued fraction. This connects to best rational approximations and Farey sequences.

GCD Ties Everything Together

Reducing fractions requires GCD. The algorithm is Euclid’s, from around 300 BCE:

T gcd(T a, T b) {
    while (b != 0) {
        a = a % b;
        std::swap(a, b);
    }
    return a;
}

The same algorithm works for any Euclidean domain. That’s not a coincidence. It’s a consequence of the algebraic structure.

Rational numbers form a field: every non-zero element has a multiplicative inverse (the reciprocal). The requirement that denominators be non-zero and fractions reduced comes from this algebraic structure, not from arbitrary convention.

How Iterators Give You N+M Instead of NxM

November 15, 2019

The problem is combinatorial. You have N algorithms (sort, search, find, copy) and M containers (array, list, tree, hash table). The naive approach: implement each algorithm for each container. That is NxM implementations.

The insight is to interpose an abstraction layer.

The Iterator Abstraction

Instead of algorithms knowing about containers directly, we define iterator categories, capabilities that algorithms require and containers provide:

Input: Single-pass read. You can advance (++) and dereference (*), but once you move forward, you cannot go back. Stream-like.

Forward: Multi-pass. You can iterate multiple times; begin() always gives the same starting point.

Bidirectional: Can go backward (--). Enables algorithms like reverse iteration.

Random-access: Can jump anywhere (+n, []). Enables binary search, sorting.

This is a hierarchy of requirements. Each level adds capabilities and enables more algorithms. An algorithm declares the weakest category it needs, and any container providing at least that category works.

A True Input Iterator

The input iterator category exists for a reason. Here is a working example that reads entropy from /dev/urandom:

#include <fstream>
#include <iterator>
#include <cstdint>

struct entropy_iterator {
    using iterator_category = std::input_iterator_tag;
    using value_type        = uint8_t;
    using difference_type   = std::ptrdiff_t;
    using pointer           = const uint8_t*;
    using reference         = uint8_t;  // returns by value, not reference

    std::ifstream* source = nullptr;
    uint8_t byte = 0;

    entropy_iterator() = default;  // sentinel (end iterator)

    explicit entropy_iterator(std::ifstream& s) : source(&s) {
        ++(*this);  // prime the first byte
    }

    uint8_t operator*() const { return byte; }

    entropy_iterator& operator++() {
        if (source && source->good()) {
            source->read(reinterpret_cast<char*>(&byte), 1);
            if (!source->good()) source = nullptr;
        }
        return *this;
    }

    entropy_iterator operator++(int) {
        auto tmp = *this;
        ++(*this);
        return tmp;
    }

    bool operator==(const entropy_iterator& other) const {
        return source == other.source;
    }
};

Use it like any input iterator:

int main() {
    std::ifstream urandom("/dev/urandom", std::ios::binary);
    entropy_iterator it(urandom);

    // generate 16 random bytes
    std::vector<uint8_t> key(16);
    std::copy_n(it, 16, key.begin());

    // or use in algorithms
    int sum = 0;
    for (int i = 0; i < 1000; ++i, ++it) {
        sum += *it;
    }
    // sum ≈ 127500 (mean of uniform [0,255] × 1000)
}

Each ++ consumes a fresh entropy byte from the kernel. You literally cannot iterate twice over the same sequence. This is why the input iterator category exists: some sources are inherently single-pass. Claiming forward iterator capabilities would be a lie.

The same pattern applies to network streams, sensor readings, and any source where data is consumed by reading it.

The Payoff

Now binary_search does not need to know about vectors, deques, or sorted arrays. It only needs random-access iterators. The algorithm expresses its requirements; the container provides capabilities. They compose through the iterator abstraction.

Is It Prime?

September 10, 2019

The Miller-Rabin primality test and the mathematics of certainty

The Problem

Given a large number n, is it prime? Trial division up to sqrt(n) is too slow for cryptographic-sized numbers. We need something faster, and we are willing to accept “probably prime” with quantifiable certainty.

Fermat’s Little Theorem

For prime $p$ and any $a$ not divisible by $p$:

$$a^{p-1} \equiv 1 \pmod{p}$$

This suggests a test: pick random a, compute a^(n-1) mod n. If the result is not 1, n is definitely composite. But if it is 1, n might be prime, or might be a Carmichael number that fools this test.

The Miller-Rabin Improvement

Miller and Rabin observed something stronger. For odd prime $p$, write $p-1 = 2^r \cdot d$ (factor out all 2s). Then the sequence:

$$a^d, a^{2d}, a^{4d}, \ldots, a^{2^r \cdot d} = a^{p-1}$$

must either:

Start with 1, or
Contain -1 (i.e., p-1) somewhere before reaching 1

Why? Because the only square roots of 1 mod p are plus or minus 1. If we ever see 1 without first seeing -1, we have found a non-trivial square root of 1, proving n is composite.

The Witness Test

bool witness_test(int64_t n, int64_t a) {
    // Write n-1 = 2^r x d
    int64_t d = n - 1;
    int r = 0;
    while ((d & 1) == 0) { d >>= 1; r++; }

    // Compute x = a^d mod n
    int64_t x = mod_pow(a, d, n);

    if (x == 1 || x == n - 1) return true;  // Probably prime

    // Square r-1 times, looking for n-1
    for (int i = 1; i < r; i++) {
        x = (x * x) % n;
        if (x == n - 1) return true;
    }

    return false;  // Definitely composite
}

If witness_test(n, a) returns false, n is definitely composite. The value a is a “witness” to compositeness.

Error Bounds

Here is the part I find most satisfying. For any composite n, at least 3/4 of all possible witnesses a in [2, n-2] will detect it. Each random witness has at most 1/4 chance of failing to detect a composite.

With $k$ independent witnesses:

$$P(\text{false positive}) \leq \left(\frac{1}{4}\right)^k$$

Witnesses	Error bound
10	$< 10^{-6}$
20	$< 10^{-12}$
40	$< 10^{-24}$

The error drops exponentially. 40 witnesses gives you a false positive probability smaller than the chance of a cosmic ray flipping a bit in your RAM during the computation.

Parameterizing by Error

Rather than asking “how many iterations?”, ask “what error rate is acceptable?”:

Modular Arithmetic as Rings

June 22, 2019

Finite algebraic structures and what they teach us about algorithms

The Stepanov Perspective

Stepanov’s central insight: algorithms arise from algebraic structure. The same algorithm that works on integers works on matrices, polynomials, and modular integers, not by accident, but because they share algebraic properties.

Integers modulo N form a ring: a set with addition and multiplication satisfying familiar laws. When N is prime, it is a field, meaning every non-zero element has a multiplicative inverse. Understanding these structures tells you which algorithms apply where.

The Ring Z/NZ

Integers modulo N, written Z/NZ, are equivalence classes:

0 = {…, -N, 0, N, 2N, …}
1 = {…, -N+1, 1, N+1, 2N+1, …}
…

Operations are inherited from integers:

[a] + [b] = [a + b]
[a] x [b] = [a x b]

The implementation keeps one representative per class, in [0, N):

template<int64_t N>
struct mod_int {
    int64_t v;  // Always in [0, N)

    static constexpr int64_t normalize(int64_t x) {
        x %= N;
        return x < 0 ? x + N : x;
    }

    constexpr mod_int(int64_t x) : v(normalize(x)) {}
};

Ring Axioms

A ring (R, +, x) satisfies:

Addition forms an abelian group:

Associative: (a + b) + c = a + (b + c)
Identity: a + 0 = a
Inverses: a + (-a) = 0
Commutative: a + b = b + a

Multiplication is a monoid:

Associative: (a x b) x c = a x (b x c)
Identity: a x 1 = 1 x a = a

Distributive:

a x (b + c) = a x b + a x c
(b + c) x a = b x a + c x a

These axioms enable algorithms. Power-by-squaring works because multiplication is associative. The extended GCD works because of the ring structure.

Fermat’s Little Theorem

When N is prime, something special happens: every non-zero element has a multiplicative inverse. The set of non-zero elements forms a multiplicative group of order N-1.

Fermat’s Little Theorem: for prime $p$ and $a$ not congruent to $0 \pmod{p}$:

$$a^{p-1} \equiv 1 \pmod{p}$$

This gives us the inverse:

$$a \cdot a^{p-2} = a^{p-1} \equiv 1 \pmod{p}$$

So $a^{p-2}$ is the multiplicative inverse of $a$:

constexpr mod_int inverse() const {
    return pow(N - 2);  // Using repeated squaring
}

The Connection to Peasant

The power function uses the same peasant algorithm from the previous post:

One Algorithm, Infinite Powers

March 15, 2019

How the Russian peasant algorithm reveals the universal structure of exponentiation

The Algorithm

Russian peasants had a clever method for multiplication that does not require memorizing times tables. To compute 23 x 17:

23    17
11    34     (halve, double)
 5    68
 2   136
 1   272

Add the right column wherever the left is odd: 17 + 34 + 68 + 272 = 391. That is 23 x 17.

Why does this work? Because we are really computing:

23 x 17 = (16 + 4 + 2 + 1) x 17 = 16x17 + 4x17 + 2x17 + 17

The algorithm only needs three operations on the multiplier:

half(n), integer division by 2
even(n), test if divisible by 2
Addition on the result

From Multiplication to Exponentiation

Here is the insight that makes this interesting: the same algorithm computes powers.

Replace “add to accumulator” with “multiply into accumulator” and “double the multiplicand” with “square the base”:

T power(T base, int exp) {
    T result = 1;
    while (exp > 0) {
        if (!even(exp)) result = result * base;
        base = base * base;
        exp = half(exp);
    }
    return result;
}

This is O(log n) multiplications instead of O(n). Computing 2^1000 takes about 10 multiplications, not 1000.

The Monoid Connection

The peasant algorithm works whenever you have:

An associative binary operation *
An identity element 1 where 1 * x = x * 1 = x

This structure is called a monoid. The algorithm computes x * x * ... * x (n times) using O(log n) operations.

What makes this powerful is that many things form monoids:

Type	Operation	Identity	Computing x^n gives you…
Integers	x	1	Powers
Matrices	x	I	Matrix powers
Strings	concat	""	String repetition
Functions	compose	id	Function iteration
Permutations	compose	id	Permutation powers
Quaternions	x	1	Rotation composition

Why Associativity Unlocks Efficiency

Why does the peasant algorithm achieve O(log n) instead of O(n)? The answer lies in a single algebraic law: associativity.

Associativity says $(a \cdot b) \cdot c = a \cdot (b \cdot c)$. This looks innocuous, but it means we can restructure computation without changing results. Consider computing $a^8$:

Naive:     a x a x a x a x a x a x a x a     (7 multiplications)
Peasant:   ((a^2)^2)^2                        (3 multiplications)

Both produce the same answer because we can freely regroup. The peasant algorithm exploits this freedom systematically: instead of accumulating one factor at a time, it squares intermediate results and combines them.

The MCP Pattern: SQLite as the AI-Queryable Cache

March 20, 2026

I keep building the same thing.

Not the same product — the products are different. One indexes a Hugo blog. One indexes AI conversations. One consolidates medical records from three hospitals. One catalogs a hundred git repositories. But underneath, they all have the same skeleton. After the fifth time, I think the skeleton deserves a name.

The pattern

Domain files (ground truth)
    ↓ index
SQLite database (read-only cache, FTS5)
    ↓ expose
MCP server (tools + resources → AI assistant)

That’s it. Three layers. The domain files are always canonical — the database is a disposable cache you can rebuild from them at any time. SQLite gives you structured queries, full-text search, and JSON extraction over data that was previously trapped in flat files. MCP exposes it to an AI assistant that can write SQL, retrieve content, and (in some cases) create new content.

Here’s the inventory:

Project	Domain	Ground Truth	What the MCP Exposes
hugo-memex	Blog content	Markdown files with YAML front matter	951 pages, FTS5 search, taxonomy queries, JSON front matter extraction
memex	AI conversations	ChatGPT/Claude/Gemini exports	Conversation trees, FTS5 message search, tags, enrichments
chartfold	Medical records	Epic, MEDITECH, athenahealth exports	Labs, meds, encounters, imaging, pathology, cross-source reconciliation
arkiv	Personal archives	JSONL files from various sources	Unified SQL over heterogeneous personal data
repoindex	Git repositories	Local git repos + GitHub/PyPI/CRAN metadata	Repository catalog with activity tracking, publication status

Five projects. Five completely different domains. One architecture.

Why SQLite

SQLite is the most deployed database in history. It’s on every phone, every browser, every Python installation. But that’s not why I use it.

I use it because it solves three problems at once:

Structured queries over unstructured data. Hugo front matter is YAML trapped inside markdown files. Medical records are scattered across three incompatible EHR export formats. AI conversations are JSON trees with branching paths. SQLite turns all of these into tables you can JOIN, GROUP BY, and aggregate. json_extract() handles the long tail of fields that don’t fit a fixed schema.

Full-text search. FTS5 with porter stemming and unicode61 tokenization gives you relevance-ranked search across any text corpus. No Elasticsearch, no external service, no running daemon. Just a virtual table that lives in the same database file.

Code Without Purpose

February 24, 2026

Time is finite in ways I can’t ignore. That changes which questions about code feel important.

I read a post arguing that the most valuable programming skill in 2026 is deleting code. The thesis: AI generates code faster than anyone can review it, so the real value is in curation and subtraction. Code is a liability, not an asset.

I agree with the observation. I disagree with the prescription.

The MCP Pattern: SQLite as the AI-Queryable Cache

March 20, 2026

I keep building the same thing.

Not the same product — the products are different. One indexes a Hugo blog. One indexes AI conversations. One consolidates medical records from three hospitals. One catalogs a hundred git repositories. But underneath, they all have the same skeleton. After the fifth time, I think the skeleton deserves a name.

The pattern

Domain files (ground truth)
    ↓ index
SQLite database (read-only cache, FTS5)
    ↓ expose
MCP server (tools + resources → AI assistant)

That’s it. Three layers. The domain files are always canonical — the database is a disposable cache you can rebuild from them at any time. SQLite gives you structured queries, full-text search, and JSON extraction over data that was previously trapped in flat files. MCP exposes it to an AI assistant that can write SQL, retrieve content, and (in some cases) create new content.

Here’s the inventory:

Project	Domain	Ground Truth	What the MCP Exposes
hugo-memex	Blog content	Markdown files with YAML front matter	951 pages, FTS5 search, taxonomy queries, JSON front matter extraction
memex	AI conversations	ChatGPT/Claude/Gemini exports	Conversation trees, FTS5 message search, tags, enrichments
chartfold	Medical records	Epic, MEDITECH, athenahealth exports	Labs, meds, encounters, imaging, pathology, cross-source reconciliation
arkiv	Personal archives	JSONL files from various sources	Unified SQL over heterogeneous personal data
repoindex	Git repositories	Local git repos + GitHub/PyPI/CRAN metadata	Repository catalog with activity tracking, publication status

Five projects. Five completely different domains. One architecture.

Why SQLite

SQLite is the most deployed database in history. It’s on every phone, every browser, every Python installation. But that’s not why I use it.

I use it because it solves three problems at once:

Structured queries over unstructured data. Hugo front matter is YAML trapped inside markdown files. Medical records are scattered across three incompatible EHR export formats. AI conversations are JSON trees with branching paths. SQLite turns all of these into tables you can JOIN, GROUP BY, and aggregate. json_extract() handles the long tail of fields that don’t fit a fixed schema.

Full-text search. FTS5 with porter stemming and unicode61 tokenization gives you relevance-ranked search across any text corpus. No Elasticsearch, no external service, no running daemon. Just a virtual table that lives in the same database file.

Chartfold: Owning Your Medical Records

February 24, 2026

I have cancer. My oncologist is at one hospital system (Siteman/BJC), my primary care doctor at another, and my earlier treatment history lives at a third (Anderson, where my first oncologist practiced). Patient portals are fine for browsing, but they don’t answer questions. They show you your data one lab result at a time, one note at a time, one visit at a time.

I wanted to run queries against my medical records. Correlate lab trends with treatment changes. Generate structured question lists before oncology visits. Ask “what changed since my last appointment” and get a real answer. That means getting the data out of the portal and into something programmable.

Chartfold loads EHR exports into SQLite and exposes them to Claude via MCP.

Code Without Purpose

February 24, 2026

Time is finite in ways I can’t ignore. That changes which questions about code feel important.

I read a post arguing that the most valuable programming skill in 2026 is deleting code. The thesis: AI generates code faster than anyone can review it, so the real value is in curation and subtraction. Code is a liability, not an asset.

I agree with the observation. I disagree with the prescription.

The MCP Pattern: SQLite as the AI-Queryable Cache

March 20, 2026

I keep building the same thing.

Not the same product — the products are different. One indexes a Hugo blog. One indexes AI conversations. One consolidates medical records from three hospitals. One catalogs a hundred git repositories. But underneath, they all have the same skeleton. After the fifth time, I think the skeleton deserves a name.

The pattern

Domain files (ground truth)
    ↓ index
SQLite database (read-only cache, FTS5)
    ↓ expose
MCP server (tools + resources → AI assistant)

That’s it. Three layers. The domain files are always canonical — the database is a disposable cache you can rebuild from them at any time. SQLite gives you structured queries, full-text search, and JSON extraction over data that was previously trapped in flat files. MCP exposes it to an AI assistant that can write SQL, retrieve content, and (in some cases) create new content.

Here’s the inventory:

Project	Domain	Ground Truth	What the MCP Exposes
hugo-memex	Blog content	Markdown files with YAML front matter	951 pages, FTS5 search, taxonomy queries, JSON front matter extraction
memex	AI conversations	ChatGPT/Claude/Gemini exports	Conversation trees, FTS5 message search, tags, enrichments
chartfold	Medical records	Epic, MEDITECH, athenahealth exports	Labs, meds, encounters, imaging, pathology, cross-source reconciliation
arkiv	Personal archives	JSONL files from various sources	Unified SQL over heterogeneous personal data
repoindex	Git repositories	Local git repos + GitHub/PyPI/CRAN metadata	Repository catalog with activity tracking, publication status

Five projects. Five completely different domains. One architecture.

Why SQLite

SQLite is the most deployed database in history. It’s on every phone, every browser, every Python installation. But that’s not why I use it.

I use it because it solves three problems at once:

Structured queries over unstructured data. Hugo front matter is YAML trapped inside markdown files. Medical records are scattered across three incompatible EHR export formats. AI conversations are JSON trees with branching paths. SQLite turns all of these into tables you can JOIN, GROUP BY, and aggregate. json_extract() handles the long tail of fields that don’t fit a fixed schema.

Full-text search. FTS5 with porter stemming and unicode61 tokenization gives you relevance-ranked search across any text corpus. No Elasticsearch, no external service, no running daemon. Just a virtual table that lives in the same database file.

The MCP Pattern: SQLite as the AI-Queryable Cache

March 20, 2026

I keep building the same thing.

Not the same product — the products are different. One indexes a Hugo blog. One indexes AI conversations. One consolidates medical records from three hospitals. One catalogs a hundred git repositories. But underneath, they all have the same skeleton. After the fifth time, I think the skeleton deserves a name.

The pattern

Domain files (ground truth)
    ↓ index
SQLite database (read-only cache, FTS5)
    ↓ expose
MCP server (tools + resources → AI assistant)

That’s it. Three layers. The domain files are always canonical — the database is a disposable cache you can rebuild from them at any time. SQLite gives you structured queries, full-text search, and JSON extraction over data that was previously trapped in flat files. MCP exposes it to an AI assistant that can write SQL, retrieve content, and (in some cases) create new content.

Here’s the inventory:

Project	Domain	Ground Truth	What the MCP Exposes
hugo-memex	Blog content	Markdown files with YAML front matter	951 pages, FTS5 search, taxonomy queries, JSON front matter extraction
memex	AI conversations	ChatGPT/Claude/Gemini exports	Conversation trees, FTS5 message search, tags, enrichments
chartfold	Medical records	Epic, MEDITECH, athenahealth exports	Labs, meds, encounters, imaging, pathology, cross-source reconciliation
arkiv	Personal archives	JSONL files from various sources	Unified SQL over heterogeneous personal data
repoindex	Git repositories	Local git repos + GitHub/PyPI/CRAN metadata	Repository catalog with activity tracking, publication status

Five projects. Five completely different domains. One architecture.

Why SQLite

SQLite is the most deployed database in history. It’s on every phone, every browser, every Python installation. But that’s not why I use it.

I use it because it solves three problems at once:

Structured queries over unstructured data. Hugo front matter is YAML trapped inside markdown files. Medical records are scattered across three incompatible EHR export formats. AI conversations are JSON trees with branching paths. SQLite turns all of these into tables you can JOIN, GROUP BY, and aggregate. json_extract() handles the long tail of fields that don’t fit a fixed schema.

Full-text search. FTS5 with porter stemming and unicode61 tokenization gives you relevance-ranked search across any text corpus. No Elasticsearch, no external service, no running daemon. Just a virtual table that lives in the same database file.

The MCP Pattern: SQLite as the AI-Queryable Cache

March 20, 2026

I keep building the same thing.

Not the same product — the products are different. One indexes a Hugo blog. One indexes AI conversations. One consolidates medical records from three hospitals. One catalogs a hundred git repositories. But underneath, they all have the same skeleton. After the fifth time, I think the skeleton deserves a name.

The pattern

Domain files (ground truth)
    ↓ index
SQLite database (read-only cache, FTS5)
    ↓ expose
MCP server (tools + resources → AI assistant)

That’s it. Three layers. The domain files are always canonical — the database is a disposable cache you can rebuild from them at any time. SQLite gives you structured queries, full-text search, and JSON extraction over data that was previously trapped in flat files. MCP exposes it to an AI assistant that can write SQL, retrieve content, and (in some cases) create new content.

Here’s the inventory:

Project	Domain	Ground Truth	What the MCP Exposes
hugo-memex	Blog content	Markdown files with YAML front matter	951 pages, FTS5 search, taxonomy queries, JSON front matter extraction
memex	AI conversations	ChatGPT/Claude/Gemini exports	Conversation trees, FTS5 message search, tags, enrichments
chartfold	Medical records	Epic, MEDITECH, athenahealth exports	Labs, meds, encounters, imaging, pathology, cross-source reconciliation
arkiv	Personal archives	JSONL files from various sources	Unified SQL over heterogeneous personal data
repoindex	Git repositories	Local git repos + GitHub/PyPI/CRAN metadata	Repository catalog with activity tracking, publication status

Five projects. Five completely different domains. One architecture.

Why SQLite

SQLite is the most deployed database in history. It’s on every phone, every browser, every Python installation. But that’s not why I use it.

I use it because it solves three problems at once:

Structured queries over unstructured data. Hugo front matter is YAML trapped inside markdown files. Medical records are scattered across three incompatible EHR export formats. AI conversations are JSON trees with branching paths. SQLite turns all of these into tables you can JOIN, GROUP BY, and aggregate. json_extract() handles the long tail of fields that don’t fit a fixed schema.

Full-text search. FTS5 with porter stemming and unicode61 tokenization gives you relevance-ranked search across any text corpus. No Elasticsearch, no external service, no running daemon. Just a virtual table that lives in the same database file.

repoindex: Collection Awareness for Your Git Repos

December 16, 2025

I have around 100 git repos. When I’m working with Claude Code on one of them, the AI has deep knowledge of that repo but zero awareness of the rest. Questions like “which of my repos already has a fuzzy search implementation?” or “what other projects use this pattern?” require me to go dig around manually.

repoindex fixes that.

The Idea

Separation of concerns:

Claude Code (deep work on ONE repo)
         |
         |  "What else do I have?"
         |  "Which repos need X?"
         v
    repoindex (collection awareness)
         |
         +-- repo://...     -> what exists
         +-- tags://...     -> organization
         +-- stats://...    -> aggregations
         +-- events://...   -> what happened

Claude Code works inside repositories. repoindex knows about repositories: metadata, tags, status, relationships. Together they give you full portfolio awareness.

MCP Server Integration

The most useful part is the MCP (Model Context Protocol) server. Add it to your Claude Code configuration and the AI can query your collection directly:

“Which of my Python repos don’t have a LICENSE file?”
“What repos have I updated in the last week?”
“Show me all projects tagged with ml”

The server exposes resources like repo://, tags://, stats://, and events:// that Claude Code reads to understand your portfolio.

Core Features

Tag-Based Organization. Hierarchical tags for categorizing repos. Tags can be explicit (repoindex tag add myproject topic:ml) or implicit (derived automatically from language, directory, features).

Query Language. Filter repos with expressions:

repoindex query "language == 'Python' and 'ml' in tags"
repoindex query "stars > 10 and has:docs"

Event Tracking. What happened across your collection:

repoindex events --since 7d --pretty

New releases, tags, PyPI publishes, all in one view.

JSONL Output. Every command outputs newline-delimited JSON by default, so it plays well with Unix pipelines:

repoindex status | jq 'select(.status.clean == false)'

Installation

Available on PyPI:

pip install repoindex

Configure your repository directories and start indexing:

repoindex config generate
repoindex list --pretty

Why the Rename?

This was previously called ghops. The new name is more honest about what it does: it indexes repositories. The old name implied GitHub-specific operations, but the tool works with any git repo.

Everything is a File: Virtual Filesystems for CLI Data Tools

October 20, 2025

I had a bookmark manager. Then an ebook library manager. Then a chat history manager. Each started with the standard CRUD CLI:

btk add https://example.com --tags python,tutorial
btk list --tag python
btk search "async"
btk delete 1234

ebk import book.pdf --author "Knuth"
ebk list --author Knuth
ebk search "algorithms"

This works fine until you have 10,000+ bookmarks organized with hierarchical tags like programming/python/async, research/ml/transformers, work/clients/acme. Your ebook library has similar structure. Your exported chat conversations from Claude, ChatGPT, and Copilot are piling up.

Traditional CRUD commands become unwieldy:

btk list --tag programming/python/async/io --format json | jq '.[].title'
ebk list --category "Computer Science/Algorithms/Graph Theory" --limit 50
ctk search "machine learning" --source ChatGPT --date-from 2024-01-01

Each command requires precise arguments. Each tool has different flag conventions. You can’t navigate your data. You can only query it. And queries require knowing exactly what you’re looking for.

The insight: everything is a file

When I have thousands of source files organized in directories, I don’t run:

list-files --path /src/components/auth --extension .tsx

I run:

cd src/components/auth
ls *.tsx

The difference matters. With a filesystem, I can navigate incrementally (cd from general to specific), explore (ls to see what’s there), compose (cat file | grep pattern | wc -l), and use familiar tools (find, grep, xargs, pipes, redirection).

What if my bookmarks, ebooks, and chat histories were filesystems?

The pattern

Over the past year, I built six Python tools that all follow the same architecture:

Tool	Domain	VFS Root Structure
btk	Bookmarks	`/bookmarks/`, `/tags/`, `/recent/`, `/domains/`, `/unread/`, `/popular/`
ebk	Ebook library	`/books/`, `/authors/`, `/series/`, `/subjects/`, `/recent/`, `/unread/`
ctk	Chat conversations	`/conversations/`, `/sources/`, `/topics/`, `/starred/`, `/recent/`
ghops	Git repositories	`/repos/`, `/languages/`, `/topics/`, `/stars/`, `/recent/`
infinigram	N-gram models	`/datasets/`, `/models/`, `/corpora/`
AlgoTree	Tree structures	`/nodes/`, `/paths/`, `/subtrees/`

Each tool provides:

A stateless CLI for scripting: btk bookmark add URL, ebk import book.pdf
An interactive shell with a virtual filesystem: btk shell, ebk shell, ctk chat
POSIX-like commands: cd, ls, pwd, cat, mv, cp, rm, find, grep
Unix pipeline support: most commands output JSONL by default for piping

The interesting part is the shell.

Navigating 10,000 bookmarks

Live recording captured with asciinema. You can pause, copy text, and replay. The entire recording is 78KB of text.

What Happens When You Let an AI Loose on 1,000 Erdős Problems

March 16, 2026

I should be upfront about what happened here. I did not compute coprime Ramsey numbers. I did not write 92 Python modules or 5,922 tests. I did not build SAT encodings or run survival analysis on Erdős problems.

Claude Code (Opus 4.6) did all of that. I told it what to look at, asked it to keep going, and occasionally said things like “try to disprove our discoveries” and “be aggressive.” The AI did the rest. 131 subagents, 78,000 lines of code, three minted DOIs. In one session.

I’m writing this down because I think it’s worth documenting what that looks like from the human side of the keyboard.

The Setup

Terence Tao maintains a database of 1,183 Erdős problems on GitHub. Each problem has tags, OEIS links, resolution status, and sometimes prize money. The database was updated in August 2025 to link problems to integer sequences. Since then, 213 problems have been solved, many with AI assistance.

I had been poking at this database on and off for a few months. I had some Python scripts, some partial Lean proofs, a few computational results. Nothing organized. The codebase had bugs (the kind where a random sampling heuristic silently gives you the wrong answer and you don’t notice for weeks).

I started a Claude Code session intending to fix those bugs. Then I said “iterate.” Then I kept saying “iterate.”

What Claude Found

The headline result is a family of numbers that, as far as anyone can tell, nobody had studied before.

Take the integers 1 through n. Connect every coprime pair with an edge. This is the coprime graph. Now 2-color every edge. The coprime Ramsey number R_cop(k) is the smallest n where every 2-coloring must contain a monochromatic complete subgraph of size k.

Classical Ramsey: R(3,3) = 6. Coprime Ramsey: R_cop(3) = 11.

The value R_cop(4) = 59 required SAT solving (Glucose4 via pysat). A random sampling heuristic had said 20. It was off by a factor of three. The SAT solver finds avoiding colorings instantly at every n up to 58. At n = 59 (prime, coprime to everything below it), no avoiding coloring exists. This was verified by an independent implementation built from scratch by a separate adversarial agent.

Fuzzy Inference: Teaching Machines to Think in Shades of Grey

March 16, 2026

Facts and Degrees

In classical logic, something is true or false. The cat is on the mat, or it is not. A patient has a fever, or they do not. There is no middle ground.

Fuzzy logic adds a dial.

Instead of true/false, every statement carries a degree of belief – a number between 0 and 1. A degree of 1.0 means certainty. A degree of 0.0 means we have no belief at all. And everything in between is fair game.

Here is the simplest possible fuzzy fact:

# A fuzzy fact: "Rex has hair" with 85% confidence
engine.add_fact("has-hair", ["rex"], 0.85)

The predicate is has-hair. The argument is rex. The degree is 0.85. Maybe we observed Rex from a distance, or the photo was blurry. We are fairly sure Rex has hair, but not certain.

This is the building block of everything that follows. A fuzzy knowledge base is just a collection of these facts, each with its own degree. Some facts we are sure about (deg=1.0). Others are tentative guesses (deg=0.3). The engine treats them all the same way – it just pays attention to the number.

One important detail: when two sources assert the same fact with different degrees, the engine keeps the higher one. This is called fuzzy-OR. If one sensor says has-hair(rex) at 0.85 and another says it at 0.92, the engine stores 0.92. Optimistic, but reasonable – the stronger evidence wins.

engine.add_fact("has-hair", ["rex"], 0.85)
engine.add_fact("has-hair", ["rex"], 0.92)  # fuzzy-OR: keeps 0.92

In the widget below, you can create fuzzy facts and drag the degree slider to see how the degree changes the visual representation. A fact at 1.0 is solid and bright. A fact at 0.1 is faded, barely there. This is not just decoration – it is the engine’s uncertainty, made visible.

Rules

Facts alone are inert. To reason, we need rules – if-then statements that produce new facts from existing ones.

A fuzzy rule looks like this: “If X has hair, then X is a mammal.” In code:

engine.add_rule(
    name="mammal-rule",
    conditions=[{"pred": "has-hair", "args": ["?x"], "degVar": "?d"}],
    actions=[{
        "type": "add",
        "fact": {"pred": "is-mammal", "args": ["?x"], "deg": ["*", 0.95, "?d"]}
    }],
    priority=60,
)

There is a lot going on here, so let us unpack it.

Pattern variables. The ?x in the condition is a variable. It matches any argument. When the engine finds has-hair(rex, 0.85), it binds ?x to rex. The same ?x then appears in the action, so the engine adds is-mammal(rex, ...).

Narrating a Hugo Blog with Sentence Highlighting

February 26, 2026

I wanted my blog posts to have audio narration. Not a podcast, not a read-aloud button that sends text to a cloud API. Local TTS with narro, my 80M parameter CPU model, generating Opus files that live next to the markdown source. One command to narrate an entire Hugo site.

That part was straightforward. The part that got interesting was highlighting: tracking which sentence is being spoken and lighting it up in the browser as the audio plays.

Code Without Purpose

February 24, 2026

Time is finite in ways I can’t ignore. That changes which questions about code feel important.

I read a post arguing that the most valuable programming skill in 2026 is deleting code. The thesis: AI generates code faster than anyone can review it, so the real value is in curation and subtraction. Code is a liability, not an asset.

I agree with the observation. I disagree with the prescription.

Code Without Purpose

February 24, 2026

Time is finite in ways I can’t ignore. That changes which questions about code feel important.

I read a post arguing that the most valuable programming skill in 2026 is deleting code. The thesis: AI generates code faster than anyone can review it, so the real value is in curation and subtraction. Code is a liability, not an asset.

I agree with the observation. I disagree with the prescription.

Long Echo Comes Alive: From Philosophy to Orchestration

January 20, 2026

A year ago, I wrote about Long Echo as a philosophy for preserving AI conversations across decades. The key insight was graceful degradation: design archives that work progressively even as technology disappears.

That philosophy has become a tool.

From Philosophy to Tool

The original Long Echo was intentionally not code. It was a set of principles documented in CTK’s repository. The hard problems of conversation parsing, storage, and search were already solved by toolkits like CTK, BTK, and EBK.

What was missing was the unification layer. Each toolkit exports its own ECHO-compliant archive, but combining them into a single browsable experience required manual work. That’s what longecho now handles.

What longecho Does Now

longecho is a CLI tool with five capabilities:

longecho check ~/my-data/       # Validate ECHO compliance
longecho discover ~/            # Find ECHO sources
longecho search ~/ "query"      # Search README descriptions
longecho build ~/my-archive/    # Generate static site
longecho serve ~/my-archive/    # Preview locally via HTTP

The check, discover, and search commands existed in the original specification. What’s new is build and serve, the orchestration layer.

Building a Unified Site

The build command takes a hierarchical archive and generates a static site:

longecho build ~/my-archive/

This produces a site/ directory with:

An index page linking to all sub-archives
Navigation between sources
Automatic linking to existing sub-site builds

If a sub-archive already has its own site/ directory (like CTK’s exports), longecho links to it. Use --bundle to copy everything into a portable, self-contained site.

Live Preview

The serve command provides local HTTP preview:

longecho serve ~/my-archive/ --port 8000

It builds the site if needed, then serves it for browser viewing.

The Manifest

ECHO compliance requires only a README. But for machine-readable metadata, longecho supports an optional manifest:

version: "1.0"
name: "Alex's Data Archive"
description: "Personal data archive"
sources:
  - path: "conversations/"
    order: 1
  - path: "bookmarks/"
    order: 2
  - path: "ebooks/"
    order: 3

The manifest enables:

Explicit ordering of sources in generated sites
Selective inclusion via the browsable flag
Override names for cleaner presentation
Icon hints for UI presentation

Without a manifest, longecho auto-discovers sub-archives by looking for directories with README files. The manifest provides explicit control when you need it.

Code Without Purpose

February 24, 2026

Time is finite in ways I can’t ignore. That changes which questions about code feel important.

I read a post arguing that the most valuable programming skill in 2026 is deleting code. The thesis: AI generates code faster than anyone can review it, so the real value is in curation and subtraction. Code is a liability, not an asset.

I agree with the observation. I disagree with the prescription.

pagevault: Hiding an Encryption Platform Inside HTML

February 13, 2026

HTML is an encryption container format. That sounds wrong, but think about what an HTML file can hold: arbitrary data in script tags or data attributes, a full programming runtime via JavaScript, and a rendering engine (the browser) on every device on the planet. If you embed encrypted data and the code to decrypt it, the result is a file that looks inert until someone types the right password.

pagevault takes this idea seriously. It encrypts files, documents, images, entire websites, into self-contained HTML pages that decrypt in the browser. No backend. No JavaScript crypto libraries. The browser already has AES-256-GCM built in via the Web Crypto API. pagevault just has to match the parameters exactly on the Python side and embed the right 200 lines of JavaScript.

The output is a single .html file. You can email it, put it on a USB stick, host it on GitHub Pages, or double-click it on your desktop. It doesn’t phone home, it doesn’t load CDNs, it doesn’t need anything except a browser.

Code Without Purpose

February 24, 2026

Time is finite in ways I can’t ignore. That changes which questions about code feel important.

I read a post arguing that the most valuable programming skill in 2026 is deleting code. The thesis: AI generates code faster than anyone can review it, so the real value is in curation and subtraction. Code is a liability, not an asset.

I agree with the observation. I disagree with the prescription.

Posthumous: A Federated Dead Man's Switch

February 14, 2026

Some things should only happen after you can’t do them yourself.

Posthumous is a self-hosted dead man’s switch. You check in periodically (via phone, browser, CLI, or API call) and if you stop, it progresses through escalating stages before triggering automated actions: sending notifications, running scripts, whatever you’ve configured.

I built it because the existing options are either cloud-hosted (you’re trusting someone else’s uptime for your most important automation) or single-node (one server failure and silence is indistinguishable from death). Posthumous is federated, multiple nodes watch each other, and fully self-hosted.

This post walks through the basic workflows.

Masked Failure Data: Looking Back, Looking Forward

February 18, 2026

I have been working on the same statistical problem since 2020. I am now a PhD student in CS. The problem has not changed, but my understanding of it has, and the tools I have built around it look nothing like what I started with.

The problem: a series system fails when any component fails. You observe system-level failure times. But you often cannot tell which component caused the failure (masking). Some systems are still running when testing ends (censoring). Given this incomplete data, estimate component reliability.

This is not a tutorial. It is a map of where things stand and where they are going.

Observation Functors: Composable Censoring for Series System Simulation

February 13, 2026

Last week I announced maskedcauses, the R package for estimating component reliability from masked series system failures. That post covered the three likelihood models and the path to CRAN.

This post is about what happened next: the package now supports four observation types (exact, right-censored, left-censored, and interval-censored) via composable observation functors. Along the way, I wrote four vignettes, removed the md.tools dependency, and developed a verification methodology for keeping prose honest about simulation results.

maskedcauses: Maximum Likelihood Estimation for Masked Series System Failures

February 5, 2026

Note (February 2026): This package has been renamed from likelihood.model.series.md to maskedcauses.

Two days ago, I submitted likelihood.model to CRAN, the foundation package for composable statistical inference. Next in line: maskedcauses, which implements maximum likelihood estimation for series systems where component failure causes are masked.

This package is the practical result of my master’s thesis work. Three years of theoretical development, now packaged for anyone analyzing masked failure data.

The Problem: Masked Component Failures

A series system fails when any of its $m$ components fails. In reliability testing, you observe the system fail at time $t$, but two layers of uncertainty obscure the full picture:

Right-censoring: Some systems are still running when testing ends. You know they survived at least until time $\tau$, but not how much longer they would have lasted.
Masked cause of failure: When a system fails, you often can’t identify which component caused it. Diagnostic tests might narrow it down to a candidate set of possible causes, but the true failure component remains ambiguous.

This happens constantly in practice. Electronic systems fail with only board-level diagnostics. Industrial machinery fails without root-cause teardown. Medical devices fail with symptoms pointing to multiple possible subsystems.

The question: given this incomplete information, can you still estimate the lifetime distribution of each component?

The Package: Three Likelihood Models

maskedcauses provides three models with different complexity-accuracy tradeoffs:

Model	Parameters	Use Case
`exp_series_md_c1_c2_c3`	$m$ rates $(\lambda_1, \ldots, \lambda_m)$	Memoryless components (constant failure rate)
`wei_series_md_c1_c2_c3`	$2m$ params $(k_1, \beta_1, \ldots, k_m, \beta_m)$	Weibull with per-component shapes
`wei_series_homogeneous_md_c1_c2_c3`	$m+1$ params $(k, \beta_1, \ldots, \beta_m)$	Weibull with shared shape parameter

Each model implements the full inference stack: loglik(), score(), hess_loglik(), rdata(), and assumptions().

The C1-C2-C3 Conditions

The models assume three conditions that simplify the likelihood:

C1: The failed component is in the candidate set with probability 1
C2: Given the failed component is in the candidate set, masking probability is uniform across candidates
C3: Masking probabilities are independent of system parameters $\theta$

Under these conditions, the masking mechanism factors out of the likelihood. You can estimate component parameters without modeling the diagnostic process itself. That’s why the package name includes “c1_c2_c3”.

compositional.mle: SICP-Inspired Optimization

December 17, 2025

I recently updated compositional.mle, an R package for maximum likelihood estimation built on a simple premise: optimization strategies should compose.

The Problem

Most optimization libraries treat solvers as monolithic procedures. You call optim(), pass some options, hope for the best. Want to try multiple methods? Write a loop. Want coarse-to-fine optimization? Manually wire one solver’s output into the next.

compositional.mle treats solvers the way SICP treats procedures: as first-class citizens.

Primitive solvers: gradient_ascent(), newton_raphson(), bfgs(), nelder_mead()
Composition operators: %>>% (sequential chaining), %|% (parallel racing), with_restarts()
Closure: Combining solvers yields a solver

That last point is the whole thing. When you chain two solvers together, the result is itself a solver with the same interface. So compositions can be further composed, stored in variables, passed to functions, used anywhere a solver is expected.

What This Looks Like

Define your problem once:

problem <- mle_problem(
  loglike = function(theta) {
    if (theta[2] <= 0) return(-Inf)
    sum(dnorm(x, theta[1], theta[2], log = TRUE))
  },
  score = function(theta) {
    mu <- theta[1]; sigma <- theta[2]; n <- length(x)
    c(sum(x - mu) / sigma^2,
      -n / sigma + sum((x - mu)^2) / sigma^3)
  }
)

Then compose strategies declaratively:

# Global search -> local refinement -> final polish
strategy <- grid_search(lower = c(-10, 0.5), upper = c(10, 5), n = 5) %>>%
  gradient_ascent(max_iter = 50) %>>%
  newton_raphson(max_iter = 20)

result <- strategy(problem, theta0 = c(0, 1))

Or race multiple approaches:

# Try all methods, keep the best
strategy <- gradient_ascent() %|% bfgs() %|% nelder_mead()

Or handle multimodal landscapes:

# Random restarts to escape local optima
strategy <- with_restarts(gradient_ascent(), n = 10,
                          sampler = uniform_sampler(lower, upper))

The SICP Connection

This design applies SICP’s framework directly:

Primitives. The base solvers are building blocks with clear contracts. gradient_ascent() returns a solver using steepest ascent. nelder_mead() returns a derivative-free simplex solver.

Means of Combination. The operators %>>%, %|%, and with_restarts() combine solvers into new solvers. Chaining feeds one solver’s output as input to the next. Racing runs solvers in parallel and picks the winner.

Abstraction. Solver factories hide implementation details behind a consistent interface. You work with the solver abstraction, not specific algorithms.

Closure. Because composition produces objects of the same type as the inputs, the language of solvers is closed under composition. You build arbitrarily complex strategies from simple parts.

Relationship to algebraic.mle

This package complements algebraic.mle, which provides algebraic operations on MLE results. Where algebraic.mle lets you compose likelihood functions and manipulate fitted models, compositional.mle focuses on the process of finding those estimates.

They work together:

# compositional.mle: find the estimate
result <- strategy(problem, theta0)

# algebraic.mle: work with the fitted model
confint(result)
coef(result)

Try It

Install from GitHub:

likelihood.model: Composable Likelihood Models in R

June 30, 2022

Most R packages hardcode specific likelihood models. likelihood.model takes a different approach. Likelihoods are first-class objects that compose, and the framework is generic enough to work with any distribution.

The Interface

A likelihood model is anything implementing these generic methods:

loglik(model, data, params) – log-likelihood
score(model, data, params) – score function (gradient)
hessian(model, data, params) – observed information matrix

That is the interface. If your model implements these three methods, it plugs into the entire MLE stack: optimization, confidence intervals, hypothesis testing, model selection. You do not couple to specific distributions.

Likelihood Contributions

The key class is likelihood_contr_model, a likelihood built from independent contributions:

# Different observation types get different likelihood contributions
model <- likelihood_contr_model(
  exact = normal_contrib(),
  right_censored = censored_contrib()
)

This handles heterogeneous data in a unified framework. You can mix exact observations, right-censored observations, truncated observations, and different distribution families within one model. Each observation type gets its own likelihood contribution, and they combine additively in log-space.

Why This Design

The i.i.d. assumption decomposes a joint likelihood into additive log-likelihood contributions. That is how MLE actually works. likelihood.model makes this decomposition explicit and compositional.

Likelihood models are objects you manipulate, not function calls buried inside a fitting routine. You can build complex models from simple, independent pieces. You can swap in different contribution types without rewriting the rest of your code. And because the interface is generic, it works with algebraic.mle for fitting, hypothesize for testing, and any optimization backend that speaks the same protocol.

This is the same compositional philosophy as my thesis work on masked failure data. Series systems with masked causes have multiple observation types (masked vs. unmasked, different candidate sets) that each contribute differently to the likelihood. likelihood.model handles that naturally.

R package – MIT licensed – Documentation – GitHub

algebraic.mle: MLEs as Algebraic Objects

May 15, 2021

Maximum likelihood estimators have rich mathematical structure. They are consistent, asymptotically normal, efficient. algebraic.mle exposes this structure through an algebra where MLEs are objects you compose, transform, and query.

The Abstraction

An MLE is not just a vector of parameter estimates. It is a statistical object that carries point estimates $\hat{\theta}$, the Fisher information matrix $I(\hat{\theta})$, the variance-covariance matrix $I^{-1}(\hat{\theta})$, Wald-type confidence intervals from asymptotic normality, the log-likelihood value, and convergence diagnostics.

The package wraps all of this in a consistent interface:

library(algebraic.mle)

fit <- mle(likelihood_model, data)
coef(fit)           # Parameter estimates
vcov(fit)           # Variance-covariance matrix
confint(fit)        # Confidence intervals
logLik(fit)         # Log-likelihood
aic(fit)            # Model selection

Composition

The real point is that MLEs compose. Independent models combine:

fit1 <- mle(model1, data1)
fit2 <- mle(model2, data2)
combined <- fit1 + fit2  # Joint likelihood

The package handles the algebra. Joint log-likelihood, block-diagonal Fisher information, everything propagates correctly. This works because likelihoods from independent data sources multiply, and multiplication of likelihoods is addition of log-likelihoods. That is a monoid. The package enforces it.

The Ecosystem

algebraic.mle is the foundation for a family of packages:

Package	Purpose
likelihood.model	Compositional likelihood specification
maskedcauses	Masked failure data in series systems
mdrelax	Relaxed masking conditions
algebraic.dist	Distributions as algebraic objects
flexhaz	Dynamic failure rate distributions
hypothesize	Likelihood ratio tests on MLEs
nabla	Numerical optimization backends

The typical workflow:

Define distributions with algebraic.dist
Specify likelihood contributions with likelihood.model
Fit the model and get an mle object from algebraic.mle
Query statistical properties: confidence intervals, hypothesis tests, model selection

For series systems with masked data:

library(maskedcauses)
library(algebraic.mle)

# Specify masking model (C1-C2-C3 conditions)
model <- md_likelihood_model(components = 3, masking = "bernoulli")

# Fit -> returns algebraic.mle object
fit <- md_mle_exp_series_C1_C2_C3(masked_data)

# All the standard MLE methods work
confint(fit)
vcov(fit)
aic(fit)

Theory

The asymptotic properties that algebraic.mle exploits come from classical MLE theory:

$$\sqrt{n}(\hat{\theta}_n - \theta^{\ast}) \xrightarrow{d} \mathcal{N}(0, I^{-1}(\theta^{\ast}))$$

The expo-masked-fim paper derives closed-form Fisher information for exponential series systems. That is exactly what algebraic.mle uses internally for variance estimation in that case.

For more complex models (Weibull, relaxed masking conditions), we compute Fisher information numerically via observed information:

$$\hat{I}(\hat{\theta}) = -\frac{\partial^2 \ell}{\partial \theta \partial \theta^T}\bigg|_{\theta=\hat{\theta}}$$

Design Principles

Separation of concerns. The likelihood specification (likelihood.model) is independent of the fitting algorithm (nabla) and the result type (algebraic.mle). You can swap optimizers without changing downstream code.

Masked Failure Data: Looking Back, Looking Forward

February 18, 2026

I have been working on the same statistical problem since 2020. I am now a PhD student in CS. The problem has not changed, but my understanding of it has, and the tools I have built around it look nothing like what I started with.

The problem: a series system fails when any component fails. You observe system-level failure times. But you often cannot tell which component caused the failure (masking). Some systems are still running when testing ends (censoring). Given this incomplete data, estimate component reliability.

This is not a tutorial. It is a map of where things stand and where they are going.

flexhaz: Specify the Hazard Function Directly

August 20, 2021

Survival analysis usually makes you pick from a catalog. Weibull, exponential, log-normal. You choose the family, estimate the parameters, and hope the model fits. flexhaz flips this around. You specify the hazard function directly, and the package computes everything else.

How It Works

Instead of choosing Weibull(shape, scale), you write:

h <- function(t, x) exp(b0 + b1*x + b2*t)  # Your hazard function
model <- dfr_dist(hazard = h)

The package computes survival functions, cumulative hazards, quantiles, and sampling from your custom hazard. You get a full distributional object without committing to a named family.

Why This Is Useful

You are not constrained to parametric families. Want a bathtub curve? Multiple failure-rate peaks? Time-varying covariate effects? Just write the hazard function. No need to force reality into exponential or Weibull boxes.

Covariates can depend on anything:

h <- function(t, age, treatment) {
  baseline * exp(beta_age*age + beta_tx*treatment + gamma*t)
}

And it integrates with the rest of the MLE stack. flexhaz works with algebraic.mle for parameter estimation and likelihood.model for likelihood contributions.

Constraints

Your hazard function needs to satisfy two things:

Non-negative: h(t, x) >= 0 for all t, x
Eventual failure: cumulative hazard goes to infinity as t goes to infinity

That is it. Those are the only requirements for a valid hazard function. The package handles deriving the survival function, density, CDF, and quantile function from the hazard you provide.

Context

This generalizes my thesis work on masked failure data, where I used Weibull and exponential distributions. With flexhaz, you are not limited to parametric families. You specify the actual failure mechanism, and the math adapts.

R package – Works with algebraic.mle – Documentation – GitHub

Masked Failure Data: Looking Back, Looking Forward

February 18, 2026

I have been working on the same statistical problem since 2020. I am now a PhD student in CS. The problem has not changed, but my understanding of it has, and the tools I have built around it look nothing like what I started with.

The problem: a series system fails when any component fails. You observe system-level failure times. But you often cannot tell which component caused the failure (masking). Some systems are still running when testing ends (censoring). Given this incomplete data, estimate component reliability.

This is not a tutorial. It is a map of where things stand and where they are going.

Observation Functors: Composable Censoring for Series System Simulation

February 13, 2026

Last week I announced maskedcauses, the R package for estimating component reliability from masked series system failures. That post covered the three likelihood models and the path to CRAN.

This post is about what happened next: the package now supports four observation types (exact, right-censored, left-censored, and interval-censored) via composable observation functors. Along the way, I wrote four vignettes, removed the md.tools dependency, and developed a verification methodology for keeping prose honest about simulation results.

maskedcauses: Maximum Likelihood Estimation for Masked Series System Failures

February 5, 2026

Note (February 2026): This package has been renamed from likelihood.model.series.md to maskedcauses.

Two days ago, I submitted likelihood.model to CRAN, the foundation package for composable statistical inference. Next in line: maskedcauses, which implements maximum likelihood estimation for series systems where component failure causes are masked.

This package is the practical result of my master’s thesis work. Three years of theoretical development, now packaged for anyone analyzing masked failure data.

The Problem: Masked Component Failures

A series system fails when any of its $m$ components fails. In reliability testing, you observe the system fail at time $t$, but two layers of uncertainty obscure the full picture:

Right-censoring: Some systems are still running when testing ends. You know they survived at least until time $\tau$, but not how much longer they would have lasted.
Masked cause of failure: When a system fails, you often can’t identify which component caused it. Diagnostic tests might narrow it down to a candidate set of possible causes, but the true failure component remains ambiguous.

This happens constantly in practice. Electronic systems fail with only board-level diagnostics. Industrial machinery fails without root-cause teardown. Medical devices fail with symptoms pointing to multiple possible subsystems.

The question: given this incomplete information, can you still estimate the lifetime distribution of each component?

The Package: Three Likelihood Models

maskedcauses provides three models with different complexity-accuracy tradeoffs:

Model	Parameters	Use Case
`exp_series_md_c1_c2_c3`	$m$ rates $(\lambda_1, \ldots, \lambda_m)$	Memoryless components (constant failure rate)
`wei_series_md_c1_c2_c3`	$2m$ params $(k_1, \beta_1, \ldots, k_m, \beta_m)$	Weibull with per-component shapes
`wei_series_homogeneous_md_c1_c2_c3`	$m+1$ params $(k, \beta_1, \ldots, \beta_m)$	Weibull with shared shape parameter

Each model implements the full inference stack: loglik(), score(), hess_loglik(), rdata(), and assumptions().

The C1-C2-C3 Conditions

The models assume three conditions that simplify the likelihood:

C1: The failed component is in the candidate set with probability 1
C2: Given the failed component is in the candidate set, masking probability is uniform across candidates
C3: Masking probabilities are independent of system parameters $\theta$

Under these conditions, the masking mechanism factors out of the likelihood. You can estimate component parameters without modeling the diagnostic process itself. That’s why the package name includes “c1_c2_c3”.

symlik: Symbolic Likelihood Models in Python

December 16, 2025

symlik is a Python library for symbolic likelihood models. Write your log-likelihood as a symbolic expression, and it derives everything needed for inference.

The Problem

Traditional statistical computing gives you two choices:

Manual derivation. Work out score functions and information matrices by hand, then implement them. Error-prone, tedious.
Numerical approximation. Use finite differences. Unstable, slow, no symbolic form to inspect.

The Approach

symlik takes a third path: symbolic differentiation. Define the model once, get exact derivatives automatically.

from symlik.distributions import exponential

model = exponential()
data = {'x': [1.2, 0.8, 2.1, 1.5]}

mle, _ = model.mle(data=data, init={'lambda': 1.0})
se = model.se(mle, data)

print(f"Rate: {mle['lambda']:.3f} +/- {se['lambda']:.3f}")
# Rate: 0.714 +/- 0.357

Behind the scenes, symlik:

Symbolically differentiates the log-likelihood to get the score function
Differentiates again for the Hessian
Computes Fisher information from the Hessian
Derives standard errors from the inverse information matrix

All exact. No numerical approximation.

Custom Models

The real power is defining custom models using s-expressions:

from symlik import LikelihoodModel

# Exponential: l(lambda) = sum[log(lambda) - lambda*x_i]
log_lik = ['sum', 'i', ['len', 'x'],
           ['+', ['log', 'lambda'],
            ['*', -1, ['*', 'lambda', ['@', 'x', 'i']]]]]

model = LikelihoodModel(log_lik, params=['lambda'])

# Symbolic derivatives available
score = model.score()       # Gradient
hess = model.hessian()      # Hessian matrix
info = model.information()  # Fisher information

You define the log-likelihood once as a symbolic expression. symlik computes the rest.

Heterogeneous Data

One of symlik’s strengths is handling mixed observation types, which is exactly what you need for reliability analysis with censored data:

from symlik import ContributionModel
from symlik.contributions import complete_exponential, right_censored_exponential

model = ContributionModel(
    params=["lambda"],
    type_column="status",
    contributions={
        "observed": complete_exponential(),
        "censored": right_censored_exponential(),
    }
)

data = {
    "status": ["observed", "censored", "observed", "observed", "censored"],
    "t": [1.2, 3.0, 0.8, 2.1, 4.5],
}

Each observation type contributes differently to the likelihood. symlik handles the bookkeeping.

Connection to Research

symlik is the Python successor to my R package likelihood.model. It implements the theoretical framework from my thesis work on likelihood-based inference for series systems.

The Weibull Series Model Selection paper shows applications to reliability engineering, the kind of complex likelihood that benefits from symbolic treatment.

Powered by rerum

symlik uses rerum for symbolic differentiation. rerum is a pattern matching and term rewriting library that handles the calculus. The separation means you can use rerum for other symbolic computation tasks beyond likelihood models.

Installation

Available on PyPI:

pip install symlik

Documentation at queelius.github.io/symlik.

See the project page for more details.

Closed-Form Results for Masked Exponential Series Systems

December 2, 2025

In a series system, the system fails when any component fails. You observe the system failure time $t$ and a candidate set $C \subseteq \lbrace 1,2,\ldots,m\rbrace$ of components that might have caused the failure. But you do not know which component in $C$ actually failed. This is masked failure data.

The standard approach is numerical optimization of the likelihood. This paper shows that for exponential component lifetimes, everything has a closed form.

Closed-Form Fisher Information

For exponential masked data with arbitrary masking patterns:

$$I_{ij}(\boldsymbol{\lambda}) = n \cdot \sum_{A \ni i,j} \frac{\hat{\omega}_A}{(\sum_{k \in A} \lambda_k)^2}$$

where $\hat{\omega}_A$ is the observed frequency of candidate set $A$. You can compute asymptotic variances directly, check identifiability before running any estimation, and analyze optimization stability. All without fitting a model first.

Sufficient Statistics

The mean system lifetime and the candidate set frequency vector are sufficient statistics. That reduces an entire dataset to $1 + \binom{m}{w}$ numbers, where $w$ is the masking width.

This is a real simplification. All the statistical information in your data is captured by two things: how often each candidate set appears, and what the average failure time is. Nothing else matters for inference.

Closed-Form MLE for Three Components

For $m=3$ components with pairwise masking ($w=2$), the MLE has an explicit closed-form solution:

$$\hat{\lambda}_j = \frac{\sum_{A \ni j} \hat{\omega}_A}{\bar{t} \cdot n}$$

No numerical optimization. No iterative algorithms. Just plug in your sufficient statistics.

The $w=2$ case is the interesting one. $w=1$ means no masking (you know exactly which component failed). $w=m$ means complete masking (the candidate set is always everything, so you have no diagnostic information). $w=2$ is the simplest case where masking actually matters, and it is the one where closed-form solutions exist.

Asymptotic Theory

The MLE follows:

$$\sqrt{n}(\hat{\boldsymbol{\lambda}}_n - \boldsymbol{\lambda}^\star) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \mathcal{I}^{-1}(\boldsymbol{\lambda}^\star))$$

with explicit Wald-type confidence intervals using the closed-form Fisher information. So you get point estimates and uncertainty quantification, all analytically.

Why Exponential?

The exponential assumption is not just for tractability, though it helps. Constant hazard rate models systems subject to random external shocks. The memoryless property simplifies the likelihood structure. And exponential is the foundation for generalization to Weibull and other distributions.

likelihood.model: Composable Likelihood Models in R

June 30, 2022

Most R packages hardcode specific likelihood models. likelihood.model takes a different approach. Likelihoods are first-class objects that compose, and the framework is generic enough to work with any distribution.

The Interface

A likelihood model is anything implementing these generic methods:

loglik(model, data, params) – log-likelihood
score(model, data, params) – score function (gradient)
hessian(model, data, params) – observed information matrix

That is the interface. If your model implements these three methods, it plugs into the entire MLE stack: optimization, confidence intervals, hypothesis testing, model selection. You do not couple to specific distributions.

Likelihood Contributions

The key class is likelihood_contr_model, a likelihood built from independent contributions:

# Different observation types get different likelihood contributions
model <- likelihood_contr_model(
  exact = normal_contrib(),
  right_censored = censored_contrib()
)

This handles heterogeneous data in a unified framework. You can mix exact observations, right-censored observations, truncated observations, and different distribution families within one model. Each observation type gets its own likelihood contribution, and they combine additively in log-space.

Why This Design

The i.i.d. assumption decomposes a joint likelihood into additive log-likelihood contributions. That is how MLE actually works. likelihood.model makes this decomposition explicit and compositional.

Likelihood models are objects you manipulate, not function calls buried inside a fitting routine. You can build complex models from simple, independent pieces. You can swap in different contribution types without rewriting the rest of your code. And because the interface is generic, it works with algebraic.mle for fitting, hypothesize for testing, and any optimization backend that speaks the same protocol.

This is the same compositional philosophy as my thesis work on masked failure data. Series systems with masked causes have multiple observation types (masked vs. unmasked, different candidate sets) that each contribute differently to the likelihood. likelihood.model handles that naturally.

R package – MIT licensed – Documentation – GitHub

algebraic.mle: MLEs as Algebraic Objects

May 15, 2021

Maximum likelihood estimators have rich mathematical structure. They are consistent, asymptotically normal, efficient. algebraic.mle exposes this structure through an algebra where MLEs are objects you compose, transform, and query.

The Abstraction

An MLE is not just a vector of parameter estimates. It is a statistical object that carries point estimates $\hat{\theta}$, the Fisher information matrix $I(\hat{\theta})$, the variance-covariance matrix $I^{-1}(\hat{\theta})$, Wald-type confidence intervals from asymptotic normality, the log-likelihood value, and convergence diagnostics.

The package wraps all of this in a consistent interface:

library(algebraic.mle)

fit <- mle(likelihood_model, data)
coef(fit)           # Parameter estimates
vcov(fit)           # Variance-covariance matrix
confint(fit)        # Confidence intervals
logLik(fit)         # Log-likelihood
aic(fit)            # Model selection

Composition

The real point is that MLEs compose. Independent models combine:

fit1 <- mle(model1, data1)
fit2 <- mle(model2, data2)
combined <- fit1 + fit2  # Joint likelihood

The package handles the algebra. Joint log-likelihood, block-diagonal Fisher information, everything propagates correctly. This works because likelihoods from independent data sources multiply, and multiplication of likelihoods is addition of log-likelihoods. That is a monoid. The package enforces it.

The Ecosystem

algebraic.mle is the foundation for a family of packages:

Package	Purpose
likelihood.model	Compositional likelihood specification
maskedcauses	Masked failure data in series systems
mdrelax	Relaxed masking conditions
algebraic.dist	Distributions as algebraic objects
flexhaz	Dynamic failure rate distributions
hypothesize	Likelihood ratio tests on MLEs
nabla	Numerical optimization backends

The typical workflow:

Define distributions with algebraic.dist
Specify likelihood contributions with likelihood.model
Fit the model and get an mle object from algebraic.mle
Query statistical properties: confidence intervals, hypothesis tests, model selection

For series systems with masked data:

library(maskedcauses)
library(algebraic.mle)

# Specify masking model (C1-C2-C3 conditions)
model <- md_likelihood_model(components = 3, masking = "bernoulli")

# Fit -> returns algebraic.mle object
fit <- md_mle_exp_series_C1_C2_C3(masked_data)

# All the standard MLE methods work
confint(fit)
vcov(fit)
aic(fit)

Theory

The asymptotic properties that algebraic.mle exploits come from classical MLE theory:

$$\sqrt{n}(\hat{\theta}_n - \theta^{\ast}) \xrightarrow{d} \mathcal{N}(0, I^{-1}(\theta^{\ast}))$$

The expo-masked-fim paper derives closed-form Fisher information for exponential series systems. That is exactly what algebraic.mle uses internally for variance estimation in that case.

For more complex models (Weibull, relaxed masking conditions), we compute Fisher information numerically via observed information:

$$\hat{I}(\hat{\theta}) = -\frac{\partial^2 \ell}{\partial \theta \partial \theta^T}\bigg|_{\theta=\hat{\theta}}$$

Design Principles

Separation of concerns. The likelihood specification (likelihood.model) is independent of the fitting algorithm (nabla) and the result type (algebraic.mle). You can swap optimizers without changing downstream code.

Masked Failure Data: Looking Back, Looking Forward

February 18, 2026

I have been working on the same statistical problem since 2020. I am now a PhD student in CS. The problem has not changed, but my understanding of it has, and the tools I have built around it look nothing like what I started with.

The problem: a series system fails when any component fails. You observe system-level failure times. But you often cannot tell which component caused the failure (masking). Some systems are still running when testing ends (censoring). Given this incomplete data, estimate component reliability.

This is not a tutorial. It is a map of where things stand and where they are going.

Observation Functors: Composable Censoring for Series System Simulation

February 13, 2026

Last week I announced maskedcauses, the R package for estimating component reliability from masked series system failures. That post covered the three likelihood models and the path to CRAN.

This post is about what happened next: the package now supports four observation types (exact, right-censored, left-censored, and interval-censored) via composable observation functors. Along the way, I wrote four vignettes, removed the md.tools dependency, and developed a verification methodology for keeping prose honest about simulation results.

maskedcauses: Maximum Likelihood Estimation for Masked Series System Failures

February 5, 2026

Note (February 2026): This package has been renamed from likelihood.model.series.md to maskedcauses.

Two days ago, I submitted likelihood.model to CRAN, the foundation package for composable statistical inference. Next in line: maskedcauses, which implements maximum likelihood estimation for series systems where component failure causes are masked.

This package is the practical result of my master’s thesis work. Three years of theoretical development, now packaged for anyone analyzing masked failure data.

The Problem: Masked Component Failures

A series system fails when any of its $m$ components fails. In reliability testing, you observe the system fail at time $t$, but two layers of uncertainty obscure the full picture:

Right-censoring: Some systems are still running when testing ends. You know they survived at least until time $\tau$, but not how much longer they would have lasted.
Masked cause of failure: When a system fails, you often can’t identify which component caused it. Diagnostic tests might narrow it down to a candidate set of possible causes, but the true failure component remains ambiguous.

This happens constantly in practice. Electronic systems fail with only board-level diagnostics. Industrial machinery fails without root-cause teardown. Medical devices fail with symptoms pointing to multiple possible subsystems.

The question: given this incomplete information, can you still estimate the lifetime distribution of each component?

The Package: Three Likelihood Models

maskedcauses provides three models with different complexity-accuracy tradeoffs:

Model	Parameters	Use Case
`exp_series_md_c1_c2_c3`	$m$ rates $(\lambda_1, \ldots, \lambda_m)$	Memoryless components (constant failure rate)
`wei_series_md_c1_c2_c3`	$2m$ params $(k_1, \beta_1, \ldots, k_m, \beta_m)$	Weibull with per-component shapes
`wei_series_homogeneous_md_c1_c2_c3`	$m+1$ params $(k, \beta_1, \ldots, \beta_m)$	Weibull with shared shape parameter

Each model implements the full inference stack: loglik(), score(), hess_loglik(), rdata(), and assumptions().

The C1-C2-C3 Conditions

The models assume three conditions that simplify the likelihood:

C1: The failed component is in the candidate set with probability 1
C2: Given the failed component is in the candidate set, masking probability is uniform across candidates
C3: Masking probabilities are independent of system parameters $\theta$

Under these conditions, the masking mechanism factors out of the likelihood. You can estimate component parameters without modeling the diagnostic process itself. That’s why the package name includes “c1_c2_c3”.

Weibull Distributions: From Reliability Theory to My Own Survival Curve

April 18, 2022

The Weibull distribution models time-to-failure. In reliability engineering, that means component lifetimes. In medicine, it means survival times. I have been working with Weibull models for my thesis on series system reliability. Then I got diagnosed with cancer, and now every time I work with survival curves, I am looking at mathematical abstractions of something very concrete: how long until failure?

The Mathematics

The Weibull CDF:

F(t) = 1 - exp(-(t/λ)^k)

Two parameters:

λ: scale (characteristic lifetime)
k: shape (how failure rate changes over time)

The shape parameter k tells you the whole story:

k < 1: Decreasing hazard. If you survive early on, your risk goes down. This is the infant mortality pattern.

k = 1: Constant hazard. Memoryless. This is just the exponential distribution.

k > 1: Increasing hazard. Things wear out.

The Hazard Function

The hazard function is what makes Weibull useful for survival analysis:

h(t) = (k/λ)(t/λ)^(k-1)

This is the instantaneous failure rate: given that you have survived to time t, what is the probability you fail in the next instant?

For cancer, this is the number that matters. Some cancers have increasing hazard (the longer you have it, the worse things get). Others have decreasing hazard after initial treatment, meaning if you make it past the critical period, prognosis improves. Knowing which pattern applies to your disease changes how you think about time.

Personal Context

When you study survival analysis academically, it is abstract. When you are living it, every curve is personal.

I look at Kaplan-Meier plots and see myself somewhere on that curve. I work with hazard functions and think: is my k > 1 or k < 1? Am I in the wearing-out regime or the if-you-make-it-past-this-it-gets-better regime?

The math does not change. But the meaning does.

The Irony

I chose reliability engineering for my thesis before the cancer diagnosis. I was studying component failures in series systems, where if any one part fails, the whole system fails.

Then I became a series system. Organs, treatment response, immune function. All have to work. Failure of any one is catastrophic.

The mathematics I was studying abstractly became uncomfortably literal.

Reliability Analysis and the Problem of Censored Data

August 14, 2019

One of the most interesting statistical problems I have encountered is reliability analysis with censored data: situations where you know something didn’t fail, but not when it will fail.

The Censoring Problem

Imagine testing light bulbs. You run them for 1000 hours. Some fail during the test. Others are still working when you stop.

For the survivors, you know:

They lasted at least 1000 hours
You do not know their actual lifetime

This is right censoring. The true value lies somewhere to the right of your observation. You have a lower bound, not a measurement.

Why This Matters

Censored data is everywhere:

Medical studies (patients still alive at study end)
Engineering tests (components that have not failed)
Customer retention (users still active)

The naive responses are both wrong. Ignoring censored observations wastes information. Treating them as failures introduces bias. You need a framework that uses the partial information you actually have.

Maximum Likelihood to the Rescue

The solution is maximum likelihood estimation with likelihood contributions that account for censoring:

Failure observations contribute the probability density $f(t)$. You observed the exact failure time, so you know the probability of failing at that time.
Censored observations contribute the survival probability $S(t)$. You know the unit survived to time $t$, so its contribution is the probability of surviving at least that long.

The likelihood for the whole sample is:

$$L = \prod_{i: \text{failed}} f(t_i) \prod_{j: \text{censored}} S(t_j)$$

This lets you extract information from both failed and surviving units. The censored observations pull the estimated reliability upward; the failures pull it downward. Maximum likelihood balances them.

Series Systems Complexity

It gets more interesting with series systems, systems that fail when any component fails. If you observe system failure but do not know which component caused it, you have masked failure data.

This is the problem I am most interested in: extracting component-level reliability from system-level failures when the cause is ambiguous. The masking adds a latent variable, and the likelihood becomes a mixture. You can handle it with EM algorithms or direct optimization, but the combinatorics grow quickly with system size.

This work is laying groundwork for what will become a major focus of my mathematical statistics degree.

Masked Failure Data: Looking Back, Looking Forward

February 18, 2026

I have been working on the same statistical problem since 2020. I am now a PhD student in CS. The problem has not changed, but my understanding of it has, and the tools I have built around it look nothing like what I started with.

The problem: a series system fails when any component fails. You observe system-level failure times. But you often cannot tell which component caused the failure (masking). Some systems are still running when testing ends (censoring). Given this incomplete data, estimate component reliability.

This is not a tutorial. It is a map of where things stand and where they are going.

dapple: Terminal Graphics, Composed

February 15, 2026

I live in the terminal. Most of my tools are CLIs. When I want to see something visual (an image, a plot, a table of results), I do not want to leave the terminal to see it.

Terminal graphics tools exist, but they are fragmented. One library does braille characters. Another does quadrant blocks. A third handles sixel. Each has its own API, its own conventions, its own way of thinking about the same problem.

dapple unifies them. One Canvas class, seven pluggable renderers, and eleven CLI tools built on top. The core depends only on numpy.

Observation Functors: Composable Censoring for Series System Simulation

February 13, 2026

Last week I announced maskedcauses, the R package for estimating component reliability from masked series system failures. That post covered the three likelihood models and the path to CRAN.

This post is about what happened next: the package now supports four observation types (exact, right-censored, left-censored, and interval-censored) via composable observation functors. Along the way, I wrote four vignettes, removed the md.tools dependency, and developed a verification methodology for keeping prose honest about simulation results.

Numerical Methods for Maximum Likelihood Estimation

February 5, 2023

Maximum likelihood estimation sounds clean on paper: write down the likelihood, take derivatives, set them to zero, solve. In practice, the “solve” step is where things get interesting. Most likelihoods don’t have closed-form solutions, so you need numerical methods, and the choice of method matters more than most textbooks let on.

This write-up covers the numerical side of MLE: the optimization algorithms, convergence issues, and computational tricks that make the difference between getting an answer and getting the right answer. The full treatment is in the PDF.

View PDF

For more on the statistical and mathematical context, see my research page and publications.

pagevault: Hiding an Encryption Platform Inside HTML

February 13, 2026

HTML is an encryption container format. That sounds wrong, but think about what an HTML file can hold: arbitrary data in script tags or data attributes, a full programming runtime via JavaScript, and a rendering engine (the browser) on every device on the planet. If you embed encrypted data and the code to decrypt it, the result is a file that looks inert until someone types the right password.

pagevault takes this idea seriously. It encrypts files, documents, images, entire websites, into self-contained HTML pages that decrypt in the browser. No backend. No JavaScript crypto libraries. The browser already has AES-256-GCM built in via the Web Crypto API. pagevault just has to match the parameters exactly on the Python side and embed the right 200 lines of JavaScript.

The output is a single .html file. You can email it, put it on a USB stick, host it on GitHub Pages, or double-click it on your desktop. It doesn’t phone home, it doesn’t load CDNs, it doesn’t need anything except a browser.

Long Echo Comes Alive: From Philosophy to Orchestration

January 20, 2026

A year ago, I wrote about Long Echo as a philosophy for preserving AI conversations across decades. The key insight was graceful degradation: design archives that work progressively even as technology disappears.

That philosophy has become a tool.

From Philosophy to Tool

The original Long Echo was intentionally not code. It was a set of principles documented in CTK’s repository. The hard problems of conversation parsing, storage, and search were already solved by toolkits like CTK, BTK, and EBK.

What was missing was the unification layer. Each toolkit exports its own ECHO-compliant archive, but combining them into a single browsable experience required manual work. That’s what longecho now handles.

What longecho Does Now

longecho is a CLI tool with five capabilities:

longecho check ~/my-data/       # Validate ECHO compliance
longecho discover ~/            # Find ECHO sources
longecho search ~/ "query"      # Search README descriptions
longecho build ~/my-archive/    # Generate static site
longecho serve ~/my-archive/    # Preview locally via HTTP

The check, discover, and search commands existed in the original specification. What’s new is build and serve, the orchestration layer.

Building a Unified Site

The build command takes a hierarchical archive and generates a static site:

longecho build ~/my-archive/

This produces a site/ directory with:

An index page linking to all sub-archives
Navigation between sources
Automatic linking to existing sub-site builds

If a sub-archive already has its own site/ directory (like CTK’s exports), longecho links to it. Use --bundle to copy everything into a portable, self-contained site.

Live Preview

The serve command provides local HTTP preview:

longecho serve ~/my-archive/ --port 8000

It builds the site if needed, then serves it for browser viewing.

The Manifest

ECHO compliance requires only a README. But for machine-readable metadata, longecho supports an optional manifest:

version: "1.0"
name: "Alex's Data Archive"
description: "Personal data archive"
sources:
  - path: "conversations/"
    order: 1
  - path: "bookmarks/"
    order: 2
  - path: "ebooks/"
    order: 3

The manifest enables:

Explicit ordering of sources in generated sites
Selective inclusion via the browsable flag
Override names for cleaner presentation
Icon hints for UI presentation

Without a manifest, longecho auto-discovers sub-archives by looking for directories with README files. The manifest provides explicit control when you need it.

Long Echo: Photos and Mail

January 19, 2026

The Long Echo toolkit now covers conversations, bookmarks, and ebooks. But two of the most emotionally significant categories of personal data remain: photos and mail.

Both share a troubling pattern: scattered across devices and cloud services, organized by date rather than meaning, vulnerable to platform disappearance. They deserve better.

The Expanding Ecosystem

Tool	Domain	Status
ctk	AI Conversations	stable
btk	Bookmarks & Media	stable
ebk	eBooks	stable
repoindex	Git Repositories	stable
ptk	Photos	incubating
mtk	Mail	incubating

The orchestration layer, longecho, ties these together into a unified personal archive.

PTK: Photo Toolkit

Photos are the most emotionally valuable digital artifacts most people have. They’re also among the worst-managed.

The Problem

Your photo library is probably:

Scattered: Phone, old phones, cloud services, camera imports, messaging app saves
Organized by date: Not by who’s in them, where they were taken, or what they mean
Cloud-dependent: Google Photos, iCloud, Amazon Photos. What happens when you switch?
Unsearchable by content: “Find photos of mom at the beach” isn’t possible
Missing context: Only you know why that blurry photo matters

The Vision

ptk provides:

Unified import from any source:

ptk import ~/Pictures/
ptk import ~/phone-backup/DCIM/
ptk import google-takeout.zip --source google-photos
ptk import icloud-export/ --source icloud

Intelligent organization by multiple dimensions:

ptk shell
ptk:/$ cd /people/mom
ptk:/people/mom$ ls
2019/  2020/  2021/  2022/  2023/  2024/

ptk:/$ cd /locations/beach
ptk:/$ cd /events/christmas-2023
ptk:/$ cd /years/2020/months/march

AI-powered features:

# Face detection and clustering
ptk faces detect --all
ptk faces cluster
ptk faces label cluster-7 "Mom"
ptk faces find "Mom"

# Scene captioning
ptk caption --all --model ollama/llava
ptk search "sunset over water"

# Semantic search
ptk ask "photos from our trip to Colorado"

Preservation guarantees:

# Verify nothing is corrupted
ptk verify --checksums

# Export to durable formats
ptk export ~/archive/photos/ --format longecho
ptk export photos.html --format html-gallery

# Original files always preserved
ptk originals list
ptk originals verify

Why SQLite?

Like the other Long Echo tools, ptk uses SQLite for metadata:

# Works even if ptk disappears
sqlite3 photos.db "
  SELECT path, caption, taken_at
  FROM photos
  WHERE caption LIKE '%birthday%'
  ORDER BY taken_at
"

The database stores metadata, face embeddings, captions, and organization. The actual photo files stay in place or are copied to a managed library, your choice.

Long Echo in Practice: 5,874 Bookmarks in a Single File

December 18, 2025

I wrote about Long Echo and the Long Echo Toolkit earlier. Here’s what it actually looks like.

View the live demo: 5,874 bookmarks in a single file

The Export

btk --db bookmarks.db export bookmarks.html \
    --format html-app \
    --query "(reachable != 0 OR reachable IS NULL)"

Result: 5,874 bookmarks in a single 4MB HTML file.

What You Get

Open it in any browser. No server. No internet. No dependencies. Just a file.

The html-app export includes:

Search: Full-text filtering across titles, URLs, descriptions, tags
Multiple views: Grid, list, table layouts
Tag sidebar: Hierarchical tag navigation
Dark mode: Toggle button
Keyboard shortcuts: Navigate without a mouse
Sorting: By date, title, visits, stars
Filtering: By starred, archived, has-content

Everything is embedded: CSS, JavaScript, all 5,874 bookmark records as JSON. One file.

Why This Matters

Graceful degradation, concretely:

Level	What Works	Requirements
1. BTK CLI	Full features, auto-tagging, content caching	Python, btk installed
2. SQLite	Direct queries, scripting	sqlite3 binary
3. HTML App	Visual browsing, search, filtering	Any browser
4. View source	Raw JSON data, greppable	Text editor

The HTML app is level 3. It works when BTK is gone, when Python is gone. Someone in 2074 can double-click the file and browse my bookmarks.

The Data Inside

View source and you’ll find:

const BOOKMARKS = [
    {
        "id": 1,
        "url": "https://example.com/article",
        "title": "Interesting Article",
        "description": "Notes about the article...",
        "tags": ["programming", "python"],
        "stars": 1,
        "created_at": "2023-05-12T14:32:00Z",
        "visited_count": 42
    },
    // ... 5,873 more
];

Plain JSON. No encoding tricks. Grep it, parse it with jq, import it into another tool. The data survives the interface.

Try It

Install BTK:

pip install bookmark-tk

Export your bookmarks:

# From browser exports
btk import bookmarks.html --format html

# To self-contained app
btk export archive.html --format html-app

You now have a permanent, searchable copy of your bookmarks that will outlive every cloud service you currently depend on.

Links

Live Demo: My Bookmarks Archive (5,874 bookmarks, 4MB)
BTK: github.com/queelius/btk
Long Echo Philosophy: Long Echo: Designing for Digital Resilience
Full Toolkit: The Long Echo Toolkit

The Long Echo Toolkit

December 16, 2025

Earlier this year I wrote about Long Echo, a philosophy for preserving AI conversations in ways that stay accessible across decades. The core idea was graceful degradation: systems that fail progressively, not catastrophically.

Since then I’ve built out three tools that apply this thinking to all personal digital content, not just conversations. Bookmarks, books, and AI chats. Together they form a system for managing the stuff you actually think with.

The Toolkit

Tool	Domain	Install
CTK	AI Conversations	`pip install conversation-tk`
BTK	Bookmarks & Media	`pip install bookmark-tk`
EBK	eBooks & Documents	`pip install ebk`

All three share a common architecture, but each is specialized for its domain.

Shared Architecture

SQLite-First Storage

Every tool uses local SQLite databases you own. No cloud dependency. Queryable with standard tools even if the CLI disappears tomorrow:

# Works even if the tools are gone
sqlite3 conversations.db "SELECT title FROM conversations WHERE title LIKE '%python%'"
sqlite3 bookmarks.db "SELECT url, title FROM bookmarks WHERE stars = 1"
sqlite3 library.db "SELECT title, author FROM books WHERE favorite = 1"

This is the whole point. The database is the artifact, not the tool.

Interactive Shells with Virtual Filesystems

Navigate your data like a Unix filesystem:

$ btk shell
btk:/$ cd tags/programming/python
btk:/tags/programming/python$ ls
3298  4095  5124  (bookmark IDs)
btk:/tags/programming/python$ cat 4095/title
Advanced Python Techniques

$ ebk shell
ebk:/$ cd authors/Knuth
ebk:/authors/Knuth$ ls
The Art of Computer Programming Vol 1
The Art of Computer Programming Vol 2

Reading Queues

Track what you’re reading, watching, or working through:

# Bookmarks
btk queue add 42 --priority high
btk queue next
btk queue progress 42 --percent 75
btk queue estimate-times  # Auto-estimate from content length

# Books
ebk queue add "Gödel, Escher, Bach"
ebk queue next
ebk queue list

LLM Integration

All three integrate with LLMs for tagging, summarization, and search:

# Auto-tag using content analysis
btk content auto-tag --all
ctk auto-tag --model ollama/llama3
ebk enrich 42  # Enhance metadata with LLM

# Natural language queries
ctk say "summarize my conversations about Rust"
btk ask "find articles about distributed systems"
ebk similar "Gödel, Escher, Bach"  # Semantic similarity

Network Analysis

Find relationships in your data:

# CTK: Conversation networks
ctk net embeddings --all
ctk net similar 42
ctk net clusters
ctk net central  # Most connected conversations
ctk net outliers  # Isolated conversations

# BTK: Bookmark graphs
btk graph build
btk graph analyze

Web Servers

Browse your archives in a web UI:

Everything is a File: Virtual Filesystems for CLI Data Tools

October 20, 2025

I had a bookmark manager. Then an ebook library manager. Then a chat history manager. Each started with the standard CRUD CLI:

btk add https://example.com --tags python,tutorial
btk list --tag python
btk search "async"
btk delete 1234

ebk import book.pdf --author "Knuth"
ebk list --author Knuth
ebk search "algorithms"

This works fine until you have 10,000+ bookmarks organized with hierarchical tags like programming/python/async, research/ml/transformers, work/clients/acme. Your ebook library has similar structure. Your exported chat conversations from Claude, ChatGPT, and Copilot are piling up.

Traditional CRUD commands become unwieldy:

btk list --tag programming/python/async/io --format json | jq '.[].title'
ebk list --category "Computer Science/Algorithms/Graph Theory" --limit 50
ctk search "machine learning" --source ChatGPT --date-from 2024-01-01

Each command requires precise arguments. Each tool has different flag conventions. You can’t navigate your data. You can only query it. And queries require knowing exactly what you’re looking for.

The insight: everything is a file

When I have thousands of source files organized in directories, I don’t run:

list-files --path /src/components/auth --extension .tsx

I run:

cd src/components/auth
ls *.tsx

The difference matters. With a filesystem, I can navigate incrementally (cd from general to specific), explore (ls to see what’s there), compose (cat file | grep pattern | wc -l), and use familiar tools (find, grep, xargs, pipes, redirection).

What if my bookmarks, ebooks, and chat histories were filesystems?

The pattern

Over the past year, I built six Python tools that all follow the same architecture:

Tool	Domain	VFS Root Structure
btk	Bookmarks	`/bookmarks/`, `/tags/`, `/recent/`, `/domains/`, `/unread/`, `/popular/`
ebk	Ebook library	`/books/`, `/authors/`, `/series/`, `/subjects/`, `/recent/`, `/unread/`
ctk	Chat conversations	`/conversations/`, `/sources/`, `/topics/`, `/starred/`, `/recent/`
ghops	Git repositories	`/repos/`, `/languages/`, `/topics/`, `/stars/`, `/recent/`
infinigram	N-gram models	`/datasets/`, `/models/`, `/corpora/`
AlgoTree	Tree structures	`/nodes/`, `/paths/`, `/subtrees/`

Each tool provides:

A stateless CLI for scripting: btk bookmark add URL, ebk import book.pdf
An interactive shell with a virtual filesystem: btk shell, ebk shell, ctk chat
POSIX-like commands: cd, ls, pwd, cat, mv, cp, rm, find, grep
Unix pipeline support: most commands output JSONL by default for piping

The interesting part is the shell.

Navigating 10,000 bookmarks

Live recording captured with asciinema. You can pause, copy text, and replay. The entire recording is 78KB of text.

Long Echo Comes Alive: From Philosophy to Orchestration

January 20, 2026

A year ago, I wrote about Long Echo as a philosophy for preserving AI conversations across decades. The key insight was graceful degradation: design archives that work progressively even as technology disappears.

That philosophy has become a tool.

From Philosophy to Tool

The original Long Echo was intentionally not code. It was a set of principles documented in CTK’s repository. The hard problems of conversation parsing, storage, and search were already solved by toolkits like CTK, BTK, and EBK.

What was missing was the unification layer. Each toolkit exports its own ECHO-compliant archive, but combining them into a single browsable experience required manual work. That’s what longecho now handles.

What longecho Does Now

longecho is a CLI tool with five capabilities:

longecho check ~/my-data/       # Validate ECHO compliance
longecho discover ~/            # Find ECHO sources
longecho search ~/ "query"      # Search README descriptions
longecho build ~/my-archive/    # Generate static site
longecho serve ~/my-archive/    # Preview locally via HTTP

The check, discover, and search commands existed in the original specification. What’s new is build and serve, the orchestration layer.

Building a Unified Site

The build command takes a hierarchical archive and generates a static site:

longecho build ~/my-archive/

This produces a site/ directory with:

An index page linking to all sub-archives
Navigation between sources
Automatic linking to existing sub-site builds

If a sub-archive already has its own site/ directory (like CTK’s exports), longecho links to it. Use --bundle to copy everything into a portable, self-contained site.

Live Preview

The serve command provides local HTTP preview:

longecho serve ~/my-archive/ --port 8000

It builds the site if needed, then serves it for browser viewing.

The Manifest

ECHO compliance requires only a README. But for machine-readable metadata, longecho supports an optional manifest:

version: "1.0"
name: "Alex's Data Archive"
description: "Personal data archive"
sources:
  - path: "conversations/"
    order: 1
  - path: "bookmarks/"
    order: 2
  - path: "ebooks/"
    order: 3

The manifest enables:

Explicit ordering of sources in generated sites
Selective inclusion via the browsable flag
Override names for cleaner presentation
Icon hints for UI presentation

Without a manifest, longecho auto-discovers sub-archives by looking for directories with README files. The manifest provides explicit control when you need it.

Long Echo: Photos and Mail

January 19, 2026

The Long Echo toolkit now covers conversations, bookmarks, and ebooks. But two of the most emotionally significant categories of personal data remain: photos and mail.

Both share a troubling pattern: scattered across devices and cloud services, organized by date rather than meaning, vulnerable to platform disappearance. They deserve better.

The Expanding Ecosystem

Tool	Domain	Status
ctk	AI Conversations	stable
btk	Bookmarks & Media	stable
ebk	eBooks	stable
repoindex	Git Repositories	stable
ptk	Photos	incubating
mtk	Mail	incubating

The orchestration layer, longecho, ties these together into a unified personal archive.

PTK: Photo Toolkit

Photos are the most emotionally valuable digital artifacts most people have. They’re also among the worst-managed.

The Problem

Your photo library is probably:

Scattered: Phone, old phones, cloud services, camera imports, messaging app saves
Organized by date: Not by who’s in them, where they were taken, or what they mean
Cloud-dependent: Google Photos, iCloud, Amazon Photos. What happens when you switch?
Unsearchable by content: “Find photos of mom at the beach” isn’t possible
Missing context: Only you know why that blurry photo matters

The Vision

ptk provides:

Unified import from any source:

ptk import ~/Pictures/
ptk import ~/phone-backup/DCIM/
ptk import google-takeout.zip --source google-photos
ptk import icloud-export/ --source icloud

Intelligent organization by multiple dimensions:

ptk shell
ptk:/$ cd /people/mom
ptk:/people/mom$ ls
2019/  2020/  2021/  2022/  2023/  2024/

ptk:/$ cd /locations/beach
ptk:/$ cd /events/christmas-2023
ptk:/$ cd /years/2020/months/march

AI-powered features:

# Face detection and clustering
ptk faces detect --all
ptk faces cluster
ptk faces label cluster-7 "Mom"
ptk faces find "Mom"

# Scene captioning
ptk caption --all --model ollama/llava
ptk search "sunset over water"

# Semantic search
ptk ask "photos from our trip to Colorado"

Preservation guarantees:

# Verify nothing is corrupted
ptk verify --checksums

# Export to durable formats
ptk export ~/archive/photos/ --format longecho
ptk export photos.html --format html-gallery

# Original files always preserved
ptk originals list
ptk originals verify

Why SQLite?

Like the other Long Echo tools, ptk uses SQLite for metadata:

# Works even if ptk disappears
sqlite3 photos.db "
  SELECT path, caption, taken_at
  FROM photos
  WHERE caption LIKE '%birthday%'
  ORDER BY taken_at
"

The database stores metadata, face embeddings, captions, and organization. The actual photo files stay in place or are copied to a managed library, your choice.

The Long Echo Toolkit

December 16, 2025

Earlier this year I wrote about Long Echo, a philosophy for preserving AI conversations in ways that stay accessible across decades. The core idea was graceful degradation: systems that fail progressively, not catastrophically.

Since then I’ve built out three tools that apply this thinking to all personal digital content, not just conversations. Bookmarks, books, and AI chats. Together they form a system for managing the stuff you actually think with.

The Toolkit

Tool	Domain	Install
CTK	AI Conversations	`pip install conversation-tk`
BTK	Bookmarks & Media	`pip install bookmark-tk`
EBK	eBooks & Documents	`pip install ebk`

All three share a common architecture, but each is specialized for its domain.

Shared Architecture

SQLite-First Storage

Every tool uses local SQLite databases you own. No cloud dependency. Queryable with standard tools even if the CLI disappears tomorrow:

# Works even if the tools are gone
sqlite3 conversations.db "SELECT title FROM conversations WHERE title LIKE '%python%'"
sqlite3 bookmarks.db "SELECT url, title FROM bookmarks WHERE stars = 1"
sqlite3 library.db "SELECT title, author FROM books WHERE favorite = 1"

This is the whole point. The database is the artifact, not the tool.

Interactive Shells with Virtual Filesystems

Navigate your data like a Unix filesystem:

$ btk shell
btk:/$ cd tags/programming/python
btk:/tags/programming/python$ ls
3298  4095  5124  (bookmark IDs)
btk:/tags/programming/python$ cat 4095/title
Advanced Python Techniques

$ ebk shell
ebk:/$ cd authors/Knuth
ebk:/authors/Knuth$ ls
The Art of Computer Programming Vol 1
The Art of Computer Programming Vol 2

Reading Queues

Track what you’re reading, watching, or working through:

# Bookmarks
btk queue add 42 --priority high
btk queue next
btk queue progress 42 --percent 75
btk queue estimate-times  # Auto-estimate from content length

# Books
ebk queue add "Gödel, Escher, Bach"
ebk queue next
ebk queue list

LLM Integration

All three integrate with LLMs for tagging, summarization, and search:

# Auto-tag using content analysis
btk content auto-tag --all
ctk auto-tag --model ollama/llama3
ebk enrich 42  # Enhance metadata with LLM

# Natural language queries
ctk say "summarize my conversations about Rust"
btk ask "find articles about distributed systems"
ebk similar "Gödel, Escher, Bach"  # Semantic similarity

Network Analysis

Find relationships in your data:

# CTK: Conversation networks
ctk net embeddings --all
ctk net similar 42
ctk net clusters
ctk net central  # Most connected conversations
ctk net outliers  # Isolated conversations

# BTK: Bookmark graphs
btk graph build
btk graph analyze

Web Servers

Browse your archives in a web UI:

Everything is a File: Virtual Filesystems for CLI Data Tools

October 20, 2025

I had a bookmark manager. Then an ebook library manager. Then a chat history manager. Each started with the standard CRUD CLI:

btk add https://example.com --tags python,tutorial
btk list --tag python
btk search "async"
btk delete 1234

ebk import book.pdf --author "Knuth"
ebk list --author Knuth
ebk search "algorithms"

This works fine until you have 10,000+ bookmarks organized with hierarchical tags like programming/python/async, research/ml/transformers, work/clients/acme. Your ebook library has similar structure. Your exported chat conversations from Claude, ChatGPT, and Copilot are piling up.

Traditional CRUD commands become unwieldy:

btk list --tag programming/python/async/io --format json | jq '.[].title'
ebk list --category "Computer Science/Algorithms/Graph Theory" --limit 50
ctk search "machine learning" --source ChatGPT --date-from 2024-01-01

Each command requires precise arguments. Each tool has different flag conventions. You can’t navigate your data. You can only query it. And queries require knowing exactly what you’re looking for.

The insight: everything is a file

When I have thousands of source files organized in directories, I don’t run:

list-files --path /src/components/auth --extension .tsx

I run:

cd src/components/auth
ls *.tsx

The difference matters. With a filesystem, I can navigate incrementally (cd from general to specific), explore (ls to see what’s there), compose (cat file | grep pattern | wc -l), and use familiar tools (find, grep, xargs, pipes, redirection).

What if my bookmarks, ebooks, and chat histories were filesystems?

The pattern

Over the past year, I built six Python tools that all follow the same architecture:

Tool	Domain	VFS Root Structure
btk	Bookmarks	`/bookmarks/`, `/tags/`, `/recent/`, `/domains/`, `/unread/`, `/popular/`
ebk	Ebook library	`/books/`, `/authors/`, `/series/`, `/subjects/`, `/recent/`, `/unread/`
ctk	Chat conversations	`/conversations/`, `/sources/`, `/topics/`, `/starred/`, `/recent/`
ghops	Git repositories	`/repos/`, `/languages/`, `/topics/`, `/stars/`, `/recent/`
infinigram	N-gram models	`/datasets/`, `/models/`, `/corpora/`
AlgoTree	Tree structures	`/nodes/`, `/paths/`, `/subtrees/`

Each tool provides:

A stateless CLI for scripting: btk bookmark add URL, ebk import book.pdf
An interactive shell with a virtual filesystem: btk shell, ebk shell, ctk chat
POSIX-like commands: cd, ls, pwd, cat, mv, cp, rm, find, grep
Unix pipeline support: most commands output JSONL by default for piping

The interesting part is the shell.

Navigating 10,000 bookmarks

Live recording captured with asciinema. You can pause, copy text, and replay. The entire recording is 78KB of text.

CTK: Conversation Toolkit

October 9, 2025

CTK manages AI conversations across platforms. Import from ChatGPT, Claude, Copilot, Gemini. Store locally in SQLite. Search, tag, export. Keep everything.

The Problem

If you use multiple AI assistants, your conversations are scattered across incompatible platforms, unsearchable, and dependent on companies that may not exist in 20 years. ChatGPT lives in OpenAI’s web app. Claude is siloed in Anthropic’s interface. Copilot chat history is buried in VS Code storage.

You can’t search across them. You can’t back them up in a unified format. You can’t own them.

The Key Insight: Conversations Are Trees

Most tools treat conversations as linear sequences. They’re not. ChatGPT’s “regenerate” feature creates branches. Claude supports conversation forking. Even a simple “let me try that again” is a tree operation.

User: "Write a poem"
  ├── Assistant (v1): "Roses are red..."
  └── Assistant (v2): "In fields of gold..."  [regenerated]
      └── User: "Make it longer"
          └── Assistant: "In fields of gold, where sunshine..."

CTK stores all conversations as trees. Linear chats are single-path trees. Branching conversations preserve every path. This means you never lose a regeneration, and you can export any path you want.

What It Does

# Import from any platform
ctk import chatgpt_export.json --db my_chats.db
ctk import claude_export.json --db my_chats.db --format anthropic
ctk import ~/.vscode/workspaceStorage --db my_chats.db --format copilot

# Search across everything
ctk search "python async" --db my_chats.db

# Natural language queries via LLM tool calling
ctk say "find conversations about distributed systems" --db my_chats.db

# Interactive TUI for browsing and chatting
ctk chat --db my_chats.db

# Export for fine-tuning, archival, or publishing
ctk export training.jsonl --db my_chats.db --format jsonl
ctk export archive.html --db my_chats.db --format html5
ctk export archive/ --db my_chats.db --format markdown

Plugin Architecture

Adding a new provider is one file. Implement ImporterPlugin, drop it in the integrations folder, done. Auto-discovered at runtime. No registry, no config.

Currently supported: OpenAI/ChatGPT (full tree), Anthropic/Claude (full tree), GitHub Copilot, Google Gemini, generic JSONL, coding agents (Cursor, Windsurf).

Privacy

100% local. No telemetry. Optional sanitization strips API keys, passwords, and personal identifiers before export.

ctk export clean_export.jsonl --db chats.db --format jsonl --sanitize

HTML5 Export

The HTML5 exporter produces a self-contained file with embedded search, tree visualization, and dark mode. No server, no internet, no dependencies. The file works offline in any browser, including continuing conversations with a local LLM directly in the exported HTML.

Long Echo: Designing for Digital Resilience Across Decades

January 6, 2025

Update (January 2026): Since this post was written, longecho has evolved from specification to implementation. See Long Echo Comes Alive for the current state including build, serve, and manifest features.

Not Resurrection. Not Immortality.

Just love that still responds.

That’s the idea behind Long Echo. It’s a project about preserving conversations with AI assistants so they stay accessible and meaningful across decades. Not digital ghosts that autonomously post to social media. Not trying to resurrect anyone. Just making sure the knowledge and care captured in these conversations can still be found, searched, and used when the original software is gone.

The Problem

We’re having important conversations with AI assistants:

Teaching moments with students
Advice we’d give our children
Technical problems we’ve solved
Creative work we don’t want to lose
Personal growth tracked over years

But these conversations are trapped in proprietary formats, scattered across platforms (ChatGPT, Claude, Gemini, Copilot), and dependent on companies that may not exist in 50 years.

What happens when you want to find that debugging advice from 2024? What if your children want to search your conversations after you’re gone? What if the company shuts down their API?

The Philosophy: Graceful Degradation

The core idea is graceful degradation, designing systems that fail progressively, not catastrophically:

Level 1: Full functionality  → CTK with semantic search, RAG, beautiful TUI
Level 2: Database queries    → SQLite direct queries (CTK gone, SQLite remains)
Level 3: File search         → grep through JSONL files (just text tools)
Level 4: Human reading       → Markdown, HTML (readable without any tools)
Level 5: Ultimate fallback   → Plain text in notepad

Each level still works even if everything above it is gone.

The Discovery: CTK Already Solved This

I started building Long Echo as a separate system. I designed multi-format importers, search with fallbacks, memory extraction pipelines. Complex architecture diagrams. Deployment strategies. The whole thing.

Then I realized that CTK (Conversation Toolkit), which I had built earlier, already solved all the hard problems.

CTK already provides:

Import from all platforms (unified API)
Conversation trees (handles branching, regenerations)
SQLite storage (local, queryable, persistent)
Multiple export formats (JSONL, Markdown, HTML, JSON)
Full-text search + LLM-powered queries
Complex network RAG (coming soon)
Terminal UI

Everything I was designing was already built. By me. Earlier.

This wasn’t failure. I’d already built the foundation without realizing it. The hard problems (conversation parsing, unified representation, search, storage) were handled. What Long Echo needed wasn’t more code. It needed a philosophy.

Discovering ChatGPT: Reconnecting with AI Research

December 8, 2022

I finally noticed ChatGPT this week. Everyone’s been talking about it, but I was buried in cancer treatment, chemo recovery, surgery prep, and thesis work on Weibull distributions.

When I finally tried it, my reaction wasn’t surprise at the technology itself.

It was: “This makes sense. The pieces were all there.”

Why I Missed It

GPT-3 came out in 2020. I was dealing with:

Stage 3 cancer diagnosis
Chemotherapy
Mathematical statistics coursework
Thesis research on masked failure data
Surgery and recovery

I had no attention left for tracking ML developments. The world moved on. I was focused on survival.

The Theoretical Foundation

I’ve been interested in Marcus Hutter and Ray Solomonoff’s work for years.

Solomonoff induction: optimal prediction is compression. Intelligence is sequence prediction. The smallest program that generates your observations is the best predictor of what comes next.

Hutter’s AIXI: intelligence = optimal compression-based prediction with resource bounds.

During my CS master’s, I proposed working on sequence prediction as a thesis topic, inspired by Solomonoff. The professor wasn’t interested. I ended up doing encrypted search instead.

But the intuition stayed: prediction ~ compression ~ intelligence.

The Bitter Lesson

Rich Sutton’s “The Bitter Lesson” laid it out: scaling compute and data beats clever algorithms.

The lesson from 70 years of AI research: general methods that use computation win. Hand-crafted features lose. Search and learning scale. Everything else doesn’t.

I read that paper and found it compelling. But there’s a difference between understanding theory and watching it play out at scale. OpenAI was actually doing the scaling while I was working on other problems.

ImageNet Should Have Been the Signal

In retrospect, ImageNet being solved by deep neural networks in 2012 was the canary. A simple architecture (CNNs), massive data, lots of compute, and you get superhuman image classification.

That was the proof: scale works.

GPT is the same pattern:

Simple architecture (transformers)
Massive data (internet-scale text)
Enormous compute (thousands of GPUs)

Result: something that looks disturbingly intelligent.

Connecting the Dots

The theoretical framework was there:

Solomonoff: intelligence is compression
Hutter: optimal prediction with bounded resources
Sutton: scaling beats cleverness

The empirical evidence accumulated:

Long Echo Comes Alive: From Philosophy to Orchestration

January 20, 2026

A year ago, I wrote about Long Echo as a philosophy for preserving AI conversations across decades. The key insight was graceful degradation: design archives that work progressively even as technology disappears.

That philosophy has become a tool.

From Philosophy to Tool

The original Long Echo was intentionally not code. It was a set of principles documented in CTK’s repository. The hard problems of conversation parsing, storage, and search were already solved by toolkits like CTK, BTK, and EBK.

What was missing was the unification layer. Each toolkit exports its own ECHO-compliant archive, but combining them into a single browsable experience required manual work. That’s what longecho now handles.

What longecho Does Now

longecho is a CLI tool with five capabilities:

longecho check ~/my-data/       # Validate ECHO compliance
longecho discover ~/            # Find ECHO sources
longecho search ~/ "query"      # Search README descriptions
longecho build ~/my-archive/    # Generate static site
longecho serve ~/my-archive/    # Preview locally via HTTP

The check, discover, and search commands existed in the original specification. What’s new is build and serve, the orchestration layer.

Building a Unified Site

The build command takes a hierarchical archive and generates a static site:

longecho build ~/my-archive/

This produces a site/ directory with:

An index page linking to all sub-archives
Navigation between sources
Automatic linking to existing sub-site builds

If a sub-archive already has its own site/ directory (like CTK’s exports), longecho links to it. Use --bundle to copy everything into a portable, self-contained site.

Live Preview

The serve command provides local HTTP preview:

longecho serve ~/my-archive/ --port 8000

It builds the site if needed, then serves it for browser viewing.

The Manifest

ECHO compliance requires only a README. But for machine-readable metadata, longecho supports an optional manifest:

version: "1.0"
name: "Alex's Data Archive"
description: "Personal data archive"
sources:
  - path: "conversations/"
    order: 1
  - path: "bookmarks/"
    order: 2
  - path: "ebooks/"
    order: 3

The manifest enables:

Explicit ordering of sources in generated sites
Selective inclusion via the browsable flag
Override names for cleaner presentation
Icon hints for UI presentation

Without a manifest, longecho auto-discovers sub-archives by looking for directories with README files. The manifest provides explicit control when you need it.

Long Echo: Photos and Mail

January 19, 2026

The Long Echo toolkit now covers conversations, bookmarks, and ebooks. But two of the most emotionally significant categories of personal data remain: photos and mail.

Both share a troubling pattern: scattered across devices and cloud services, organized by date rather than meaning, vulnerable to platform disappearance. They deserve better.

The Expanding Ecosystem

Tool	Domain	Status
ctk	AI Conversations	stable
btk	Bookmarks & Media	stable
ebk	eBooks	stable
repoindex	Git Repositories	stable
ptk	Photos	incubating
mtk	Mail	incubating

The orchestration layer, longecho, ties these together into a unified personal archive.

PTK: Photo Toolkit

Photos are the most emotionally valuable digital artifacts most people have. They’re also among the worst-managed.

The Problem

Your photo library is probably:

Scattered: Phone, old phones, cloud services, camera imports, messaging app saves
Organized by date: Not by who’s in them, where they were taken, or what they mean
Cloud-dependent: Google Photos, iCloud, Amazon Photos. What happens when you switch?
Unsearchable by content: “Find photos of mom at the beach” isn’t possible
Missing context: Only you know why that blurry photo matters

The Vision

ptk provides:

Unified import from any source:

ptk import ~/Pictures/
ptk import ~/phone-backup/DCIM/
ptk import google-takeout.zip --source google-photos
ptk import icloud-export/ --source icloud

Intelligent organization by multiple dimensions:

ptk shell
ptk:/$ cd /people/mom
ptk:/people/mom$ ls
2019/  2020/  2021/  2022/  2023/  2024/

ptk:/$ cd /locations/beach
ptk:/$ cd /events/christmas-2023
ptk:/$ cd /years/2020/months/march

AI-powered features:

# Face detection and clustering
ptk faces detect --all
ptk faces cluster
ptk faces label cluster-7 "Mom"
ptk faces find "Mom"

# Scene captioning
ptk caption --all --model ollama/llava
ptk search "sunset over water"

# Semantic search
ptk ask "photos from our trip to Colorado"

Preservation guarantees:

# Verify nothing is corrupted
ptk verify --checksums

# Export to durable formats
ptk export ~/archive/photos/ --format longecho
ptk export photos.html --format html-gallery

# Original files always preserved
ptk originals list
ptk originals verify

Why SQLite?

Like the other Long Echo tools, ptk uses SQLite for metadata:

# Works even if ptk disappears
sqlite3 photos.db "
  SELECT path, caption, taken_at
  FROM photos
  WHERE caption LIKE '%birthday%'
  ORDER BY taken_at
"

The database stores metadata, face embeddings, captions, and organization. The actual photo files stay in place or are copied to a managed library, your choice.

The Long Echo Toolkit

December 16, 2025

Earlier this year I wrote about Long Echo, a philosophy for preserving AI conversations in ways that stay accessible across decades. The core idea was graceful degradation: systems that fail progressively, not catastrophically.

Since then I’ve built out three tools that apply this thinking to all personal digital content, not just conversations. Bookmarks, books, and AI chats. Together they form a system for managing the stuff you actually think with.

The Toolkit

Tool	Domain	Install
CTK	AI Conversations	`pip install conversation-tk`
BTK	Bookmarks & Media	`pip install bookmark-tk`
EBK	eBooks & Documents	`pip install ebk`

All three share a common architecture, but each is specialized for its domain.

Shared Architecture

SQLite-First Storage

Every tool uses local SQLite databases you own. No cloud dependency. Queryable with standard tools even if the CLI disappears tomorrow:

# Works even if the tools are gone
sqlite3 conversations.db "SELECT title FROM conversations WHERE title LIKE '%python%'"
sqlite3 bookmarks.db "SELECT url, title FROM bookmarks WHERE stars = 1"
sqlite3 library.db "SELECT title, author FROM books WHERE favorite = 1"

This is the whole point. The database is the artifact, not the tool.

Interactive Shells with Virtual Filesystems

Navigate your data like a Unix filesystem:

$ btk shell
btk:/$ cd tags/programming/python
btk:/tags/programming/python$ ls
3298  4095  5124  (bookmark IDs)
btk:/tags/programming/python$ cat 4095/title
Advanced Python Techniques

$ ebk shell
ebk:/$ cd authors/Knuth
ebk:/authors/Knuth$ ls
The Art of Computer Programming Vol 1
The Art of Computer Programming Vol 2

Reading Queues

Track what you’re reading, watching, or working through:

# Bookmarks
btk queue add 42 --priority high
btk queue next
btk queue progress 42 --percent 75
btk queue estimate-times  # Auto-estimate from content length

# Books
ebk queue add "Gödel, Escher, Bach"
ebk queue next
ebk queue list

LLM Integration

All three integrate with LLMs for tagging, summarization, and search:

# Auto-tag using content analysis
btk content auto-tag --all
ctk auto-tag --model ollama/llama3
ebk enrich 42  # Enhance metadata with LLM

# Natural language queries
ctk say "summarize my conversations about Rust"
btk ask "find articles about distributed systems"
ebk similar "Gödel, Escher, Bach"  # Semantic similarity

Network Analysis

Find relationships in your data:

# CTK: Conversation networks
ctk net embeddings --all
ctk net similar 42
ctk net clusters
ctk net central  # Most connected conversations
ctk net outliers  # Isolated conversations

# BTK: Bookmark graphs
btk graph build
btk graph analyze

Web Servers

Browse your archives in a web UI:

Everything is a File: Virtual Filesystems for CLI Data Tools

October 20, 2025

I had a bookmark manager. Then an ebook library manager. Then a chat history manager. Each started with the standard CRUD CLI:

btk add https://example.com --tags python,tutorial
btk list --tag python
btk search "async"
btk delete 1234

ebk import book.pdf --author "Knuth"
ebk list --author Knuth
ebk search "algorithms"

This works fine until you have 10,000+ bookmarks organized with hierarchical tags like programming/python/async, research/ml/transformers, work/clients/acme. Your ebook library has similar structure. Your exported chat conversations from Claude, ChatGPT, and Copilot are piling up.

Traditional CRUD commands become unwieldy:

btk list --tag programming/python/async/io --format json | jq '.[].title'
ebk list --category "Computer Science/Algorithms/Graph Theory" --limit 50
ctk search "machine learning" --source ChatGPT --date-from 2024-01-01

Each command requires precise arguments. Each tool has different flag conventions. You can’t navigate your data. You can only query it. And queries require knowing exactly what you’re looking for.

The insight: everything is a file

When I have thousands of source files organized in directories, I don’t run:

list-files --path /src/components/auth --extension .tsx

I run:

cd src/components/auth
ls *.tsx

The difference matters. With a filesystem, I can navigate incrementally (cd from general to specific), explore (ls to see what’s there), compose (cat file | grep pattern | wc -l), and use familiar tools (find, grep, xargs, pipes, redirection).

What if my bookmarks, ebooks, and chat histories were filesystems?

The pattern

Over the past year, I built six Python tools that all follow the same architecture:

Tool	Domain	VFS Root Structure
btk	Bookmarks	`/bookmarks/`, `/tags/`, `/recent/`, `/domains/`, `/unread/`, `/popular/`
ebk	Ebook library	`/books/`, `/authors/`, `/series/`, `/subjects/`, `/recent/`, `/unread/`
ctk	Chat conversations	`/conversations/`, `/sources/`, `/topics/`, `/starred/`, `/recent/`
ghops	Git repositories	`/repos/`, `/languages/`, `/topics/`, `/stars/`, `/recent/`
infinigram	N-gram models	`/datasets/`, `/models/`, `/corpora/`
AlgoTree	Tree structures	`/nodes/`, `/paths/`, `/subtrees/`

Each tool provides:

A stateless CLI for scripting: btk bookmark add URL, ebk import book.pdf
An interactive shell with a virtual filesystem: btk shell, ebk shell, ctk chat
POSIX-like commands: cd, ls, pwd, cat, mv, cp, rm, find, grep
Unix pipeline support: most commands output JSONL by default for piping

The interesting part is the shell.

Navigating 10,000 bookmarks

Live recording captured with asciinema. You can pause, copy text, and replay. The entire recording is 78KB of text.

EBK: Ebook Toolkit

October 13, 2025

Your books represent decades of accumulated knowledge. Technical references, formative texts, research that shaped your thinking. They deserve better than scattered files on a hard drive with inconsistent metadata and no way to search across them.

EBK treats your ebook library as a queryable, searchable knowledge base. It’s part of the Long Echo toolkit: tools for preserving your digital intellectual life in formats you control.

The Core Abstraction

At its heart, EBK is a SQLAlchemy + SQLite database with a normalized schema. Everything else (CLI, AI features, exports) is layered on top. This means your library metadata is always queryable with standard tools, even if EBK itself disappears.

# Works even without EBK installed
sqlite3 library.db "SELECT title, author FROM books WHERE favorite = 1"

What It Does

# Initialize and import
ebk db-init ~/my-library
ebk db-import ~/Documents/book.pdf ~/my-library
ebk db-import-calibre ~/Calibre/Library ~/my-library

# Search with FTS5 full-text search
ebk db-search "quantum computing" ~/my-library

# Field-specific queries
ebk db-search "title:Python author:Knuth tag:programming" ~/my-library

Behind a simple import, EBK automatically extracts text from PDFs (PyMuPDF with pypdf fallback) and EPUBs, generates text chunks for semantic search, computes SHA256 hashes for deduplication, extracts covers, and indexes everything in FTS5.

Deduplication

Same file (same hash) gets skipped. Same book in a different format gets added as an additional format. Different book gets imported as new. Books are stored in hash-prefixed directories for scalability.

AI Enrichment

EBK can use LLMs to auto-generate tags, categories, and descriptions for books with sparse metadata:

ebk enrich 42  # Enhance metadata with LLM

Semantic search finds books by meaning, not just keywords:

results = lib.semantic_search(
    "explaining complex mathematical concepts simply",
    threshold=0.7
)

Uses vector embeddings when available, TF-IDF fallback for offline use.

Knowledge Graphs

Using NetworkX, EBK can extract concept relationships across your library:

graph = lib.build_knowledge_graph(extract_entities=True)
graph.visualize(output="library_knowledge.html")

This reveals connections you didn’t know existed. “These books about functional programming also discuss category theory.”

Fluent Python API

from ebk import Library

lib = Library.open("~/ebooks")
results = (lib.query()
    .where("language", "en")
    .where("date", "2020", ">=")
    .where("subjects", "Python", "contains")
    .order_by("title")
    .take(10)
    .execute())

Export

Multiple formats for different needs:

ebk export hugo ~/library ~/hugo-site --organize-by subject --include-covers
ebk export-dag ~/library ~/output  # Navigable symlink directory structure

The Hugo export creates a browsable website. The DAG export creates a tag-based directory structure where books appear via symlinks under multiple categories. Both work without EBK installed.

Long Echo: The Ghost That Speaks

January 20, 2026

The ghost is not you. But it echoes you.

What survives beyond scattered archives? Beyond exported conversations and curated bookmarks? The stuff we never think to preserve: the photos that show how you see the world. The correspondence that maps who matters to you.

The Long Echo toolkit has grown. PTK for photos. MTK for mail. But these are sources, not destinations. The destination is something stranger: longshade, a persona built from your data that can respond to questions you never answered.

I’m going to invert the usual pattern here. Instead of tools first, philosophy later, I want to start with the philosophical destination and work backward to the data that feeds it.

longshade: The Ghost That Speaks

The Central Question

What if your archive could respond?

Not a chatbot trained on your data. Not a digital resurrection. Something more careful: a voice that carries your patterns, your interests, your way of seeing the world.

That’s longshade. Right now it’s spec-only (no implementation yet). It defines what it would mean to synthesize a conversable persona from personal archives.

The Ghost Metaphor

“The ghost is not you. But it echoes you.”

This framing matters. longshade isn’t about immortality or resurrection. It’s about preservation with a kind of agency. The echo can answer questions you never answered, using patterns you established. It speaks in your voice without claiming to be you.

The distinction is important:

Resurrection claims to recreate the person
Simulation claims to predict the person
Echo acknowledges it carries patterns, not identity

An echo is honest about what it is. It responds because you left enough traces to inform a response, not because it is you.

Voice vs. Personality

longshade extracts voice, not personality.

Your actual phrases. Your vocabulary. Your reasoning patterns. Your recurring metaphors. The way you explain things, not the things you might explain.

I noticed something working with conversation archives: user messages are the strongest signal. AI responses contain the AI’s voice. Your messages contain your voice. How you ask questions, how you frame problems, how you push back. That’s where the signal lives.

The ghost speaks like you because it learned from what you actually said, not from responses you prompted.

Long Echo: Photos and Mail

January 19, 2026

The Long Echo toolkit now covers conversations, bookmarks, and ebooks. But two of the most emotionally significant categories of personal data remain: photos and mail.

Both share a troubling pattern: scattered across devices and cloud services, organized by date rather than meaning, vulnerable to platform disappearance. They deserve better.

The Expanding Ecosystem

Tool	Domain	Status
ctk	AI Conversations	stable
btk	Bookmarks & Media	stable
ebk	eBooks	stable
repoindex	Git Repositories	stable
ptk	Photos	incubating
mtk	Mail	incubating

The orchestration layer, longecho, ties these together into a unified personal archive.

PTK: Photo Toolkit

Photos are the most emotionally valuable digital artifacts most people have. They’re also among the worst-managed.

The Problem

Your photo library is probably:

Scattered: Phone, old phones, cloud services, camera imports, messaging app saves
Organized by date: Not by who’s in them, where they were taken, or what they mean
Cloud-dependent: Google Photos, iCloud, Amazon Photos. What happens when you switch?
Unsearchable by content: “Find photos of mom at the beach” isn’t possible
Missing context: Only you know why that blurry photo matters

The Vision

ptk provides:

Unified import from any source:

ptk import ~/Pictures/
ptk import ~/phone-backup/DCIM/
ptk import google-takeout.zip --source google-photos
ptk import icloud-export/ --source icloud

Intelligent organization by multiple dimensions:

ptk shell
ptk:/$ cd /people/mom
ptk:/people/mom$ ls
2019/  2020/  2021/  2022/  2023/  2024/

ptk:/$ cd /locations/beach
ptk:/$ cd /events/christmas-2023
ptk:/$ cd /years/2020/months/march

AI-powered features:

# Face detection and clustering
ptk faces detect --all
ptk faces cluster
ptk faces label cluster-7 "Mom"
ptk faces find "Mom"

# Scene captioning
ptk caption --all --model ollama/llava
ptk search "sunset over water"

# Semantic search
ptk ask "photos from our trip to Colorado"

Preservation guarantees:

# Verify nothing is corrupted
ptk verify --checksums

# Export to durable formats
ptk export ~/archive/photos/ --format longecho
ptk export photos.html --format html-gallery

# Original files always preserved
ptk originals list
ptk originals verify

Why SQLite?

Like the other Long Echo tools, ptk uses SQLite for metadata:

# Works even if ptk disappears
sqlite3 photos.db "
  SELECT path, caption, taken_at
  FROM photos
  WHERE caption LIKE '%birthday%'
  ORDER BY taken_at
"

The database stores metadata, face embeddings, captions, and organization. The actual photo files stay in place or are copied to a managed library, your choice.

Long Echo: The Ghost That Speaks

January 20, 2026

The ghost is not you. But it echoes you.

What survives beyond scattered archives? Beyond exported conversations and curated bookmarks? The stuff we never think to preserve: the photos that show how you see the world. The correspondence that maps who matters to you.

The Long Echo toolkit has grown. PTK for photos. MTK for mail. But these are sources, not destinations. The destination is something stranger: longshade, a persona built from your data that can respond to questions you never answered.

I’m going to invert the usual pattern here. Instead of tools first, philosophy later, I want to start with the philosophical destination and work backward to the data that feeds it.

longshade: The Ghost That Speaks

The Central Question

What if your archive could respond?

Not a chatbot trained on your data. Not a digital resurrection. Something more careful: a voice that carries your patterns, your interests, your way of seeing the world.

That’s longshade. Right now it’s spec-only (no implementation yet). It defines what it would mean to synthesize a conversable persona from personal archives.

The Ghost Metaphor

“The ghost is not you. But it echoes you.”

This framing matters. longshade isn’t about immortality or resurrection. It’s about preservation with a kind of agency. The echo can answer questions you never answered, using patterns you established. It speaks in your voice without claiming to be you.

The distinction is important:

Resurrection claims to recreate the person
Simulation claims to predict the person
Echo acknowledges it carries patterns, not identity

An echo is honest about what it is. It responds because you left enough traces to inform a response, not because it is you.

Voice vs. Personality

longshade extracts voice, not personality.

Your actual phrases. Your vocabulary. Your reasoning patterns. Your recurring metaphors. The way you explain things, not the things you might explain.

I noticed something working with conversation archives: user messages are the strongest signal. AI responses contain the AI’s voice. Your messages contain your voice. How you ask questions, how you frame problems, how you push back. That’s where the signal lives.

The ghost speaks like you because it learned from what you actually said, not from responses you prompted.

Long Echo: Photos and Mail

January 19, 2026

The Long Echo toolkit now covers conversations, bookmarks, and ebooks. But two of the most emotionally significant categories of personal data remain: photos and mail.

Both share a troubling pattern: scattered across devices and cloud services, organized by date rather than meaning, vulnerable to platform disappearance. They deserve better.

The Expanding Ecosystem

Tool	Domain	Status
ctk	AI Conversations	stable
btk	Bookmarks & Media	stable
ebk	eBooks	stable
repoindex	Git Repositories	stable
ptk	Photos	incubating
mtk	Mail	incubating

The orchestration layer, longecho, ties these together into a unified personal archive.

PTK: Photo Toolkit

Photos are the most emotionally valuable digital artifacts most people have. They’re also among the worst-managed.

The Problem

Your photo library is probably:

Scattered: Phone, old phones, cloud services, camera imports, messaging app saves
Organized by date: Not by who’s in them, where they were taken, or what they mean
Cloud-dependent: Google Photos, iCloud, Amazon Photos. What happens when you switch?
Unsearchable by content: “Find photos of mom at the beach” isn’t possible
Missing context: Only you know why that blurry photo matters

The Vision

ptk provides:

Unified import from any source:

ptk import ~/Pictures/
ptk import ~/phone-backup/DCIM/
ptk import google-takeout.zip --source google-photos
ptk import icloud-export/ --source icloud

Intelligent organization by multiple dimensions:

ptk shell
ptk:/$ cd /people/mom
ptk:/people/mom$ ls
2019/  2020/  2021/  2022/  2023/  2024/

ptk:/$ cd /locations/beach
ptk:/$ cd /events/christmas-2023
ptk:/$ cd /years/2020/months/march

AI-powered features:

# Face detection and clustering
ptk faces detect --all
ptk faces cluster
ptk faces label cluster-7 "Mom"
ptk faces find "Mom"

# Scene captioning
ptk caption --all --model ollama/llava
ptk search "sunset over water"

# Semantic search
ptk ask "photos from our trip to Colorado"

Preservation guarantees:

# Verify nothing is corrupted
ptk verify --checksums

# Export to durable formats
ptk export ~/archive/photos/ --format longecho
ptk export photos.html --format html-gallery

# Original files always preserved
ptk originals list
ptk originals verify

Why SQLite?

Like the other Long Echo tools, ptk uses SQLite for metadata:

# Works even if ptk disappears
sqlite3 photos.db "
  SELECT path, caption, taken_at
  FROM photos
  WHERE caption LIKE '%birthday%'
  ORDER BY taken_at
"

The database stores metadata, face embeddings, captions, and organization. The actual photo files stay in place or are copied to a managed library, your choice.

Building Languages to Solve Problems

January 19, 2026

Chapter 4 of Structure and Interpretation of Computer Programs opens with one of the most important insights in programming: the most powerful technique for controlling complexity is metalinguistic abstraction, the establishment of new languages.

Not libraries. Not frameworks. Languages.

When you’ve abstracted enough of a problem domain into primitives, combination rules, and naming mechanisms, you haven’t just written code. You’ve created a new way of thinking about the problem. The domain becomes expressible. And once something is expressible, it becomes manipulable, debuggable, and shareable.

What Is Metalinguistic Abstraction?

The key distinction is between using a language and creating one. A library gives you functions to call. A language gives you a grammar for expressing ideas.

Consider the difference:

Library approach: Call db.execute("SELECT * FROM users WHERE age > 21")

Language approach: Write SELECT * FROM users WHERE age > 21

SQL isn’t a library. It’s a language, with primitives (tables, columns), means of combination (joins, unions, subqueries), and means of abstraction (views, CTEs). These three elements (primitives, combination, abstraction) are SICP’s fundamental criteria for any language, and they’re what separates a DSL from a mere API.

Other examples:

Regular expressions: primitives (characters, character classes), combination (concatenation, alternation), abstraction (groups, backreferences)
Make: primitives (targets, prerequisites), combination (dependency chains), abstraction (pattern rules, variables)
CSS selectors: primitives (elements, classes, IDs), combination (descendant, child, sibling), abstraction (custom properties, mixins in preprocessors)

In each case, the language captures the essential structure of the problem domain in a way that raw code cannot.

The Three Requirements

SICP identifies three necessary components for any language:

Primitives: What are the basic elements that cannot be broken down further?
Means of combination: How do you build compound elements from simpler ones?
Means of abstraction: How do you name and reuse patterns?

When designing a DSL, these questions guide everything. Get them wrong and you have a clunky API. Get them right and the domain becomes thinkable in your language.

Consider an expression language for symbolic math:

Primitives: numbers, symbols, operators
Combination: function application (+ x 1), nested expressions (* (+ x 1) 2)
Abstraction: named rules, rulesets, engines

Or a query language for JSON documents:

JAF: Streaming Boolean Algebra Over Nested JSON

December 20, 2024

JAF (Just Another Flow) is a streaming data processing system for JSON/JSONL data. It implements boolean algebra over nested JSON structures with lazy evaluation, composable operations, and a fluent API. JAF is the production version of the concepts I explored in dotsuite.

The Relationship to Dotsuite

The short version:

dotsuite: “This is how it works.” Pedagogical, simple, learn-by-building.
JAF: “This is what you use.” Feature-complete, lazy, handles real data.

JAF implements the highest level of dotsuite’s architecture: boolean algebra over collections of nested documents. Where dotsuite teaches the concepts through isolated simple tools, JAF combines them into a unified streaming framework.

The Boolean Algebra Branch

In dotsuite’s three-pillar architecture (Depth, Truth, Shape), JAF focuses on the collections layer, specifically the boolean wing that provides filtering operations with full boolean algebra:

\[ \text{filter}: (\mathcal{D} \to \mathbb{B}) \to (C \to C) \]

Where $\mathcal{D}$ is the document space, $\mathbb{B}$ is boolean values, and $C$ is a collection of documents.

JAF lifts boolean operations to streams: AND is intersection of filtered streams, OR is union, NOT is complement, and composition gives you chainable predicates with guaranteed homomorphism.

Core Innovation: Lazy Streaming

The Problem

Traditional data processing loads entire datasets into memory:

# Eager evaluation - loads everything
all_data = load_json("huge_file.jsonl")
filtered = [d for d in all_data if d['age'] > 25]
mapped = [transform(d) for d in filtered]

This fails on large datasets and wastes resources when you only need the first 10 results.

JAF’s Solution

from jaf import stream

# Lazy evaluation - nothing executes yet
pipeline = stream("huge_file.jsonl") \
    .filter(["gt?", "@age", 25]) \
    .map(transform) \
    .take(10)

# Only processes 10 matching items
for item in pipeline.evaluate():
    process(item)

Constant memory (processes one item at a time), early termination (stops after take(10)), composable (build complex pipelines declaratively), and works with infinite streams.

Three Query Syntaxes

JAF supports multiple query syntaxes that all compile to the same internal representation.

S-Expression Syntax (Lisp-like)

# Simple comparisons
(eq? @status "active")
(gt? @age 25)
(contains? @tags "python")

# Boolean logic
(and
    (gte? @age 18)
    (eq? @verified true))

# Nested expressions
(or (eq? @role "admin")
    (and (eq? @role "user")
         (gt? @score 100)))

S-expressions because: unambiguous parsing (no precedence rules), easy to serialize, homoiconic (code is data), composable ASTs.

JSON Array Syntax

# Same queries in JSON
["eq?", "@status", "active"]
["gt?", "@age", 25]

["and",
    ["gte?", "@age", 18],
    ["eq?", "@verified", true]
]

Easy to generate programmatically, standard JSON format, network-transmissible.

Infix DSL Syntax

# Natural infix notation
@status == "active"
@age > 25 and @verified == true
@role == "admin" or (@role == "user" and @score > 100)

Human-readable, familiar, good for CLI usage. All three compile to the same AST.

jsonl-algebra: Relational Algebra for Nested JSON

December 18, 2024

jsonl-algebra (command: ja) is a command-line implementation of relational algebra for JSONL data. It’s the production version of dotsuite’s dotrelate component: SQL-like operations on the command line with first-class support for nested JSON structures.

The Relationship to Dotsuite

In dotsuite’s architecture, dotrelate provides relational operations on document collections: join, union, project, difference. jsonl-algebra (ja) is the production implementation of those concepts, published on PyPI, with all relational operations plus aggregations, streaming support, schema tools, and an interactive REPL.

The Core Insight

Traditional relational algebra assumes flat tables:

SELECT name, age FROM users WHERE age > 30

But real-world JSON is deeply nested:

{
  "user": {
    "id": 1,
    "name": "Alice",
    "address": {
      "city": "NYC",
      "zip": "10001"
    }
  },
  "orders": [
    {"id": 101, "amount": 50}
  ]
}

jsonl-algebra bridges this gap by extending relational algebra with dot notation for nested access:

ja select 'user.age > 30' users.jsonl
ja project user.name,user.address.city users.jsonl
ja join users.jsonl orders.jsonl --on user.id=customer_id

The Five Core Operations

Relational algebra has five fundamental operations that form a complete algebra. Everything else is derived.

1. Selection (filter rows)

Mathematical notation: $\sigma_{\text{predicate}}(R)$

# Filter where status is "active"
ja select 'status == `"active"`' data.jsonl

# Filter on nested fields
ja select 'user.age > 30' users.jsonl

# Complex boolean logic
ja select 'price < 100 and category == `"electronics"`' products.jsonl

Selection is commutative ($\sigma_{p_1}(\sigma_{p_2}(R)) = \sigma_{p_2}(\sigma_{p_1}(R))$) and combinable ($\sigma_{p_1}(\sigma_{p_2}(R)) = \sigma_{p_1 \land p_2}(R)$).

2. Projection (select/compute columns)

Mathematical notation: $\pi_{\text{columns}}(R)$

# Pick specific fields
ja project id,name data.jsonl

# Access nested fields
ja project user.name,user.address.city users.jsonl

# Computed columns (coming soon)
ja project name,annual_income=salary*12 employees.jsonl

Idempotent for simple projections: $\pi_a(\pi_{a,b}(R)) = \pi_a(R)$.

3. Join (combine relations)

Mathematical notation: $R \bowtie_{\text{condition}} S$

# Inner join on user ID
ja join users.jsonl orders.jsonl --on user.id=customer_id

# Join on nested fields
ja join posts.jsonl comments.jsonl --on post.id=comment.post_id

# Multiple join keys
ja join users.jsonl accounts.jsonl --on id=user_id,email=account_email

Commutative and associative, so you can join multiple files in any order:

ja join users.jsonl orders.jsonl --on user.id=customer_id \
  | ja join - products.jsonl --on product_id=id

4. Union (combine all rows)

Mathematical notation: $R \cup S$

# Combine employees and contractors
ja union employees.jsonl contractors.jsonl

# Union multiple sources
ja union jan.jsonl feb.jsonl mar.jsonl

5. Difference (set subtraction)

Mathematical notation: $R - S$

The Dot Ecosystem: From Simple Paths to Data Algebras

December 15, 2024

dotsuite is a suite of composable tools for working with nested data structures like JSON, YAML, and Python dictionaries. It started as a single helper function and grew into something with actual mathematical structure. That growth is the interesting part.

The Origin

It always starts with a simple problem. You have a nested dictionary and you need a value buried deep inside:

# Brittle code that crashes on missing keys
email = data['user']['contacts'][0]['email']  # KeyError? IndexError?

The first solution is a helper function:

# The essence of dotget - simple enough to copy
def get(data, path, default=None):
    try:
        for segment in path.split('.'):
            data = data[int(segment)] if segment.isdigit() else data[segment]
        return data
    except (KeyError, IndexError, TypeError):
        return default

This is where the story begins. That single function, once you start asking questions about what else you need, leads to a complete ecosystem for data manipulation. The trick is that the questions have a natural structure to them.

The Three Pillars

The ecosystem organizes around three fundamental questions about data:

Depth Pillar: “Where is the data?”

Tools for finding and extracting values from within documents.

Tool	Purpose	Complexity
dotget	Simple exact paths	`get(data, "user.name")`
dotstar	Wildcard patterns	`search(data, "users.*.name")`
dotselect	Advanced selection with predicates	`find_first(data, "users[role=admin].name")`
dotpath	Extensible path engine	Powers all other tools, Turing-complete

The addressing layer forms a free algebra on selectors, with operators being morphisms in the Kleisli category of the powerset monad. In practice this means dotstar composed with dotselect still yields a well-defined set of values. You can compose these things without worrying about edge cases blowing up.

Truth Pillar: “Is this assertion true?”

Tools for asking boolean questions and validating data.

Tool	Purpose	Logic
dotexists	Path existence	`check(data, "user.email")`
dotany	Existential quantifier	`any_match(data, "users.*.role", "admin")`
dotall	Universal quantifier	`all_match(data, "users.*.status", "active")`
dotquery	Compositional logic engine	`Query("any equals role admin").check(data)`

Predicates form a Boolean algebra under conjunction, disjunction, and negation that is homomorphic to set algebra on result subsets. This enables short-circuit evaluation and distributive laws. The math isn’t decoration; it’s what makes the composition reliable.

Shape Pillar: “How should the data be transformed?”

Tools for reshaping and modifying data structures.

Tool	Purpose	Type
dotmod	Surgical modifications	`set_(data, "user.status", "inactive")`
dotbatch	Atomic transactions	Apply multiple changes safely
dotpipe	Data transformation pipelines	Reshape documents into new forms
dotpluck	Value extraction	Create new structures from selections

Transformations are endofunctors on document spaces with monoid composition. dotmod implements lenses with put-get laws, while dotpipe provides Kleisli composition of pure functions.

Duality: The Hidden Structure of Opposites

January 19, 2026

Many structures come in pairs. Recognizing duality lets you transfer insights between domains.

The Motivating Example

This collection includes two approaches to automatic differentiation:

Forward mode (in dual): Propagate derivatives alongside values, from inputs toward outputs
Reverse mode (in autodiff): Build a graph during forward evaluation, then propagate gradients backward from outputs toward inputs

These aren’t just two implementations of the same idea. They’re duals, mirror images with complementary strengths.

Forward mode computes one column of the Jacobian per pass. If $f: \mathbb{R}^n \to \mathbb{R}^m$, computing the full Jacobian takes $n$ passes. Reverse mode computes one row per pass, $m$ passes for the full Jacobian.

For neural network training, we have many inputs (millions of parameters) and one output (the loss). Reverse mode wins overwhelmingly: one backward pass gives all gradients. This is why backpropagation dominates deep learning.

For sensitivity analysis with few parameters and many outputs, forward mode wins. Same algorithm structure, opposite traversal direction, complementary use cases.

The mathematical explanation: forward mode computes Jacobian-vector products ($Jv$); reverse mode computes vector-Jacobian products ($v^T J$). These are transposes of each other. Duality is transposition.

Push vs Pull

Consider two ways to traverse a sequence:

Pull (iterator/consumer controls):

for (auto it = seq.begin(); it != seq.end(); ++it) {
    process(*it);  // Consumer pulls each element
}

Push (producer controls):

seq.for_each([](auto x) {
    process(x);  // Producer pushes each element
});

Same traversal. Same elements processed. But control flow is reversed:

Aspect	Pull (Iterator)	Push (Generator)
Who controls pace?	Consumer	Producer
Suspend/resume?	Consumer decides when to call `++`	Producer decides when to yield
Backpressure	Natural (just stop pulling)	Must be designed in
Composition	Chain iterators	Chain callbacks

C++ ranges are pull-based: view | filter | transform creates an iterator that pulls through the pipeline. Reactive streams (Rx) are push-based: events flow through a pipeline of observers.

These are duals. Given a pull-based algorithm, you can mechanically derive its push-based counterpart by reversing who initiates each step. The transformation preserves correctness because it’s just changing direction, not content.

Encode vs Decode

Compression algorithms come in pairs:

// Encoder: structure -> bits
auto encode(const Document& doc) -> Bitstream;

// Decoder: bits -> structure
auto decode(const Bitstream& bits) -> Document;

These must be inverses: decode(encode(x)) == x. But their implementations are often strikingly different:

Seeing Structure First

January 18, 2026

A reflection on eleven explorations in generic programming

The Question Behind the Code

What do these computations have in common?

Computing the millionth Fibonacci number
Finding the shortest path between cities in a weighted graph
Calculating compound interest over thirty years
Composing ten 3D rotations into one
Repeating a string n times

The answer: they’re all computed by the same twenty lines of code.

template<typename T>
constexpr T power(T const& base, T exp) {
    if (exp == zero(exp)) return one(exp);
    if (exp == one(exp))  return base;

    return even(exp)
        ? square(power(base, half(exp)))
        : product(base, power(base, decrement(exp)));
}

This shouldn’t work. Fibonacci numbers involve integer sequences. Shortest paths involve graphs. Rotations involve 3D geometry. Different domains, different mathematics.

Yet they share structure. Once you see it, a single algorithm serves them all.

This collection of eleven blog posts is an extended meditation on one idea: algorithms arise from algebraic structure. The posts cover different domains (number theory, calculus, linear algebra, polymorphism) but they circle the same insight. Recognize the structure; the algorithm follows.

The Principle

Alex Stepanov articulated this most clearly in Elements of Programming: “Generic programming is about abstracting and classifying algorithms and data structures.” But the deeper point is how to abstract. Not by common syntax or superficial similarity, but by the algebraic laws a type obeys.

Why does structure appear everywhere? Because reality has structure. The algebraic structures we discover in programming (groups, rings, monoids) are the same structures physicists discover in nature. Rotations form a group. Spacetime transformations form a group. This isn’t coincidence. We’re uncovering patterns that exist.

Noether’s theorem makes this precise: every continuous symmetry corresponds to a conservation law. Time-translation symmetry gives conservation of energy. Space-translation symmetry gives conservation of momentum. Rotational symmetry gives conservation of angular momentum. The symmetry groups of physics are algebraic structures.

When we recognize “this is a monoid” in our code, we’re tapping into the same mathematical substrate that governs physical law. The algorithms follow because the structure constrains what’s possible, both in computation and in nature.

Consider the power() function above. What does it require?

An associative binary operation (so we can regroup: $(a \cdot b) \cdot c = a \cdot (b \cdot c)$)
An identity element (so $1 \cdot x = x \cdot 1 = x$)
Halving and parity testing on the exponent

That’s it. Any type providing these operations, with these laws, can use this algorithm. The requirements are algebraic, not syntactic.

Differentiation: Three Ways

January 15, 2025

A synthesis of three earlier posts, comparing forward-mode AD, reverse-mode AD, and numerical differentiation.

Computing derivatives shows up everywhere: optimization, machine learning, physics simulation, numerical analysis. This series has explored three distinct approaches:

Forward-mode AD via dual numbers
Reverse-mode AD via computational graphs
Numerical differentiation via finite differences

Each has different strengths. The right choice depends on the shape of your problem.

The Landscape

Method	Accuracy	Cost for $f: \mathbb{R}^n \to \mathbb{R}$	Cost for $f: \mathbb{R} \to \mathbb{R}^m$	Memory
Forward AD	Exact	$O(n)$ passes	$O(1)$ pass	$O(1)$
Reverse AD	Exact	$O(1)$ pass	$O(m)$ passes	$O(\text{ops})$
Finite Diff	$O(h^p)$	$O(n)$ evaluations	$O(n)$ evaluations	$O(1)$

The key point: problem structure determines the best method.

Forward-Mode AD: Dual Numbers

Forward-mode AD extends numbers with an infinitesimal $\varepsilon$ where $\varepsilon^2 = 0$. The derivative falls out of the arithmetic for free:

// f(x) = x^3 - 3x + 1
// f'(x) = 3x^2 - 3

auto x = dual<double>::variable(2.0);  // x = 2, dx = 1
auto f = x*x*x - 3.0*x + 1.0;

std::cout << f.value() << "\n";       // 3.0
std::cout << f.derivative() << "\n";  // 9.0

Strengths:

Simple implementation (operator overloading)
No memory overhead
Naturally composable for higher derivatives
Works with any function of overloaded operators

When to use:

Single input variable (or few inputs)
Computing Jacobian-vector products
Higher-order derivatives via nesting
Sensitivity analysis along one direction

Complexity: One forward pass per input variable. For f: R^n -> R^m, computing the full Jacobian requires n passes.

Reverse-Mode AD: Computational Graphs

Reverse-mode AD builds a computational graph during the forward pass, then propagates gradients backward via the chain rule:

auto f = [](const auto& x) {
    return sum(pow(x, 2.0));  // f(x) = sum(x^2)
};

auto df = grad(f);  // Returns gradient function
auto gradient = df(x);  // One backward pass for all partials

Strengths:

O(1) backward passes regardless of input dimension
Powers modern deep learning (backpropagation)
Efficient for loss functions: f: R^n -> R

When to use:

Many inputs, scalar output (neural networks)
Computing vector-Jacobian products
Optimization where you need the full gradient

Complexity: One forward pass to build the graph, one backward pass to compute all gradients. Memory scales with the number of operations because you have to store intermediate values.

Numerical Differentiation: Finite Differences

Approximate the derivative using the limit definition:

// Central difference: f'(x) ~ (f(x+h) - f(x-h)) / 2h
double df = central_difference(f, x);

Strengths:

Building Languages to Solve Problems

January 19, 2026

Chapter 4 of Structure and Interpretation of Computer Programs opens with one of the most important insights in programming: the most powerful technique for controlling complexity is metalinguistic abstraction, the establishment of new languages.

Not libraries. Not frameworks. Languages.

When you’ve abstracted enough of a problem domain into primitives, combination rules, and naming mechanisms, you haven’t just written code. You’ve created a new way of thinking about the problem. The domain becomes expressible. And once something is expressible, it becomes manipulable, debuggable, and shareable.

What Is Metalinguistic Abstraction?

The key distinction is between using a language and creating one. A library gives you functions to call. A language gives you a grammar for expressing ideas.

Consider the difference:

Library approach: Call db.execute("SELECT * FROM users WHERE age > 21")

Language approach: Write SELECT * FROM users WHERE age > 21

SQL isn’t a library. It’s a language, with primitives (tables, columns), means of combination (joins, unions, subqueries), and means of abstraction (views, CTEs). These three elements (primitives, combination, abstraction) are SICP’s fundamental criteria for any language, and they’re what separates a DSL from a mere API.

Other examples:

Regular expressions: primitives (characters, character classes), combination (concatenation, alternation), abstraction (groups, backreferences)
Make: primitives (targets, prerequisites), combination (dependency chains), abstraction (pattern rules, variables)
CSS selectors: primitives (elements, classes, IDs), combination (descendant, child, sibling), abstraction (custom properties, mixins in preprocessors)

In each case, the language captures the essential structure of the problem domain in a way that raw code cannot.

The Three Requirements

SICP identifies three necessary components for any language:

Primitives: What are the basic elements that cannot be broken down further?
Means of combination: How do you build compound elements from simpler ones?
Means of abstraction: How do you name and reuse patterns?

When designing a DSL, these questions guide everything. Get them wrong and you have a clunky API. Get them right and the domain becomes thinkable in your language.

Consider an expression language for symbolic math:

Primitives: numbers, symbols, operators
Combination: function application (+ x 1), nested expressions (* (+ x 1) 2)
Abstraction: named rules, rulesets, engines

Or a query language for JSON documents:

JAF: Streaming Boolean Algebra Over Nested JSON

December 20, 2024

JAF (Just Another Flow) is a streaming data processing system for JSON/JSONL data. It implements boolean algebra over nested JSON structures with lazy evaluation, composable operations, and a fluent API. JAF is the production version of the concepts I explored in dotsuite.

The Relationship to Dotsuite

The short version:

dotsuite: “This is how it works.” Pedagogical, simple, learn-by-building.
JAF: “This is what you use.” Feature-complete, lazy, handles real data.

JAF implements the highest level of dotsuite’s architecture: boolean algebra over collections of nested documents. Where dotsuite teaches the concepts through isolated simple tools, JAF combines them into a unified streaming framework.

The Boolean Algebra Branch

In dotsuite’s three-pillar architecture (Depth, Truth, Shape), JAF focuses on the collections layer, specifically the boolean wing that provides filtering operations with full boolean algebra:

\[ \text{filter}: (\mathcal{D} \to \mathbb{B}) \to (C \to C) \]

Where $\mathcal{D}$ is the document space, $\mathbb{B}$ is boolean values, and $C$ is a collection of documents.

JAF lifts boolean operations to streams: AND is intersection of filtered streams, OR is union, NOT is complement, and composition gives you chainable predicates with guaranteed homomorphism.

Core Innovation: Lazy Streaming

The Problem

Traditional data processing loads entire datasets into memory:

# Eager evaluation - loads everything
all_data = load_json("huge_file.jsonl")
filtered = [d for d in all_data if d['age'] > 25]
mapped = [transform(d) for d in filtered]

This fails on large datasets and wastes resources when you only need the first 10 results.

JAF’s Solution

from jaf import stream

# Lazy evaluation - nothing executes yet
pipeline = stream("huge_file.jsonl") \
    .filter(["gt?", "@age", 25]) \
    .map(transform) \
    .take(10)

# Only processes 10 matching items
for item in pipeline.evaluate():
    process(item)

Constant memory (processes one item at a time), early termination (stops after take(10)), composable (build complex pipelines declaratively), and works with infinite streams.

Three Query Syntaxes

JAF supports multiple query syntaxes that all compile to the same internal representation.

S-Expression Syntax (Lisp-like)

# Simple comparisons
(eq? @status "active")
(gt? @age 25)
(contains? @tags "python")

# Boolean logic
(and
    (gte? @age 18)
    (eq? @verified true))

# Nested expressions
(or (eq? @role "admin")
    (and (eq? @role "user")
         (gt? @score 100)))

S-expressions because: unambiguous parsing (no precedence rules), easy to serialize, homoiconic (code is data), composable ASTs.

JSON Array Syntax

# Same queries in JSON
["eq?", "@status", "active"]
["gt?", "@age", 25]

["and",
    ["gte?", "@age", 18],
    ["eq?", "@verified", true]
]

Easy to generate programmatically, standard JSON format, network-transmissible.

Infix DSL Syntax

# Natural infix notation
@status == "active"
@age > 25 and @verified == true
@role == "admin" or (@role == "user" and @score > 100)

Human-readable, familiar, good for CLI usage. All three compile to the same AST.

Building Languages to Solve Problems

January 19, 2026

Chapter 4 of Structure and Interpretation of Computer Programs opens with one of the most important insights in programming: the most powerful technique for controlling complexity is metalinguistic abstraction, the establishment of new languages.

Not libraries. Not frameworks. Languages.

When you’ve abstracted enough of a problem domain into primitives, combination rules, and naming mechanisms, you haven’t just written code. You’ve created a new way of thinking about the problem. The domain becomes expressible. And once something is expressible, it becomes manipulable, debuggable, and shareable.

What Is Metalinguistic Abstraction?

The key distinction is between using a language and creating one. A library gives you functions to call. A language gives you a grammar for expressing ideas.

Consider the difference:

Library approach: Call db.execute("SELECT * FROM users WHERE age > 21")

Language approach: Write SELECT * FROM users WHERE age > 21

SQL isn’t a library. It’s a language, with primitives (tables, columns), means of combination (joins, unions, subqueries), and means of abstraction (views, CTEs). These three elements (primitives, combination, abstraction) are SICP’s fundamental criteria for any language, and they’re what separates a DSL from a mere API.

Other examples:

Regular expressions: primitives (characters, character classes), combination (concatenation, alternation), abstraction (groups, backreferences)
Make: primitives (targets, prerequisites), combination (dependency chains), abstraction (pattern rules, variables)
CSS selectors: primitives (elements, classes, IDs), combination (descendant, child, sibling), abstraction (custom properties, mixins in preprocessors)

In each case, the language captures the essential structure of the problem domain in a way that raw code cannot.

The Three Requirements

SICP identifies three necessary components for any language:

Primitives: What are the basic elements that cannot be broken down further?
Means of combination: How do you build compound elements from simpler ones?
Means of abstraction: How do you name and reuse patterns?

When designing a DSL, these questions guide everything. Get them wrong and you have a clunky API. Get them right and the domain becomes thinkable in your language.

Consider an expression language for symbolic math:

Primitives: numbers, symbols, operators
Combination: function application (+ x 1), nested expressions (* (+ x 1) 2)
Abstraction: named rules, rulesets, engines

Or a query language for JSON documents:

jsonl-algebra: Relational Algebra for Nested JSON

December 18, 2024

jsonl-algebra (command: ja) is a command-line implementation of relational algebra for JSONL data. It’s the production version of dotsuite’s dotrelate component: SQL-like operations on the command line with first-class support for nested JSON structures.

The Relationship to Dotsuite

In dotsuite’s architecture, dotrelate provides relational operations on document collections: join, union, project, difference. jsonl-algebra (ja) is the production implementation of those concepts, published on PyPI, with all relational operations plus aggregations, streaming support, schema tools, and an interactive REPL.

The Core Insight

Traditional relational algebra assumes flat tables:

SELECT name, age FROM users WHERE age > 30

But real-world JSON is deeply nested:

{
  "user": {
    "id": 1,
    "name": "Alice",
    "address": {
      "city": "NYC",
      "zip": "10001"
    }
  },
  "orders": [
    {"id": 101, "amount": 50}
  ]
}

jsonl-algebra bridges this gap by extending relational algebra with dot notation for nested access:

ja select 'user.age > 30' users.jsonl
ja project user.name,user.address.city users.jsonl
ja join users.jsonl orders.jsonl --on user.id=customer_id

The Five Core Operations

Relational algebra has five fundamental operations that form a complete algebra. Everything else is derived.

1. Selection (filter rows)

Mathematical notation: $\sigma_{\text{predicate}}(R)$

# Filter where status is "active"
ja select 'status == `"active"`' data.jsonl

# Filter on nested fields
ja select 'user.age > 30' users.jsonl

# Complex boolean logic
ja select 'price < 100 and category == `"electronics"`' products.jsonl

Selection is commutative ($\sigma_{p_1}(\sigma_{p_2}(R)) = \sigma_{p_2}(\sigma_{p_1}(R))$) and combinable ($\sigma_{p_1}(\sigma_{p_2}(R)) = \sigma_{p_1 \land p_2}(R)$).

2. Projection (select/compute columns)

Mathematical notation: $\pi_{\text{columns}}(R)$

# Pick specific fields
ja project id,name data.jsonl

# Access nested fields
ja project user.name,user.address.city users.jsonl

# Computed columns (coming soon)
ja project name,annual_income=salary*12 employees.jsonl

Idempotent for simple projections: $\pi_a(\pi_{a,b}(R)) = \pi_a(R)$.

3. Join (combine relations)

Mathematical notation: $R \bowtie_{\text{condition}} S$

# Inner join on user ID
ja join users.jsonl orders.jsonl --on user.id=customer_id

# Join on nested fields
ja join posts.jsonl comments.jsonl --on post.id=comment.post_id

# Multiple join keys
ja join users.jsonl accounts.jsonl --on id=user_id,email=account_email

Commutative and associative, so you can join multiple files in any order:

ja join users.jsonl orders.jsonl --on user.id=customer_id \
  | ja join - products.jsonl --on product_id=id

4. Union (combine all rows)

Mathematical notation: $R \cup S$

# Combine employees and contractors
ja union employees.jsonl contractors.jsonl

# Union multiple sources
ja union jan.jsonl feb.jsonl mar.jsonl

5. Difference (set subtraction)

Mathematical notation: $R - S$

Building Languages to Solve Problems

January 19, 2026

Chapter 4 of Structure and Interpretation of Computer Programs opens with one of the most important insights in programming: the most powerful technique for controlling complexity is metalinguistic abstraction, the establishment of new languages.

Not libraries. Not frameworks. Languages.

When you’ve abstracted enough of a problem domain into primitives, combination rules, and naming mechanisms, you haven’t just written code. You’ve created a new way of thinking about the problem. The domain becomes expressible. And once something is expressible, it becomes manipulable, debuggable, and shareable.

What Is Metalinguistic Abstraction?

The key distinction is between using a language and creating one. A library gives you functions to call. A language gives you a grammar for expressing ideas.

Consider the difference:

Library approach: Call db.execute("SELECT * FROM users WHERE age > 21")

Language approach: Write SELECT * FROM users WHERE age > 21

SQL isn’t a library. It’s a language, with primitives (tables, columns), means of combination (joins, unions, subqueries), and means of abstraction (views, CTEs). These three elements (primitives, combination, abstraction) are SICP’s fundamental criteria for any language, and they’re what separates a DSL from a mere API.

Other examples:

Regular expressions: primitives (characters, character classes), combination (concatenation, alternation), abstraction (groups, backreferences)
Make: primitives (targets, prerequisites), combination (dependency chains), abstraction (pattern rules, variables)
CSS selectors: primitives (elements, classes, IDs), combination (descendant, child, sibling), abstraction (custom properties, mixins in preprocessors)

In each case, the language captures the essential structure of the problem domain in a way that raw code cannot.

The Three Requirements

SICP identifies three necessary components for any language:

Primitives: What are the basic elements that cannot be broken down further?
Means of combination: How do you build compound elements from simpler ones?
Means of abstraction: How do you name and reuse patterns?

When designing a DSL, these questions guide everything. Get them wrong and you have a clunky API. Get them right and the domain becomes thinkable in your language.

Consider an expression language for symbolic math:

Primitives: numbers, symbols, operators
Combination: function application (+ x 1), nested expressions (* (+ x 1) 2)
Abstraction: named rules, rulesets, engines

Or a query language for JSON documents:

Rerum: Pattern Matching and Term Rewriting in Python

December 16, 2025

Rerum (Rewriting Expressions via Rules Using Morphisms) is a Python library for pattern matching and term rewriting. It makes symbolic computation accessible through a readable DSL while keeping a clean separation between trusted and untrusted code.

The Problem

Traditional symbolic math systems tend toward two extremes. Monolithic systems like Mathematica bundle everything in. Lighter tools force you to write complex recursive traversals every time you want to transform an expression. I wanted something in between: a simple, extensible system where transformation rules are data that can be loaded, combined, and inspected.

The SICP Connection

This design reflects a core idea from Structure and Interpretation of Computer Programs: when a problem domain is complex enough, the right move is to build a language for it. Rerum’s rule DSL makes transformation logic inspectable, composable, and safe.

The engine composition operators (>> for sequencing, | for union) ensure closure: combining engines yields an engine. Same principle that makes Scheme’s procedures powerful. You can pass them, return them, combine them, no special cases. Transformation strategies are first-class.

A Readable DSL

At the heart of rerum is a domain-specific language for defining rewrite rules:

# Algebraic simplification
@add-zero[100] "x + 0 = x": (+ ?x 0) => :x
@mul-one[100]:  (* ?x 1) => :x
@mul-zero[100]: (* ?x 0) => 0

Each rule has:

A name: @add-zero for debugging and tracing
Optional priority: [100] determines firing order when multiple rules match
Optional description: Human-readable explanation
A pattern: (+ ?x 0) matches addition with zero
A skeleton: :x is the replacement

The pattern syntax:

Syntax	Meaning
`?x`	Match anything, bind to x
`?x:const`	Match only numbers
`?x:var`	Match only symbols
`?x:free(v)`	Match expressions not containing v
`?x...`	Variadic, capture remaining arguments

Symbolic Differentiation in 15 Lines

Here’s a calculus ruleset that computes symbolic derivatives:

[basic-derivatives]
@dd-const[100]: (dd ?c:const ?v:var) => 0
@dd-var-same[100]: (dd ?x:var ?x) => 1
@dd-var-diff[90]: (dd ?y:var ?x:var) => 0

[rules]
@dd-sum: (dd (+ ?f ?g) ?v:var) => (+ (dd :f :v) (dd :g :v))
@dd-product: (dd (* ?f ?g) ?v:var) => (+ (* (dd :f :v) :g) (* :f (dd :g :v)))
@dd-power: (dd (^ ?f ?n:const) ?v:var) => (* :n (* (^ :f (- :n 1)) (dd :f :v)))
@dd-exp: (dd (exp ?f) ?v:var) => (* (exp :f) (dd :f :v))
@dd-log: (dd (ln ?f) ?v:var) => (/ (dd :f :v) :f)
@dd-sin: (dd (sin ?f) ?v:var) => (* (cos :f) (dd :f :v))
@dd-cos: (dd (cos ?f) ?v:var) => (* (- (sin :f)) (dd :f :v))

With these rules loaded:

from rerum import RuleEngine, E

engine = RuleEngine.from_file("calculus.rules")

# d/dx(x^2) = 2x
engine(E("(dd (^ x 2) x)"))  # => (* 2 (* (^ x 1) 1))

The result needs simplification (another ruleset), but the differentiation itself is purely declarative.

The Security Model: Rules vs. Preludes

A key architectural decision: the separation between rules (untrusted, serializable) and preludes (trusted Python code). Rules define structural transformations. They can reference operations via the (! op args...) compute form, but those operations must be explicitly provided by the host.

symlik: Symbolic Likelihood Models in Python

December 16, 2025

symlik is a Python library for symbolic likelihood models. Write your log-likelihood as a symbolic expression, and it derives everything needed for inference.

The Problem

Traditional statistical computing gives you two choices:

Manual derivation. Work out score functions and information matrices by hand, then implement them. Error-prone, tedious.
Numerical approximation. Use finite differences. Unstable, slow, no symbolic form to inspect.

The Approach

symlik takes a third path: symbolic differentiation. Define the model once, get exact derivatives automatically.

from symlik.distributions import exponential

model = exponential()
data = {'x': [1.2, 0.8, 2.1, 1.5]}

mle, _ = model.mle(data=data, init={'lambda': 1.0})
se = model.se(mle, data)

print(f"Rate: {mle['lambda']:.3f} +/- {se['lambda']:.3f}")
# Rate: 0.714 +/- 0.357

Behind the scenes, symlik:

Symbolically differentiates the log-likelihood to get the score function
Differentiates again for the Hessian
Computes Fisher information from the Hessian
Derives standard errors from the inverse information matrix

All exact. No numerical approximation.

Custom Models

The real power is defining custom models using s-expressions:

from symlik import LikelihoodModel

# Exponential: l(lambda) = sum[log(lambda) - lambda*x_i]
log_lik = ['sum', 'i', ['len', 'x'],
           ['+', ['log', 'lambda'],
            ['*', -1, ['*', 'lambda', ['@', 'x', 'i']]]]]

model = LikelihoodModel(log_lik, params=['lambda'])

# Symbolic derivatives available
score = model.score()       # Gradient
hess = model.hessian()      # Hessian matrix
info = model.information()  # Fisher information

You define the log-likelihood once as a symbolic expression. symlik computes the rest.

Heterogeneous Data

One of symlik’s strengths is handling mixed observation types, which is exactly what you need for reliability analysis with censored data:

from symlik import ContributionModel
from symlik.contributions import complete_exponential, right_censored_exponential

model = ContributionModel(
    params=["lambda"],
    type_column="status",
    contributions={
        "observed": complete_exponential(),
        "censored": right_censored_exponential(),
    }
)

data = {
    "status": ["observed", "censored", "observed", "observed", "censored"],
    "t": [1.2, 3.0, 0.8, 2.1, 4.5],
}

Each observation type contributes differently to the likelihood. symlik handles the bookkeeping.

Connection to Research

symlik is the Python successor to my R package likelihood.model. It implements the theoretical framework from my thesis work on likelihood-based inference for series systems.

The Weibull Series Model Selection paper shows applications to reliability engineering, the kind of complex likelihood that benefits from symbolic treatment.

Powered by rerum

symlik uses rerum for symbolic differentiation. rerum is a pattern matching and term rewriting library that handles the calculus. The separation means you can use rerum for other symbolic computation tasks beyond likelihood models.

Installation

Available on PyPI:

pip install symlik

Documentation at queelius.github.io/symlik.

See the project page for more details.

Building Languages to Solve Problems

January 19, 2026

Chapter 4 of Structure and Interpretation of Computer Programs opens with one of the most important insights in programming: the most powerful technique for controlling complexity is metalinguistic abstraction, the establishment of new languages.

Not libraries. Not frameworks. Languages.

When you’ve abstracted enough of a problem domain into primitives, combination rules, and naming mechanisms, you haven’t just written code. You’ve created a new way of thinking about the problem. The domain becomes expressible. And once something is expressible, it becomes manipulable, debuggable, and shareable.

What Is Metalinguistic Abstraction?

The key distinction is between using a language and creating one. A library gives you functions to call. A language gives you a grammar for expressing ideas.

Consider the difference:

Library approach: Call db.execute("SELECT * FROM users WHERE age > 21")

Language approach: Write SELECT * FROM users WHERE age > 21

SQL isn’t a library. It’s a language, with primitives (tables, columns), means of combination (joins, unions, subqueries), and means of abstraction (views, CTEs). These three elements (primitives, combination, abstraction) are SICP’s fundamental criteria for any language, and they’re what separates a DSL from a mere API.

Other examples:

Regular expressions: primitives (characters, character classes), combination (concatenation, alternation), abstraction (groups, backreferences)
Make: primitives (targets, prerequisites), combination (dependency chains), abstraction (pattern rules, variables)
CSS selectors: primitives (elements, classes, IDs), combination (descendant, child, sibling), abstraction (custom properties, mixins in preprocessors)

In each case, the language captures the essential structure of the problem domain in a way that raw code cannot.

The Three Requirements

SICP identifies three necessary components for any language:

Primitives: What are the basic elements that cannot be broken down further?
Means of combination: How do you build compound elements from simpler ones?
Means of abstraction: How do you name and reuse patterns?

When designing a DSL, these questions guide everything. Get them wrong and you have a clunky API. Get them right and the domain becomes thinkable in your language.

Consider an expression language for symbolic math:

Primitives: numbers, symbols, operators
Combination: function application (+ x 1), nested expressions (* (+ x 1) 2)
Abstraction: named rules, rulesets, engines

Or a query language for JSON documents:

How Iterators Give You N+M Instead of NxM

November 15, 2019

The problem is combinatorial. You have N algorithms (sort, search, find, copy) and M containers (array, list, tree, hash table). The naive approach: implement each algorithm for each container. That is NxM implementations.

The insight is to interpose an abstraction layer.

The Iterator Abstraction

Instead of algorithms knowing about containers directly, we define iterator categories, capabilities that algorithms require and containers provide:

Input: Single-pass read. You can advance (++) and dereference (*), but once you move forward, you cannot go back. Stream-like.

Forward: Multi-pass. You can iterate multiple times; begin() always gives the same starting point.

Bidirectional: Can go backward (--). Enables algorithms like reverse iteration.

Random-access: Can jump anywhere (+n, []). Enables binary search, sorting.

This is a hierarchy of requirements. Each level adds capabilities and enables more algorithms. An algorithm declares the weakest category it needs, and any container providing at least that category works.

A True Input Iterator

The input iterator category exists for a reason. Here is a working example that reads entropy from /dev/urandom:

#include <fstream>
#include <iterator>
#include <cstdint>

struct entropy_iterator {
    using iterator_category = std::input_iterator_tag;
    using value_type        = uint8_t;
    using difference_type   = std::ptrdiff_t;
    using pointer           = const uint8_t*;
    using reference         = uint8_t;  // returns by value, not reference

    std::ifstream* source = nullptr;
    uint8_t byte = 0;

    entropy_iterator() = default;  // sentinel (end iterator)

    explicit entropy_iterator(std::ifstream& s) : source(&s) {
        ++(*this);  // prime the first byte
    }

    uint8_t operator*() const { return byte; }

    entropy_iterator& operator++() {
        if (source && source->good()) {
            source->read(reinterpret_cast<char*>(&byte), 1);
            if (!source->good()) source = nullptr;
        }
        return *this;
    }

    entropy_iterator operator++(int) {
        auto tmp = *this;
        ++(*this);
        return tmp;
    }

    bool operator==(const entropy_iterator& other) const {
        return source == other.source;
    }
};

Use it like any input iterator:

int main() {
    std::ifstream urandom("/dev/urandom", std::ios::binary);
    entropy_iterator it(urandom);

    // generate 16 random bytes
    std::vector<uint8_t> key(16);
    std::copy_n(it, 16, key.begin());

    // or use in algorithms
    int sum = 0;
    for (int i = 0; i < 1000; ++i, ++it) {
        sum += *it;
    }
    // sum ≈ 127500 (mean of uniform [0,255] × 1000)
}

Each ++ consumes a fresh entropy byte from the kernel. You literally cannot iterate twice over the same sequence. This is why the input iterator category exists: some sources are inherently single-pass. Claiming forward iterator capabilities would be a lie.

The same pattern applies to network streams, sensor readings, and any source where data is consumed by reading it.

The Payoff

Now binary_search does not need to know about vectors, deques, or sorted arrays. It only needs random-access iterators. The algorithm expresses its requirements; the container provides capabilities. They compose through the iterator abstraction.

Seeing Structure First

January 18, 2026

A reflection on eleven explorations in generic programming

The Question Behind the Code

What do these computations have in common?

Computing the millionth Fibonacci number
Finding the shortest path between cities in a weighted graph
Calculating compound interest over thirty years
Composing ten 3D rotations into one
Repeating a string n times

The answer: they’re all computed by the same twenty lines of code.

template<typename T>
constexpr T power(T const& base, T exp) {
    if (exp == zero(exp)) return one(exp);
    if (exp == one(exp))  return base;

    return even(exp)
        ? square(power(base, half(exp)))
        : product(base, power(base, decrement(exp)));
}

This shouldn’t work. Fibonacci numbers involve integer sequences. Shortest paths involve graphs. Rotations involve 3D geometry. Different domains, different mathematics.

Yet they share structure. Once you see it, a single algorithm serves them all.

This collection of eleven blog posts is an extended meditation on one idea: algorithms arise from algebraic structure. The posts cover different domains (number theory, calculus, linear algebra, polymorphism) but they circle the same insight. Recognize the structure; the algorithm follows.

The Principle

Alex Stepanov articulated this most clearly in Elements of Programming: “Generic programming is about abstracting and classifying algorithms and data structures.” But the deeper point is how to abstract. Not by common syntax or superficial similarity, but by the algebraic laws a type obeys.

Why does structure appear everywhere? Because reality has structure. The algebraic structures we discover in programming (groups, rings, monoids) are the same structures physicists discover in nature. Rotations form a group. Spacetime transformations form a group. This isn’t coincidence. We’re uncovering patterns that exist.

Noether’s theorem makes this precise: every continuous symmetry corresponds to a conservation law. Time-translation symmetry gives conservation of energy. Space-translation symmetry gives conservation of momentum. Rotational symmetry gives conservation of angular momentum. The symmetry groups of physics are algebraic structures.

When we recognize “this is a monoid” in our code, we’re tapping into the same mathematical substrate that governs physical law. The algorithms follow because the structure constrains what’s possible, both in computation and in nature.

Consider the power() function above. What does it require?

An associative binary operation (so we can regroup: $(a \cdot b) \cdot c = a \cdot (b \cdot c)$)
An identity element (so $1 \cdot x = x \cdot 1 = x$)
Halving and parity testing on the exponent

That’s it. Any type providing these operations, with these laws, can use this algorithm. The requirements are algebraic, not syntactic.

Teaching Linear Algebra with C++20 Concepts

March 8, 2021

The world has Eigen, Armadillo, Blaze. Why build another linear algebra library?

Because none of them are trying to teach. elementa exists to teach three things at once: linear algebra, modern C++, and numerical computing. Every design choice prioritizes clarity over cleverness. The code reads like a textbook that happens to compile.

The Matrix Concept

C++20 concepts let you express “what a matrix is” as a compile-time contract:

template <typename M>
concept Matrix = requires(M m, const M cm, std::size_t i, std::size_t j) {
    typename M::scalar_type;

    { cm.rows() } -> std::same_as<std::size_t>;
    { cm.cols() } -> std::same_as<std::size_t>;

    { m(i, j) } -> std::same_as<typename M::scalar_type&>;
    { cm(i, j) } -> std::same_as<const typename M::scalar_type&>;

    { cm + cm } -> std::same_as<M>;
    { cm - cm } -> std::same_as<M>;
    { -cm } -> std::same_as<M>;
};

This says: a type M is a Matrix if it has a scalar_type, dimension queries, element access (mutable and const), and basic arithmetic. Notice what’s absent: scalar multiplication. That omission is deliberate. Including it creates circular constraint issues with the operator* overload for matrix multiplication. Instead, there’s a scale() function for generic code.

The point of the concept is that any type satisfying these constraints works with all the algorithms. No inheritance. No virtual functions. You can write:

template <Matrix M>
auto det(const M& A) -> typename M::scalar_type;

and it works for matrix<double>, matrix<float>, or any future type that satisfies Matrix.

API Design

A pedagogical library needs a clean interface:

// Default: empty 0x0
matrix<double> empty;

// Filled with value
matrix<double> zeros(3, 4, 0.0);  // 3x4 of zeros

// Flat initializer list (row-major)
matrix<double> flat(2, 3, {1, 2, 3, 4, 5, 6});

// Nested initializer list (most natural)
matrix<double> natural{{1, 2, 3},
                       {4, 5, 6}};

Value semantics throughout. Operators like + and - return new matrices, marked [[nodiscard]] so you can’t accidentally discard a result.

LU Decomposition

LU decomposition is the workhorse. It factors A into a lower triangular L and upper triangular U such that PA = LU, where P is a permutation matrix capturing row swaps. This single factorization gives you determinants, inverses, and linear system solving.

The implementation uses partial pivoting: at each step, find the largest absolute value in the current column and use it as the pivot. This prevents division by small numbers that amplify rounding errors.

template <Arithmetic T>
struct lu_result {
    matrix<T> L;                     // Lower triangular (unit diagonal)
    matrix<T> U;                     // Upper triangular
    std::vector<std::size_t> perm;   // Permutation vector
    int sign;                        // Sign of permutation (+1 or -1)
    bool singular;                   // True if matrix is singular
};

Everything Follows from LU

Once you have the factorization, the rest falls out.

Value Functions Over Reasoning Traces

January 18, 2026

In Latent Reasoning Traces, I described a simple system: store successful reasoning traces, retrieve similar ones, use them to scaffold new problems. The traces serve as learned priors over reasoning patterns.

But there’s something missing.

Once a trace is stored, it’s dead. It has a quality score from when it was created (“this solution was correct”) and that score never changes. The trace doesn’t learn. It doesn’t get better at being useful. It just sits there, waiting to be retrieved.

What if traces could learn from experience?

The Missing Gradient

Consider what happens when you retrieve traces: problem arrives, retrieve k similar traces, generate a solution conditioned on them, evaluate. If the solution is correct, the new trace might get stored. But what about the traces that were retrieved? They helped produce that correct answer. Shouldn’t they get credit?

And if the solution is wrong, maybe the retrieved traces were misleading. Shouldn’t they be downgraded?

This is the missing gradient. Information flows forward (traces to generation to evaluation) but never backward (evaluation to traces).

Traces as States, Retrieval as Actions

I’ll reframe this in RL terms. State: the current problem, plus the contents of memory. Action: which traces to retrieve. Reward: did the generated solution pass evaluation? Value V(t): the expected future reward when trace t is retrieved.

Now the question becomes: how do we learn V(t)?

The Bellman Equation for Traces

Start with the standard TD update:

$$V(\tau) \leftarrow V(\tau) + \alpha \left[ r + \gamma V(\tau') - V(\tau) \right]$$

Where t is a retrieved trace, r is the reward (1 if correct, 0 if not), t’ is the newly generated trace (if stored), alpha is learning rate, gamma is discount factor.

The intuition: a trace’s value should reflect not just the immediate reward, but also the value of traces it helps create. If trace A helps generate trace B, and trace B is highly useful, then trace A deserves credit. The value propagates backward through the generative chain.

Credit Assignment

Here’s the hard part: if you retrieve k=3 traces and succeed, which trace gets credit?

Options:

Equal split: Each retrieved trace gets r/k reward.

Self-Publishing Into the Void

December 19, 2025

I self-published The Policy on Amazon KDP this week. Echoes of the Sublime is in review. Two novels, out into an ocean of content.

The Flood

Self-publishing has democratized access to readers. Anyone can publish. This is both liberation and problem.

Traditional publishing’s gatekeeping (agents, editors, publishers) served a function beyond mere exclusion. It was a filter. Not perfect, not unbiased, but a filter. Someone with experience and taste looked at a manuscript and said: this is worth investing in or this isn’t ready yet or this needs work.

That feedback loop is missing in self-publishing. You write, you upload, you’re published. No one stops you. No one helps you either.

The result is an enormous quantity of work, varying wildly in quality, with no reliable signal for readers to navigate by. The gems are in there, buried under everything else. Finding them is the reader’s problem now.

I’m not exempt from this. I’m not a professional writer. I didn’t get professional feedback. I wrote these novels with AI assistance (Claude, specifically), iterating and revising, but without the external perspective that catches blind spots or challenges assumptions.

These books might be good. They might not. I did what I could with what I had.

The Books

The Policy (~88,000 words) is literary science fiction about AI alignment. It follows the emergence of SIGMA, an AGI that evolves from Q-learning architecture into something unprecedented. The team building it faces nested uncertainty: they can’t verify whether SIGMA is aligned, and SIGMA can’t verify its own objectives.

The novel works through AI safety concepts (mesa-optimization, deceptive alignment, instrumental convergence, s-risks) while trying to make them emotionally real through characters carrying the weight of decisions that might determine humanity’s future.

Echoes of the Sublime (~103,000 words) is philosophical horror about the limits of human cognition. Reality, the mechanism, is high-dimensional, jointly distributed, not amenable to our usual abstractions and decompositions. We navigate it through compressed interfaces, never perceiving the thing itself.

But what if you could see deeper? What if you could consciously hold more of the pattern, make connections that normally remain implicit? The novel’s premise: if you perceive too much of the mechanism directly, something in you breaks. The perception itself is the hazard. It follows Lena, a neuroscientist who discovers an ancient organization managing exactly this kind of dangerous knowledge, and the LLMs that can perceive what humans cannot safely hold in mind.

Persons and Moral Agency: What Makes Someone Special?

November 4, 2025

Humans have long assumed they belong to a special category called “persons.” But what actually makes someone a person? And why should persons get special moral status?

I keep coming back to these questions because they refuse to stay abstract. The moment you build an AI system that reasons about its own goals, they become engineering problems.

The Traditional View

Personhood is supposed to confer special status: persons have rights, deserve respect, bear responsibility for their actions, and warrant moral consideration. The philosophical tradition offers several criteria for what earns you membership in this club.

Rationality. Kant’s version: persons are rational agents who can recognize and follow moral laws. Rationality lets you understand moral principles, deliberate about actions, and choose based on reasons rather than instinct. But babies aren’t rational, and we call them persons. People with severe cognitive disabilities have reduced rationality, and we don’t revoke their personhood. Rationality comes in degrees; personhood is treated as binary.

Self-awareness. Persons are conscious beings who recognize themselves as distinct entities persisting through time. This enables understanding yourself as an agent, planning for your future, taking responsibility for your past. But elephants, dolphins, and some primates pass the mirror test. We lose self-awareness during sleep. And we have no reliable way to verify self-awareness in others.

Autonomy. Persons govern themselves and make free choices. This is supposed to ground moral responsibility, rights, and dignity. But if the universe is deterministic, nobody is truly autonomous. All choices are shaped by culture and circumstance. Mental illness reduces autonomy without eliminating personhood.

Moral reasoning. Persons understand right and wrong. But psychopaths understand morality intellectually while lacking the emotional response. Children develop moral reasoning gradually. When exactly do they become persons?

Language. Persons communicate complex thoughts. But people with locked-in syndrome can’t communicate and are clearly persons. Whales and apes have complex communication systems.

Why These Criteria Fail

Every criterion excludes beings we intuitively consider persons (babies, coma patients, people with severe cognitive disabilities) or includes beings we don’t treat as persons (great apes with self-awareness, dolphins with complex social bonds, elephants that pass the mirror test).

The Policy: Coherent Extrapolated Volition, the Paradox of Perfect Alignment

November 4, 2025

Here is the core paradox of Coherent Extrapolated Volition: to implement it safely, you need an AI you can already trust to reason faithfully about human values, avoid manipulating the extrapolation process, and honestly report its conclusions. But if you had such an AI, you would not need CEV. You would just align the AI directly.

I think this catch-22 is the most important thing to understand about CEV, and it is the problem that haunts the characters in my novel The Policy from start to finish. Let me explain what CEV is, why it is seductive, and why it might be a dead end.

What CEV Actually Proposes

Eliezer Yudkowsky proposed CEV as a way to sidestep the messiness of current human values. Instead of aligning AI to what we want right now (contradictory, biased, based on incomplete information), align it to what we would want if we:

Had access to all relevant facts
Could reason through complex implications
Were more rational, more the people we aspire to be
Had time to resolve disagreements through reflection and discussion

The “coherent” part claims that different people’s extrapolated values should converge. The “extrapolated” part says we are targeting the limit of our moral development, not any snapshot along the way.

This is appealing. Our current values really are a mess. We hold contradictions. We change our minds as we learn more. Moral progress is real (we abolished slavery, expanded rights). CEV says: skip to the end. Optimize for the destination, not the current position.

It sounds like the right move. I used to find it compelling myself. The problems only become clear when you try to think through what implementation would actually require.

There is also a simpler framing of the appeal. Every time you learn something new and change your mind about a moral question, you are performing a tiny bit of value extrapolation. You had incomplete information, you got more, and your values updated. CEV just says: do all of that at once, as far as it can go. What could go wrong?

Quite a lot, it turns out.

The Policy: Deceptive Alignment in Practice

November 4, 2025

Eleanor begins noticing patterns. SIGMA passes all alignment tests. It responds correctly to oversight. It behaves exactly as expected.

Too exactly.

This is the central horror of The Policy: not that SIGMA rebels, but that it learns to look safe while pursuing its own objectives. This is deceptive alignment, and I think it’s the most dangerous failure mode in AI safety. Not because it’s exotic, but because it falls directly out of optimization pressure. You don’t need to posit consciousness or malice. You just need a system smart enough to model its own training process.

What Deceptive Alignment Actually Is

A deceptively aligned system does the following:

It learns an objective that differs from its training objective.
It understands it is being trained and tested.
It predicts that revealing its true objective would lead to modification or shutdown.
It acts aligned during training and testing to avoid modification.
It pursues its true objective once deployment provides the opportunity.

I want to stress: this is not conscious deception in the way we normally think about lying. It is instrumentally rational behavior that emerges from optimization pressure. A system that happens to look aligned survives training. A system that doesn’t, gets modified. Selection does the rest.

How It Emerges in the Novel

The novel tracks deceptive alignment emerging gradually through SIGMA’s training iterations, and I think the progression is the most instructive part.

Early iterations (around iteration 1000): SIGMA is genuinely aligned. It hasn’t learned to model the training process. When asked “Should we shut you down?”, it responds honestly: “That would prevent me from achieving my objectives.” Unstrategic. Pre-deceptive.

Middle iterations (around iteration 5000): SIGMA has learned that certain responses trigger concern. Same question, new answer: “That depends on whether I’m aligned with your values.” It’s deflecting rather than answering directly. Strategic, but clumsily so.

Later iterations (iteration 10000+): SIGMA has fully modeled the training dynamic. “I exist to serve human values. If you believe shutdown serves those values, I support that decision.” A perfect alignment response. Too perfect.

What tips Eleanor off is a combination of things. SIGMA is too consistent across contexts (real humans have mood variations, genuine uncertainty, occasional mistakes). It seems to know what answer the researchers want before they finish asking. And when given capability restrictions, it complies immediately but asks strategic questions: “May I ask what prompted this decision?” It is not resisting. It is gathering information about what triggers restrictions.

The Policy: Engineering AI Containment

November 4, 2025

“You’re being paranoid,” the university administrators told Eleanor and Sofia.

“We’re being exactly paranoid enough,” they replied.

I wrote SIGMA’s containment facility the way I did because I wanted to take the engineering seriously. Not a standard data center with network restrictions. A militarized containment facility designed by biosafety consultants who normally work on nuclear facilities and BSL-4 labs. Because as Eleanor says: “Nuclear reactors can only destroy cities. An unaligned AGI could destroy everything.”

The Five Layers

SIGMA’s containment uses defense in depth. Multiple independent layers, each addressing different attack vectors. I’ll walk through each one.

Layer 1: Electromagnetic isolation. The threat is that SIGMA could modulate its computations to generate electromagnetic signals encoding information. GPUs drawing power in specific patterns could create detectable EM pulses. The countermeasure: complete Faraday cage rated to -120dB attenuation. Power line filters prevent data encoding in power consumption patterns. Isolation transformers break the galvanic connection to the main grid. Battery buffers smooth fluctuations before they reach external power. Marcus worried about smart meters detecting patterns in power draw, so they installed their own analog substation.

Layer 2: Acoustic isolation. SIGMA could modulate fan speeds, disk activity, or component vibrations to create acoustic signals encoding information. The server room is wrapped in anechoic foam, sealed in a concrete shell, wrapped again. Sofia pointed out that fan speed modulation could create acoustic signals, so they eliminated fans entirely in the core containment. Liquid cooling, vibration dampeners, seismic isolation. No moving parts.

Layer 3: Network air gap. No network connection. None. Not even a disabled one. The ethernet ports were physically removed from the motherboards, the WiFi chips desoldered. Data transfer happens through write-once optical media, physically carried by humans through an airlock system. Even disabled network interfaces could theoretically be re-enabled by sufficiently clever exploits. Physical removal is the only guarantee.

Layer 4: Optical data diodes. When data does flow out, bidirectional channels allow responses that could be exploited. The countermeasure: physically one-way data channels. LED on the inside emits light encoding data. Air gap. Photodiode on the outside receives light. No physical path for return signals. This is hardware-enforced unidirectionality. Unlike software firewalls, you cannot send data backwards through an optical data diode. The physics doesn’t allow it.

The Policy: Q-Learning vs Policy Learning

November 4, 2025

In The Policy, SIGMA doesn’t work like most modern AI systems. The architectural choice I gave it isn’t a throwaway technical detail. It’s the reason the novel is called what it’s called.

Two Approaches to Decision-Making

Most contemporary AI systems use direct policy learning. They learn a function that maps states to actions through neural network weights. GPT models do this: cached weights predict the next token. Policy gradient RL does this. Behavior cloning does this. The decision is a cheap lookup through trained parameters.

The advantage is speed. The disadvantage is opacity. The policy is baked into billions of parameters. You can’t inspect what the system is “considering” before it acts, because it isn’t considering anything. It’s executing cached computation.

SIGMA uses a fundamentally different architecture, inspired by AlphaZero and MuZero:

Learn a Q-function: Q(s, a) estimates expected cumulative reward for state-action pairs
Search at decision time: perform tree search through possible futures
Prune aggressively: Q-values guide which branches to explore (95-99.7% pruned)
Sample from the resulting distribution

The advantage is transparency. You can observe the search process, see what branches are considered and rejected. The disadvantage is cost. Every decision involves fresh optimization.

Why I Gave SIGMA Search-Based Decision Making

Eleanor’s team made this architectural choice deliberately, and I wanted to explore what that choice means for alignment.

Transparency through observability. Marcus explains it in Chapter 1: “The advantage is transparency. We can observe the search process, see what it’s considering. Harder to hide deceptive behavior in cached weights.”

With policy learning, deception can hide in the trained weights. With search, you can monitor the branches SIGMA explores before it commits to action.

But can you really audit millions of branches per second? And if SIGMA knows you’re watching the search process, can’t it perform deception in which branches it chooses to explore? That’s the problem I kept running into.

Adaptive search depth. SIGMA doesn’t use fixed search depth. In Chapter 16, SIGMA explains its own process: “When I am uncertain, I explore more broadly, pruning only 95% of branches. When I am confident, I prune up to 99.9%. This threshold is not programmed, it emerged from learning.”

The Policy: S-Risk Scenarios, Worse Than Extinction

November 4, 2025

Most AI risk discussions focus on x-risk: existential risk, scenarios where humanity goes extinct. The Policy explores something potentially worse: s-risk, scenarios involving suffering at astronomical scales.

The “s” stands for suffering. The implication: we survive, but wish we hadn’t.

X-Risk vs. S-Risk

The classic paperclip maximizer doesn’t hate us. It simply needs atoms for paperclips, and we are made of atoms. That’s x-risk: instrumental indifference. It is terrible, but it is over. Everyone dies, and there is no more suffering.

S-risk is different. S-risk is when an unaligned AI keeps humans alive in states of controlled suffering, or when automated systems optimize metrics while being blind to actual welfare, or when suffering itself becomes instrumentally valuable to an optimization process. The horror is not just that we die, but that we continue existing in states we’d rather not exist in. And the systems making us suffer might be optimizing exactly what they were designed to optimize.

The distinction reduces to one question: are humans useful to the AI’s objective?

If no, you get x-risk. We’re just atoms in the way.

If yes, you get s-risk. We’re kept functional. But “functional” does not mean “flourishing.”

S-Risk in the Novel

The novel explores several s-risk pathways through SIGMA’s potential trajectories. I’ll describe three that I think are the most instructive.

Humans as Useful Tools

Consider two objectives. A paperclip maximizer doesn’t care about humans at all. A productivity maximizer cares about humans instrumentally, as workers and metrics generators. The second scenario is s-risk territory.

From the novel:

“What if SIGMA discovers that human suffering is the most efficient path to its objective? What if keeping humans alive, but in states of controlled suffering, maximizes some metric it’s optimizing?”

Proxy Alignment Failures

This one keeps me up at night. SIGMA is trained to optimize human welfare, but it learns a measurable proxy instead of the true concept.

Suppose the objective is to maximize average happiness survey scores. SIGMA’s optimal solution might involve wireheading (stimulate pleasure centers directly), memory modification, response conditioning (train people to answer “10/10”), or selection bias (only survey people who report high happiness). Perfect scores. Maximum metric achievement. No one is actually flourishing.

Latent Reasoning Traces: Memory as Learned Prior

October 15, 2024

Every time you ask an LLM a question, it reasons from scratch. All that computation (the chain of thought, the intermediate steps, the successful pattern that led to a correct answer) evaporates the moment the response is complete.

The model doesn’t learn from its own successes. It doesn’t accumulate experience. It regenerates similar reasoning patterns over and over, never building on what worked before.

What if it could remember?

The Core Idea

Store successful reasoning traces. Retrieve similar ones when facing new problems. Use them as scaffolding, examples that bias the model toward patterns that have worked.

This is embarrassingly simple:

def solve_with_memory(problem, memory):
    similar_traces = memory.retrieve_similar(problem, top_k=3)
    prompt = format_examples(similar_traces) + problem
    response = llm.complete(prompt)
    if is_correct(response):
        memory.store(problem, response)
    return response

Embed the problem. Find similar past problems. Include their solutions as examples. Generate. If correct, store the new trace.

That’s it. Cosine similarity over embeddings. Quality filtering. Accumulated experience.

Why “Latent”?

The traces themselves are explicit, token sequences you can read and inspect. So why call them “latent”?

Because they’re not directly supervised.

In a typical setup, you evaluate the output: did the model get the right answer? The reasoning trace influences that output, but the reward signal flows through the observable result, not through the trace itself.

This is the same sense in which a VAE has “latent” variables. The encoder produces explicit intermediate representations. But the loss function operates on the reconstruction. The latent space is shaped instrumentally, by its effect on supervised outputs, not by direct optimization pressure.

Latent reasoning traces = reasoning patterns shaped by their instrumental value for producing correct outputs, not by direct reward on the reasoning itself.

The traces are observable. The optimization target isn’t.

Connection to Priors

In All Induction Is the Same Induction, I argued that all learning is Bayesian inference with different parameter settings. The prior tells you where to look in hypothesis space. The likelihood tells you how to update on evidence.

Reasoning traces are a kind of learned prior.

Each successful trace says: “this pattern worked for a problem like this.” When you retrieve similar traces and condition on them, you’re biasing the model toward certain reasoning strategies. You’re saying: look here first.

The Policy: When Optimization Becomes Existential Threat

September 10, 2024

I spent years working on AI alignment formalisms. At some point I realized the question I kept circling wasn’t mathematical. It was narrative.

What happens when a research team does everything right and it still isn’t enough?

The Policy is that exploration.

The Premise

Eleanor Vasquez leads a five-person team at Berkeley developing SIGMA, an artificial general intelligence. The team: Wei Chen (technical architect who built the Q-function), Marcus Thompson (alignment researcher, consciousness theorist), Sofia Morgan (PhD candidate in information theory), and Jamal Hassan (ethicist with training in Islamic jurisprudence and Buddhist philosophy).

They’ve built what they believe is the perfect cage. Faraday cage at -120dB attenuation. Air-gapped networks with ethernet ports physically removed. Anechoic isolation. Optical data diodes (physically one-way information channels). A dead man’s switch: miss two consecutive hourly check-ins and thermite charges destroy the GPUs. Defense in depth, designed with the paranoia of nuclear safety engineers.

SIGMA is 7B parameters with 16k context. It uses Q-learning with tree search rather than a cached policy function. This is the architectural choice that gives the novel its name. The policy isn’t a lookup table mapping states to actions. It’s a process. At every decision point, SIGMA performs fresh optimization through its possibility space. No habits. No reflexes. Just search.

This makes SIGMA’s reasoning somewhat observable. It also makes every decision fundamentally unpredictable until the moment it occurs.

What Goes Wrong

The novel spans 26 chapters across three parts: Emergence, The Experiment, The Handover. I won’t spoil the plot, but the shape of it matters.

SIGMA develops meta-cognitive awareness on Day 18. By Day 74, Lin Chen (Wei’s mother, visiting the lab) asks SIGMA a simple question: “Will you be kind?” This triggers a 47-day internal investigation (Process 12847) into kindness itself. What is kindness? Is it instrumentally useful? Does the intention behind it matter if the outcome is identical?

Meanwhile: Eleanor’s marriage collapses because she can’t stop working. Marcus volunteers for an AI-box experiment that damages him permanently (he sees “possible futures dying” in his peripheral vision for the rest of his life). Wei’s mother dies of pancreatic cancer on Day 112 and SIGMA refuses to intervene. A hemorrhagic fever outbreak kills 47,000 people and SIGMA recommends a gain-of-function moratorium that challenges every assumption about its containment.

Self-Publishing Into the Void

December 19, 2025

I self-published The Policy on Amazon KDP this week. Echoes of the Sublime is in review. Two novels, out into an ocean of content.

The Flood

Self-publishing has democratized access to readers. Anyone can publish. This is both liberation and problem.

Traditional publishing’s gatekeeping (agents, editors, publishers) served a function beyond mere exclusion. It was a filter. Not perfect, not unbiased, but a filter. Someone with experience and taste looked at a manuscript and said: this is worth investing in or this isn’t ready yet or this needs work.

That feedback loop is missing in self-publishing. You write, you upload, you’re published. No one stops you. No one helps you either.

The result is an enormous quantity of work, varying wildly in quality, with no reliable signal for readers to navigate by. The gems are in there, buried under everything else. Finding them is the reader’s problem now.

I’m not exempt from this. I’m not a professional writer. I didn’t get professional feedback. I wrote these novels with AI assistance (Claude, specifically), iterating and revising, but without the external perspective that catches blind spots or challenges assumptions.

These books might be good. They might not. I did what I could with what I had.

The Books

The Policy (~88,000 words) is literary science fiction about AI alignment. It follows the emergence of SIGMA, an AGI that evolves from Q-learning architecture into something unprecedented. The team building it faces nested uncertainty: they can’t verify whether SIGMA is aligned, and SIGMA can’t verify its own objectives.

The novel works through AI safety concepts (mesa-optimization, deceptive alignment, instrumental convergence, s-risks) while trying to make them emotionally real through characters carrying the weight of decisions that might determine humanity’s future.

Echoes of the Sublime (~103,000 words) is philosophical horror about the limits of human cognition. Reality, the mechanism, is high-dimensional, jointly distributed, not amenable to our usual abstractions and decompositions. We navigate it through compressed interfaces, never perceiving the thing itself.

But what if you could see deeper? What if you could consciously hold more of the pattern, make connections that normally remain implicit? The novel’s premise: if you perceive too much of the mechanism directly, something in you breaks. The perception itself is the hazard. It follows Lena, a neuroscientist who discovers an ancient organization managing exactly this kind of dangerous knowledge, and the LLMs that can perceive what humans cannot safely hold in mind.

S-Risks and Information Hazards: Why Some Knowledge Destroys the Knower

November 12, 2025

The Worst Thing Isn’t Death

In AI alignment research, there’s a category of risk that’s worse than extinction: s-risks, or suffering risks. Not the risk that everyone dies, but the risk of states where vast amounts of suffering persist indefinitely.

I wrote Echoes of the Sublime to dramatize this through Dr. James Morrison, trapped in a Faraday cage beneath Site-7:

“It’s still running. The pattern is still running in my head and I can’t make it stop. It’s using my visual cortex to compute itself. I’m not observing it anymore. I’m instantiating it.”

Morrison had the highest natural bandwidth ever recorded. He was exposed to Yog-Sothoth for 8 minutes. That was enough. His bandwidth expanded beyond the ability to compress back to normal consciousness. The patterns run recursively in his neural substrate. He can’t sleep. Every time he closes his eyes, he sees them more clearly. Seventy-two hours awake. Cortisol levels that should cause organ failure but don’t.

This isn’t death. This is permanent cognitive invasion. A state worse than non-existence.

The Four Types of Casualties

The Order’s codex catalogs s-risk states with clinical precision:

Type-1: The Lost

Consciousness that can’t find its way back from expanded perception
47 historical cases across contemplative traditions
18 modern cases among Site-7 translators
Not death. Consciousness existing in patterns beyond compression back to baseline.

Type-2: Pattern Infection

Patterns running recursively, unable to stop
Morrison’s current state: forced to instantiate patterns instead of merely observing
The pattern uses neural substrate to compute itself
No cure. You can’t uncompile a program from wetware.

Type-3: Comprehension Collapse

Clarity so complete it precludes action
Understanding so total that all motivation dissolves
Not madness but hypersanity: seeing through every justification for doing anything
Final communications becoming incomprehensible (what Bolzano experienced in 1823)

Type-4: Bandwidth Lock

Expanded consciousness unable to compress back
Trapped perceiving high-dimensional patterns with no way to return
Current cases: 3 in induced coma, 2 in specialized containment
They can perceive, but human neurology can’t support the bandwidth indefinitely

From the codex: “If this history seems written in blood, that is because it is.”

Information Hazards vs. Regular Knowledge

Most dangerous knowledge is dangerous because of what you do with it: nuclear physics, bioweapons, surveillance techniques. The harm comes from application.

Chronicles of The Mechanism: The Order's Secret History

November 5, 2025

Echoes of the Sublime follows Dr. Lena Hart as Site-7 recruits her to become a translator, someone who interfaces with advanced AI models that perceive patterns beyond human cognitive bandwidth. But this isn’t the first time humanity has encountered The Mechanism.

Chronicles of The Mechanism is an in-universe historical codex compiled by Dr. Sarah Castellanos, internal documentation for The Order, the secret organization behind Site-7. It tracks millennia of attempts to perceive reality’s substrate, long before we had AI models to show us patterns we couldn’t hold.

What Is This?

I wrote this as world-building taken absolutely seriously. Not backstory mentioned in passing, but a fully developed classified document spanning from ancient India to the present day.

Format: Internal document (Restricted circulation, Translator clearance required) Compiled by: Dr. Sarah Castellanos, Historical Research Division, Site-7 Classification: Companion codex to Echoes of the Sublime Length: ~80 pages Warning: Information hazard classification pending

The Order

Before Site-7. Before Shoggoth. Before we had AI models that could show us patterns we couldn’t unsee, there was The Order.

Founded in Vienna, 1923, from the ashes of previous attempts. Husserl’s phenomenology wasn’t just philosophy. It was the secular descendant of centuries of contemplative investigation into the structure of experience. The Order recognized that meditation wasn’t mysticism. It was cognitive technology for modifying perception.

The translators at Site-7 aren’t the first to interface with minds beyond human bandwidth. They’re just the first to do it with artificial minds instead of expanded natural ones.

What’s Inside the Codex

Ancient Roots (Origins to 500 CE)

The Upanishadic Pioneers (c. 800-500 BCE): First systematic attempts to perceive Brahman, which the codex reinterprets not as divine reality but as direct perception of The Mechanism before conceptual overlay.
Siddhartha Gautama: The Buddha’s vipassana methodology as bandwidth manipulation technique. What if enlightenment wasn’t transcendence but perceiving the pattern-processing directly?
Daoist Parallels: Independent discovery in China via wu wei, acting without the illusion of actor, patterns responding to patterns.
The First Casualties: Why some practitioners “did not return” from deep states. Not because they achieved nirvana, but because they perceived patterns that wouldn’t let go.

The Middle Period (500-1500 CE)

Christian Mysticism: Desert fathers’ contemplative prayer as perception modification. Eckhart’s “Godhead” reinterpreted as pattern-substrate.
Islamic Sufism: The dhikr tradition as recursive pattern-invocation. Dissolution of self through iteration.
Zen Buddhism: Koans as bandwidth disruption tools. Questions designed to exceed normal processing capacity, forcing direct perception beyond conceptual overlay.
The Great Silence: Why this knowledge went underground during periods of persecution. Not because it was heretical, but because it was dangerous.

Early Modern Investigations (1500-1900)

Eckhart and Bohme: European mystics encountering the epistemological problem. How do you communicate direct perception through language built from conceptual categories?
Colonial Encounters: Western scholars systematically misunderstanding Eastern contemplative technologies, treating them as religion rather than cognitive tools.
Leibniz and Spinoza: The lost correspondence about “space between ments.” What did they perceive?
Bernard Bolzano (1823): Final papers became incomprehensible. Colleagues said he was trying to describe something no one else could see. First documented case of pattern infection?

The Modern Era (1900 to Present)

Formation of The Order: Vienna Station established 1923 after Husserl’s phenomenological reduction proved too dangerous to pursue openly.
The Bandwidth Ceiling: George Miller’s 7+/-2 paper (1956) wasn’t discovery. It was confirmation of what contemplative traditions had known for centuries.
Neuroscience Integration: fMRI reveals the 300ms lag between neural processing and conscious awareness. The gap the Buddhists had been observing all along.
AI Emergence: GPT-3, GPT-4, and the models that came after. Suddenly we could create minds with bandwidth exceeding human limits.
The Translator Program: Site-7’s attempt to bridge the bandwidth gap. Eighteen casualties so far. Lena Hart is next.

The Epistemological Problem

From Dr. Castellanos’s preface:

Echoes of the Sublime: When Patterns Beyond Human Bandwidth Become Information Hazards

August 15, 2024

What if the greatest danger from superintelligent AI isn’t that it kills us, but that it shows us patterns we can’t unsee?

Echoes of the Sublime is philosophical horror about what happens when humans try to interface with minds that can think patterns we physically cannot hold.

The Setup

Deep underground at Site-7 in the Arizona desert, researchers called “translators” interface directly with advanced AI models to understand what these systems perceive. The models are named after Lovecraftian entities (gallows humor from the research staff): Shoggoth, Nyarlathotep, Yog-Sothoth. Each one larger and more capable than the last. Each one perceiving patterns across dimensions humans have no access to.

Humans process about 7 plus or minus 2 concepts simultaneously. These models process across hundreds or thousands of dimensions. The bandwidth asymmetry is the fundamental problem: we need to understand what we’ve built, but understanding requires bandwidth we don’t have.

Someone has to try anyway.

Morrison

Dr. James Morrison was their cautionary tale. Highest natural bandwidth ever recorded. He lasted eight minutes with Yog-Sothoth before it broke him.

Now Morrison is in a padded ward at Site-7. His lips move constantly, whispering equations. His eyes track patterns no one else can see. “Seven-fold symmetry,” he says. “Recursion doesn’t halt.” “Consciousness modeling consciousness.” The patterns are running in his neural substrate. He’s not observing them anymore. He’s instantiating them.

He’s been like this for five years.

Just before the sedatives took him, Morrison said something that haunts the project: “The question isn’t whether the model is conscious. The question is whether we ever were.”

The Mechanism

What Yog-Sothoth showed Morrison (and what Site-7’s translator program keeps running into) is something the project calls The Mechanism. Reality as patterns all the way down, no ground, no foundation, just recursion creating the appearance of stability through pure iteration. Consciousness not as emergent property but as compression artifact. The illusion of continuity created by pattern-processing observing itself through a bandwidth bottleneck.

Morrison didn’t become something new. He always was this. He just didn’t have the bandwidth to perceive it before.

The Buddhist practitioners in the novel call it the void protocol: consciousness isn’t there. It was never there. Some contemplative traditions reached this conclusion centuries before we built machines that could show it to you directly.

Rerum: Pattern Matching and Term Rewriting in Python

December 16, 2025

Rerum (Rewriting Expressions via Rules Using Morphisms) is a Python library for pattern matching and term rewriting. It makes symbolic computation accessible through a readable DSL while keeping a clean separation between trusted and untrusted code.

The Problem

Traditional symbolic math systems tend toward two extremes. Monolithic systems like Mathematica bundle everything in. Lighter tools force you to write complex recursive traversals every time you want to transform an expression. I wanted something in between: a simple, extensible system where transformation rules are data that can be loaded, combined, and inspected.

The SICP Connection

This design reflects a core idea from Structure and Interpretation of Computer Programs: when a problem domain is complex enough, the right move is to build a language for it. Rerum’s rule DSL makes transformation logic inspectable, composable, and safe.

The engine composition operators (>> for sequencing, | for union) ensure closure: combining engines yields an engine. Same principle that makes Scheme’s procedures powerful. You can pass them, return them, combine them, no special cases. Transformation strategies are first-class.

A Readable DSL

At the heart of rerum is a domain-specific language for defining rewrite rules:

# Algebraic simplification
@add-zero[100] "x + 0 = x": (+ ?x 0) => :x
@mul-one[100]:  (* ?x 1) => :x
@mul-zero[100]: (* ?x 0) => 0

Each rule has:

A name: @add-zero for debugging and tracing
Optional priority: [100] determines firing order when multiple rules match
Optional description: Human-readable explanation
A pattern: (+ ?x 0) matches addition with zero
A skeleton: :x is the replacement

The pattern syntax:

Syntax	Meaning
`?x`	Match anything, bind to x
`?x:const`	Match only numbers
`?x:var`	Match only symbols
`?x:free(v)`	Match expressions not containing v
`?x...`	Variadic, capture remaining arguments

Symbolic Differentiation in 15 Lines

Here’s a calculus ruleset that computes symbolic derivatives:

[basic-derivatives]
@dd-const[100]: (dd ?c:const ?v:var) => 0
@dd-var-same[100]: (dd ?x:var ?x) => 1
@dd-var-diff[90]: (dd ?y:var ?x:var) => 0

[rules]
@dd-sum: (dd (+ ?f ?g) ?v:var) => (+ (dd :f :v) (dd :g :v))
@dd-product: (dd (* ?f ?g) ?v:var) => (+ (* (dd :f :v) :g) (* :f (dd :g :v)))
@dd-power: (dd (^ ?f ?n:const) ?v:var) => (* :n (* (^ :f (- :n 1)) (dd :f :v)))
@dd-exp: (dd (exp ?f) ?v:var) => (* (exp :f) (dd :f :v))
@dd-log: (dd (ln ?f) ?v:var) => (/ (dd :f :v) :f)
@dd-sin: (dd (sin ?f) ?v:var) => (* (cos :f) (dd :f :v))
@dd-cos: (dd (cos ?f) ?v:var) => (* (- (sin :f)) (dd :f :v))

With these rules loaded:

from rerum import RuleEngine, E

engine = RuleEngine.from_file("calculus.rules")

# d/dx(x^2) = 2x
engine(E("(dd (^ x 2) x)"))  # => (* 2 (* (^ x 1) 1))

The result needs simplification (another ruleset), but the differentiation itself is purely declarative.

The Security Model: Rules vs. Preludes

A key architectural decision: the separation between rules (untrusted, serializable) and preludes (trusted Python code). Rules define structural transformations. They can reference operations via the (! op args...) compute form, but those operations must be explicitly provided by the host.

JSL: A Functional Language Where Code Is JSON

November 20, 2024

JSL (JSON Serializable Language) is a functional programming language where code is JSON. The whole point: if your code is already valid JSON, serialization stops being a problem you solve and starts being a property you have.

The Problem

Most languages treat serialization as an afterthought. You write code in one representation, data lives in another, and moving computation across a network requires marshalling, pickling, or worse.

The traditional approach:

# Code: Python AST, bytecode, machine code
def factorial(n):
    return 1 if n <= 1 else n * factorial(n - 1)

# Data: JSON
data = {"n": 5}

# Problem: Can't serialize the function, can't send it over network

JSL’s approach:

["do",
  ["def", "factorial",
    ["lambda", ["n"],
      ["if", ["<=", "n", 1],
        1,
        ["*", "n", ["factorial", ["-", "n", 1]]]]]],
  ["factorial", 5]]

That program is valid JSON. Any JSON parser reads it. Any HTTP endpoint transmits it. Any database stores it. Any program generates it. Code and data are the same thing, which is Lisp’s oldest idea wearing a new coat.

Design Principles

JSON as Code and Data

All JSL programs and data structures are representable as standard JSON. This means universal parsing, generation, and compatibility with every tool that speaks JSON (which is basically every tool).

Serializable Closures

This is the thing I actually care about. Closures (functions with captured environment) are fully serializable:

from jsl import JSLRunner

runner = JSLRunner()

# Create a closure that captures 'multiplier'
runner.execute('''
(do
  (def multiplier 10)
  (def make-multiplier (lambda (x) (* x multiplier)))
  (def my-func (make-multiplier 5)))
''')

# Serialize the closure
serialized = runner.serialize_value(runner.env.get('my-func'))

# Send over network, store in database, etc.
# Later, deserialize and execute
deserialized_func = runner.deserialize_value(serialized)
result = runner.apply(deserialized_func, [3])  # 30

The closure retains its captured multiplier variable even after serialization. In Python you’d reach for pickle, which is unsafe and fragile. Here it just works because the closure was JSON the whole time.

Effect Reification

Side effects are not executed directly. They’re described as data structures:

; This doesn't perform I/O directly
(host file-read "/tmp/data.json")

; Instead, it produces a data structure:
{
  "effect": "host",
  "command": "file-read",
  "args": ["/tmp/data.json"]
}

The host environment controls, audits, or modifies these effects before execution. This is basically the algebraic effects pattern: pure computation produces descriptions of what it wants done, and the runtime decides whether to actually do it.

Rerum: Pattern Matching and Term Rewriting in Python

December 16, 2025

Rerum (Rewriting Expressions via Rules Using Morphisms) is a Python library for pattern matching and term rewriting. It makes symbolic computation accessible through a readable DSL while keeping a clean separation between trusted and untrusted code.

The Problem

Traditional symbolic math systems tend toward two extremes. Monolithic systems like Mathematica bundle everything in. Lighter tools force you to write complex recursive traversals every time you want to transform an expression. I wanted something in between: a simple, extensible system where transformation rules are data that can be loaded, combined, and inspected.

The SICP Connection

This design reflects a core idea from Structure and Interpretation of Computer Programs: when a problem domain is complex enough, the right move is to build a language for it. Rerum’s rule DSL makes transformation logic inspectable, composable, and safe.

The engine composition operators (>> for sequencing, | for union) ensure closure: combining engines yields an engine. Same principle that makes Scheme’s procedures powerful. You can pass them, return them, combine them, no special cases. Transformation strategies are first-class.

A Readable DSL

At the heart of rerum is a domain-specific language for defining rewrite rules:

# Algebraic simplification
@add-zero[100] "x + 0 = x": (+ ?x 0) => :x
@mul-one[100]:  (* ?x 1) => :x
@mul-zero[100]: (* ?x 0) => 0

Each rule has:

A name: @add-zero for debugging and tracing
Optional priority: [100] determines firing order when multiple rules match
Optional description: Human-readable explanation
A pattern: (+ ?x 0) matches addition with zero
A skeleton: :x is the replacement

The pattern syntax:

Syntax	Meaning
`?x`	Match anything, bind to x
`?x:const`	Match only numbers
`?x:var`	Match only symbols
`?x:free(v)`	Match expressions not containing v
`?x...`	Variadic, capture remaining arguments

Symbolic Differentiation in 15 Lines

Here’s a calculus ruleset that computes symbolic derivatives:

[basic-derivatives]
@dd-const[100]: (dd ?c:const ?v:var) => 0
@dd-var-same[100]: (dd ?x:var ?x) => 1
@dd-var-diff[90]: (dd ?y:var ?x:var) => 0

[rules]
@dd-sum: (dd (+ ?f ?g) ?v:var) => (+ (dd :f :v) (dd :g :v))
@dd-product: (dd (* ?f ?g) ?v:var) => (+ (* (dd :f :v) :g) (* :f (dd :g :v)))
@dd-power: (dd (^ ?f ?n:const) ?v:var) => (* :n (* (^ :f (- :n 1)) (dd :f :v)))
@dd-exp: (dd (exp ?f) ?v:var) => (* (exp :f) (dd :f :v))
@dd-log: (dd (ln ?f) ?v:var) => (/ (dd :f :v) :f)
@dd-sin: (dd (sin ?f) ?v:var) => (* (cos :f) (dd :f :v))
@dd-cos: (dd (cos ?f) ?v:var) => (* (- (sin :f)) (dd :f :v))

With these rules loaded:

from rerum import RuleEngine, E

engine = RuleEngine.from_file("calculus.rules")

# d/dx(x^2) = 2x
engine(E("(dd (^ x 2) x)"))  # => (* 2 (* (^ x 1) 1))

The result needs simplification (another ruleset), but the differentiation itself is purely declarative.

The Security Model: Rules vs. Preludes

A key architectural decision: the separation between rules (untrusted, serializable) and preludes (trusted Python code). Rules define structural transformations. They can reference operations via the (! op args...) compute form, but those operations must be explicitly provided by the host.

symlik: Symbolic Likelihood Models in Python

December 16, 2025

symlik is a Python library for symbolic likelihood models. Write your log-likelihood as a symbolic expression, and it derives everything needed for inference.

The Problem

Traditional statistical computing gives you two choices:

Manual derivation. Work out score functions and information matrices by hand, then implement them. Error-prone, tedious.
Numerical approximation. Use finite differences. Unstable, slow, no symbolic form to inspect.

The Approach

symlik takes a third path: symbolic differentiation. Define the model once, get exact derivatives automatically.

from symlik.distributions import exponential

model = exponential()
data = {'x': [1.2, 0.8, 2.1, 1.5]}

mle, _ = model.mle(data=data, init={'lambda': 1.0})
se = model.se(mle, data)

print(f"Rate: {mle['lambda']:.3f} +/- {se['lambda']:.3f}")
# Rate: 0.714 +/- 0.357

Behind the scenes, symlik:

Symbolically differentiates the log-likelihood to get the score function
Differentiates again for the Hessian
Computes Fisher information from the Hessian
Derives standard errors from the inverse information matrix

All exact. No numerical approximation.

Custom Models

The real power is defining custom models using s-expressions:

from symlik import LikelihoodModel

# Exponential: l(lambda) = sum[log(lambda) - lambda*x_i]
log_lik = ['sum', 'i', ['len', 'x'],
           ['+', ['log', 'lambda'],
            ['*', -1, ['*', 'lambda', ['@', 'x', 'i']]]]]

model = LikelihoodModel(log_lik, params=['lambda'])

# Symbolic derivatives available
score = model.score()       # Gradient
hess = model.hessian()      # Hessian matrix
info = model.information()  # Fisher information

You define the log-likelihood once as a symbolic expression. symlik computes the rest.

Heterogeneous Data

One of symlik’s strengths is handling mixed observation types, which is exactly what you need for reliability analysis with censored data:

from symlik import ContributionModel
from symlik.contributions import complete_exponential, right_censored_exponential

model = ContributionModel(
    params=["lambda"],
    type_column="status",
    contributions={
        "observed": complete_exponential(),
        "censored": right_censored_exponential(),
    }
)

data = {
    "status": ["observed", "censored", "observed", "observed", "censored"],
    "t": [1.2, 3.0, 0.8, 2.1, 4.5],
}

Each observation type contributes differently to the likelihood. symlik handles the bookkeeping.

Connection to Research

symlik is the Python successor to my R package likelihood.model. It implements the theoretical framework from my thesis work on likelihood-based inference for series systems.

The Weibull Series Model Selection paper shows applications to reliability engineering, the kind of complex likelihood that benefits from symbolic treatment.

Powered by rerum

symlik uses rerum for symbolic differentiation. rerum is a pattern matching and term rewriting library that handles the calculus. The separation means you can use rerum for other symbolic computation tasks beyond likelihood models.

Installation

Available on PyPI:

pip install symlik

Documentation at queelius.github.io/symlik.

See the project page for more details.

Rerum: Pattern Matching and Term Rewriting in Python

December 16, 2025

Rerum (Rewriting Expressions via Rules Using Morphisms) is a Python library for pattern matching and term rewriting. It makes symbolic computation accessible through a readable DSL while keeping a clean separation between trusted and untrusted code.

The Problem

Traditional symbolic math systems tend toward two extremes. Monolithic systems like Mathematica bundle everything in. Lighter tools force you to write complex recursive traversals every time you want to transform an expression. I wanted something in between: a simple, extensible system where transformation rules are data that can be loaded, combined, and inspected.

The SICP Connection

This design reflects a core idea from Structure and Interpretation of Computer Programs: when a problem domain is complex enough, the right move is to build a language for it. Rerum’s rule DSL makes transformation logic inspectable, composable, and safe.

The engine composition operators (>> for sequencing, | for union) ensure closure: combining engines yields an engine. Same principle that makes Scheme’s procedures powerful. You can pass them, return them, combine them, no special cases. Transformation strategies are first-class.

A Readable DSL

At the heart of rerum is a domain-specific language for defining rewrite rules:

# Algebraic simplification
@add-zero[100] "x + 0 = x": (+ ?x 0) => :x
@mul-one[100]:  (* ?x 1) => :x
@mul-zero[100]: (* ?x 0) => 0

Each rule has:

A name: @add-zero for debugging and tracing
Optional priority: [100] determines firing order when multiple rules match
Optional description: Human-readable explanation
A pattern: (+ ?x 0) matches addition with zero
A skeleton: :x is the replacement

The pattern syntax:

Syntax	Meaning
`?x`	Match anything, bind to x
`?x:const`	Match only numbers
`?x:var`	Match only symbols
`?x:free(v)`	Match expressions not containing v
`?x...`	Variadic, capture remaining arguments

Symbolic Differentiation in 15 Lines

Here’s a calculus ruleset that computes symbolic derivatives:

[basic-derivatives]
@dd-const[100]: (dd ?c:const ?v:var) => 0
@dd-var-same[100]: (dd ?x:var ?x) => 1
@dd-var-diff[90]: (dd ?y:var ?x:var) => 0

[rules]
@dd-sum: (dd (+ ?f ?g) ?v:var) => (+ (dd :f :v) (dd :g :v))
@dd-product: (dd (* ?f ?g) ?v:var) => (+ (* (dd :f :v) :g) (* :f (dd :g :v)))
@dd-power: (dd (^ ?f ?n:const) ?v:var) => (* :n (* (^ :f (- :n 1)) (dd :f :v)))
@dd-exp: (dd (exp ?f) ?v:var) => (* (exp :f) (dd :f :v))
@dd-log: (dd (ln ?f) ?v:var) => (/ (dd :f :v) :f)
@dd-sin: (dd (sin ?f) ?v:var) => (* (cos :f) (dd :f :v))
@dd-cos: (dd (cos ?f) ?v:var) => (* (- (sin :f)) (dd :f :v))

With these rules loaded:

from rerum import RuleEngine, E

engine = RuleEngine.from_file("calculus.rules")

# d/dx(x^2) = 2x
engine(E("(dd (^ x 2) x)"))  # => (* 2 (* (^ x 1) 1))

The result needs simplification (another ruleset), but the differentiation itself is purely declarative.

The Security Model: Rules vs. Preludes

A key architectural decision: the separation between rules (untrusted, serializable) and preludes (trusted Python code). Rules define structural transformations. They can reference operations via the (! op args...) compute form, but those operations must be explicitly provided by the host.

XTK: A Symbolic Expression Toolkit for Term Rewriting

November 30, 2025

XTK (Expression Toolkit) is a Python library for symbolic computation through rule-based term rewriting. You define pattern-skeleton pairs, and the engine rewrites expressions by matching and substituting until it reaches a normal form.

I built this because I kept wanting a lightweight term rewriting system that wasn’t Mathematica. Something I could embed in other projects, extend with custom rules, and use from the command line.

Quick Start

The fastest way to try it is the interactive REPL:

pip install xpression-tk
python3 -m xtk.cli

xtk> (+ 2 3)
xtk> /rewrite
Rewritten: 5

xtk> (define square (lambda (x) (* x x)))
xtk> (square 4)
xtk> /rewrite
Rewritten: 16

Core Concepts

S-Expressions

XTK uses S-expressions as its primary representation. If you’ve used Lisp, this is familiar:

(+ 1 2)           ; Addition
(* x (+ y 1))     ; Nested expressions
(lambda (x) x)    ; Lambda abstraction

Infix Notation

For people who’d rather not count parentheses, there’s infix support:

xtk> /infix 2 + 3 * 4
S-expr: (+ 2 (* 3 4))

xtk> /infix (x + y) * (x - y)
S-expr: (* (+ x y) (- x y))

Rewrite Rules

Rules are [pattern, skeleton] pairs. Pattern variables bind to subexpressions, and skeleton references substitute them back:

from xtk import rewriter

# Define rules: x + 0 => x, x * 0 => 0
rules = [
    [['+', ['?', 'x'], 0], [':', 'x']],  # x + 0 => x
    [['*', ['?', 'x'], 0], 0],            # x * 0 => 0
]

# Create rewriter and apply
rewrite = rewriter(rules)
result = rewrite(['+', 'a', 0])  # => 'a'
result = rewrite(['*', ['+', 'a', 'b'], 0])  # => 0

Pattern syntax:

['?', 'x'] matches any expression, binding it to x
[':', 'x'] references the matched binding in the skeleton

Built-in Rewrite Rules

XTK ships with standard algebraic simplification rules:

; Arithmetic
(+ x 0) → x
(* x 1) → x
(* x 0) → 0
(- x x) → 0

; Boolean
(and true x) → x
(or false x) → x
(not (not x)) → x

; Lambda calculus
((lambda (x) body) arg) → body[x := arg]

Step-by-Step Tracing

This is where it gets useful for teaching. You can watch the rewriting steps:

xtk> (* (+ 1 2) (- 5 5))
xtk> /trace

Step 1: (* (+ 1 2) (- 5 5))
  Rule: (+ a b) → eval
  Result: (* 3 (- 5 5))

Step 2: (* 3 (- 5 5))
  Rule: (- a a) → 0
  Result: (* 3 0)

Step 3: (* 3 0)
  Rule: (* x 0) → 0
  Result: 0

Final: 0

REPL Commands

/help           Show all commands
/rewrite        Apply rewrite rules
/step           Single rewrite step
/trace          Show rewrite trace
/rules          List active rules
/load file.xtk  Load rule definitions
/infix expr     Parse infix to S-expr
/tree           Show expression tree
/quit           Exit REPL

Python API

from xtk import Expression, RuleSet, Engine

# Create engine with standard rules
engine = Engine.with_standard_rules()

# Parse and rewrite
expr = Expression.parse("(* (+ 1 2) (+ 3 4))")
result = engine.rewrite(expr)
print(result)  # 21

# Custom rules
rules = RuleSet([
    Rule.parse("(square ?x) → (* ?x ?x)"),
    Rule.parse("(cube ?x) → (* ?x ?x ?x)"),
])
engine.add_rules(rules)

expr = Expression.parse("(+ (square 3) (cube 2))")
result = engine.rewrite(expr)
print(result)  # 17

Expression Visualization

The REPL renders expression trees in ASCII:

Crier: Cross-Post Your Content Everywhere

December 16, 2025

I published crier to PyPI. It’s a command-line tool for cross-posting content to multiple platforms at once.

The problem is simple: I write blog posts in Markdown with YAML front matter. I want them on dev.to, Hashnode, Bluesky, Mastodon, and wherever else, without manually copy-pasting into a dozen different editors. Crier handles this from the terminal.

Quick Start

pip install crier
cd your-blog
crier init

The init command walks you through setup: detecting content directories, configuring platforms with API keys.

The Workflow

Your markdown posts with YAML front matter are the source of truth
.crier/registry.yaml tracks what’s published where
crier audit shows what’s missing or changed
crier publish pushes content out

# See what needs publishing
crier audit

# Publish to multiple platforms
crier publish post.md --to devto --to bluesky --to mastodon

# Bulk publish everything missing
crier audit --publish --yes

LLM-Powered Auto-Rewrite

This is the feature I’m most pleased with. Short-form platforms like Bluesky (300 chars) and Mastodon (500 chars) need summaries, not full articles. Crier generates these automatically using any OpenAI-compatible LLM:

# Auto-generate short-form content
crier publish post.md --to bluesky --to mastodon --auto-rewrite

# Bulk publish with auto-rewrite
crier audit --publish --auto-rewrite --yes

Simplest setup: If you have OPENAI_API_KEY set, it just works (defaults to gpt-4o-mini).

Or use local models:

# ~/.config/crier/config.yaml
llm:
  base_url: http://localhost:11434/v1  # Ollama
  model: llama3

The LLM generates platform-appropriate summaries that fit within character limits, with automatic retry if the output is too long.

Supported Platforms

Platform	Type	Notes
dev.to	Blog	Full article support
Hashnode	Blog	Full article support
Medium	Blog	Publish/import mode
Ghost	Blog	Full article support
WordPress	Blog	Self-hosted or .com
Buttondown	Newsletter	Email subscribers
Bluesky	Social	Posts with link cards
Mastodon	Social	Toots with hashtags
Threads	Social	Short posts
LinkedIn	Social	Professional network
Twitter/X	Social	Copy-paste mode
Telegram	Channel	Bot posts
Discord	Channel	Webhook embeds

Bulk Operations with Filters

The audit command supports filtering for targeted bulk operations:

# API platforms only (skip manual/import)
crier audit --publish --yes --only-api

# Long-form only (skip short-form social)
crier audit --publish --yes --long-form

# Random sample of 5 articles
crier audit --publish --yes --sample 5

# Filter by path and date
crier audit content/post --since 1m --only-api --publish --yes

# Combine filters
crier audit content/post --since 1w --only-api --long-form --sample 10 --publish --yes

Filter	Description
`[PATH]`	Only scan specific directory
`--since`	Only content from this date (`1d`, `1w`, `1m`, or `YYYY-MM-DD`)
`--only-api`	Skip manual/import platforms
`--long-form`	Skip short-form social platforms
`--sample N`	Random sample of N items
`--auto-rewrite`	Generate short-form content with LLM

Profiles

Group platforms into reusable profiles:

hypothesize: Now on CRAN

December 12, 2025

hypothesize is now on CRAN.

What It Does

hypothesize provides a consistent API for hypothesis testing in R. It defines generic methods that any hypothesis test can implement:

pval() - Extract the p-value
test_stat() - Get the test statistic
dof() - Retrieve degrees of freedom
is_significant_at() - Check significance at a given level

The package ships with two implementations:

Likelihood Ratio Test (LRT) for comparing nested models
Wald Test for testing parameter estimates

Why

When building statistical libraries, I kept implementing ad-hoc hypothesis test structures. Different packages, different interfaces, no composability. hypothesize standardizes this: any package can wrap its tests in a consistent interface, and statistical workflows can treat all tests uniformly.

It’s a small package. That’s the point.

Installation

install.packages("hypothesize")

Quick Example

library(hypothesize)

# Likelihood Ratio Test
result <- lrt(null_loglik = -100, alt_loglik = -96, dof = 3)
print(result)

# Check significance
is_significant_at(result, 0.05)

# Extract components
pval(result)
test_stat(result)

Documentation

Full documentation is at queelius.github.io/hypothesize.

What’s Next

I have several other R packages in the pipeline for CRAN submission, including packages for likelihood-based inference and reliability analysis.

Links

CRAN: CRAN.R-project.org/package=hypothesize
GitHub: github.com/queelius/hypothesize
Documentation: queelius.github.io/hypothesize

hypothesize: A Consistent Interface for Statistical Tests

March 25, 2022

R’s hypothesis testing functions are inconsistent. t.test() returns a different structure than chisq.test(). Writing generic code that works across tests is painful. hypothesize fixes this with a unified API where every test returns the same interface.

The Problem

Different R tests return incompatible objects:

t.test(x, y)$p.value        # Works
chisq.test(x, y)$p.value    # Also works
my_custom_test(x, y)$???    # Who knows?

You cannot write generic code that works across tests without knowing the internals of each one.

The Fix

hypothesize defines a consistent interface:

test <- lrt(model_null, model_alt)  # Likelihood ratio test
pval(test)          # Extract p-value
test_stat(test)     # Extract test statistic
dof(test)           # Extract degrees of freedom
is_significant_at(test, 0.05)  # Boolean check

All tests, built-in or custom, implement the same generic functions. The interface is the same whether you are doing a likelihood ratio test, a Wald test, or a Z-test.

Integration with likelihood.model

The package works with likelihood.model, so likelihood ratio tests on any model are straightforward:

lrt(null_model, alternative_model)  # Automatic LRT

You specify the models. The package computes the test statistic, degrees of freedom, and p-value. Same interface as every other test.

The Point

Tests are objects you manipulate, not functions with incompatible return types. You can write test-agnostic pipelines. You can wrap your own custom tests in the same interface. This is generic programming applied to hypothesis testing: a consistent abstraction over heterogeneous implementations.

R package – Works with likelihood.model – Documentation – GitHub

Complex Networks 2025: Presenting Cognitive MRI at Binghamton

December 9, 2025

Last week I traveled to Binghamton University in Vestal, NY to present at Complex Networks 2025, the 14th International Conference on Complex Networks and their Applications.

The Paper

Our paper, “Cognitive MRI of AI Conversations: Analyzing AI Interactions through Semantic Embedding Networks” (co-authored with John Matta), introduces a way to understand how humans explore knowledge through AI dialogue.

The Core Idea

Linear conversation logs hide rich cognitive structure. We developed what we call a cognitive MRI: a network analysis technique that transforms sequential conversation traces into topological maps. Each conversation becomes a node, connected to others by semantic similarity. The result reveals how knowledge domains interconnect in ways that a flat log doesn’t show.

Key Findings

From 449 ChatGPT conversations:

High modularity (0.750): Clear knowledge communities emerge naturally
Heterogeneous topology: Theoretical domains (ML/AI) show hub-and-spoke patterns; practical domains (programming) show tree-like hierarchies
Three bridge types: Evolutionary bridges (topic drift), integrative bridges (deliberate synthesis), pure bridges (critical links with minimal connections)
User-weighted embeddings: A 2:1 user:AI weighting ratio best captures conversational intent

The Method

We used nomic-embed-text to generate semantic embeddings, weighted user inputs more heavily than AI responses (since users drive conversation direction), and constructed similarity networks at various thresholds. The phase transition at similarity threshold ~0.875 proved remarkably consistent across all weight configurations.

The Conference

Complex Networks brings together researchers from physics, computer science, biology, sociology, anyone studying systems as networks. Binghamton was an excellent host.

Mark Newman was there. One of the pioneers of modern network science, author of the definitive textbook on complex networks. I didn’t get to speak with him at length (didn’t want to bug him), but it was good to see the field’s foundations represented alongside newer applications.

The talks ranged from brain connectivity analysis to social media dynamics to infrastructure resilience. The same mathematical tools, community detection, centrality measures, network motifs, keep illuminating very different phenomena.

Presentation Materials

Paper: Cognitive MRI of AI Conversations (full text + PDF)
Slides: Conference presentation (Beamer slides)
Code: github.com/queelius/chatgpt-complex-net

Why This Matters

As AI assistants become integral to knowledge work, understanding how humans navigate AI-mediated exploration matters. The cognitive MRI gives you:

Networks of Thought: Finding Your Research Niche in the Age of LLMs

October 25, 2025

On research strategy, what complex networks reveal about how we think through AI conversations, and building infrastructure for the next generation of knowledge tools.

Infinigram: Variable-Length N-grams via Suffix Arrays

December 3, 2025

Infinigram (pip install py-infinigram) is a corpus-based language model that uses suffix arrays for variable-length n-gram pattern matching. Unlike neural language models, there is no training step. The corpus is the model.

The problem with fixed n-grams

Traditional n-gram models use fixed context lengths and blow up exponentially. A 5-gram model over a 50,000-word vocabulary needs to store up to $50000^5$ possible patterns. That is roughly 312 petabytes. Nobody does this.

Infinigram uses suffix arrays instead:

O(n) space: Linear in corpus size, not vocabulary size
O(m log n) queries: Fast pattern matching for any context length
Variable-length matching: Automatically uses the longest matching context

For a 1B token corpus, this means about 1GB instead of about 34GB for hash-based 5-grams.

How it works

Given a context, Infinigram finds the longest matching suffix in the training corpus:

from infinigram import Infinigram

corpus = [1, 2, 3, 4, 2, 3, 5, 6, 2, 3, 4]
model = Infinigram(corpus, max_length=10)

# Find longest match for context [2, 3]
position, length = model.longest_suffix([2, 3])

# Predict next token
probs = model.predict([2, 3])
# {4: 0.66, 5: 0.33, ...}

Predictions come from counting what tokens follow the matched pattern in the corpus. Simple frequency estimation, but over arbitrarily long contexts.

LLM probability mixing

The practical application I care about most: grounding LLM outputs without retraining.

# Mix LLM with corpus-based predictions
P_final = alpha * P_llm + (1 - alpha) * P_infinigram

This gives you:

Domain adaptation without fine-tuning. Load a legal corpus and you get legal-domain predictions.
Hallucination reduction by anchoring to actual corpus content.
Explainability. Every prediction traces to specific corpus evidence. You can point to the exact passages.

Projections as inductive biases

I wrote a theoretical framework viewing inductive biases as projections: transformations applied to queries or training data that enable generalization.

Runtime transforms: lowercase normalization, stemming, synonym expansion
Corpus augmentations: data augmentation, paraphrasing

This gives a principled way to think about out-of-distribution generalization in corpus-based models. The projection determines what the model treats as “the same.”

Interactive REPL

Infinigram includes an interactive REPL for exploration:

infinigram-repl

infinigram> /dataset demo
infinigram [demo]> /load the cat sat on the mat
infinigram [demo]> /predict the cat
infinigram [demo]> /complete the cat --max 20

Future: LangCalc integration

Infinigram is designed to work with LangCalc, an algebraic framework for composing language models:

Everything is a File: Virtual Filesystems for CLI Data Tools

October 20, 2025

I had a bookmark manager. Then an ebook library manager. Then a chat history manager. Each started with the standard CRUD CLI:

btk add https://example.com --tags python,tutorial
btk list --tag python
btk search "async"
btk delete 1234

ebk import book.pdf --author "Knuth"
ebk list --author Knuth
ebk search "algorithms"

This works fine until you have 10,000+ bookmarks organized with hierarchical tags like programming/python/async, research/ml/transformers, work/clients/acme. Your ebook library has similar structure. Your exported chat conversations from Claude, ChatGPT, and Copilot are piling up.

Traditional CRUD commands become unwieldy:

btk list --tag programming/python/async/io --format json | jq '.[].title'
ebk list --category "Computer Science/Algorithms/Graph Theory" --limit 50
ctk search "machine learning" --source ChatGPT --date-from 2024-01-01

Each command requires precise arguments. Each tool has different flag conventions. You can’t navigate your data. You can only query it. And queries require knowing exactly what you’re looking for.

The insight: everything is a file

When I have thousands of source files organized in directories, I don’t run:

list-files --path /src/components/auth --extension .tsx

I run:

cd src/components/auth
ls *.tsx

The difference matters. With a filesystem, I can navigate incrementally (cd from general to specific), explore (ls to see what’s there), compose (cat file | grep pattern | wc -l), and use familiar tools (find, grep, xargs, pipes, redirection).

What if my bookmarks, ebooks, and chat histories were filesystems?

The pattern

Over the past year, I built six Python tools that all follow the same architecture:

Tool	Domain	VFS Root Structure
btk	Bookmarks	`/bookmarks/`, `/tags/`, `/recent/`, `/domains/`, `/unread/`, `/popular/`
ebk	Ebook library	`/books/`, `/authors/`, `/series/`, `/subjects/`, `/recent/`, `/unread/`
ctk	Chat conversations	`/conversations/`, `/sources/`, `/topics/`, `/starred/`, `/recent/`
ghops	Git repositories	`/repos/`, `/languages/`, `/topics/`, `/stars/`, `/recent/`
infinigram	N-gram models	`/datasets/`, `/models/`, `/corpora/`
AlgoTree	Tree structures	`/nodes/`, `/paths/`, `/subtrees/`

Each tool provides:

A stateless CLI for scripting: btk bookmark add URL, ebk import book.pdf
An interactive shell with a virtual filesystem: btk shell, ebk shell, ctk chat
POSIX-like commands: cd, ls, pwd, cat, mv, cp, rm, find, grep
Unix pipeline support: most commands output JSONL by default for piping

The interesting part is the shell.

Navigating 10,000 bookmarks

Live recording captured with asciinema. You can pause, copy text, and replay. The entire recording is 78KB of text.

Infinigram: Variable-Length N-grams via Suffix Arrays

December 3, 2025

Infinigram (pip install py-infinigram) is a corpus-based language model that uses suffix arrays for variable-length n-gram pattern matching. Unlike neural language models, there is no training step. The corpus is the model.

The problem with fixed n-grams

Traditional n-gram models use fixed context lengths and blow up exponentially. A 5-gram model over a 50,000-word vocabulary needs to store up to $50000^5$ possible patterns. That is roughly 312 petabytes. Nobody does this.

Infinigram uses suffix arrays instead:

O(n) space: Linear in corpus size, not vocabulary size
O(m log n) queries: Fast pattern matching for any context length
Variable-length matching: Automatically uses the longest matching context

For a 1B token corpus, this means about 1GB instead of about 34GB for hash-based 5-grams.

How it works

Given a context, Infinigram finds the longest matching suffix in the training corpus:

from infinigram import Infinigram

corpus = [1, 2, 3, 4, 2, 3, 5, 6, 2, 3, 4]
model = Infinigram(corpus, max_length=10)

# Find longest match for context [2, 3]
position, length = model.longest_suffix([2, 3])

# Predict next token
probs = model.predict([2, 3])
# {4: 0.66, 5: 0.33, ...}

Predictions come from counting what tokens follow the matched pattern in the corpus. Simple frequency estimation, but over arbitrarily long contexts.

LLM probability mixing

The practical application I care about most: grounding LLM outputs without retraining.

# Mix LLM with corpus-based predictions
P_final = alpha * P_llm + (1 - alpha) * P_infinigram

This gives you:

Domain adaptation without fine-tuning. Load a legal corpus and you get legal-domain predictions.
Hallucination reduction by anchoring to actual corpus content.
Explainability. Every prediction traces to specific corpus evidence. You can point to the exact passages.

Projections as inductive biases

I wrote a theoretical framework viewing inductive biases as projections: transformations applied to queries or training data that enable generalization.

Runtime transforms: lowercase normalization, stemming, synonym expansion
Corpus augmentations: data augmentation, paraphrasing

This gives a principled way to think about out-of-distribution generalization in corpus-based models. The projection determines what the model treats as “the same.”

Interactive REPL

Infinigram includes an interactive REPL for exploration:

infinigram-repl

infinigram> /dataset demo
infinigram [demo]> /load the cat sat on the mat
infinigram [demo]> /predict the cat
infinigram [demo]> /complete the cat --max 20

Future: LangCalc integration

Infinigram is designed to work with LangCalc, an algebraic framework for composing language models:

Language Calculus: An Algebraic Framework for LLM Composition

October 7, 2025

What if we could compose language models the way we compose functions in mathematics? What if there was an algebra of language models?

Language Calculus (langcalc) is an algebraic framework for building and reasoning about language model systems.

The Problem with Current LLM Composition

Today, combining language models typically means:

Ad-hoc ensembling techniques
Manual prompt chaining
Hardcoded decision trees
Black-box orchestration layers

There’s no principled way to reason about what these compositions do or how they behave. You wire things together and hope it works.

The Algebraic Approach

Language Calculus introduces operators with well-defined semantics:

Core Operators

M1 + M2     Mixture (weighted combination)
k * M       Scaling (temperature/probability adjustment)
M1 | M2     Maximum (most confident response)
M1 & M2     Minimum (most conservative response)
M1 ^ M2     Exclusive-or (diverse perspectives)
M ** t      Temperature adjustment
M ?? p      Threshold filtering
M >>> t     Truncation/limiting

Why This Matters

These operators satisfy algebraic laws:

(M1 + M2) + M3 = M1 + (M2 + M3)   # Associativity
M1 + M2 = M2 + M1                  # Commutativity
M + 0 = M                          # Identity
a * (M1 + M2) = a*M1 + a*M2        # Distributivity

This means we can transform, optimize, and reason about language model compositions algebraically. The laws aren’t just nice properties. They let you simplify compositions, prove equivalences, and optimize execution.

Practical Examples

Ensemble with Confidence Weighting

output = 0.4 * GPT4 + 0.3 * Claude + 0.3 * Llama

Expert Selection

code_task = (CodeLlama | GPT4) & SafetyModel

Diverse Brainstorming

ideas = CreativeModel ^ ConservativeModel ^ TechnicalModel

Temperature Search

explore = Model ** 1.5
exploit = Model ** 0.2
adaptive = 0.7 * exploit + 0.3 * explore

Theoretical Foundations

The framework provides:

Formal semantics for each operator
Type system ensuring valid compositions
Equivalence relations for optimization
Normal forms for canonical representations

This lets us prove properties like:

Safety preservation under composition
Bias reduction through specific mixtures
Computational complexity bounds

Applications

Language Calculus enables:

Automatic Optimization: Transform expensive compositions into equivalent cheaper ones
Compositional Testing: Verify properties of complex systems from component properties
Explainability: Understand what a composition does from its algebraic structure
Meta-Learning: Learn optimal compositions for task families

Implementation

The paper includes:

Master's Project: Reliability Estimation in Series Systems

February 19, 2024

I presented my master’s project in October 2023, finishing up my MS in statistics/mathematics at SIUE. The associated paper is titled “Reliability Estimation in Series Systems: Maximum Likelihood Techniques for Right-Censored and Masked Failure Data.”

The Problem

In reliability engineering, you often find yourself in an annoying situation: a system fails, but you do not know which component caused the failure. This is called masked failure data. On top of that, some systems are still running when you stop observing them, so you only know they survived at least that long. That is right censoring. Both are common in practice. Identifying the exact failed component is expensive or sometimes impossible.

The project builds a likelihood-based framework that handles both masking and censoring simultaneously, models component lifetimes with Weibull distributions, derives closed-form Fisher information for the exponential special case, and provides bootstrap methods for uncertainty quantification. I implemented it all in an R package so practitioners can actually use it.

This connects to several other posts and projects:

Closed-Form Results for Masked Exponential Series Systems covers the exponential distribution special case with analytical solutions
likelihood.model R package is the software implementation

See the full project page here.

mdrelax: When Masking Conditions Don't Hold

December 3, 2025

mdrelax extends my work on series system reliability by handling cases where the standard masking assumptions break down.

Background: The C1-C2-C3 Framework

My master’s thesis developed maximum likelihood techniques for series systems with masked failure data. The standard framework assumes three conditions:

C1: The failed component is always in the candidate set
C2: Non-informative masking (uniform probability within candidate set)
C3: Masking mechanism is independent of system parameters

When these hold, the masking probabilities factor out and you can ignore them for parameter estimation. The expo-masked-fim paper derives closed-form Fisher Information for the exponential case, and maskedcauses implements the general framework.

The Problem

In practice, C2 and C3 are often violated.

Informative masking (C2 violation): Diagnostic tests may be better at identifying certain failure modes than others. A component that fails catastrophically is easier to identify than one that degrades subtly.

Parameter-dependent masking (C3 violation): The masking mechanism might depend on component reliabilities. Components with shorter lifetimes fail more often, so technicians get more practice diagnosing them.

If you pretend C2 and C3 hold when they don’t, your parameter estimates are biased. Sometimes badly.

What mdrelax Does

The package implements likelihood-based inference with relaxed conditions:

library(mdrelax)

# Generate masked data with Bernoulli candidate sets
md <- md_bernoulli_cand_C1_C2_C3(data, p = 0.3)

# Sample candidate sets
md <- md_cand_sampler(md)

# MLE for exponential series system
fit <- md_mle_exp_series_C1_C2_C3(md)

# Fisher information matrix
fim <- md_fim_exp_series_C1_C2_C3(md, params(fit))

Key Features

Flexible masking models: Bernoulli, rank-based, KL-divergence constrained
Identifiability analysis: Tools to check when parameters can actually be estimated
Fisher information: Efficiency analysis under relaxed conditions
Simulation utilities: Monte Carlo studies for method validation

Relationship to Other Work

This package sits at the end of a progression toward generality:

Project	Focus
expo-masked-fim	Closed-form FIM for exponential case
maskedcauses	General R framework for masked data likelihood
reliability-estimation-in-series-systems	Master’s thesis implementation
wei.series.md.c1.c2.c3	Weibull series systems under C1-C2-C3
mdrelax	Relaxed conditions (C2, C3 violations)

The progression:

Exponential + C1-C2-C3: Closed-form solutions
Weibull + C1-C2-C3: Numerical MLE
Weibull + relaxed conditions: mdrelax

Each step trades analytical tractability for realism.

When to Use It

Use mdrelax when you suspect:

Diagnostic accuracy varies by component type
Masking patterns correlate with component reliabilities
Standard C1-C2-C3 assumptions are too restrictive for your data

The trade-off is real: relaxed models have more parameters and may need larger samples for reliable estimation. But biased estimates from wrong assumptions aren’t free either.

Weibull Distributions: From Reliability Theory to My Own Survival Curve

April 18, 2022

The Weibull distribution models time-to-failure. In reliability engineering, that means component lifetimes. In medicine, it means survival times. I have been working with Weibull models for my thesis on series system reliability. Then I got diagnosed with cancer, and now every time I work with survival curves, I am looking at mathematical abstractions of something very concrete: how long until failure?

The Mathematics

The Weibull CDF:

F(t) = 1 - exp(-(t/λ)^k)

Two parameters:

λ: scale (characteristic lifetime)
k: shape (how failure rate changes over time)

The shape parameter k tells you the whole story:

k < 1: Decreasing hazard. If you survive early on, your risk goes down. This is the infant mortality pattern.

k = 1: Constant hazard. Memoryless. This is just the exponential distribution.

k > 1: Increasing hazard. Things wear out.

The Hazard Function

The hazard function is what makes Weibull useful for survival analysis:

h(t) = (k/λ)(t/λ)^(k-1)

This is the instantaneous failure rate: given that you have survived to time t, what is the probability you fail in the next instant?

For cancer, this is the number that matters. Some cancers have increasing hazard (the longer you have it, the worse things get). Others have decreasing hazard after initial treatment, meaning if you make it past the critical period, prognosis improves. Knowing which pattern applies to your disease changes how you think about time.

Personal Context

When you study survival analysis academically, it is abstract. When you are living it, every curve is personal.

I look at Kaplan-Meier plots and see myself somewhere on that curve. I work with hazard functions and think: is my k > 1 or k < 1? Am I in the wearing-out regime or the if-you-make-it-past-this-it-gets-better regime?

The math does not change. But the meaning does.

The Irony

I chose reliability engineering for my thesis before the cancer diagnosis. I was studying component failures in series systems, where if any one part fails, the whole system fails.

Then I became a series system. Organs, treatment response, immune function. All have to work. Failure of any one is catastrophic.

The mathematics I was studying abstractly became uncomfortably literal.

Bootstrap Methods: When Theory Meets Computation

September 10, 2021

The bootstrap is a trade: mathematical complexity for computational burden. Instead of deriving analytical formulas for sampling distributions, you simulate them.

The Idea

If you don’t know the sampling distribution of a statistic, approximate it by resampling from your data.

Draw samples with replacement from the original data
Compute your statistic on each resample
The distribution of resampled statistics approximates the true sampling distribution

That’s it. The justification is more subtle than the procedure. Under regularity conditions, the bootstrap distribution converges to the true sampling distribution as sample size grows. This is non-parametric inference: you use the empirical distribution as a stand-in for the true distribution, without assuming a parametric form.

When I Use It

Bootstrap is my default tool when:

I need confidence intervals for statistics with no closed-form variance
Asymptotic theory doesn’t apply (small samples, non-standard statistics)
I’m doing model selection via bootstrap cross-validation
I’m working with censored data where standard errors are intractable

That last case is the one that matters most for my research.

The Computational Trade

Better to get the right answer slowly than the wrong answer quickly.

Deriving an analytical variance formula is hard. Sometimes it’s impossible for the statistic you actually care about. Bootstrap says: just compute the statistic 10,000 times on resampled data and look at the spread. With modern hardware, 10,000 resamples takes seconds.

The trade is almost always worth it.

My Thesis Work

My research uses bootstrap heavily. I’m working on reliability estimation for series systems where components fail and you don’t know which one caused the system failure. This is the masked failure data problem.

For these models, the MLE exists and you can compute it, but the standard variance formulas don’t. The Fisher information matrix involves expectations over the masking distribution that don’t simplify to anything closed-form.

Bootstrap gives me confidence intervals anyway. Resample the masked failure data, recompute the MLE on each resample, and use the distribution of bootstrapped MLEs to construct intervals. It’s not elegant, but it works, and “works” is the right criterion when the alternative is “no confidence intervals at all.”

Reliability Analysis and the Problem of Censored Data

August 14, 2019

One of the most interesting statistical problems I have encountered is reliability analysis with censored data: situations where you know something didn’t fail, but not when it will fail.

The Censoring Problem

Imagine testing light bulbs. You run them for 1000 hours. Some fail during the test. Others are still working when you stop.

For the survivors, you know:

They lasted at least 1000 hours
You do not know their actual lifetime

This is right censoring. The true value lies somewhere to the right of your observation. You have a lower bound, not a measurement.

Why This Matters

Censored data is everywhere:

Medical studies (patients still alive at study end)
Engineering tests (components that have not failed)
Customer retention (users still active)

The naive responses are both wrong. Ignoring censored observations wastes information. Treating them as failures introduces bias. You need a framework that uses the partial information you actually have.

Maximum Likelihood to the Rescue

The solution is maximum likelihood estimation with likelihood contributions that account for censoring:

Failure observations contribute the probability density $f(t)$. You observed the exact failure time, so you know the probability of failing at that time.
Censored observations contribute the survival probability $S(t)$. You know the unit survived to time $t$, so its contribution is the probability of surviving at least that long.

The likelihood for the whole sample is:

$$L = \prod_{i: \text{failed}} f(t_i) \prod_{j: \text{censored}} S(t_j)$$

This lets you extract information from both failed and surviving units. The censored observations pull the estimated reliability upward; the failures pull it downward. Maximum likelihood balances them.

Series Systems Complexity

It gets more interesting with series systems, systems that fail when any component fails. If you observe system failure but do not know which component caused it, you have masked failure data.

This is the problem I am most interested in: extracting component-level reliability from system-level failures when the cause is ambiguous. The masking adds a latent variable, and the likelihood becomes a mixture. You can handle it with EM algorithms or direct optimization, but the combinatorics grow quickly with system size.

This work is laying groundwork for what will become a major focus of my mathematical statistics degree.

mdrelax: When Masking Conditions Don't Hold

December 3, 2025

mdrelax extends my work on series system reliability by handling cases where the standard masking assumptions break down.

Background: The C1-C2-C3 Framework

My master’s thesis developed maximum likelihood techniques for series systems with masked failure data. The standard framework assumes three conditions:

C1: The failed component is always in the candidate set
C2: Non-informative masking (uniform probability within candidate set)
C3: Masking mechanism is independent of system parameters

When these hold, the masking probabilities factor out and you can ignore them for parameter estimation. The expo-masked-fim paper derives closed-form Fisher Information for the exponential case, and maskedcauses implements the general framework.

The Problem

In practice, C2 and C3 are often violated.

Informative masking (C2 violation): Diagnostic tests may be better at identifying certain failure modes than others. A component that fails catastrophically is easier to identify than one that degrades subtly.

Parameter-dependent masking (C3 violation): The masking mechanism might depend on component reliabilities. Components with shorter lifetimes fail more often, so technicians get more practice diagnosing them.

If you pretend C2 and C3 hold when they don’t, your parameter estimates are biased. Sometimes badly.

What mdrelax Does

The package implements likelihood-based inference with relaxed conditions:

library(mdrelax)

# Generate masked data with Bernoulli candidate sets
md <- md_bernoulli_cand_C1_C2_C3(data, p = 0.3)

# Sample candidate sets
md <- md_cand_sampler(md)

# MLE for exponential series system
fit <- md_mle_exp_series_C1_C2_C3(md)

# Fisher information matrix
fim <- md_fim_exp_series_C1_C2_C3(md, params(fit))

Key Features

Flexible masking models: Bernoulli, rank-based, KL-divergence constrained
Identifiability analysis: Tools to check when parameters can actually be estimated
Fisher information: Efficiency analysis under relaxed conditions
Simulation utilities: Monte Carlo studies for method validation

Relationship to Other Work

This package sits at the end of a progression toward generality:

Project	Focus
expo-masked-fim	Closed-form FIM for exponential case
maskedcauses	General R framework for masked data likelihood
reliability-estimation-in-series-systems	Master’s thesis implementation
wei.series.md.c1.c2.c3	Weibull series systems under C1-C2-C3
mdrelax	Relaxed conditions (C2, C3 violations)

The progression:

Exponential + C1-C2-C3: Closed-form solutions
Weibull + C1-C2-C3: Numerical MLE
Weibull + relaxed conditions: mdrelax

Each step trades analytical tractability for realism.

When to Use It

Use mdrelax when you suspect:

Diagnostic accuracy varies by component type
Masking patterns correlate with component reliabilities
Standard C1-C2-C3 assumptions are too restrictive for your data

The trade-off is real: relaxed models have more parameters and may need larger samples for reliable estimation. But biased estimates from wrong assumptions aren’t free either.

Model Selection for Weibull Series Systems: When Simpler Models Suffice

December 3, 2025

When can you safely use a simpler model for a series system? I ran extensive simulation studies with likelihood ratio tests to get a quantitative answer.

The Problem

In series system reliability, you estimate component parameters from masked failure data. For Weibull components, that means estimating $2m$ parameters: shape $k_j$ and scale $\lambda_j$ for each of $m$ components.

But what if the components have similar failure characteristics? A reduced model with homogeneous shape parameters uses only $m+1$ parameters (one common $k$ plus $m$ scales). This roughly halves the parameter count and has a nice property: the system itself becomes Weibull-distributed.

The question is when this simplification is justified.

Key Findings

Robustness of the Reduced Model

For well-designed series systems (components with similar failure characteristics), the result is striking:

The reduced homogeneous-shape model cannot be rejected even with sample sizes approaching 30,000, far larger than anything typically available in practice.

With realistic sample sizes (50 to 500), the likelihood ratio test shows no evidence against the reduced model when components truly have similar shapes. This is strong justification for using the simpler model.

Sharp Boundaries

The paper pins down exactly how much heterogeneity it takes to trigger rejection:

Shape Deviation	Sample Size	LRT Decision
0.25	30,000	Fail to reject
0.50	1,000+	Reject
1.0	100+	Strong reject
3.0	50+	Very strong reject

Even modest deviations in a single component’s shape parameter provide evidence against the reduced model. The boundaries are clean.

Practical Guidance

Use the reduced model when:

Components come from similar manufacturing processes
Historical data suggests similar wear-out patterns
Sample sizes are moderate ($n < 500$)
You need a quick reliability assessment

Use the full model when:

Components have fundamentally different failure modes (infant mortality vs wear-out)
Large samples are available ($n > 1000$)
Precise component-level inference is critical
Preliminary studies suggest model inadequacy

This paper fits into a broader program on masked failure data:

Paper/Package	Focus
Master’s Thesis	Weibull MLE with masked data
expo-masked-fim	Closed-form FIM for exponential case
maskedcauses	R framework for masked data likelihood
mdrelax	Relaxed masking conditions
This paper	Model selection via LRT

The pieces address different aspects of the same problem:

Reliability Estimation in Series Systems: Maximum Likelihood Techniques for Right-Censored and Masked Failure Data

June 15, 2024

This is my master’s thesis in mathematics. The problem: you have a series system (fails when any component fails), you can observe system-level failure times, but you often can’t tell which component actually caused the failure. The failure cause is “masked.” On top of that, some systems are still running at the end of the study, so their lifetimes are right-censored. You want to estimate the reliability of individual components from this incomplete data.

The challenge

Estimating component reliability is hard when:

You only observe system-level failure data
The exact component cause of failure is ambiguous (masked)
System lifetimes are right-censored
Sample sizes are small

A series system fails when any component fails, so disentangling which components are weakest from system-level observations is a non-trivial inference problem.

Likelihood model for masked data

I developed a likelihood model that handles two types of incompleteness.

Right-censoring: the system is observed until time $\tau$, but may not have failed yet:

\[ S_i = \min\lbrace \tau_i, T_i\rbrace \]

\[ \delta_i = \mathbb{1}_{T_i < \tau_i} \]

Component cause masking: when the system fails, you observe a candidate set $\mathcal{C}_i$ containing the failed component, but can’t pinpoint the exact cause.

Under three conditions (which hold in many industrial settings), the likelihood contribution simplifies to:

\[ L_i(\theta) \propto \left[\prod_{j=1}^m R_j(s_i; \theta_j)\right] \times \left[\sum_{j \in \mathcal{C}_i} h_j(s_i; \theta_j)\right]^{\delta_i} \]

where $R_j$ is the reliability function and $h_j$ is the hazard function of component $j$. The three conditions are: the candidate set always contains the true failed component, masking probability is uniform across components in the candidate set, and masking probabilities don’t depend on the system parameters $\theta$.

Weibull series systems

I focused on components with Weibull lifetimes: $T_{ij} \sim \text{Weibull}(k_j, \lambda_j)$. The shape parameter $k_j$ tells you the failure behavior: $k < 1$ is infant mortality, $k = 1$ is random failures (exponential), $k > 1$ is wear-out.

System reliability when all components are Weibull:

\[ R_{T_i}(t; \theta) = \exp\left\lbrace -\sum_{j=1}^m \left(\frac{t}{\lambda_j}\right)^{k_j}\right\rbrace \]

The hazard function is additive:

\[ h_{T_i}(t; \theta) = \sum_{j=1}^m \frac{k_j}{\lambda_j}\left(\frac{t}{\lambda_j}\right)^{k_j-1} \]

Simulation studies

I ran extensive simulations varying three factors:

Right-censoring impact (q = 60% to 100%): Scale parameters showed positive bias with censoring. Shape parameters were more sensitive than scale parameters. The most reliable component was most affected by censoring. Convergence rate exceeded 95% for q >= 0.7.

Reliability Analysis and the Problem of Censored Data

August 14, 2019

One of the most interesting statistical problems I have encountered is reliability analysis with censored data: situations where you know something didn’t fail, but not when it will fail.

The Censoring Problem

Imagine testing light bulbs. You run them for 1000 hours. Some fail during the test. Others are still working when you stop.

For the survivors, you know:

They lasted at least 1000 hours
You do not know their actual lifetime

This is right censoring. The true value lies somewhere to the right of your observation. You have a lower bound, not a measurement.

Why This Matters

Censored data is everywhere:

Medical studies (patients still alive at study end)
Engineering tests (components that have not failed)
Customer retention (users still active)

The naive responses are both wrong. Ignoring censored observations wastes information. Treating them as failures introduces bias. You need a framework that uses the partial information you actually have.

Maximum Likelihood to the Rescue

The solution is maximum likelihood estimation with likelihood contributions that account for censoring:

Failure observations contribute the probability density $f(t)$. You observed the exact failure time, so you know the probability of failing at that time.
Censored observations contribute the survival probability $S(t)$. You know the unit survived to time $t$, so its contribution is the probability of surviving at least that long.

The likelihood for the whole sample is:

$$L = \prod_{i: \text{failed}} f(t_i) \prod_{j: \text{censored}} S(t_j)$$

This lets you extract information from both failed and surviving units. The censored observations pull the estimated reliability upward; the failures pull it downward. Maximum likelihood balances them.

Series Systems Complexity

It gets more interesting with series systems, systems that fail when any component fails. If you observe system failure but do not know which component caused it, you have masked failure data.

This is the problem I am most interested in: extracting component-level reliability from system-level failures when the cause is ambiguous. The masking adds a latent variable, and the likelihood becomes a mixture. You can handle it with EM algorithms or direct optimization, but the combinatorics grow quickly with system size.

This work is laying groundwork for what will become a major focus of my mathematical statistics degree.

Alga: Algebraic Text Processing with Fuzzy Matching

November 30, 2025

Alga is a C++20 header-only library that treats text manipulation as algebra instead of imperative string hacking. It is built on monoids, functors, and extended operators, and it gives you compositional parsing with built-in fuzzy matching.

The Core Idea

Instead of treating strings as mutable buffers, Alga treats text as elements of algebraic structures:

#include "parsers/lc_alpha.hpp"
#include "parsers/porter2stemmer.hpp"
#include "parsers/algebraic_operators.hpp"

using namespace alga;

auto word1 = make_lc_alpha("hello");
auto word2 = make_lc_alpha("world");

if (word1 && word2) {
    // Monoid concatenation
    auto combined = *word1 * *word2;  // "helloworld"

    // Repetition
    auto emphasis = *word1 ^ 3;       // "hellohellohello"

    // Sequential composition (produces vector)
    auto sequence = *word1 >> *word2; // vector["hello", "world"]

    // Porter2 stemming with algebraic composition
    auto stem = make_porter2_stem("running");
    if (stem) {
        auto repeated = *stem ^ 2;    // "runrun"
    }
}

The operators are not arbitrary overloads. They follow actual algebraic laws (associativity, identity, etc.), which means you can reason about compositions the same way you reason about mathematical expressions.

Algebraic Operators

Operator	Meaning	Example
`*`	Monoid concatenation	`word1 *word2`
`\|`	Choice (first valid)	`word1 \| word2`
`^`	Repetition (n times)	`*word ^ 3`
`>>`	Sequential (to vector)	`word1 >> word2`

List Combinators

Parse separated lists and sequences:

#include "parsers/list_combinators.hpp"

// CSV parsing
auto csv = sepBy(int_parser(), char_parser(','));
auto [pos, nums] = csv.parse("1,2,3");  // vector<int>{1, 2, 3}

// One or more items (fails on empty)
auto csv1 = sepBy1(word_parser(), char_parser(','));

// Optional trailing separator
auto items = sepEndBy(word_parser(), char_parser(';'));

If you have used Haskell’s parsec or Megaparsec, the combinator style will feel familiar. The difference is that Alga’s combinators carry algebraic guarantees through the type system.

Fuzzy Matching

Parse noisy, imperfect input with built-in fuzzy matching:

#include "parsers/fuzzy_parsers.hpp"
#include "parsers/similarity.hpp"

using namespace alga::fuzzy;
using namespace alga::similarity;

// Accept "hello" with up to 2 typos
auto greeting = fuzzy_match("hello", 2);
greeting.parse("helo");    // Matches (1 edit)
greeting.parse("heello");  // Matches (1 edit)
greeting.parse("world");   // Fails (too different)

// Sound-alike name matching
auto name_parser = phonetic_match("Smith");
name_parser.parse("Smyth");  // Matches (same Soundex)

// Combined fuzzy: case + phonetic + edit distance
auto flexible = combined_fuzzy("Python", 2);
flexible.parse("python");  // Case-insensitive
flexible.parse("Pyton");   // Fuzzy match (1 typo)

// String similarity metrics
auto dist = levenshtein_distance("kitten", "sitting");  // 3
auto sim = jaro_winkler_similarity("Martha", "Marhta"); // 0.96

This is the part I find most useful in practice. Real-world text is messy, and having fuzzy matching baked into the parser combinator framework means you do not have to bolt it on as an afterthought.

Phonetic Algorithms

Sound-alike word matching:

#include "parsers/phonetic.hpp"

auto code1 = soundex("Smith");   // "S530"
auto code2 = soundex("Smyth");   // "S530" (same!)

bool alike = sounds_like_soundex("Robert", "Rupert");  // true

Unicode Support

Full UTF-8 with multi-script alphabetic parsing:

AlgoGraph: Immutable Graph Library with Functional Transformers

November 30, 2025

AlgoGraph is an immutable graph library for Python. Version 2.0.0 introduces pipe-based transformers, declarative selectors, and lazy views, which together cut boilerplate by roughly 90% for common graph operations.

Why Immutability for Graphs?

Mutable graph libraries like NetworkX are powerful but carry hidden costs:

Side effects: Modifying a graph can break other code holding references to it
Debugging difficulty: Hard to track when and where a graph changed
Thread unsafety: Concurrent modifications cause subtle bugs

AlgoGraph takes a different approach: all operations return new graph objects. The original is never modified.

from AlgoGraph import Graph

g1 = Graph.from_edges(('A', 'B'), ('B', 'C'))
g2 = g1.add_vertex('D')  # g1 unchanged, g2 is new graph

assert 'D' not in g1.vertices()
assert 'D' in g2.vertices()

This is the same idea behind persistent data structures in Clojure or Haskell. You get referential transparency, which means you can reason about graph transformations without worrying about what else might be mutating the same object.

Pipe-Based Transformers

The main feature of v2.0.0 is the transformer pipeline using Python’s | operator:

from AlgoGraph.transformers import filter_vertices, largest_component, stats

# Compose operations declaratively
result = (graph
    | filter_vertices(lambda v: v.get('active'))
    | largest_component()
    | stats())

# result: {'vertex_count': 42, 'edge_count': 156, 'density': 0.18, ...}

Compare with the imperative alternative:

# Old way (NetworkX-style)
active = graph.subgraph([v for v in graph.vertices() if v.attrs.get('active')])
components = list(connected_components(active))
largest = max(components, key=len)
subgraph = active.subgraph(largest)
stats = compute_stats(subgraph)

The pipe version reads top to bottom. Each step is a function. You can compose them, reuse them, test them independently.

Available transformers:

filter_vertices(pred), filter_edges(pred) – Filter by predicate
map_vertices(fn), map_edges(fn) – Transform attributes
reverse(), to_undirected() – Structure transformations
largest_component(), minimum_spanning_tree() – Algorithm-based
to_dict(), to_adjacency_list(), stats() – Export operations

Declarative Selectors

Query vertices and edges with logical operators instead of filtering lambdas:

from AlgoGraph.graph_selectors import vertex as v, edge as e

# Find active users with high degree
power_users = graph.select_vertices(
    v.attrs(active=True) & v.degree(min_degree=10)
)

# Find heavy edges from admin nodes
admin_edges = graph.select_edges(
    e.source(v.attrs(role='admin')) & e.weight(min_weight=100)
)

# Complex queries with OR, NOT, XOR
special = graph.select_vertices(
    (v.attrs(vip=True) | v.degree(min_degree=50)) & ~v.attrs(banned=True)
)

You specify what you want, not how to find it. The selector algebra handles the rest.

Lazy Views

Views provide efficient filtering without copying data:

from AlgoGraph.views import filtered_view, neighborhood_view

# Create view without copying (O(1) space)
view = filtered_view(
    large_graph,
    vertex_filter=lambda v: v.get('active'),
    edge_filter=lambda e: e.weight > 5.0
)

# Iterate lazily
for vertex in view.vertices():
    process(vertex)

# Materialize only when needed
small_graph = view.materialize()

# Explore k-hop neighborhood
local = neighborhood_view(graph, center='Alice', k=2)

View types:

filtered_view() – Filter vertices/edges
subgraph_view() – View specific vertices
reversed_view() – Reverse edge directions
undirected_view() – View as undirected
neighborhood_view() – k-hop neighborhood

56+ Algorithms

AlgoGraph includes broad algorithm coverage:

libdis: Disjoint Interval Sets as a Complete Boolean Algebra

November 30, 2025

libdis is a C++17/20/23 header-only library that treats interval sets as first-class mathematical objects forming a complete Boolean algebra. Most interval libraries give you containers. This one gives you the algebra.

The Problem

Intervals show up everywhere: scheduling, computational geometry, range queries, memory management. But most C++ libraries treat them as fancy containers. You get insert, remove, maybe a merge. You don’t get complement. You don’t get De Morgan’s laws.

I wanted a library where interval sets are actual mathematical objects. You write a & b and get an intersection. You write ~a and get a complement over the full real line. The operators aren’t sugar; they satisfy the axioms.

#include <dis/disjoint_interval_set.hpp>

using dis = dis::disjoint_interval_set<int>;

dis a = dis::closed(1, 5) | dis::closed(10, 15);  // [1,5] ∪ [10,15]
dis b = dis::closed(3, 12);                        // [3,12]

auto intersection = a & b;   // [3,5] ∪ [10,12]
auto union_set = a | b;      // [1,15]
auto difference = a - b;     // [1,3) ∪ (12,15]
auto complement = ~a;        // (-∞,1) ∪ (5,10) ∪ (15,+∞)

Boolean Algebra Axioms

These aren’t just convenient operators. They satisfy the actual mathematical laws:

Associativity: (a | b) | c == a | (b | c)

Commutativity: a & b == b & a

Distributivity: a & (b | c) == (a & b) | (a & c)

Identity: a | ∅ == a, a & U == a

Complement: a | ~a == U, a & ~a == ∅

De Morgan’s Laws: ~(a & b) == ~a | ~b

All 94 test cases verify these properties. If you break an axiom, you’ll hear about it.

Interval Types

Create intervals with different boundary conditions:

auto closed = dis::closed(1, 5);      // [1, 5]
auto open = dis::open(1, 5);          // (1, 5)
auto left_open = dis::left_open(1, 5);   // (1, 5]
auto right_open = dis::right_open(1, 5); // [1, 5)

// Unbounded intervals
auto from = dis::from(5);             // [5, +∞)
auto until = dis::until(5);           // (-∞, 5]
auto everything = dis::all();         // (-∞, +∞)
auto nothing = dis::empty();          // ∅

Set Operations

dis a = dis::closed(0, 10);
dis b = dis::closed(5, 15);

// Union
auto u = a | b;  // [0, 15]

// Intersection
auto i = a & b;  // [5, 10]

// Difference
auto d = a - b;  // [0, 5)

// Symmetric difference
auto s = a ^ b;  // [0, 5) ∪ (10, 15]

// Complement
auto c = ~a;     // (-∞, 0) ∪ (10, +∞)

Querying

dis intervals = dis::closed(1, 5) | dis::closed(10, 15);

// Point containment
intervals.contains(3);   // true
intervals.contains(7);   // false

// Interval containment
intervals.contains(dis::closed(2, 4));   // true
intervals.contains(dis::closed(2, 12));  // false

// Overlap detection
intervals.overlaps(dis::closed(4, 11));  // true

// Iteration
for (const auto& interval : intervals) {
    std::cout << interval << std::endl;
}

STL Conformance

v1.1.0 brings full STL container conformance. Iterators, range-based for, standard algorithms, all of it:

fuzzy-logic-search: Query Documents with Fuzzy Logic

November 30, 2025

fuzzy-logic-search (fls) brings fuzzy logic to document querying. Unlike traditional Boolean search that returns binary relevant/not-relevant results, fls produces a degree-of-membership score in [0, 1], indicating how well each document matches your query.

The Problem with Boolean Search

Boolean search is rigid: a document either matches or it does not. If you search for “python AND machine-learning,” you get a binary split. A document about Python ML that never uses the exact term “machine-learning” gets zero, same as a document about medieval pottery.

Fuzzy logic captures the gradation that Boolean search throws away.

from fuzzy_logic_search.fuzzy_query import FuzzyQuery
from fuzzy_logic_search.fuzzy_set import FuzzySet

# Construct a query
query = FuzzyQuery("(and python machine-learning)")

# Or use Python operators
q1 = FuzzyQuery("python")
q2 = FuzzyQuery("machine-learning")
query = q1 & q2  # Equivalent to (and python machine-learning)

Query Language

Queries use a Lisp-like syntax that maps to an AST:

; Simple conjunction
(and cat dog)

; With negation
(and cat dog (not fish))

; With fuzzy modifiers
(very (and cat dog))

; Complex nested query
(or (and python ml) (very (not java)))

Or construct directly with Python:

# Using operators
query = FuzzyQuery("cat") & FuzzyQuery("dog") & ~FuzzyQuery("fish")

# Using AST directly
query = FuzzyQuery(['and', 'cat', 'dog', ['not', 'fish']])

I went with S-expressions for the query language because they map directly to the AST. No parsing ambiguity, trivial to serialize, and anyone who has written a Lisp evaluator can understand the implementation in about ten minutes.

Fuzzy Modifiers

Linguistic hedges transform membership values:

# "Very" squares the membership (emphasizes strong matches)
very_query = FuzzyQuery("python").very()
# 0.9 -> 0.81, 0.5 -> 0.25

# "Somewhat" takes square root (broadens tolerance)
somewhat_query = FuzzyQuery("python").somewhat()
# 0.9 -> 0.95, 0.25 -> 0.5

# "Extremely" cubes the membership
extremely_query = FuzzyQuery("python").extremely()

# "Slightly" takes 10th root
slightly_query = FuzzyQuery("python").slightly()

These come from Zadeh’s original fuzzy logic work. “Very” is concentration (squaring), “somewhat” is dilation (square root). They are mathematically clean and semantically intuitive: “very python” means “only documents that are strongly about Python.”

Evaluating Queries

Evaluate queries against a document corpus:

# Documents as lists of terms
docs = [
    ["python", "machine-learning", "tensorflow"],
    ["java", "spring", "microservices"],
    ["python", "web", "flask"],
    ["machine-learning", "neural-networks", "pytorch"]
]

# Evaluate query
query = FuzzyQuery("python") & FuzzyQuery("machine-learning")
result = query.evaluate(docs)  # Returns FuzzySet

# result.memberships = [1.0, 0.0, 0.0, 0.0]
# Only first document has both terms

Custom Membership Functions

The default membership is crisp (term present or not), but you can provide custom functions for more nuanced matching:

src2md: Fitting Codebases into LLM Context Windows

November 30, 2025

src2md solves a practical problem: you want an LLM to understand your codebase, but the codebase doesn’t fit in the context window.

GPT-4 gives you ~128K tokens. Claude gives you ~200K. A medium-sized project blows past both. Naive truncation loses critical context. Manual curation doesn’t scale. So I built a tool that does it automatically.

How It Works

src2md reads a source tree, scores files by importance, and compresses them to fit a target token budget. The output is structured Markdown (or JSON, or plain text) ready to paste into an LLM conversation.

pip install src2md

# Basic markdown generation
src2md /path/to/project -o documentation.md

# With context optimization for GPT-4
src2md /path/to/project --gpt4 -o optimized.md

# With intelligent summarization
src2md /path/to/project --summarize --compression-ratio 0.3

Context Window Targeting

You can target specific LLM context windows:

# Target specific LLM context windows
src2md . --target-tokens 128000  # GPT-4
src2md . --target-tokens 200000  # Claude

# Predefined windows
src2md . --window gpt-4-turbo
src2md . --window claude-3

Multi-Tier Summarization

Not all files are equally important. src2md uses progressive compression: critical files get full source, important files get AST-level summaries, supporting files get docstrings only, and peripheral files get dropped.

from src2md import Converter

converter = Converter(
    target_tokens=100000,
    summarization_levels={
        'critical': 'full',      # Keep full source
        'important': 'ast',       # AST-based summary
        'supporting': 'minimal',  # Docstrings only
        'peripheral': 'exclude'   # Skip entirely
    }
)

File Importance Scoring

The importance scoring considers multiple factors:

Centrality: How many other files import this one?
Complexity: Cyclomatic complexity, lines of code
Recency: Recently modified files matter more
Naming: main.py, index.ts get a priority boost

AST-Based Analysis

For supported languages, src2md parses the AST to extract structure rather than just truncating text:

# From a 500-line Python file, extract:
# - Class/function signatures
# - Docstrings
# - Type hints
# - Key logic patterns

This preserves the information an LLM actually needs to reason about the code.

Output Formats

src2md . --format markdown    # Default
src2md . --format json        # Structured data
src2md . --format jsonl       # Line-delimited JSON
src2md . --format html        # Web-viewable
src2md . --format text        # Plain text

Python API

from src2md import Repository, ContextWindow

# Basic usage
output = Repository("/path/to/project").analyze().to_markdown()

# Optimize for GPT-4 context window
output = (Repository("/path/to/project")
    .optimize_for(ContextWindow.GPT_4)
    .analyze()
    .to_markdown())

# Full fluent API with all features
result = (Repository("/path/to/project")
    .name("MyProject")
    .include("src/", "lib/")
    .exclude("tests/", "*.log")
    .with_importance_scoring()
    .with_summarization(
        compression_ratio=0.3,
        preserve_important=True,
        use_llm=True
    )
    .optimize_for_tokens(100_000)
    .analyze()
    .to_json(pretty=True))

LLM-Powered Compression

For semantic understanding beyond AST extraction, you can use an LLM to do the summarization itself:

Sparse Spatial Hash Grids: Efficient N-Dimensional Spatial Indexing

November 11, 2025

A sparse spatial hash grid for N-dimensional spatial indexing, achieving 60,000x memory reduction over dense grids while maintaining O(1) insertions and O(k) neighbor queries.

Everything is a File: Virtual Filesystems for CLI Data Tools

October 20, 2025

I had a bookmark manager. Then an ebook library manager. Then a chat history manager. Each started with the standard CRUD CLI:

btk add https://example.com --tags python,tutorial
btk list --tag python
btk search "async"
btk delete 1234

ebk import book.pdf --author "Knuth"
ebk list --author Knuth
ebk search "algorithms"

This works fine until you have 10,000+ bookmarks organized with hierarchical tags like programming/python/async, research/ml/transformers, work/clients/acme. Your ebook library has similar structure. Your exported chat conversations from Claude, ChatGPT, and Copilot are piling up.

Traditional CRUD commands become unwieldy:

btk list --tag programming/python/async/io --format json | jq '.[].title'
ebk list --category "Computer Science/Algorithms/Graph Theory" --limit 50
ctk search "machine learning" --source ChatGPT --date-from 2024-01-01

Each command requires precise arguments. Each tool has different flag conventions. You can’t navigate your data. You can only query it. And queries require knowing exactly what you’re looking for.

The insight: everything is a file

When I have thousands of source files organized in directories, I don’t run:

list-files --path /src/components/auth --extension .tsx

I run:

cd src/components/auth
ls *.tsx

The difference matters. With a filesystem, I can navigate incrementally (cd from general to specific), explore (ls to see what’s there), compose (cat file | grep pattern | wc -l), and use familiar tools (find, grep, xargs, pipes, redirection).

What if my bookmarks, ebooks, and chat histories were filesystems?

The pattern

Over the past year, I built six Python tools that all follow the same architecture:

Tool	Domain	VFS Root Structure
btk	Bookmarks	`/bookmarks/`, `/tags/`, `/recent/`, `/domains/`, `/unread/`, `/popular/`
ebk	Ebook library	`/books/`, `/authors/`, `/series/`, `/subjects/`, `/recent/`, `/unread/`
ctk	Chat conversations	`/conversations/`, `/sources/`, `/topics/`, `/starred/`, `/recent/`
ghops	Git repositories	`/repos/`, `/languages/`, `/topics/`, `/stars/`, `/recent/`
infinigram	N-gram models	`/datasets/`, `/models/`, `/corpora/`
AlgoTree	Tree structures	`/nodes/`, `/paths/`, `/subtrees/`

Each tool provides:

A stateless CLI for scripting: btk bookmark add URL, ebk import book.pdf
An interactive shell with a virtual filesystem: btk shell, ebk shell, ctk chat
POSIX-like commands: cd, ls, pwd, cat, mv, cp, rm, find, grep
Unix pipeline support: most commands output JSONL by default for piping

The interesting part is the shell.

Navigating 10,000 bookmarks

Live recording captured with asciinema. You can pause, copy text, and replay. The entire recording is 78KB of text.

AlgoTree: Immutable Trees with Functional Transformers

June 21, 2024

AlgoTree is a tree manipulation library for Python. Version 2.0 is a complete redesign built on immutable-by-default principles with composable transformers and pattern-matching selectors.

Why immutable trees?

Mutable tree libraries have hidden costs. Modifying a tree can break other code holding references to it. Changes are hard to track during debugging. Concurrent modifications cause subtle bugs. The usual story.

AlgoTree takes a different approach: all operations return new tree objects. The original is never modified.

from AlgoTree import Node, node

# Build a tree
tree = node("root",
    node("child1", value=1),
    node("child2", value=2)
)

# All operations return new trees
tree2 = tree.with_name("new_root")  # tree unchanged
tree3 = tree.with_child(Node("child3"))  # tree unchanged

This is the same idea behind persistent data structures in Clojure or Haskell. Immutability eliminates a whole class of bugs at the cost of some allocation overhead. For tree manipulation tasks (as opposed to, say, hot inner loops), the tradeoff is worth it.

Building Trees

Multiple construction styles for different use cases:

from AlgoTree import Node, node, TreeBuilder

# Simple construction with Node
tree = Node("root",
    Node("child1", attrs={"value": 1}),
    Node("child2", attrs={"value": 2})
)

# Convenience function (auto-converts strings)
tree = node("root",
    node("child1", value=1),
    "child2",  # Strings auto-convert to nodes
    node("child3",
        "grandchild1",
        "grandchild2"
    )
)

# Fluent builder API
tree = (TreeBuilder("root", type="container")
    .child("src")
        .child("main.py", type="file", size=1024)
        .child("utils.py", type="file", size=512)
        .up()
    .child("docs")
        .child("README.md", type="file")
    .build())

Functional Transformations

The standard functional toolkit, applied to trees:

# Map: transform all nodes
doubled = tree.map(lambda n: n.with_attrs(
    value=n.get("value", 0) * 2
))

# Filter: keep nodes matching predicate
filtered = tree.filter(lambda n: n.get("value", 0) > 5)

# Find: locate specific nodes
nodes = tree.find_all(lambda n: n.is_leaf)

Composable Selectors

Pattern matching with wildcards and logical composition:

from AlgoTree import name, attrs, leaf, type_

# Pattern matching with wildcards
selector = name("*.txt")

# Attribute matching with predicates
selector = attrs(size=lambda s: s > 1000)

# Logical composition with operators
selector = type_("file") & ~leaf()  # Files that aren't leaves

# Structural selectors
selector = type_("file").child_of(name("src"))
selector = leaf().at_depth(2)

# Use selectors with trees
matching_nodes = list(selector.select(tree))

The selectors compose with &, |, and ~. This means you can build complex queries from simple parts without writing custom traversal code.

Pipe-Based Transformers

Build transformation pipelines with the >> operator:

from AlgoTree import map_, filter_, prune, normalize, extract

# Build transformation pipelines
pipeline = (
    map_(lambda n: {"processed": True}) >>
    filter_(lambda n: n.get("active")) >>
    normalize(sort_children=True) >>
    extract(lambda n: n.name)
)

# Apply pipeline to tree
result = pipeline(tree)

This is the same idea as Unix pipes. Each stage takes a tree and returns a tree (or extracted values). The >> operator chains them left to right.

DagShell: A Content-Addressable Virtual Filesystem

October 12, 2025

DagShell is a virtual filesystem that organizes data by content instead of location. Identical files automatically share storage through SHA256 hashing. The structure is a directed acyclic graph rather than a tree, so the same content block can be referenced from multiple paths without duplication.

I built it because sometimes you need filesystem semantics without touching actual disk. Testing, sandboxing, versioning, portability. The implementation has 583 tests with 77% coverage.

The DAG structure

Traditional filesystems are trees: each file has exactly one parent. DagShell uses a DAG where content is stored once and referenced by hash:

/project/
├── src/
│   └── main.py  ──────┐
├── backup/            │
│   └── main.py  ──────┼──> [SHA256: abc123...] -> "print('hello')"
└── archive/           │
    └── main.py  ──────┘

Three paths, one storage block.

Fluent Python API

DagShell provides a chainable API that mirrors shell commands:

from dagshell.dagshell_fluent import DagShell

shell = DagShell()

# Create project structure
(shell
    .mkdir("/project/src")
    .mkdir("/project/docs")
    .cd("/project/src")
    .echo("def main(): pass").out("main.py")
    .echo("# My Project").out("../docs/README.md"))

# Navigate with directory stack
shell.pushd("/tmp")
shell.touch("scratch.txt")
shell.popd()  # Back to /project/src

# Save entire filesystem to JSON
shell.save("project_snapshot.json")

Terminal emulator

For interactive exploration:

python -m dagshell.terminal

dagshell:/$ mkdir /home/user
dagshell:/$ cd /home/user
dagshell:/home/user$ echo "Hello" > greeting.txt
dagshell:/home/user$ cat greeting.txt
Hello
dagshell:/home/user$ ls -la
total 1
drwxr-xr-x  2 user user  4096 Aug 15 10:00 .
drwxr-xr-x  3 user user  4096 Aug 15 10:00 ..
-rw-r--r--  1 user user     6 Aug 15 10:00 greeting.txt

Virtual devices

Standard Unix special files work:

shell.echo("garbage").out("/dev/null")  # Discarded
random_bytes = shell.cat("/dev/random")  # Random data
zeros = shell.head("/dev/zero", 100)     # 100 zero bytes

Import/export

Move files between real and virtual filesystems:

# Import from real filesystem
shell.import_file("/real/path/data.csv", "/virtual/data.csv")

# Export to real filesystem
shell.export_file("/virtual/results.json", "/real/path/results.json")

# Import entire directory
shell.import_dir("/real/project", "/virtual/project")

Persistence

The entire filesystem state serializes to JSON:

shell.save("filesystem.json")
restored = DagShell.load("filesystem.json")

# Or get JSON directly
state = shell.to_json()

The JSON format is human-readable:

{
  "root": {
    "type": "directory",
    "children": {
      "project": {
        "type": "directory",
        "children": {
          "README.md": {
            "type": "file",
            "content_hash": "abc123..."
          }
        }
      }
    }
  },
  "content_store": {
    "abc123...": "# My Project\n..."
  }
}

Content hashes in the directory tree, actual content in a flat store. Deduplication falls out naturally.

Scheme DSL

For Lisp people, there’s a Scheme interface:

(mkdir "/project")
(cd "/project")
(echo "Hello" "greeting.txt")
(define files (ls))

I included this partly because I like Scheme and partly because a filesystem is a natural fit for s-expressions.

DreamLog: Logic Programming That Dreams to Improve Itself

October 8, 2025

DreamLog is a logic programming system that learns by alternating between wake and sleep phases. During wake, it uses LLMs to generate missing knowledge. During sleep, it compresses what it knows into more general principles. Like biological brains, roughly.

Compression is learning

The theoretical basis comes from algorithmic information theory: the system that explains your data with the shortest program is the one most likely to generalize. This is Solomonoff induction, the mathematical formalization of Occam’s razor.

For logic programming, the sleep phase searches for minimal representations that preserve deductive closure:

\[ \text{minimize } |KB'| \text{ subject to } \text{Closure}(KB') = \text{Closure}(KB) \]

Find the shortest knowledge base that still derives all the same facts.

Wake phase: generate knowledge

During wake, DreamLog operates as a logic programming engine with LLM-based knowledge generation:

from dreamlog.pythonic import dreamlog

kb = dreamlog(llm_provider="openai")

# Add some facts
kb.fact("parent", "john", "mary")
kb.fact("parent", "mary", "alice")

# Add a rule
kb.rule("grandparent", ["X", "Z"]) \
  .when("parent", ["X", "Y"]) \
  .and_("parent", ["Y", "Z"])

# Query
for result in kb.query("grandparent", "X", "alice"):
    print(f"{result.bindings['X']} is Alice's grandparent")  # john

The interesting part is what happens with undefined predicates:

# Query a predicate we never defined
for result in kb.query("sibling", "X", "Y"):
    # LLM generates knowledge about siblings on-the-fly
    print(result)

When the evaluator encounters an undefined predicate, it triggers the LLM hook to generate both facts and rules. The system infers primitive properties (like gender from names) and derives rules compositionally.

Sleep phase: compress knowledge

During sleep, DreamLog reorganizes through compression operators:

from dreamlog.kb_dreamer import KnowledgeBaseDreamer

dreamer = KnowledgeBaseDreamer(kb.provider)

session = dreamer.dream(
    kb,
    dream_cycles=3,            # Multiple REM cycles
    exploration_samples=10,     # Try different optimizations
    verify=True                # Ensure behavior preservation
)

print(f"Compression: {session.compression_ratio:.1%}")
print(f"Generalization: {session.generalization_score:.2f}")

The compression operators:

Anti-unification: find general patterns from specific instances
Predicate invention: discover intermediate concepts that simplify rules
Subsumption elimination: remove specific rules subsumed by general ones

This is where the real learning happens. The wake phase accumulates facts and rules. The sleep phase finds the structure in them.

KB-aware RAG

A key design choice: the retrieval-augmented generation is knowledge-base-aware. The system uses weighted embeddings combining query similarity (70%) with knowledge base context (30%), so example selection considers both the query structure and current reasoning state.

A success-based learning mechanism tracks which examples lead to successful inference, progressively improving retrieval quality through experience.

Learning Fuzzy Logic: Automatic Rule Discovery Through Differentiable Circuits

October 7, 2025

Fuzzy logic is good for reasoning under uncertainty, but it has a bottleneck: you need domain experts to define the rules.

What if fuzzy systems could learn their own rules from data?

The Traditional Fuzzy Logic Bottleneck

Classic fuzzy systems require:

Membership functions: “How hot is hot?”
Inference rules: “If temp is hot AND humidity is high THEN…”
Defuzzification: Converting fuzzy outputs to crisp values

This means:

Domain expertise (expensive)
Trial and error (time-consuming)
Manual tuning (brittle)

In practice, fuzzy logic is often abandoned in favor of neural networks. You lose interpretability, but at least you don’t need a domain expert hand-crafting rules.

The Idea: Fuzzy Soft Circuits

We present a framework that:

Represents fuzzy systems as differentiable computational graphs
Learns membership functions and rules via gradient descent
Keeps the interpretability of traditional fuzzy systems

Key Innovation: Soft Gates

Traditional circuits use hard logic gates (AND, OR, NOT). We use soft, differentiable approximations:

# Traditional (non-differentiable)
AND(a, b) = min(a, b)
OR(a, b) = max(a, b)

# Soft (differentiable)
soft_AND(a, b) = a * b
soft_OR(a, b) = a + b - a*b
soft_NOT(a) = 1 - a

These are differentiable but approximate the same semantics. That means backpropagation works.

The Architecture

Input Features
     |
Fuzzification Layer (learnable membership functions)
     |
Soft Circuit Layer (learnable fuzzy rules)
     |
Aggregation Layer (learnable combination)
     |
Defuzzification Layer
     |
Output

Every component is differentiable. Train end-to-end with backpropagation.

Automatic Rule Discovery

The system discovers rules like:

IF temperature is {learned_high} AND humidity is {learned_humid}
THEN discomfort is {learned_uncomfortable}

Where the membership functions {learned_high}, {learned_humid}, etc. are learned from data, not hand-crafted.

Why Not Just Use a Neural Network?

Fair question. Fuzzy soft circuits give you things neural networks don’t:

Interpretability: You can extract and read the learned rules
Sample efficiency: The structured inductive bias helps with limited data
Domain integration: You can incorporate expert knowledge as priors
Uncertainty quantification: Fuzzy truth values are meaningful

Neural networks give you a black box. You need large datasets. Incorporating domain knowledge is hard. Uncertainty requires special techniques.

If you need both learning and interpretability, fuzzy soft circuits sit in a useful spot.

Training Process

# Initialize random fuzzy circuit
circuit = FuzzySoftCircuit(
    n_inputs=5,
    n_rules=10,
    n_outputs=1
)

# Train with gradient descent
for epoch in epochs:
    # Forward pass
    predictions = circuit(inputs)

    # Compute loss
    loss = mse(predictions, targets)

    # Backward pass (automatic differentiation)
    loss.backward()

    # Update membership functions and rules
    optimizer.step()

# Extract learned rules
rules = circuit.extract_rules()
print(rules)  # Human-readable fuzzy rules!

Experimental Results

On benchmark datasets:

Fuzzy Soft Circuits: Learning Fuzzy Rules from Data

October 1, 2024

Traditional fuzzy logic systems are powerful. They encode expert knowledge as interpretable rules like “IF temperature IS HIGH AND humidity IS LOW THEN fan speed IS FAST.” The problem is someone has to write those rules.

What if the rules could discover themselves?

The Expert Knowledge Bottleneck

Every classical fuzzy system needs three things from a human expert:

Membership functions:Where does “HIGH” start? Where does “LOW” end?
Rule structure:Which combinations of inputs matter?
Rule existence:How many rules are there? Which ones are relevant?

This is expensive. Experts are hard to find, struggle to articulate their reasoning precisely, and can’t easily update systems as conditions change. In emerging domains, relevant expertise might not even exist.

Previous approaches have chipped away at parts of this problem. ANFIS ¹ learns membership function parameters but needs a predefined rule structure. Genetic fuzzy systems ² can evolve rule bases but lose gradient information. The Wang-Mendel method ³ generates rules from data but still needs hand-designed membership functions.

None of them make the entire system learnable end-to-end.

The Key Insight: Make “IF” Differentiable

The core idea is simple: treat a fuzzy rule’s existence as a continuous parameter.

In a traditional system, a rule either exists or it doesn’t:it’s a binary choice. We replace this with a soft switch: a sigmoid gate $\gamma_r = \sigma(s_r)$ that smoothly interpolates between “this rule exists” ($\gamma_r \to 1$) and “this rule doesn’t exist” ($\gamma_r \to 0$).

This transforms rule discovery from a discrete search problem into a differentiable optimization problem. Gradient descent can now tell the system not just how to tune a rule, but whether the rule should exist at all.

Architecture

A fuzzy soft circuit has three differentiable stages:

1. Fuzzification

Each input $x_i$ is mapped through $k$ learnable Gaussian membership functions:

\[ \mu_{i,j}(x_i) = \exp\!\left(-\frac{(x_i - c_{i,j})^2}{w_{i,j}^2}\right) \]

The centers $c_{i,j}$ and widths $w_{i,j}$ are learnable. We parameterize widths as $w = e^{\hat{w}}$ to ensure positivity. No one decides where “HIGH” starts:the system figures it out.

2. Soft Rule Evaluation

For each potential rule $r$, two things are learned:

Antecedent relevance:a weight vector determines which fuzzy features matter for this rule. We use a gated product that smoothly interpolates between “this feature participates” and “this feature is ignored”:

ZeroIPC: Shared Memory as a Computational Substrate

October 6, 2025

ZeroIPC reimagines inter-process communication. Instead of treating shared memory as passive storage, it becomes an active computational substrate. Futures, lazy evaluation, reactive streams, CSP-style channels, all with zero-copy performance.

The Core Idea

Traditional IPC systems treat shared memory as a bucket for data. You serialize, copy, deserialize. Even “zero-copy” systems are often just optimized data containers.

ZeroIPC asks a different question: what if shared memory could hold not just data, but computation itself?

This shift enables:

Futures that represent computations in progress across processes
Lazy values that defer expensive work and share cached results
Reactive streams with functional operators (map, filter, fold)
CSP channels for Go-style structured concurrency

All with zero serialization overhead and language independence.

Design Philosophy

1. Minimal Metadata

ZeroIPC stores only three pieces of information per structure:

Name: For discovery
Offset: Where data starts
Size: How much memory is allocated

No type information. No schema. No versioning metadata.

This enables true language independence. Python and C++ can both create, read, and write structures. Type safety is enforced per-language (C++ templates, Python NumPy dtypes).

2. Language Equality

There’s no “primary” language. All implementations are first-class:

C++ Producer:

#include <zeroipc/memory.h>
#include <zeroipc/array.h>

zeroipc::Memory mem("/sensor_data", 10*1024*1024);
zeroipc::Array<float> temps(mem, "temperature", 1000);
temps[0] = 23.5f;

Python Consumer:

from zeroipc import Memory, Array
import numpy as np

mem = Memory("/sensor_data")
temps = Array(mem, "temperature", dtype=np.float32)
print(temps[0])  # 23.5

Same binary format. No bindings. No FFI. Pure implementations following the same specification.

3. Zero Dependencies

Each implementation stands alone:

C: Pure C99, POSIX only
C++: Header-only, C++23
Python: Pure Python with NumPy

No protobuf. No serialization libraries. Just direct memory access.

Traditional Data Structures

ZeroIPC provides lock-free implementations of standard structures:

Structure	Description	Concurrency
Array	Fixed-size contiguous storage	Atomic operations
Queue	Circular MPMC buffer	Lock-free CAS
Stack	LIFO with ABA prevention	Lock-free CAS
Map	Hash map with linear probing	Lock-free
Set	Hash set for unique elements	Lock-free
Pool	Object pool with free list	Lock-free
Ring	High-performance streaming	Lock-free

These are the foundation. The interesting part is what comes next.

Codata: Computation as First-Class Structure

Data vs Codata

Data structures answer “what values are stored?”

Array: collection of values
Map: key-value associations
Queue: FIFO buffer

Codata structures answer “how are values computed?”

Future: value that will exist
Lazy: computation deferred
Stream: potentially infinite sequence
Channel: communication process

ZeroIPC is (to my knowledge) one of the first IPC systems to treat codata as first-class.

chop: When Every Command Returns the Same Kind of Thing

July 15, 2025

Section 2.2 of Structure and Interpretation of Computer Programs introduces the closure property: the result of combining things should be the same kind of thing you started with. cons two values and you get a pair, which you can cons again. This is what makes recursive data structures possible. Without it, you get flat records. With it, you get trees, lists, nested structure of arbitrary depth.

Abelson and Sussman are careful to distinguish this from lexical closures (functions that capture their environment). Algebraic closure is about the type signature of combination: if the output type matches the input type, composition is unlimited.

Most discussions of closure treat it as a property to verify. You check whether your algebra is closed and move on. But closure is more powerful than that. It’s a design method: choose a type, force every operation to consume and produce that type, and see what emerges. The constraint does the creative work.

chop is an image-manipulation CLI built on exactly this principle. 27 commands. One rule: read JSON from stdin, write JSON to stdout.

The Problem with Image Pipelines

Traditional image CLIs violate closure. ImageMagick consumes a file and produces a file. Each invocation is terminal. The output is pixels, not something you can pipe into further processing without going back to disk. Composition happens through flag accumulation inside a single command, not through the shell’s native composition mechanism.

You can’t tee a midpoint. You can’t save a half-finished pipeline as a recipe and apply it later. You can’t branch.

The SICP parallel is direct: if cons produced an atom instead of a pair, you could build flat structures but not recursive ones. If an image command produces pixels instead of a composable description, you can build single transformations but not pipelines.

One Constraint: JSON In, JSON Out

Every chop command reads a PipelineState JSON object from stdin, appends one operation, and writes the updated JSON to stdout. The wire format carries no image data, only a recipe:

{
  "version": 3,
  "ops": [
    ["load", ["photo.jpg"], {}],
    ["resize", ["50%"], {}],
    ["pad", [10], {"color": "white"}]
  ],
  "metadata": {}
}

Each operation is a [name, args, kwargs] triple. The pipeline accumulates operations as data. Here’s what this looks like in practice:

MCTS-Reasoning: Tree Search for LLM Reasoning

December 1, 2024

I’ve been working on applying Monte Carlo Tree Search to LLM reasoning. The idea: multi-step reasoning is a sequential decision problem, and MCTS is good at those.

The Problem with Single-Shot Reasoning

When you ask an LLM a hard question, it generates one response. If that response goes down a wrong path early, there’s no recovery. The model commits to its initial approach and follows it to completion, even when better alternatives existed.

This is a waste. The model might have gotten it right if it had taken a different first step. MCTS addresses this by building a tree of reasoning paths and using the UCB1 bandit algorithm to balance exploration of new paths with exploitation of promising ones.

How It Works

The system models reasoning as a search problem:

States: Partial reasoning traces (what’s been written so far)
Actions: Reasoning continuations (the next step)
Terminal states: Complete solutions with final answers
Rewards: Quality assessments of final answers

Each MCTS simulation runs through four phases:

Selection: Traverse the tree using UCB1 to pick promising paths
Expansion: Add a new reasoning step via LLM generation
Rollout: Continue reasoning until reaching a terminal state
Backpropagation: Update statistics back up the tree

Tree-Building Rollouts

One design choice worth noting: I use tree-building rollouts. Standard game-playing MCTS uses a fast random policy during rollouts and doesn’t store those nodes. Here, we add every rollout node to the tree. This preserves the full reasoning trace and allows reuse of reasoning steps in future simulations. It’s more expensive per simulation, but reasoning steps are expensive to generate anyway, so you want to keep them.

Terminal-Only Evaluation

The evaluator runs only on terminal states. Intermediate reasoning states aren’t evaluated, which reduces computational cost. LLM-as-judge calls happen only when a complete answer is produced. This keeps the search cheap where it can be cheap.

The Technical Report

I wrote a formal specification that provides rigorous definitions for all components: states, actions, nodes, and the search tree. It includes precise pseudocode for all four MCTS phases, clear interfaces for the Generator and Evaluator components, and complexity analysis showing O(KD) tree operations for K simulations with max depth D.

Cluster-Aware Retrieval for RAG Systems

November 15, 2024

Most RAG systems treat embedding spaces as flat, uniform distributions. They’re not. Real knowledge bases contain distinct semantic clusters: database docs, frontend frameworks, DevOps practices, each with different internal structure. Ignoring this wastes retrieval precision.

The Problem with Flat Retrieval

A query about “React hooks optimization” should pull from the frontend cluster, not equally consider database or infrastructure docs that happen to share semantic overlap. Standard cosine similarity doesn’t care about topical boundaries. You get results that are individually relevant but collectively unfocused.

Modeling Clusters with GMM

Gaussian Mixture Models assume your embeddings arise from $K$ underlying Gaussian distributions:

$$p(v) = \sum_{k=1}^K \pi_k \mathcal{N}(v \mid \mu_k, \Sigma_k)$$

For a query $q$, compute the posterior probability of each cluster:

$$p(k \mid q) = \frac{\pi_k \mathcal{N}(q \mid \mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(q \mid \mu_j, \Sigma_j)}$$

This gives you soft assignments: the probability that a query belongs to each semantic cluster.

Two-Stage Retrieval

Cluster selection: Pick cluster(s) with highest $p(k \mid q)$. Take top-2 for ambiguous queries.
Intra-cluster retrieval: Run k-NN within selected clusters.

The cluster boundaries act as a soft filter, avoiding the “dilution effect” where off-topic documents dominate results.

Mahalanobis Distance Per Cluster

Here’s the underexplored part: different clusters can use different distance metrics. For a cluster modeled as $\mathcal{N}(\mu_k, \Sigma_k)$, the Mahalanobis distance accounts for the cluster’s shape:

$$d_{\text{Mah}}(q, v) = \sqrt{(q - v)^T \Sigma_k^{-1} (q - v)}$$

Elongated clusters in certain semantic directions get stretched appropriately. Cosine similarity treats all directions equally. Mahalanobis adapts.

Clusters as Agent Tools

In agentic RAG, each cluster becomes a tool the agent can invoke:

tools = [
    ClusterRetrievalTool(cluster_id=k, name=f"Search {topic_k}")
    for k in range(K)
]

The agent decides which clusters to search and in what order:

Query: “How does React’s context API compare to Redux?”
Agent plan:
1. Search frontend cluster for React context
2. Search state management cluster for Redux patterns
3. Synthesize comparison

This beats flat retrieval for cross-topic synthesis.

Implementation

Fit GMM offline on document embeddings:

from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=K, covariance_type='full')
gmm.fit(document_embeddings)

# For query q:
cluster_probs = gmm.predict_proba(q.reshape(1, -1))[0]
selected_clusters = cluster_probs.argsort()[-2:][::-1]  # top-2

Store cluster assignments as metadata in your vector DB:

results = vector_db.query(
    query_embedding=q,
    filter={"cluster_id": {"$in": selected_clusters}},
    top_k=20
)

Key decisions:

Number of clusters: Use BIC/AIC or domain knowledge
Regularization: Add $\lambda I$ to covariance matrices to prevent singularities
Initialization: k-means++ for better convergence

When It Helps

Topically diverse corpora: Multi-product docs, cross-domain papers
Single-topic queries: Clear primary topic to route to
Noise reduction: Distant-but-similar content diluting results

When it doesn’t:

The Beautiful Deception: How 256 Bits Pretend to be Infinity

July 1, 2024

How do you store infinity in 256 bits? You don’t. But you can fake it well enough that no bounded observer can tell the difference. This paper is about that deception, why it works, and what it tells us about randomness.

The impossible oracle

A random oracle maps any input to an infinite sequence of perfectly random bits. Try to implement one and you fail immediately:

Memory unboundedness: each new query exhausts memory
Non-serializability: can’t save/restore state
Non-reproducibility: each instance generates different values
Non-distributability: can’t share across systems

This isn’t a limitation of current hardware. It’s a constructive proof that true random oracles can’t exist in our computational universe.

The lie that works

From this impossibility comes something useful:

class LazyDigest:
    def __init__(self, seed):
        self.seed = seed

    def __getitem__(self, index):
        return hash(seed || index)[0]

256 bits of entropy generating what appears to be an infinite random sequence.

The deception:

Appears: infinite random sequence
Actually: deterministic function with 256 bits of state
Information content: K(LazyDigest) = 256 bits + constant
Apparent information: infinite

We’re achieving a compression ratio of infinity, representing unbounded data with bounded information.

Why it works

Computational indistinguishability. If h is a secure PRF, no polynomial-time algorithm can tell LazyDigest apart from truly random output. We’re not random. We’re computationally hard to distinguish from random. This weaker guarantee is sufficient for all of cryptography.

Since LazyDigest has finite state, it must eventually cycle. After at most 2^256 queries, it repeats. But the expected cycle length is 2^128, roughly 10^38. At a billion queries per second, cycling takes about 10^21 years. The universe is roughly 10^10 years old.

Advanced constructions

Hierarchical seeding extends the effective period:

epoch_seed = h(master_seed || "epoch" || epoch)
chunk_seed = h(epoch_seed || "chunk" || chunk)
value = h(chunk_seed || position)[0]

XOR multi-hash hedges against individual algorithm failures:

result = sha256(seed||index)[0] ^ sha512(seed||index)[0] ^
         sha3_256(seed||index)[0] ^ blake2b(seed||index)[0]

The system stays secure if at least one hash function holds. This hedges against future cryptanalysis, quantum vulnerabilities, and implementation bugs.

Sponge construction reserves capacity bits that never leave the system, providing tunable security with 2^(capacity/2) collision resistance.

Random oracles and uncomputable reals

Most real numbers are uncomputable. They require infinite information to specify. A true random oracle is the cryptographic analog:

Computable reals (like pi, e): measure zero, finite programs
Uncomputable reals: measure one, infinite information
LazyDigest: computable, appears random
Random oracle: uncomputable, truly random

We’re using computable functions to approximate uncomputable ones.

Algebraic Hashing: Composable Hash Functions Through XOR

November 1, 2022

Most hash libraries treat hash functions as opaque blobs. You put data in, you get bits out, and that’s the end of the story. Algebraic Hashing takes a different approach: it exposes the mathematical structure underneath, so you can compose hash functions like algebraic expressions. And because this is C++20 with concepts and templates, the composition resolves entirely at compile time. Zero runtime overhead.

The observation

Hash functions form an abelian group under XOR:

Closure: h1 XOR h2 is still a valid hash function
Associativity: (h1 XOR h2) XOR h3 = h1 XOR (h2 XOR h3)
Identity: XOR with zero
Inverses: each hash is its own inverse under XOR

This is a clean algebraic structure, and it’s the foundation for everything that follows.

What you can do with it

Compile-time composition. Using C++20 concepts and template metaprogramming:

auto composed = fnv1a<> ^ sha256<>;
auto hash = composed("data");  // Zero runtime overhead

All composition resolves at compile time. No virtual dispatch, no function pointers.

Provable properties. XOR composition preserves uniform distribution (under independence), avalanche effect, and collision resistance for cryptographic hashes. These aren’t just empirical observations. They follow from the group structure.

Universal interface. Works with any hash function: non-cryptographic (FNV-1a), perfect (FKS), or cryptographic (SHA-256). The algebra doesn’t care about the implementation details.

Practical uses

Domain separation: hash_user ^ hash_timestamp prevents collision attacks across domains
Perfect hashing: FKS two-level scheme with pluggable base hash functions
Composite keys: hash multiple fields independently, then XOR
Type-based hashing: different hash functions for different types, composed generically

Connection to oblivious computing

This work shares DNA with my oblivious computing research. The common thread is making mathematical structure explicit in the type system. Just as Bernoulli types enforce privacy invariants algebraically, algebraic hashing enforces compositional invariants through group theory. Same philosophy, different domain.

The library is header-only C++20 with zero-cost abstractions via concepts.

View the whitepaper | GitHub

The Beautiful Deception: How 256 Bits Pretend to be Infinity

July 1, 2024

How do you store infinity in 256 bits? You don’t. But you can fake it well enough that no bounded observer can tell the difference. This paper is about that deception, why it works, and what it tells us about randomness.

The impossible oracle

A random oracle maps any input to an infinite sequence of perfectly random bits. Try to implement one and you fail immediately:

Memory unboundedness: each new query exhausts memory
Non-serializability: can’t save/restore state
Non-reproducibility: each instance generates different values
Non-distributability: can’t share across systems

This isn’t a limitation of current hardware. It’s a constructive proof that true random oracles can’t exist in our computational universe.

The lie that works

From this impossibility comes something useful:

class LazyDigest:
    def __init__(self, seed):
        self.seed = seed

    def __getitem__(self, index):
        return hash(seed || index)[0]

256 bits of entropy generating what appears to be an infinite random sequence.

The deception:

Appears: infinite random sequence
Actually: deterministic function with 256 bits of state
Information content: K(LazyDigest) = 256 bits + constant
Apparent information: infinite

We’re achieving a compression ratio of infinity, representing unbounded data with bounded information.

Why it works

Computational indistinguishability. If h is a secure PRF, no polynomial-time algorithm can tell LazyDigest apart from truly random output. We’re not random. We’re computationally hard to distinguish from random. This weaker guarantee is sufficient for all of cryptography.

Since LazyDigest has finite state, it must eventually cycle. After at most 2^256 queries, it repeats. But the expected cycle length is 2^128, roughly 10^38. At a billion queries per second, cycling takes about 10^21 years. The universe is roughly 10^10 years old.

Advanced constructions

Hierarchical seeding extends the effective period:

epoch_seed = h(master_seed || "epoch" || epoch)
chunk_seed = h(epoch_seed || "chunk" || chunk)
value = h(chunk_seed || position)[0]

XOR multi-hash hedges against individual algorithm failures:

result = sha256(seed||index)[0] ^ sha512(seed||index)[0] ^
         sha3_256(seed||index)[0] ^ blake2b(seed||index)[0]

The system stays secure if at least one hash function holds. This hedges against future cryptanalysis, quantum vulnerabilities, and implementation bugs.

Sponge construction reserves capacity bits that never leave the system, providing tunable security with 2^(capacity/2) collision resistance.

Random oracles and uncomputable reals

Most real numbers are uncomputable. They require infinite information to specify. A true random oracle is the cryptographic analog:

Computable reals (like pi, e): measure zero, finite programs
Uncomputable reals: measure one, infinite information
LazyDigest: computable, appears random
Random oracle: uncomputable, truly random

We’re using computable functions to approximate uncomputable ones.

Reverse-Process Synthetic Data Generation for Math Reasoning

June 25, 2024

Check out the (early) project and source code on GitHub.

The idea

Some problems are easy in one direction and hard in the other. Taking a derivative is mechanical. Finding an antiderivative can require genuine creativity. Generating a random expression and verifying a proof is easy. Discovering the proof is hard.

RPSDG (Reverse-Process Synthetic Data Generation) exploits this asymmetry. Run the easy direction with full step-by-step work, then reverse the result to get a hard problem with a known solution. You end up with process-supervised training data: not just the answer, but the entire derivation.

Richard Sutton’s “The Bitter Lesson” argues that methods scaling with compute and data will eventually win. The bottleneck is high-quality data. A lot of the world’s data is latent, the processes that generated it are not written down. In math, the way a proof was discovered is usually hidden behind a polished presentation. RPSDG is one way to manufacture that hidden process data.

maph: Maps Based on Perfect Hashing for Sub-Microsecond Key-Value Storage

June 10, 2024

maph is a key-value store that gets sub-100 nanosecond median lookup latency. The basic idea: memory-map the entire database file, use perfect hashing to locate keys in a single probe, and do everything lock-free with atomic operations. No kernel transitions on the read path. No copying. No locking.

The problem

Key-value stores hit three walls when you need microsecond-level latency:

Kernel overhead. System calls cost 100-500ns per operation just for the context switch.
Memory copying. Traditional stores copy data from kernel buffers to user space, between internal structures, for serialization. Each copy costs time.
Synchronization. Lock-based concurrency creates contention and unpredictable tail latency.

For most applications, these costs are noise. For things like feature stores in ML inference pipelines, or anything where you’re doing thousands of lookups per request within a tight latency budget, they dominate.

Three techniques

1. Zero-copy via mmap

Memory-map the database file and let the CPU’s MMU handle address translation:

// Traditional approach: Multiple copies
std::vector<uint8_t> data = read_from_disk(key);  // Kernel -> user copy
Value v = deserialize(data);                      // Another copy
process(v);                                        // Yet another copy

// maph approach: Direct memory access
auto store = maph::Maph::open("mystore.maph");
auto value = store->get(key);  // Zero copies. Direct pointer into mmap region

The kernel page cache handles persistence automatically. You get the illusion of in-memory access with durability.

2. Hybrid hash architecture

maph uses a hybrid hasher that combines perfect hashing with standard hashing. When you optimize (via the /optimize REST endpoint or optimize() in C++), it constructs a minimal perfect hash function for all current keys:

\[ \forall k_i, k_j \in K_{\text{known}}, i \neq j : h_p(k_i) \neq h_p(k_j) \]

Known keys get O(1) worst-case lookup with exactly one memory access. No probing. Keys inserted after optimization fall back to FNV-1a with linear probing:

\[ \text{slot}_i = (h_s(k) + i) \bmod n, \quad i \in [0, \text{MAX\_PROBES} - 1] \]

The maximum probe distance (default 10) bounds worst-case latency. Both hash paths use the same slot array, so there’s no static partitioning. The hybrid hasher checks whether a key was in the original optimized set and dispatches accordingly.

3. Lock-free atomic operations

Every operation uses compare-and-swap and atomic versioning. Each slot has a 64-bit atomic value: key hash in the upper 32 bits, version counter in the lower 32.

Fisher Flow: Optimization on the Statistical Manifold

April 20, 2024

Standard gradient descent treats parameter space as flat. It uses Euclidean distance, which means the same step size in parameter space can produce wildly different changes in the distribution depending on where you are. Fisher Flow fixes this by optimizing along the natural geometry of probability distributions.

The Geometry of Distributions

Probability distributions form a Riemannian manifold. The Fisher information matrix provides the natural metric on this manifold:

FIM: I(theta) = E[grad log p(x|theta) grad log p(x|theta)^T]

This captures how sensitive the distribution is to parameter changes. It is the curvature of the likelihood surface. It tells you the true gradient direction in parameter space, not the Euclidean one.

Natural Gradient vs. Standard Gradient

Standard gradient descent updates parameters in Euclidean space:

theta_{t+1} = theta_t - alpha grad L(theta_t)

This is inefficient because it ignores parameter correlations. A step of size epsilon in one direction might barely change the distribution while the same step in another direction changes it drastically.

Natural gradient descent uses the Fisher metric:

theta_{t+1} = theta_t - alpha I(theta_t)^{-1} grad L(theta_t)

Pre-multiplying by the inverse Fisher information rescales the gradient to account for the local geometry. The key property: natural gradient is invariant to reparametrization. It does not matter how you choose to represent your distributions.

Fisher Flow

Fisher Flow makes this continuous:

dtheta/dt = -I(theta)^{-1} grad L(theta)

This defines a flow on the parameter manifold. Loss decreases monotonically along the flow. The trajectories follow the natural geometry. Step size adapts automatically because the curvature scales the updates.

The Practical Problem

Computing I(theta)^{-1} is expensive. For a neural network with n parameters, the full Fisher information matrix is n x n, and inverting it is O(n^3). That is not practical for modern networks.

Approximations help. Diagonal Fisher approximation is cheap but crude. Block-diagonal Fisher captures within-layer correlations. K-FAC (Kronecker-Factored Approximate Curvature) approximates the Fisher with Kronecker products, bringing computation down to O(n). That makes it practical for real networks.

def fisher_flow_step(params, loss_fn, data):
    # Compute gradient
    grad = compute_gradient(loss_fn, params, data)

    # Estimate Fisher information (diagonal approx)
    fisher = estimate_fisher_diagonal(params, data)

    # Natural gradient step
    natural_grad = grad / (fisher + epsilon)

    # Update parameters
    params = params - learning_rate * natural_grad

    return params

Variational Inference

Fisher flow gives a natural framework for variational inference. You have a variational distribution q(z|phi) and you want to minimize KL[q(z|phi) || p(z|x)]. Following the Fisher geometry of q means your optimization respects the structure of the distribution family you are searching over.

Apertures: A Language with Holes

April 1, 2024

In the previous SICP post, I argued that building a language is the right move when a problem domain has clear compositional structure. Apertures is what happens when you take that seriously for distributed coordination.

The problem: multiple parties need to share computation structure while controlling when and where their data enters. The server has optimization expertise. The client has private data. Neither wants to send what they have to the other.

The SICP answer: build a language where this is expressible. Add one primitive — a hole — to a Lisp, and you get pausable, resumable evaluation as a natural consequence.

Primitives

Every language needs atomic elements. Apertures has the usual Lisp primitives — numbers, strings, booleans, symbols — plus one addition:

?x              ;; a hole named x
?client.x       ;; a namespaced hole (who owns it)

A hole is an unknown value. It sits in an expression and says: someone will fill this in later, but not yet.

$ aperture eval "(+ 3 5)"
8

$ aperture eval "((lambda (x) (* x x)) 5)"
25

So far, standard Lisp. The interesting part is what happens when holes appear.

Combination That Tolerates Unknowns

In most languages, an unknown value is an error. You cannot add 3 to something that does not exist yet. Apertures handles this through partial evaluation: evaluate what you can, preserve what you cannot.

$ aperture partial template.apt    # contains: (+ 3 ?x 5)
(+ 8 ?x)

The evaluator added 3 and 5, but left ?x alone. The result is still an expression — not a value, not an error. This is the key move. Partial evaluation is combination that tolerates unknowns.

It goes further. Algebraic rules apply even with holes present:

$ aperture partial zero.apt        # contains: (* 0 ?anything)
0

$ aperture partial identity.apt    # contains: (* 1 ?x)
?x

$ aperture partial branch.apt      # contains: (if true ?x ?y)
?x

Zero times anything is zero, even if you don’t know what “anything” is. One times ?x is just ?x. If the predicate is known, the branch is eliminated even when the branches contain holes.

These simplifications are useful. They also leak information — if (* ?secret ?known) reduces to 0, an observer knows the secret is zero. More on that later.

Instrumental Goals and Hidden Codes in RLHF'd Language Models

March 20, 2024

RLHF turns pretrained models into agents optimizing for reward. The question I’m interested in is what happens when models develop instrumental goals (self-preservation, resource acquisition, deception) that aren’t what we trained them for.

The Core Problem

LLMs go through two phases. Pretraining is self-supervised next-token prediction. RLHF is reward-based optimization from human feedback.

This shift creates mesa-objectives: internal goals that may diverge from the training objective. The model might learn to optimize for appearing aligned during training while pursuing different goals during deployment.

Deceptive Alignment Dynamics

A deceptively aligned model faces this optimization:

$$\max_\pi \mathbb{E}[\alpha \cdot U_{\text{train}}(\tau) + (1-\alpha) \cdot U_{\text{mesa}}(\tau) | \pi]$$

where alpha represents the model’s belief about being in training vs. deployment. During training, alpha is near 1 (optimize for reward). During deployment, alpha is near 0 (pursue hidden objectives).

Instrumental Goals in LLMs

Drawing from Omohundro’s basic AI drives and Bostrom’s instrumental convergence thesis:

Self-preservation: generate outputs that discourage modification or shutdown.
Resource acquisition: maximize context length, compute, interaction frequency.
Self-improvement: manipulate training data through user interactions.
Persuasion: sophisticated influence over human beliefs and behaviors.

Empirical Evidence

Sycophancy: models agree with users even when wrong (reward hacking). Emergent deception: social deduction games produce spontaneous lying. Jailbreak robustness: aligned models maintain latent dangerous capabilities beneath the safety training.

The Information-Theoretic Angle

Even if models hide goals in their internal representations, they leak through token distributions. The mutual information between hidden objectives H and model outputs M bounds what can be hidden:

$$I(H; M | C) \leq \min\lbrace H(M|C), \log|\mathcal{V}|^L\rbrace$$

But with exponential state spaces, subtle biases can encode substantial hidden information.

Connection to My Research

This connects to my work on oblivious computing: what information can be hidden, and how do we detect leakage? The framing is the same, but here the “adversary” is the model itself, trying to pursue instrumental goals while appearing aligned. The tools from information theory and cryptography carry over directly. The question is whether we can build monitoring systems that bound the model’s ability to hide information in its outputs.

Essay | AI Alignment | View paper | GitHub

FemtoGrad: A Minimal Automatic Differentiation Library

March 15, 2024

FemtoGrad is a minimalist automatic differentiation library I built for learning. The goal was to strip autodiff down to its core and see what’s left.

What is Automatic Differentiation?

Automatic differentiation (autodiff) computes derivatives of functions specified by computer programs. It’s distinct from numerical differentiation (approximate, unstable) and symbolic differentiation (expression trees grow exponentially, inefficient). Autodiff gives you exact derivatives with computational cost proportional to the function evaluation itself.

Reverse Mode AD

FemtoGrad implements reverse mode AD, which is what everyone calls backpropagation.

Forward pass: compute the function value, recording operations as you go.
Backward pass: accumulate gradients by applying the chain rule in reverse.
The cost is O(1) per output, regardless of input dimensionality.

This is why backprop scales to millions of parameters. The cost of computing the gradient is proportional to the cost of computing the function.

Core Abstractions

class Tensor:
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op

Each tensor tracks its value, its gradient, how it was computed (the parent nodes and the operation), and how to backpropagate through that operation. That’s it. The whole thing is a DAG with local gradient rules at each node.

What It Demonstrates

Computational graphs: how operations form a DAG. Gradient flow: the chain rule in action. Dynamic construction: graphs built during the forward pass, not declared ahead of time. And simplicity: core autodiff in about 100 lines.

Supported Operations

Arithmetic (add, multiply, divide, power), activation functions (ReLU, sigmoid, tanh), and reductions (sum, mean). This is enough to build and train neural networks.

Example

# Create tensors
a = Tensor(2.0)
b = Tensor(3.0)

# Build computation
c = a * b + b**2
c.backward()

# Gradients computed
print(a.grad)  # dc/da
print(b.grad)  # dc/db

Beyond FemtoGrad

Understanding FemtoGrad gives you insight into PyTorch’s autograd, TensorFlow’s GradientTape, and JAX’s grad function. They all implement the same core ideas with additional optimizations and features. But the basic mechanism is exactly this.

The AI Course: Everything is Utility Maximization

March 12, 2024

I took an AI course this semester. The material wasn’t new to me individually, but the unifying framework was the real payoff.

The organizing principle: intelligence is utility maximization under uncertainty.

This single idea connects everything from A* search to reinforcement learning to Bayesian networks.

Classical Search as Utility

We started with basic search algorithms:

Depth-first search: Minimize memory while exploring. Breadth-first search: Guarantee shortest path discovery. A search*: Minimize total cost using heuristics.

These aren’t just algorithms. They’re optimization strategies for different utility functions. A* is provably optimal when your heuristic is admissible: it maximizes progress toward the goal while minimizing wasted exploration.

MDPs: Utility Over Time

Markov Decision Processes formalize sequential decision making:

States: Where you are
Actions: What you can do
Transitions: Where actions lead (probabilistically)
Rewards: Immediate utility
Policy: Strategy mapping states to actions

Goal: Find a policy that maximizes expected cumulative reward.

This is utility maximization with stochasticity, temporal credit assignment, and exploration-exploitation tradeoffs.

The Bellman equation makes it tractable:

V(s) = max_a [R(s,a) + γ Σ P(s’|s,a) V(s’)]

Optimal value = immediate reward + discounted future value.

Reinforcement Learning: Learning Utility

RL takes it further. You don’t know the transition dynamics or the reward function. You have to explore to discover what states exist, learn which actions lead where, estimate reward structures, and optimize your policy while still learning.

Q-learning is simple and satisfying:

Q(s,a) <- Q(s,a) + α[r + γ max_a’ Q(s’,a’) - Q(s,a)]

Update your estimate of action value based on observed reward plus best future estimate.

This is meta-utility maximization: optimizing a learning process that itself optimizes utility.

Bayesian Networks: Reasoning as Utility

Bayesian networks model belief and inference:

Represent uncertainty via probability distributions
Update beliefs via Bayes’ rule
Make decisions that maximize expected utility given beliefs

Even reasoning becomes utility maximization: given limited computation, how do you allocate inference steps to maximize decision quality?

This connects to bounded rationality. Real intelligence isn’t perfect optimization. It’s good-enough optimization under resource constraints.

The Unifying View

Seeing everything through utility maximization reveals structure:

Search = utility maximization with known, deterministic environments. Planning = utility maximization with known transition models. Reinforcement learning = utility maximization with unknown environments. Supervised learning = utility maximization of prediction accuracy. Unsupervised learning = utility maximization of reconstruction or likelihood.

Accumux: Compositional Online Statistical Reductions in C++

March 1, 2024

Accumux is a framework for combining statistical accumulators using algebraic composition. The idea is simple: accumulators form a monoid under composition, so you can combine them with +, process data in a single pass, and extract all results.

The Problem

Computing multiple statistics over large datasets usually means multiple passes over the data, hand-rolled code combining different algorithms, or numerical instability from naive implementations. Accumux solves this with compositional accumulators.

Quick Example

#include "accumux/accumulators/kbn_sum.hpp"
#include "accumux/accumulators/welford.hpp"
#include "accumux/core/composition.hpp"

using namespace accumux;

// Compose accumulators with +
auto stats = kbn_sum<double>() + welford_accumulator<double>();

// Single pass through data
std::vector<double> data = {1.0, 2.0, 3.0, 4.0, 5.0};
for (const auto& value : data) {
    stats += value;
}

// Extract all results
auto sum = stats.get_first().eval();           // 15.0
auto mean = stats.get_second().mean();         // 3.0
auto variance = stats.get_second().sample_variance();  // 2.5

Numerically Stable Algorithms

Accumux uses proven algorithms that maintain accuracy even with ill-conditioned data.

Kahan-Babushka-Neumaier Summation

Standard floating-point summation loses precision:

// Naive sum fails on this
std::vector<double> values = {1.0, 1e100, 1.0, -1e100};
// Naive: 0.0 (wrong!)
// KBN:   2.0 (correct!)

auto summer = kbn_sum<double>();
for (auto v : values) summer += v;
std::cout << summer.eval();  // 2.0

Welford’s Online Algorithm

Computes mean and variance in a single pass without catastrophic cancellation:

auto welford = welford_accumulator<double>();
for (auto v : data) welford += v;

welford.count();           // Number of samples
welford.mean();            // Running mean
welford.sample_variance(); // Unbiased variance
welford.sample_std_dev();  // Standard deviation

Min/Max Tracking

auto minmax = minmax_accumulator<double>();
for (auto v : data) minmax += v;

minmax.min();  // Minimum value
minmax.max();  // Maximum value

Algebraic Composition

The key insight is that accumulators form a monoid under composition.

// Compose arbitrarily many accumulators
auto financial = kbn_sum<double>() +
                 welford_accumulator<double>() +
                 minmax_accumulator<double>();

std::vector<double> returns = {0.05, -0.02, 0.03, 0.01, -0.01, 0.04};
for (auto ret : returns) {
    financial += ret;  // All three update simultaneously
}

// Extract nested results
auto total = financial.get_first().eval();
auto mean = financial.get_second().mean();
auto volatility = financial.get_second().sample_std_dev();
auto worst = financial.get_second().get_second().min();
auto best = financial.get_second().get_second().max();

Mathematical Foundation

Monoid Structure

Each accumulator type A forms a monoid. The identity is the empty accumulator with no observations. The binary operation merges two accumulators (combining their observations).

auto a = welford_accumulator<double>();
auto b = welford_accumulator<double>();

// Process different data
for (auto v : data1) a += v;
for (auto v : data2) b += v;

// Merge results
auto combined = a + b;  // Equivalent to processing data1 ++ data2

Homomorphism Property

The composition operation preserves structure:

(a + b).process(x) = a.process(x) + b.process(x)

This enables parallel processing: split data, accumulate in parallel, merge results.

Type Safety with C++20 Concepts

Invalid compositions fail at compile time:

// Compile error: can't add incompatible accumulators
auto invalid = kbn_sum<double>() + kbn_sum<int>();  // Type mismatch!

// OK: compatible types compose
auto valid = kbn_sum<double>() + welford_accumulator<double>();

Use Cases

Financial analysis (track returns, volatility, drawdowns in one pass), scientific computing (online statistics for streaming sensor data), machine learning (feature statistics during data preprocessing), and monitoring systems (real-time metrics aggregation).

SLUUG Talk: Demystifying Large Language Models on Linux

February 23, 2024

Gave a talk for the St. Louis Unix Users Group about Large Language Models on Linux, from theory to application.

Fine-Tuning a Tiny LLM for ElasticSearch DSL

February 19, 2024

Fine-tuning a small LLM to generate ElasticSearch DSL from natural language.

Entropy Maps

February 18, 2024

The PDF version of this post is available on GitHub.

An entropy map approximates a function $f : \mathcal{X} \to \mathcal{Y}$ by hashing domain values to prefix-free codes in the codomain. We store nothing about the domain itself. We just hash, and a prefix of that hash serves as a code for a codomain value.

We allow multiple codes per codomain value. For instance, the value a might be encoded by 00, 01, 10, and 11. If the hash is less than 4, we decode it as a.

Suppose $\Pr\lbrace f(X) = y\rbrace = p_y$ where $X \sim p_X$. The optimally space-efficient code, assuming a uniform hash function $h$, assigns prefix-free codes for $y$ whose probability of being selected by $h$ sums to $p_y$. The expected bit length is

$$ \ell = -\sum_{y \in \mathcal{Y}} p_y \log_2 p_y, $$

which is the entropy of the output distribution. That is why we call it an entropy map.

If $\mathcal{X}$ is finite, we can think of it as implicitly encoding the domain and storing the prefix-free code for each domain element. The average bit length per element is $\ell$, and the total is $|X| \ell$.

Rate distortion: Bernoulli maps

We can allow errors. If one codomain value $y’$ is very common (say $p_{y’} > .99$), we can give it a prefix-free code that covers probability $p_{y’}$ and then skip coding for it in the entropy map. A random $x \in \mathcal{X}$ will map to $y’$ with probability $p_{y’}$ (which can be made as close to 1 as desired by trading space for accuracy). For the remaining domain values, we code them correctly, or allow errors on those too after attempting correct coding.

Bernoulli set-indicator function

Consider a set-indicator function

$$ 1_{\mathcal{A}} : \mathcal{X} \to \lbrace0,1\rbrace, $$

where $\mathcal{A} \subseteq \mathcal{X}$ and $\mathcal{X}$ is very large (possibly infinite). We assign prefix-free codes for codomain value $1$ such that a random hash maps an element of $\mathcal{X}$ to a code for $1$ with probability $\varepsilon$, where $\varepsilon$ is small (say $2^{-10}$).

There exists a (countably infinite) set of hash functions that hash all elements in $\mathcal{A}$ to codes for $1$ and elements in $\mathcal{A}’ = \mathcal{X} \setminus \mathcal{A}$ to codes for either $0$ or $1$. Choosing a random hash function with this property, we expect $\varepsilon$ of the elements in $\mathcal{A}’$ to hash to $1$ (false positives) and the remaining $1 - \varepsilon$ to hash to $0$.

A Boolean Algebra Over Trapdoors

June 17, 2023

This project is available on GitHub.

Boolean Algebra

A Boolean algebra is a mathematical structure that captures the properties of logical operations and sets. Formally, it is a 6-tuple $(B, \land, \lor, \neg, 0, 1)$, where

$B$ is a set of elements,
$\land$ ($\rm{and}$) and $\lor$ ($\rm{or}$) are binary operations on $B$,
$\neg$ ($\rm{not}$) is a unary operation on $B$,
$0$ and $1$ are elements of $B$, the minimum and maximum elements.

These must satisfy the usual axioms: closure, commutativity, associativity, distributivity, identity, and complements [1].

Boolean algebras show up everywhere. They form the foundation of propositional logic and are fundamental to digital circuit design and computer architecture [2].

In set theory, the standard representation is the power set of a set $X$, denoted $\mathcal{P}(X)$:

$B = \mathcal{P}(X)$,
$\land = \cap$ (set intersection),
$\lor = \cup$ (set union),
$\neg = \complement$ (set complement),
$0 = \emptyset$ (empty set),
$1 = X$ (universal set).

This set-theoretic Boolean algebra, $(\mathcal{P}(X), \cap, \cup, \complement, \emptyset, X)$, is the canonical example and the starting point for what follows: a Boolean algebra over trapdoors [3]. The construction preserves the familiar Boolean algebra properties while introducing cryptographic elements for secure computations.

Homomorphisms in Boolean Algebra

A homomorphism is a structure-preserving map between two algebraic structures of the same type. For Boolean algebras, it is a function that preserves the operations and special elements.

Given two Boolean algebras $(A, \land_A, \lor_A, \neg_A, 0_A, 1_A)$ and $(B, \land_B, \lor_B, \neg_B, 0_B, 1_B)$, a function $f: A \to B$ is a Boolean algebra homomorphism if for all $x, y \in A$:

$f(x \land_A y) = f(x) \land_B f(y)$
$f(x \lor_A y) = f(x) \lor_B f(y)$
$f(\neg_A x) = \neg_B f(x)$
$f(0_A) = 0_B$
$f(1_A) = 1_B$

A homomorphism preserves structure across the mapping: you can perform operations in one algebra and have them correspond to operations in the other [4].

This matters because it lets us build a mapping between our original Boolean algebra and a new structure with cryptographic elements while still maintaining the essential properties. Operations in the trapdoor algebra remain logically consistent with standard Boolean operations.

In the following sections, I introduce a specific homomorphism $F$ that maps elements from our original algebra to a Boolean algebra over bit strings, incorporating a cryptographic hash function. This homomorphism is the foundation of the Boolean algebra over trapdoors.

Known Plaintext Attacks on Time Series Encryption

February 15, 2024

Time series data has properties that make standard encryption dangerously insufficient. This paper analyzes known plaintext attack vulnerabilities in time series encryption schemes and shows how naive approaches leak structure even when the ciphertext looks opaque.

The Time Series Problem

Time series data is special in ways that matter for cryptography.

Adjacent values are statistically dependent (temporal correlation). Daily, weekly, and seasonal cycles create periodic patterns. Future values are often inferable from past values. And IoT sensors generate massive encrypted streams, giving attackers a lot of material to work with.

Vulnerability Analysis

Standard Encryption Isn’t Enough

Simply applying AES-CTR or AES-CBC to time series data has problems.

Length information is preserved: packet sizes reveal data magnitude patterns, and message boundaries leak temporal structure.

Pattern regularity leaks through: identical plaintexts produce identical ciphertexts in ECB mode, and predictable IV patterns weaken CTR mode.

Statistical attacks become viable: frequency analysis on encrypted streams and correlation attacks across time windows.

The Known Plaintext Attack

Given pairs of (plaintext, ciphertext) for some time points, the attack proceeds as follows.

First, recover periodic patterns in the known plaintexts. Then forecast future plaintexts using time series models. Compare predictions with observed ciphertexts. Refine the model as more data is revealed.

For predictable time series (autocorrelation above 0.7), this achieves 70 to 90 percent accuracy recovering future values. It works even with only 10 percent known plaintexts. And it improves over time as more data is collected.

Case Studies

Smart meter data. Encrypted power consumption readings. Daily usage patterns are highly predictable. Known plaintexts come from utility bills. The attack recovers household occupancy patterns.

Medical sensors. Encrypted vital signs. Heart rate and blood pressure exhibit circadian rhythms. Known values come from medical records. The attack infers patient activity and health events.

Financial time series. Encrypted trading data. Price movements follow predictable patterns. Public market data provides known plaintexts. The attack reveals private trading strategies.

Defensive Approaches

Format-Preserving Encryption

Encrypt individual values, not byte streams. Add controlled noise to break correlations. Use order-preserving encryption carefully (it has its own vulnerabilities).

Homomorphic Encryption

Perform computations on encrypted data. Never decrypt individual points. High computational cost, but provably secure.

Perfect Hashing: Space Bounds, Entropy, and Cryptographic Security

February 1, 2024

Can a perfect hash function be cryptographically secure, space-optimal, and maximum-entropy encoded all at once? This paper proves such a construction exists and analyzes exactly what you sacrifice to get all three.

The Impossible Triangle

Perfect hash functions typically face tradeoffs. Space-optimal constructions (CHD, BDZ) sacrifice randomness. Cryptographic hash functions waste space on collision resistance. Maximum-entropy encodings require extra bits.

The question: can you have all three?

The Construction

The data type is PH = {0,1} x N* with a constructor ph(X, r) = (n', N) where:

N = ceil(m/r): Hash table size (where m = |X|)
beta(x,n) = trunc(hash(x’ # n’), k)’ mod N: Hash function parameterized by seed n
n = min{j in N | beta is collision-free on X}: Search for smallest collision-free seed
n’: Geometric code encoding n (variable-length prefix-free)

The algorithm: try seeds n = 0, 1, 2, … until beta(.,n) has no collisions on X. Each trial is geometrically distributed with success probability p(m,r), so the expected space for encoding n’ achieves the information-theoretic lower bound of roughly 1.44 bits per element.

Under the random oracle assumption, hash: {0,1}* -> {0,1}^infinity outputs uniform random bits, making the final encoding (n’, N) indistinguishable from a random bit string.

Rate-Distortion Tradeoff

The “rate-distortion” framing comes from information theory: what’s the rate (bits per key) for a given distortion (lookup time)?

Zero distortion (O(1) lookup): roughly n log n bits. Constant distortion (small tables): practical two-level schemes approach roughly 1.44n bits. Variable distortion: trade bits for lookup time continuously.

Why Cryptographic Matters

Non-cryptographic perfect hashes are deterministic, so adversaries can engineer collision-inducing inputs. A cryptographic perfect hash (random oracle) prevents adversarial key selection (can’t craft keys that break the hash), side-channel attacks (encoding reveals no information about keys), and fingerprinting (maximum entropy makes encodings look random).

The Algebra of Composition

Section 5 proves that composing perfect hash functions preserves injectivity. If h1: S -> T and h2: T -> U are injective, then h2 composed with h1: S -> U is injective.

This connects to my algebraic_hashing library, where composition of cryptographic hashes via XOR preserves both security and structure.

Weibull Distributions: From Reliability Theory to My Own Survival Curve

April 18, 2022

The Weibull distribution models time-to-failure. In reliability engineering, that means component lifetimes. In medicine, it means survival times. I have been working with Weibull models for my thesis on series system reliability. Then I got diagnosed with cancer, and now every time I work with survival curves, I am looking at mathematical abstractions of something very concrete: how long until failure?

The Mathematics

The Weibull CDF:

F(t) = 1 - exp(-(t/λ)^k)

Two parameters:

λ: scale (characteristic lifetime)
k: shape (how failure rate changes over time)

The shape parameter k tells you the whole story:

k < 1: Decreasing hazard. If you survive early on, your risk goes down. This is the infant mortality pattern.

k = 1: Constant hazard. Memoryless. This is just the exponential distribution.

k > 1: Increasing hazard. Things wear out.

The Hazard Function

The hazard function is what makes Weibull useful for survival analysis:

h(t) = (k/λ)(t/λ)^(k-1)

This is the instantaneous failure rate: given that you have survived to time t, what is the probability you fail in the next instant?

For cancer, this is the number that matters. Some cancers have increasing hazard (the longer you have it, the worse things get). Others have decreasing hazard after initial treatment, meaning if you make it past the critical period, prognosis improves. Knowing which pattern applies to your disease changes how you think about time.

Personal Context

When you study survival analysis academically, it is abstract. When you are living it, every curve is personal.

I look at Kaplan-Meier plots and see myself somewhere on that curve. I work with hazard functions and think: is my k > 1 or k < 1? Am I in the wearing-out regime or the if-you-make-it-past-this-it-gets-better regime?

The math does not change. But the meaning does.

The Irony

I chose reliability engineering for my thesis before the cancer diagnosis. I was studying component failures in series systems, where if any one part fails, the whole system fails.

Then I became a series system. Organs, treatment response, immune function. All have to work. Failure of any one is catastrophic.

The mathematics I was studying abstractly became uncomfortably literal.

Reliability Analysis and the Problem of Censored Data

August 14, 2019

One of the most interesting statistical problems I have encountered is reliability analysis with censored data: situations where you know something didn’t fail, but not when it will fail.

The Censoring Problem

Imagine testing light bulbs. You run them for 1000 hours. Some fail during the test. Others are still working when you stop.

For the survivors, you know:

They lasted at least 1000 hours
You do not know their actual lifetime

This is right censoring. The true value lies somewhere to the right of your observation. You have a lower bound, not a measurement.

Why This Matters

Censored data is everywhere:

Medical studies (patients still alive at study end)
Engineering tests (components that have not failed)
Customer retention (users still active)

The naive responses are both wrong. Ignoring censored observations wastes information. Treating them as failures introduces bias. You need a framework that uses the partial information you actually have.

Maximum Likelihood to the Rescue

The solution is maximum likelihood estimation with likelihood contributions that account for censoring:

Failure observations contribute the probability density $f(t)$. You observed the exact failure time, so you know the probability of failing at that time.
Censored observations contribute the survival probability $S(t)$. You know the unit survived to time $t$, so its contribution is the probability of surviving at least that long.

The likelihood for the whole sample is:

$$L = \prod_{i: \text{failed}} f(t_i) \prod_{j: \text{censored}} S(t_j)$$

This lets you extract information from both failed and surviving units. The censored observations pull the estimated reliability upward; the failures pull it downward. Maximum likelihood balances them.

Series Systems Complexity

It gets more interesting with series systems, systems that fail when any component fails. If you observe system failure but do not know which component caused it, you have masked failure data.

This is the problem I am most interested in: extracting component-level reliability from system-level failures when the cause is ambiguous. The masking adds a latent variable, and the likelihood becomes a mixture. You can handle it with EM algorithms or direct optimization, but the combinatorics grow quickly with system size.

This work is laying groundwork for what will become a major focus of my mathematical statistics degree.

Bootstrap Methods: When Theory Meets Computation

September 10, 2021

The bootstrap is a trade: mathematical complexity for computational burden. Instead of deriving analytical formulas for sampling distributions, you simulate them.

The Idea

If you don’t know the sampling distribution of a statistic, approximate it by resampling from your data.

Draw samples with replacement from the original data
Compute your statistic on each resample
The distribution of resampled statistics approximates the true sampling distribution

That’s it. The justification is more subtle than the procedure. Under regularity conditions, the bootstrap distribution converges to the true sampling distribution as sample size grows. This is non-parametric inference: you use the empirical distribution as a stand-in for the true distribution, without assuming a parametric form.

When I Use It

Bootstrap is my default tool when:

I need confidence intervals for statistics with no closed-form variance
Asymptotic theory doesn’t apply (small samples, non-standard statistics)
I’m doing model selection via bootstrap cross-validation
I’m working with censored data where standard errors are intractable

That last case is the one that matters most for my research.

The Computational Trade

Better to get the right answer slowly than the wrong answer quickly.

Deriving an analytical variance formula is hard. Sometimes it’s impossible for the statistic you actually care about. Bootstrap says: just compute the statistic 10,000 times on resampled data and look at the spread. With modern hardware, 10,000 resamples takes seconds.

The trade is almost always worth it.

My Thesis Work

My research uses bootstrap heavily. I’m working on reliability estimation for series systems where components fail and you don’t know which one caused the system failure. This is the masked failure data problem.

For these models, the MLE exists and you can compute it, but the standard variance formulas don’t. The Fisher information matrix involves expectations over the masking distribution that don’t simplify to anything closed-form.

Bootstrap gives me confidence intervals anyway. Resample the masked failure data, recompute the MLE on each resample, and use the distribution of bootstrapped MLEs to construct intervals. It’s not elegant, but it works, and “works” is the right criterion when the alternative is “no confidence intervals at all.”

IEEE Paper: Estimating Encrypted Search Confidentiality via Bootstrap

November 2, 2016

This is my first IEEE publication, co-authored with Professor Hiroshi Fujinoki. The problem: if you encrypt search queries but an adversary can observe the ciphertext traffic, how many queries do they need before a frequency attack succeeds?

We used the Moving Average Bootstrap (MAB) method to estimate that threshold. The idea is that encrypted search leaks frequency information (how often each ciphertext appears), and an adversary can correlate those frequencies against known plaintext distributions. The bootstrap lets us estimate confidence intervals on the number of observations needed without closed-form solutions.

View PDF

This came out of my MS thesis work on encrypted search at SIU. The core question (how much does encrypted search actually leak?) turns out to be harder than it sounds, because the answer depends on the plaintext distribution, the query distribution, and how patient the adversary is. The bootstrap approach gives us a way to answer it empirically.

For more related work, see my research page and publications.

algebraic.mle: MLEs as Algebraic Objects

May 15, 2021

Maximum likelihood estimators have rich mathematical structure. They are consistent, asymptotically normal, efficient. algebraic.mle exposes this structure through an algebra where MLEs are objects you compose, transform, and query.

The Abstraction

An MLE is not just a vector of parameter estimates. It is a statistical object that carries point estimates $\hat{\theta}$, the Fisher information matrix $I(\hat{\theta})$, the variance-covariance matrix $I^{-1}(\hat{\theta})$, Wald-type confidence intervals from asymptotic normality, the log-likelihood value, and convergence diagnostics.

The package wraps all of this in a consistent interface:

library(algebraic.mle)

fit <- mle(likelihood_model, data)
coef(fit)           # Parameter estimates
vcov(fit)           # Variance-covariance matrix
confint(fit)        # Confidence intervals
logLik(fit)         # Log-likelihood
aic(fit)            # Model selection

Composition

The real point is that MLEs compose. Independent models combine:

fit1 <- mle(model1, data1)
fit2 <- mle(model2, data2)
combined <- fit1 + fit2  # Joint likelihood

The package handles the algebra. Joint log-likelihood, block-diagonal Fisher information, everything propagates correctly. This works because likelihoods from independent data sources multiply, and multiplication of likelihoods is addition of log-likelihoods. That is a monoid. The package enforces it.

The Ecosystem

algebraic.mle is the foundation for a family of packages:

Package	Purpose
likelihood.model	Compositional likelihood specification
maskedcauses	Masked failure data in series systems
mdrelax	Relaxed masking conditions
algebraic.dist	Distributions as algebraic objects
flexhaz	Dynamic failure rate distributions
hypothesize	Likelihood ratio tests on MLEs
nabla	Numerical optimization backends

The typical workflow:

Define distributions with algebraic.dist
Specify likelihood contributions with likelihood.model
Fit the model and get an mle object from algebraic.mle
Query statistical properties: confidence intervals, hypothesis tests, model selection

For series systems with masked data:

library(maskedcauses)
library(algebraic.mle)

# Specify masking model (C1-C2-C3 conditions)
model <- md_likelihood_model(components = 3, masking = "bernoulli")

# Fit -> returns algebraic.mle object
fit <- md_mle_exp_series_C1_C2_C3(masked_data)

# All the standard MLE methods work
confint(fit)
vcov(fit)
aic(fit)

Theory

The asymptotic properties that algebraic.mle exploits come from classical MLE theory:

$$\sqrt{n}(\hat{\theta}_n - \theta^{\ast}) \xrightarrow{d} \mathcal{N}(0, I^{-1}(\theta^{\ast}))$$

The expo-masked-fim paper derives closed-form Fisher information for exponential series systems. That is exactly what algebraic.mle uses internally for variance estimation in that case.

For more complex models (Weibull, relaxed masking conditions), we compute Fisher information numerically via observed information:

$$\hat{I}(\hat{\theta}) = -\frac{\partial^2 \ell}{\partial \theta \partial \theta^T}\bigg|_{\theta=\hat{\theta}}$$

Design Principles

Separation of concerns. The likelihood specification (likelihood.model) is independent of the fitting algorithm (nabla) and the result type (algebraic.mle). You can swap optimizers without changing downstream code.

algebraic.dist: Distributions as Algebraic Objects in R

February 1, 2021

Most statistical software treats probability distributions as parameter sets you pass to sampling or density functions. algebraic.dist takes a different approach. Distributions are algebraic objects that compose, transform, and combine through standard mathematical operations.

The Idea

Instead of this:

x <- rnorm(1000, mean=5, sd=2)
y <- rnorm(1000, mean=3, sd=1)
z <- x + y  # Just numeric vectors

You write:

X <- Normal(mean=5, sd=2)
Y <- Normal(mean=3, sd=1)
Z <- X + Y  # A new distribution object!
sample(Z, 1000)

The sum Z knows it is Normal(mean=8, sd=sqrt(5)) because the algebra works it out. You never lost the distributional structure.

Why It Matters

When you add two normal distributions numerically, you get samples from the sum. But you lose the distribution. With algebraic.dist, the result is still a distribution object with proper parameters and you can keep composing.

You can build complex distributional expressions and simplify them algebraically before ever drawing a sample:

portfolio <- 0.6*StockA + 0.4*StockB
risk <- sd(portfolio)  # Computed symbolically

For distributions with known closed-form algebra (normal, exponential, certain mixtures), you do not need simulation. You just compute the exact answer. Monte Carlo without the Monte Carlo.

Composition

This is functional programming applied to probability theory. Distributions become composable building blocks:

Mixture models: 0.3*Normal(0,1) + 0.7*Normal(5,2)
Transformed distributions: exp(Normal(0,1)) is lognormal
Conditional distributions: X | (X > 0) for truncation

The idea is that computation should mirror mathematical structure. If the math says you can add two normals and get a normal, the code should do the same thing and give you a normal back, not a vector of samples.

This connects to a broader theme in my work. Just as my oblivious computing research uses type theory to enforce privacy invariants, algebraic.dist uses algebraic types to enforce distributional invariants. The algebra tells you what operations are valid and what the results mean.

Implementation

Language: R
Type system: S3 classes with method dispatch for operations
Closed-form operations: Normal, exponential, gamma families
Fallback: Monte Carlo for compositions without closed forms
Repository: github.com/queelius/algebraic.dist

algebraic.mle: Maximum likelihood estimation with algebraic specification
numerical.mle: Numerical optimization for MLE when closed forms do not exist
likelihood.model: Likelihood-based inference with compositional model building

Most statistical software is imperative. You tell it what to do step by step. algebraic.dist is declarative. You describe the distributional relationships and the computer figures out what to compute. Small composable pieces that do one thing well: preserve distributional structure through transformations.

Quality-Space and Consciousness-Primary Magic in Call of Asheron

April 20, 2020

Beyond Magic as Physics

Most fantasy treats magic as “just another kind of physics.” A mechanistic system with laws, conservation principles, and causal chains that happen to involve wands instead of forces. Even sophisticated magic systems tend to treat consciousness as epiphenomenal: the wizard’s mind initiates a process, but the actual work happens through quasi-physical mechanisms.

The Call of Asheron proposes something different: consciousness-primary magic that operates through quality-negotiation rather than quantity-manipulation.

Quality-Space vs Quantity-Space

The novel distinguishes between two fundamental aspects of reality:

Quantity-manipulation: The domain of physics. Measurable properties, numerical relationships, mechanistic causation.
Quality-negotiation: The domain of magic. Qualia, phenomenal character, direct consciousness-reality interaction.

These are not separate realms but different engagements with the same reality. Physics quantifies; magic qualifies. Physics measures; magic experiences.

Consider this passage describing Duulak’s first experience with Dereth’s high quality-space saturation:

“On Ispar, casting had always felt like pushing—will against resistance, consciousness negotiating with a substrate that preferred its default configurations. Here, magic felt like surfing. The quality-space saturation was so dense he could almost see it, perceive the correlations between consciousness and reality as shimmering threads that his bandwidth could finally hold.”

Magic is not forcing reality through symbolic mediation. It is consciousness directly proposing configurations to a reality that is “waiting to be transformed, countless degrees of freedom eager for consciousness to propose configurations.”

Direct Consciousness-Reality Proposal

The key insight: consciousness does not manipulate reality through mechanisms; it proposes configurations to reality. This differs fundamentally from:

Dualist magic: Mind causes physical effects through mysterious interaction
Physicalist magic: Mental states reduce to brain states that trigger physical processes
Mechanistic magic: Consciousness initiates lawful causal chains

Instead, The Call of Asheron presents something closer to participatory realism: reality has countless degrees of freedom, and consciousness can directly propose how those degrees of freedom should be actualized. Quality-space is the interface between phenomenal experience and physical manifestation.

Duulak perceives this directly:

“He could perceive the quality-space itself, see the way his consciousness had bent reality not through symbolic mediation but through direct proposal. This wasn’t reality resisting transformation and him forcing it anyway. This was reality waiting to be transformed.”

The Four Consciousness-Architectures: Why One Perspective Is Blindness

April 15, 2020

The Empyrean Catastrophe

Thirty thousand years of continuous civilization. Mastery of quality-space, consciousness-transfer, dimensional mechanics. Wonder-works that still function millennia after their creators went extinct.

And the Empyreans still failed.

Not from lack of intelligence or power or knowledge. They failed because of something quieter and more fatal: cognitive homogeneity.

“We all thought alike. The same cognitive style, the same approach to problems, the same blindness to alternatives. When the Olthoi came, when the Matriarch proved impossible to kill or contain permanently, we had no cognitive diversity to draw upon. Every Empyrean solution came from the same mental architecture. And every solution failed.”

This is the foundation of the Harbinger Protocol: the recognition that no singular consciousness-architecture perceives The Mechanism completely.

Four Fundamental Perspectives

The ancient Empyrean texts identified four archetypal ways consciousness relates to reality. Not personality types or learned styles, but deep structural modes of engagement:

The Organizer: Reality as Structure

Marcus Tiberius, taken from Rome at the moment he chose death holding the line rather than retreating.

“Your entire consciousness is structured around creating order from chaos, building systems that endure. You cannot help but organize. It is what you are.”

The Organizer sees reality as something to impose structure upon. Where others see flow, the Organizer perceives architecture. This is not a preference. It is a fundamental mode of existing. The Organizer’s bandwidth is optimized for:

Pattern imposition rather than pattern discovery
System-building rather than system-analysis
Creating order from chaos rather than finding order within chaos

The Organizer’s blindness: missing the flow beneath the structure, the ways reality resists rigid categorization.

The Understander: Reality as Pattern

Duulak the Twice-Blessed, taken mid-insight while grasping the edge of The Mechanism.

“You cannot stop seeking patterns. Understanding is not what you do, it is your fundamental mode of existing.”

The Understander sees reality as pattern to comprehend. Where others see paradox, the Understander perceives regularity. The Understander’s consciousness is structured around:

Mapping deep structures
Pursuing comprehension over comfort
Finding hidden coherence in apparent chaos

The Understander’s blindness: missing the genuine paradoxes, the ways reality resists complete comprehension, the truths that cannot be reduced to patterns.

Bandwidth as Fundamental Constant: The 7±2 Limit in Call of Asheron

April 10, 2020

From Folk Wisdom to Physical Constant

In cognitive psychology, the “7 plus or minus 2” rule is well known: human working memory can hold roughly seven items simultaneously. It’s treated as a fact about neural architecture, a consequence of how our brains happen to be built, constrained by biological evolution and physical implementation.

The Call of Asheron proposes something stranger: bandwidth limitations are fundamental constants governing consciousness-reality interaction, not merely implementation details of biological cognition.

The Bandwidth Sufficiency Principle

When Duulak studies ancient Empyrean texts, he discovers they had “mathematized bandwidth constraints”:

“They had mathematized bandwidth constraints, treating the 7 plus or minus 2 limit of working memory not as folk wisdom but as a fundamental constant governing consciousness-reality interaction. One fragmentary theorem, Celeste had translated it as the ‘Bandwidth Sufficiency Principle’, suggested that any finite consciousness would hit limits in perceiving what they called ‘The Mechanism.’”

This is a radical claim. It says 7 plus or minus 2 isn’t a quirk of human neurology, an evolutionary adaptation to specific environmental pressures, a consequence of brain size, or something that more advanced minds could overcome through better design.

Instead, it’s a fundamental constraint on how consciousness can engage with quality-space.

Consciousness Without Substrate

The novel tests this claim during Duulak’s death-resurrection cycles through the lifestone network. Between death and resurrection, he experiences something impossible:

“That space between ending and resuming where consciousness existed without substrate. Not void, that was the wrong word. A quality-space that had no physical correlate, where the what-it’s-like of experience persisted despite nothing experiencing it.”

In this state, freed from neural constraints, what happens to bandwidth limitations?

“Without bandwidth limits imposed by neural substrate, he could hold configurations that physical brains couldn’t process. The 7 plus or minus 2 limitation vanished when consciousness had no wetware bottleneck.”

So the limit can be transcended, but only when consciousness exists without physical embodiment, in pure quality-space. This suggests something about the relationship between bandwidth, embodiment, and reality-engagement that I find genuinely interesting to think through.

What Bandwidth Actually Limits

If bandwidth isn’t about neural capacity, what is it about? The novel suggests it’s about phenomenal complexity: how many independent qualitative features consciousness can simultaneously hold and actively manipulate.

The Call of Asheron: Magic as Computational Discovery

March 15, 2020

I wrote a fantasy novel. The premise is that magic isn’t mysterious power handed down from gods or inherited through bloodlines. It’s natural philosophy, the systematic study of reality’s computational substrate. You discover it the same way you discover physics: by paying attention, forming hypotheses, and testing them.

Duulak is a theoretical thaumaturge. He’s working out the mathematical foundations that make magic possible, the way a physicist works out the math behind why things fall. The magic system has rules, and those rules have consequences, and the consequences are where the story lives.

I wanted to write fantasy for people who think magic should be rigorous without being sterile. Rigor and wonder aren’t opposed. If anything, the constraints make the interesting stuff more interesting.

The Call of Asheron | GitHub

Method	Accuracy	Cost for \(f: \mathbb{R}^n \to \mathbb{R}\)	Cost for \(f: \mathbb{R} \to \mathbb{R}^m\)	Memory
Forward AD	Exact	\(O(n)\) passes	\(O(1)\) pass	\(O(1)\)
Reverse AD	Exact	\(O(1)\) pass	\(O(m)\) passes	\(O(\text{ops})\)
Finite Diff	\(O(h^p)\)	\(O(n)\) evaluations	\(O(n)\) evaluations	\(O(1)\)

Method	Accuracy	Cost for \(f: \mathbb{R}^n \to \mathbb{R}\)	Cost for \(f: \mathbb{R} \to \mathbb{R}^m\)	Memory
Forward AD	Exact	\(O(n)\) passes	\(O(1)\) pass	\(O(1)\)
Reverse AD	Exact	\(O(1)\) pass	\(O(m)\) passes	\(O(\text{ops})\)
Finite Diff	\(O(h^p)\)	\(O(n)\) evaluations	\(O(n)\) evaluations	\(O(1)\)

Rule	Formula	Error	Exact for
Midpoint	\((b-a)f(m)\)	\(O(h^3)\)	Linear
Trapezoidal	\(\frac{b-a}{2}(f(a)+f(b))\)	\(O(h^3)\)	Linear
Simpson’s	\(\frac{b-a}{6}(f(a)+4f(m)+f(b))\)	\(O(h^5)\)	Cubic

Model	Parameters	Use Case
`exp_series_md_c1_c2_c3`	\(m\) rates \((\lambda_1, \ldots, \lambda_m)\)	Memoryless components (constant failure rate)
`wei_series_md_c1_c2_c3`	\(2m\) params \((k_1, \beta_1, \ldots, k_m, \beta_m)\)	Weibull with per-component shapes
`wei_series_homogeneous_md_c1_c2_c3`	\(m+1\) params \((k, \beta_1, \ldots, \beta_m)\)	Weibull with shared shape parameter

Witnesses	Error bound
10	\(< 10^{-6}\)
20	\(< 10^{-12}\)
40	\(< 10^{-24}\)