Linked project: Pfc
Synthesis: Codecs as Structure
May 15, 2026
Twelve posts, twelve codes, one thesis that refused to change. This is the closing summary.
A. The Twelve Codes Together
Every post in this series answered a version of the same question: given a source of positive integers, how do you represent its values compactly as a sequence of bits? The answers differ in shape, in assumptions, and in which distribution each code implicitly expects.
| Post | Code | Implied prior (one phrase) |
|---|---|---|
| 1-2 | Foundations | Prefix-free codes are possible iff Kraft’s inequality holds |
| 3 | Priors framework | Any code defines a prior; the best code matches the source |
| 4 | Unary | Geometric(1/2): value 1 is twice as likely as value 2, etc. |
| 5a | Elias Gamma | Power-law: probability falls as 1/n^2 |
| 5b | Elias Delta | Heavier-tailed power law: slower decay for large values |
| 5c | Elias Omega | Recursive structure: no fixed polynomial decay rate |
| 6 | Fibonacci | Near-geometric with Zeckendorf structure; good for Zeckendorf-sparse integers |
| 7 | Rice / Golomb | Geometric with known parameter m; optimal when m divides entropy |
| 8 | VByte | Roughly uniform over byte-aligned ranges; engineering favorite |
| 9 | Huffman | Source-optimal given the exact symbol distribution |
| 10 | Arithmetic coding | Approaches entropy to an arbitrary fraction of a bit |
| 11 | Succinct bit vectors | Not a code for integers: a representation that answers rank/select queries |
| 12 | RoaringBitmap | Polyalgorithm: picks array, bitset, or run-length per container chunk |
Posts 1 and 2 (Kraft’s Inequality and McMillan’s Converse) established why prefix-free codes are the right unit of analysis. Post 3 (Universal Codes as Priors) named the frame: a code is a hypothesis about the source. Posts 4 through 10 filled in the catalogue. Posts 11 and 12 extended from integer coding to set representation, where the questions shift from “how long is this codeword?” to “how do you store membership?” and “how do you answer rank/select?”
Looking across all twelve, the main lesson is not that one code dominates. It is that the question “which code?” is always empirically answerable given a sample.
B. The Unifying Frame Restated
Post 3 introduced the codes-as-priors thesis with two instances behind it. We now have twelve. The thesis has not changed; it has only become more evidently true.
Bits Follow Types
April 23, 2026
Every type decomposes structurally. So does its codec.
Codecs as Functors
You have an optional<vector<pair<int, string>>>. The type decomposes structurally: it is an optional of a free monoid of products of an integer and a string. That decomposition is not an observation about memory layout. It is a statement about the algebraic structure of the type.
Now ask: does the codec decompose the same way?
If the answer is yes, you stop writing one-off encoders. You build a codec for optional<T> from a codec for T. You build a codec for vector<T> from a codec for T. The codec for optional<vector<pair<int, string>>> assembles from its parts with no manual layout decisions, no hand-placed length headers, no ad-hoc format negotiation.
This post argues that the answer is always yes, and shows what the machinery looks like. The thesis: codecs are not ad-hoc bit formats. They are constructions on the algebraic structure of types. The algebraic structure of a type determines its codec, the same way it determines its algorithms.
This extends Stepanov’s claim. The peasant algorithm post showed that algorithms arise from algebraic structure. The homomorphism post showed that structure-preserving maps are the natural morphisms. Here, we show the codec itself is a structure-preserving map, and that it lifts from leaf types to compound types by the same algebraic logic.
Bit I/O: The Foundation
Before combinators, we need concrete bit I/O. The approach taken here follows Stepanov’s move in the algorithm posts: state the concept first, then provide a model.
Two concepts govern bit-level I/O:
template<typename T>
concept BitSink = requires(T& s, bool bit) {
{ s.write(bit) } -> std::same_as<void>;
};
template<typename T>
concept BitSource = requires(T& s) {
{ s.read() } -> std::same_as<bool>;
{ s.peek() } -> std::convertible_to<bool>;
};
A BitSink accepts bits. A BitSource supplies them. A codec is an algorithm parameterized over BitSink and BitSource, not a class hierarchy. This is Stepanov’s move at the bit level: require only what the algorithm needs, let anything that satisfies the concept participate.
The standard models are BitWriter and BitReader, which pack bits into byte buffers in LSB-first order:
class BitWriter {
std::span<std::uint8_t> buf_;
std::size_t byte_idx_ = 0;
std::uint8_t byte_ = 0;
std::uint8_t bit_pos_ = 0;
public:
explicit BitWriter(std::span<std::uint8_t> buf) noexcept : buf_(buf) {}
void write(bool bit) noexcept {
byte_ |= (bit ? std::uint8_t{1} : std::uint8_t{0}) << bit_pos_;
if (++bit_pos_ == 8) {
buf_[byte_idx_++] = byte_;
byte_ = 0;
bit_pos_ = 0;
}
}
void align() noexcept {
if (bit_pos_ > 0) {
buf_[byte_idx_++] = byte_;
byte_ = 0;
bit_pos_ = 0;
}
}
[[nodiscard]] std::size_t bytes_written() const noexcept {
return byte_idx_ + (bit_pos_ > 0 ? 1 : 0);
}
};
class BitReader {
std::span<const std::uint8_t> buf_;
std::size_t byte_idx_ = 0;
std::uint8_t bit_pos_ = 0;
public:
explicit BitReader(std::span<const std::uint8_t> buf) noexcept : buf_(buf) {}
bool read() noexcept {
bool bit = ((buf_[byte_idx_] >> bit_pos_) & 1) != 0;
if (++bit_pos_ == 8) {
++byte_idx_;
bit_pos_ = 0;
}
return bit;
}
[[nodiscard]] bool peek() const noexcept {
return byte_idx_ < buf_.size();
}
};
A codec concept rounds out the three-concept core:
When Lists Become Bits
April 23, 2026
The free monoid on a type lifts to bit space. It lifts injectively only when the element codec is prefix-free.
Prefix-Free Codes and the Free Monoid
You have a list of unsigned integers. Encode the list as a single bit string.
Fixed-width encoding wastes space. If you allocate 64 bits per integer, small values like 1 or 7 cost as much as values near \(2^{64}\). Variable-width encoding recovers that space, but immediately raises a harder question: where does one encoded integer end and the next begin?
Two escape routes. First, prefix each encoded item with its length. That works, but the length headers are overhead, and you now need a codec for the lengths as well. Second, choose a code where the structure of the codewords makes boundaries unambiguous without any headers. These are prefix-free codes, and this is the right answer, in a precise categorical sense.
The “precise categorical sense” is what this post develops. Encoding a list as the concatenation of encoded elements is a monoid homomorphism from the free monoid on \(T\) to the monoid of bit strings under concatenation. The universal property of the free monoid guarantees this homomorphism always exists. The question of whether the decoder can invert it comes down to exactly one property of the element codec: whether it is prefix-free.
The Free Monoid, Recalled
A monoid is a set with an associative binary operation and an identity element. The free monoid on a set \(S\) is the set of all finite sequences of elements from \(S\), with concatenation as the operation and the empty sequence as the identity.
“Free” means no equations hold except those forced by the monoid axioms. Nothing is identified with anything else. If you need commutativity or idempotency, you quotient the free monoid by additional equations. But the free monoid itself imposes nothing beyond associativity and identity.
The universal property says: given any monoid \(M\) and any function \(f: S \to M\), there is exactly one monoid homomorphism \(\hat{f}: \text{Free}(S) \to M\) that extends \(f\). That unique extension is fold:
$$\hat{f}([x_1, x_2, \ldots, x_n]) = f(x_1) \cdot f(x_2) \cdot \cdots \cdot f(x_n)$$where \(\cdot\) is the operation in \(M\). The free-algebra post develops this in full. For this post, the one fact that matters is that fold is canonical: it is the unique way to extend a per-element map to a list-consuming function that respects the monoid structure.
Arithmetic Coding
January 12, 2025
Huffman codes one symbol at a time. Arithmetic coding encodes the whole sequence as a single number. The difference is a factor of twelve, at least on the right source.
The Last Bit of Redundancy
Huffman coding gets expected codeword length within one bit of entropy. That is the best it can do, because codeword lengths must be integers while entropy is a real number.
The waste is structural. A symbol with probability $p = 0.7$ has optimal (fractional) length $-\log_2(0.7) \approx 0.515$ bits. Huffman rounds that up to 1 bit: 0.485 bits wasted per occurrence. For a nearly-deterministic source with $p_0 = 0.99$ and $p_1 = 0.01$, the entropy is $H \approx 0.081$ bits per symbol. Huffman is stuck at 1 bit per symbol. That is a factor-of-twelve gap, and Huffman cannot close it: a symbol that appears 99% of the time still gets a complete codeword.
Arithmetic coding steps back from per-symbol codewords entirely. It encodes an entire sequence as a single rational number in $[0, 1)$. The bit-length of that number converges to the entropy of the sequence as the sequence grows. No integer rounding, no per-symbol overhead.
This post builds an integer range coder in C++23 and demonstrates the factor-of-twelve improvement on the Bernoulli(0.99) source.
The Continuous View
Start with the unit interval $[0, 1)$. For a two-symbol source, partition it by probability: symbol 0 gets $[0, p_0)$ and symbol 1 gets $[p_0, 1)$.
To encode a sequence, begin with the full interval and narrow it with each symbol. After symbol $s_1$, restrict to the corresponding sub-interval. After $s_2$, apply the same proportional rule inside that sub-interval. After $L$ symbols, the interval has width $\prod_{i=1}^{L} p_{s_i}$.
Any number inside the final interval is a valid encoding. The shortest such number in binary requires approximately $-\log_2(\prod p_{s_i}) = \sum_{i=1}^{L} (-\log_2 p_{s_i})$ bits. As $L \to \infty$, bits per symbol approaches $H(p) = -\sum_k p_k \log_2 p_k$ exactly.
Decoding is the inverse: given the encoded number, determine at each step which sub-interval it falls in, recover the symbol, narrow the interval, and repeat.
The theory is complete. The practice is not. A real interval narrows exponentially fast: after a few dozen symbols you need arbitrary precision. The integer range coder fixes this with 32-bit arithmetic and a renormalization step.
Huffman Coding
August 4, 2024
Huffman coding is two things: the optimal length vector for a known distribution, and McMillan’s construction applied to that vector. This post develops both.
From Universal to Optimal
Every code in this series so far has been universal: no prior knowledge of the source distribution required. Elias gamma assigns shorter codewords to smaller integers regardless of which integers actually appear. Fibonacci does the same. VByte packs smaller values into fewer bytes without knowing whether your data clusters at the low end or the high end. Universal codes are defensive: they perform acceptably across a broad class of inputs by committing to none.
Huffman flips that stance. You bring a finite probability distribution. Huffman finds the prefix-free code with minimum expected codeword length for that distribution. The code is tuned to what you provide and will perform poorly on anything else. Call this the move from defensive to distribution-specific coding.
The payoff is real. Shannon’s source coding theorem says no prefix-free code can achieve expected length below $H(p) = -\sum_i p_i \log_2 p_i$. Huffman gets within 1 bit of that bound. For any prefix-free code, expected length satisfies
$$H(p) \le L(\text{code}) \le H(p) + 1.$$The upper bound comes from the integer-length constraint: each codeword is a whole number of bits, and $\lceil -\log_2 p_i \rceil \le -\log_2 p_i + 1$. Huffman is optimal subject to this constraint. No prefix-free code does better without abandoning whole-bit codewords.
That last clause points to the limit of this post and the subject of the next. Arithmetic coding breaks the integer-length constraint by assigning fractional bits in effect, reaching entropy exactly in the limit.
The Algorithm
The four steps of Huffman’s construction:
- Create one leaf node per symbol, weighted by its probability.
- Push all leaves into a min-priority queue (lowest weight first).
- While the queue contains more than one node: extract the two lowest-weight nodes, merge them into a new internal node whose weight is their sum, push the internal node back.
- The remaining node is the root. The path from root to each leaf encodes that leaf’s codeword (“0” for left, “1” for right).
Here is the complete implementation from huffman.hpp:
PFC: Zero-Copy Data Compression Through Prefix-Free Codecs
June 10, 2024
PFC (Prefix-Free Codecs) is a header-only C++20 library built on a simple observation: data compression and zero-copy access are not contradictory goals, as long as you build on prefix-free codes and generic programming. The library gets 3-10x compression on typical integer distributions while maintaining full STL integration and type safety.
The zero-copy invariant
Traditional compression creates two worlds. Data lives uncompressed in memory (32 bits per integer) and compressed on disk (variable bits). You marshal between them. PFC eliminates that boundary:
\[ \text{In-memory representation} = \text{Wire representation} \]// Traditional approach
std::vector<uint32_t> data = {1, 2, 3, 5, 8, 13};
auto compressed = compress(data); // Marshal to wire format
store_to_disk(compressed);
auto restored = decompress(compressed); // Unmarshal back
// PFC approach
PackedContainer<uint32_t, EliasGamma> data;
data.push_back(1);
data.push_back(2);
data.push_back(3);
// Data is ALREADY compressed in memory
uint32_t value = data[0]; // Decodes from compressed form on access
// Write to disk? Zero copy.
write(fd, data.bytes().data(), data.bytes().size());
The data structure IS the compressed format. There’s no marshaling step.
Prefix-free codes
A code is prefix-free if no codeword is a prefix of another:
Prefix-free: Not prefix-free:
1 -> 0 1 -> 0
2 -> 10 2 -> 01 ("0" is a prefix of "01")
3 -> 110 3 -> 011
4 -> 1110
This matters because prefix-free codes compose naturally. Concatenate encodings without delimiters and decode unambiguously:
encode(1); // 0
encode(2); // 10
encode(3); // 110
// Result: 010110 (self-delimiting)
decode(); // Reads "0" -> 1
decode(); // Reads "10" -> 2
decode(); // Reads "110" -> 3
This enables streaming and composition without any framing overhead.
Universal codes
PFC implements several universal codes, meaning they’re asymptotically optimal for any distribution without knowing the distribution in advance.
Elias Gamma encodes positive integer \(n\) in \(2\lfloor\log_2 n\rfloor + 1\) bits:
n Binary Elias Gamma
1 1 1
2 10 010
3 11 011
4 100 00100
5 101 00101
8 1000 0001000
Write \(\lfloor\log_2 n\rfloor\) zeros, then the binary representation of \(n\). Asymptotically optimal for geometric distributions.
Fibonacci encoding uses Zeckendorf representation (sum of non-consecutive Fibonacci numbers) with a terminal “11” marker. Every positive integer has a unique representation.
Rice/Golomb codes are parametric, optimal for geometric distributions with known parameter. Quotient in unary, remainder in binary. Good for values with exponential decay.
VByte / Varint
February 25, 2024
Every code in this series so far operates at bit granularity. VByte does not. It gives up bit-level precision for byte-alignment, and in production systems, that trade wins most of the time.
The Practical Question
Every code in this series so far operates at bit granularity. Elias gamma encodes 1 in a single bit. Fibonacci coding uses exactly as many bits as the Zeckendorf representation requires. Bit packing is theoretically attractive because it minimizes the number of bits written, which minimizes the encoded size.
But bit packing is computationally expensive. Reading or writing a single bit requires a shift, a mask, and often a branch to handle byte boundaries. Encoding a sequence of integers this way burns CPU cycles that scale with the number of integers, independent of their values. For high-throughput applications, the overhead of bit manipulation can easily exceed the savings from compact encoding.
VByte (also called Varint in Google’s ecosystem, and LEB128 in the DWARF debug format) trades a small amount of length efficiency for byte-alignment. The idea is simple: encode each integer as a sequence of 7-bit groups, one per byte, with a continuation flag in the high bit of each byte. The result is self-delimiting, compact for small values, and requires no bit-level manipulation to decode.
VByte is the encoding used by Protocol Buffers for all integer fields. It appears in Apache Arrow, Parquet, Snappy’s block format, LevelDB’s metadata, and most production columnar file formats. These are high-throughput systems. Byte-alignment is why VByte is their choice over the more compact universal codes from posts 4 through 7.
The Encoding
VByte splits an integer into 7-bit groups, starting from the least significant bits. Each group occupies one byte where bits 0 through 6 carry 7 data bits and bit 7 is a continuation flag: 1 means more bytes follow, 0 means this is the last byte.
A value in $[0, 127]$ fits in a single byte (continuation flag clear). A value in $[128, 16383]$ requires two bytes (first byte has flag set, second has flag clear). The pattern continues: each additional byte adds 7 bits of capacity.
Rice / Golomb
September 17, 2023
Every code in this series so far has been fixed. Rice and Golomb are different: they take a parameter, and the parameter is your model of the data.
The First Parametric Code
Every code examined so far in this series has been monolithic. Unary coding is just unary coding. Elias gamma is just Elias gamma. Each one encodes all non-negative integers with a single fixed strategy. You do not get to choose anything about the code beyond whether to use it.
Rice and Golomb codes break this pattern. They are parametric: a single integer parameter, $k$ for Rice or $m$ for Golomb, tunes the code to a specific source distribution. Rice$(k)$ is not one code but a family of codes, one per value of $k$. Each member of the family is optimal for a specific geometric distribution. Choosing $k$ is choosing your prior precisely.
This matters because data sources are rarely uniform. Run-length encodings, inter-frame video differences, and the gap sequences in inverted indexes are all approximately geometrically distributed. If you know the mean of your source, you can pick $k$ so that Rice$(k)$ performs near-optimally, without the overhead of a Huffman table or arithmetic coding.
The key insight: for a geometric source with mean approximately $2^k$, Rice$(k)$ is within a small constant of entropy. No other universal code in this series achieves this. Elias gamma and delta perform well asymptotically but can be far from optimal for a specific geometric distribution with a known mean. Rice exploits that knowledge directly.
Rice Coding
Rice coding splits a non-negative integer $n$ into two parts: a quotient $q = \lfloor n / 2^k \rfloor = n \gg k$ and a remainder $r = n \bmod 2^k = n \mathbin{&} (2^k - 1)$.
The quotient is encoded in unary: $q$ zero bits followed by a stop bit of 1. The remainder is encoded in exactly $k$ bits, MSB first. The total codeword is the concatenation of these two parts.
Codeword examples for $k = 2$ (remainder is always 2 bits):
| $n$ | $q$ | $r$ | Codeword | Bits |
|---|---|---|---|---|
| 0 | 0 | 0 | 1 00 | 3 |
| 1 | 0 | 1 | 1 01 | 3 |
| 2 | 0 | 2 | 1 10 | 3 |
| 3 | 0 | 3 | 1 11 | 3 |
| 4 | 1 | 0 | 0 1 00 | 4 |
| 5 | 1 | 1 | 0 1 01 | 4 |
Codeword length: $(n \gg k) + 1 + k$ bits. The Kraft sum saturates to 1, so Rice is a complete prefix-free code.
Fibonacci Coding
April 23, 2023
Every code in this series so far has optimized expected length under some implied prior. Fibonacci coding does something different: it gives the decoder a way to recover from errors without help from a lower layer.
A Different Design Goal
All the codes in this series have aimed at the same target: assign short codewords to frequent symbols, with length growing roughly as $\log n$ for the $n$-th symbol under some implied prior. Elias gamma minimizes expected length for power-law distributions; delta and omega extend the recursion for heavier tails.
Fibonacci coding has a different goal. It does not optimize for average codeword length under a specific distribution. It optimizes for error resilience. In a stream of gamma-coded integers, a single bit flip in a codeword’s length prefix causes the decoder to misread that codeword’s length, then misread every subsequent codeword. The error propagates without limit until the decoder somehow reacquires sync. On a reliable channel this is a nonissue. On a noisy one, or in stored data that may have silently rotted, it is a serious problem.
Fibonacci coding avoids this. Every Fibonacci codeword ends in two consecutive 1 bits (“11”). This double-one marker appears nowhere else in the codeword. A single bit flip corrupts the codeword it hits, possibly spills into the next codeword, and then the decoder finds the next “11” and resynchronizes. At most two codewords are corrupted per error. The rest of the stream is intact.
The price is length overhead: Fibonacci codewords are approximately $1.44 \times \log_2 n$ bits long, compared to $\log_2 n$ bits for the entropy lower bound. On a reliable channel, that overhead is not worth paying. On a noisy channel, or in a long-running stream where rare bit errors must not lose the entire tail, the self-synchronization property is worth it.
Zeckendorf’s Theorem
Fibonacci numbers starting from $F_2 = 1$: $1, 2, 3, 5, 8, 13, 21, 34, \ldots$
Zeckendorf’s theorem: every positive integer $n$ has a unique representation as a sum of non-consecutive Fibonacci numbers. The greedy algorithm produces it by repeatedly subtracting the largest Fibonacci number that does not exceed $n$.
Elias Delta and Omega
November 13, 2022
Elias gamma spends too many bits saying how many bits it will use. Delta fixes that. Omega takes the fix one step further. This post is about what happens when you apply recursion to the length prefix.
Where Gamma Stops Being Good
Elias gamma, from the previous post, encodes a positive integer $n$ in $2\lfloor \log_2 n \rfloor + 1$ bits: a unary count of $\lfloor \log_2 n \rfloor$ zeros, then a stop bit, then the $\lfloor \log_2 n \rfloor$ trailing binary bits of $n$. For small $n$ this is fine. For large $n$, nearly half the bits are spent on the unary prefix alone.
The unary prefix is the bottleneck. It encodes the length $L = \lfloor \log_2 n \rfloor + 1$ in the most wasteful possible way: one bit per unit. For $n = 256$, that is 8 zero bits just to say “the payload is 8 bits long.” The payload itself is also 8 bits, so you are paying a 100% overhead on the length announcement. That is bad, and it gets worse as $n$ grows.
The fix is obvious once you see it: encode $L$ itself in some shorter code instead of unary. Elias delta does exactly this, replacing the unary length prefix with a gamma-coded length. Elias omega takes the idea one step further and applies the recursion to itself, all the way down.
Both codes are universal: they assign finite codewords to every positive integer, and the expected codeword length is within a constant factor of optimal for any source whose probabilities decrease with $n$. The improvement over gamma is real and measurable once $n$ grows past a few dozen.
This post shows both implementations, their implied priors, and the crossover points where each code wins. As in the rest of this series, the code is pedagogical: each header stands alone and the struct-with-encode/decode pattern maps directly onto the PFC library’s EliasDelta and EliasOmega in codecs.hpp.
Elias Delta
Algorithm. Let $L = \lfloor \log_2 n \rfloor + 1$ (the bit-width of $n$, equivalently std::bit_width(n)).
- Encode $L$ in Elias gamma.
- Write the $L - 1$ trailing bits of $n$ after its implicit leading 1, MSB first.
Gamma encodes $L$ (a small integer) in $O(\log \log n)$ bits instead of $O(\log n)$ bits for the unary prefix. The payload is identical to gamma’s: the trailing bits of $n$. The total length is $O(\log n + \log \log n)$.
Unary and Elias Gamma
June 19, 2022
Unary is older than information theory. Elias gamma is its 1975 improvement. Together they span the gap between optimal-but-impractical and practical-but-nearly-optimal. This post derives what each code bets on, and shows numerically what that means.
Unary and Elias Gamma
Unary is the oldest code in this series. It predates information theory by centuries: a shepherd counting sheep on a stick is using unary. Mark one notch per sheep; count the notches to decode. The codeword for \(n\) is \(n\) tally marks. Its information-theoretic justification came later, when Shannon showed it is exactly optimal for a geometric source.
Elias gamma is the 1975 extension by Peter Elias. It brings the codeword length from \(O(n)\) to \(O(\log n)\), making it practical for numbers beyond small single digits, while keeping the prefix-free property that makes self-delimiting streams possible.
Both codes are instances of the claim from Universal Codes as Priors: every prefix-free code is a bet about the source. Unary bets on a geometric distribution with parameter \(1/2\). Gamma bets on a power-law distribution with exponent \(\approx 2\). This post implements both, derives their implied priors, and shows numerically what the bets mean.
Unary: Geometric Prior
The encoding rule for unary is simple: to encode integer \(n \geq 1\), write \((n-1)\) zero bits followed by one 1 bit. The decoder reads bits until it sees the 1; the number of bits read is the decoded value.
Examples: \(1 \to\) 1, \(2 \to\) 01, \(3 \to\) 001, \(4 \to\) 0001.
struct Unary {
using value_type = std::uint64_t;
template<BitSink S>
static void encode(value_type n, S& sink) {
assert(n >= 1 && "Unary is undefined for n = 0");
for (value_type i = 1; i < n; ++i) sink.write(false);
sink.write(true);
}
template<BitSource S>
static value_type decode(S& source) {
value_type n = 1;
while (!source.read()) ++n;
return n;
}
};
Length analysis. The codeword for \(n\) has length \(n\). The Kraft sum is \(\sum_{n=1}^{\infty} 2^{-n} = 1\): unary saturates Kraft exactly. The implied prior is \(p_n = 2^{-n}\): a geometric distribution with parameter \(1/2\), where each value is half as likely as the previous.
Optimality test. Because the implied prior is dyadic (all probabilities are powers of \(1/2\)) and Kraft saturates, unary achieves entropy exactly on this prior. For a 30-symbol truncation of geometric(1/2), the expected unary length equals the entropy to within the truncation tail (\(\approx 2^{-30}\)):
Universal Codes as Priors
January 15, 2022
When you pick a code for integers, you are making a bet about what integers the source will produce. The bet lives in the codeword lengths, not in a separate parameter. This post makes that precise.
Universal Codes as Priors
You want to compress a stream of positive integers. Which code should you use?
The question has more structure than it appears. A code for integers assigns a codeword to each integer. The codeword for 1 is short, for 2 a bit longer, for 100 much longer. The relative lengths encode an implicit bet: what fraction of the stream will be 1s? What fraction will be 100s? If the bet matches the source, the average codeword length will be close to the theoretical minimum, the entropy. If the bet is wrong, you pay an overhead proportional to how wrong you are.
The bet is not a separate parameter. It lives in the codeword lengths themselves. This is the central claim of this post:
Every prefix-free code is a prior over the integers. The codeword lengths determine, up to normalization, a probability distribution. The code is optimal for exactly the sources that match that distribution.
This post makes that claim precise and implements the tools to measure it. The rest of the series (posts 4 through 12) examines ten specific codes and the priors they embody.
The Correspondence: Lengths to Priors
For a prefix-free code with codeword lengths \((l_1, l_2, \ldots, l_n)\), define the unnormalized weight of symbol \(i\) as \(w_i = 2^{-l_i}\). This is the fraction of the Kraft budget consumed by that codeword.
If the code saturates Kraft (meaning \(\sum_i 2^{-l_i} = 1\)), then the weights are already a valid probability distribution: \(p_i = 2^{-l_i}\). If the code does not saturate (meaning \(\sum_i 2^{-l_i} < 1\)), normalize: \(p_i = 2^{-l_i} / \sum_j 2^{-l_j}\).
This is the inverse of Shannon’s prescription. Shannon says: given a distribution \(p_i\), the optimal codeword length is \(\lceil -\log_2 p_i \rceil\) bits. We reverse the direction: given a length \(l_i\), the implied probability is \(2^{-l_i}\).
The function implied_prior computes this map:
inline std::vector<double> implied_prior(const std::vector<std::size_t>& lengths) {
std::vector<double> probs;
probs.reserve(lengths.size());
double total = 0.0;
for (std::size_t l : lengths) {
double p = std::ldexp(1.0, -static_cast<int>(l));
probs.push_back(p);
total += p;
}
// Normalize if Kraft sum is less than 1.
if (total < 1.0) {
for (double& p : probs) p /= total;
}
return probs;
}
Two examples show the range of priors you get in practice.
McMillan's Converse
September 13, 2020
Kraft’s inequality is necessary. McMillan’s theorem says it is also sufficient, and the proof is a construction.
McMillan’s Converse
The previous post in this series proved Kraft’s inequality: for any prefix-free binary code with codeword lengths \(l_1, l_2, \ldots, l_n\),
$$\sum_{i=1}^{n} 2^{-l_i} \leq 1.$$Every prefix-free code satisfies it. No exceptions. But necessity alone is not the useful direction. The question I want answered is the converse: given a length vector that satisfies Kraft, does a prefix-free code with those lengths actually exist?
Yes, and McMillan’s theorem (1956) proves it. Better still, the proof is a construction: given any Kraft-satisfying length vector, you can produce a specific prefix-free code with those exact lengths. No search required. No verification required after the fact. The construction always terminates, always produces a valid code, because Kraft pre-certifies that the budget is sufficient.
This post proves the constructive direction, then goes further. McMillan proved something stronger than just the prefix-free converse. He showed that even uniquely-decodable codes that are not prefix-free must satisfy Kraft. The consequence is worth sitting with: there is no advantage to non-prefix-free designs. If a code can be uniquely decoded, a prefix-free code with the same lengths exists. Prefix-freeness is not a restriction you impose for convenience. It is just the cleanest form of what unique decodability requires.
The Construction
The construction is a left-to-right walk through an imaginary binary trie. Sort the lengths, then assign codewords by taking the next available leaf at each step.
Concretely: fix a counter at zero, and for each length \(l_i\) (in sorted order), emit the binary representation of counter >> (l_max - l_i) left-padded to \(l_i\) bits. Then advance the counter by \(2^{l_{\max} - l_i}\), which skips past the entire subtree rooted at the just-assigned codeword. That advance ensures the next codeword starts at the first unoccupied leaf position in the depth-\(l_{\max}\) trie.
Work through the example from post 1: lengths \(\{1, 2, 3, 3\}\). Sort: \(1, 2, 3, 3\). Take \(l_{\max} = 3\).
- Length 1: counter is 0. Shift right by \(3 - 1 = 2\): emit
0 >> 2 = 0as a 1-bit string, giving codeword"0". Advance counter by \(2^{3-1} = 4\). Counter is now 4. - Length 2: counter is 4 (binary
100). Shift right by \(3 - 2 = 1\): emit4 >> 1 = 2as a 2-bit string, giving"10". Advance by \(2^{3-2} = 2\). Counter is now 6. - Length 3: counter is 6 (binary
110). Shift right by \(3 - 3 = 0\): emit6as a 3-bit string, giving"110". Advance by \(2^0 = 1\). Counter is 7. - Length 3: counter is 7 (binary
111). Emit7as a 3-bit string:"111". Advance by 1. Counter is 8.
Result: {"0", "10", "110", "111"}. This is exactly the example code from post 1. The construction recovered it directly from the length vector, without any search.
Kraft's Inequality
March 22, 2020
Every prefix-free code satisfies one inequality. That inequality is also sufficient. This post develops the necessary direction.
Kraft’s Inequality
I want a code where each symbol maps to a bit string, and where any concatenation of codewords can be decoded unambiguously. The simplest way to guarantee that is prefix-freeness: no codeword is a prefix of any other. A prefix-free code is self-delimiting. The decoder reads bits left-to-right and knows exactly when each codeword ends, with no lookahead and no length headers.
The question I keep returning to is: which collections of lengths are actually achievable? If I want four codewords of lengths 1, 2, 3, and 3, can I build a prefix-free code with those lengths? What if I want two codewords of length 1? (No: there are only two 1-bit strings, and they are prefixes of everything longer.)
Kraft’s inequality is the answer. A length vector \((l_1, l_2, \ldots, l_n)\) is achievable by a prefix-free binary code only if
$$\sum_{i=1}^{n} 2^{-l_i} \leq 1.$$This is the constraint you cannot escape. Any prefix-free code satisfies it. Any length vector that violates it cannot be realized as a prefix-free code, full stop.
The converse is also true: any length vector satisfying Kraft is realizable by some prefix-free code. That is McMillan’s theorem, and it is the subject of the next post in this series. This post develops the necessary direction: every prefix-free code satisfies Kraft.
The right tool for understanding why is the binary tree.
The Trie View
Represent each codeword as a path in a binary tree. Start at the root. For each bit, go left (0) or right (1). The codeword ends at a node, which I mark as a terminal. A code is prefix-free if and only if no terminal node has any descendants that are also terminals. Once you reach a terminal on the way down, you stop.
The example code \(\{A \to \texttt{0},\ B \to \texttt{10},\ C \to \texttt{110},\ D \to \texttt{111}\}\) has lengths \((1, 2, 3, 3)\). Its trie looks like this:
root
/ \
0 1
[A] / \
0 1
[B] / \
0 1
[C] [D]
A is at depth 1, left branch. B is at depth 2, right-then-left. C and D share a parent at depth 2, then split at depth 3. No codeword’s node is an ancestor of another’s: the code is prefix-free.
Linked project: Wire-Formats
Synthesis: Codecs as Structure
May 15, 2026
Twelve posts, twelve codes, one thesis that refused to change. This is the closing summary.
A. The Twelve Codes Together
Every post in this series answered a version of the same question: given a source of positive integers, how do you represent its values compactly as a sequence of bits? The answers differ in shape, in assumptions, and in which distribution each code implicitly expects.
| Post | Code | Implied prior (one phrase) |
|---|---|---|
| 1-2 | Foundations | Prefix-free codes are possible iff Kraft’s inequality holds |
| 3 | Priors framework | Any code defines a prior; the best code matches the source |
| 4 | Unary | Geometric(1/2): value 1 is twice as likely as value 2, etc. |
| 5a | Elias Gamma | Power-law: probability falls as 1/n^2 |
| 5b | Elias Delta | Heavier-tailed power law: slower decay for large values |
| 5c | Elias Omega | Recursive structure: no fixed polynomial decay rate |
| 6 | Fibonacci | Near-geometric with Zeckendorf structure; good for Zeckendorf-sparse integers |
| 7 | Rice / Golomb | Geometric with known parameter m; optimal when m divides entropy |
| 8 | VByte | Roughly uniform over byte-aligned ranges; engineering favorite |
| 9 | Huffman | Source-optimal given the exact symbol distribution |
| 10 | Arithmetic coding | Approaches entropy to an arbitrary fraction of a bit |
| 11 | Succinct bit vectors | Not a code for integers: a representation that answers rank/select queries |
| 12 | RoaringBitmap | Polyalgorithm: picks array, bitset, or run-length per container chunk |
Posts 1 and 2 (Kraft’s Inequality and McMillan’s Converse) established why prefix-free codes are the right unit of analysis. Post 3 (Universal Codes as Priors) named the frame: a code is a hypothesis about the source. Posts 4 through 10 filled in the catalogue. Posts 11 and 12 extended from integer coding to set representation, where the questions shift from “how long is this codeword?” to “how do you store membership?” and “how do you answer rank/select?”
Looking across all twelve, the main lesson is not that one code dominates. It is that the question “which code?” is always empirically answerable given a sample.
B. The Unifying Frame Restated
Post 3 introduced the codes-as-priors thesis with two instances behind it. We now have twelve. The thesis has not changed; it has only become more evidently true.
Arithmetic Coding
January 12, 2025
Huffman codes one symbol at a time. Arithmetic coding encodes the whole sequence as a single number. The difference is a factor of twelve, at least on the right source.
The Last Bit of Redundancy
Huffman coding gets expected codeword length within one bit of entropy. That is the best it can do, because codeword lengths must be integers while entropy is a real number.
The waste is structural. A symbol with probability $p = 0.7$ has optimal (fractional) length $-\log_2(0.7) \approx 0.515$ bits. Huffman rounds that up to 1 bit: 0.485 bits wasted per occurrence. For a nearly-deterministic source with $p_0 = 0.99$ and $p_1 = 0.01$, the entropy is $H \approx 0.081$ bits per symbol. Huffman is stuck at 1 bit per symbol. That is a factor-of-twelve gap, and Huffman cannot close it: a symbol that appears 99% of the time still gets a complete codeword.
Arithmetic coding steps back from per-symbol codewords entirely. It encodes an entire sequence as a single rational number in $[0, 1)$. The bit-length of that number converges to the entropy of the sequence as the sequence grows. No integer rounding, no per-symbol overhead.
This post builds an integer range coder in C++23 and demonstrates the factor-of-twelve improvement on the Bernoulli(0.99) source.
The Continuous View
Start with the unit interval $[0, 1)$. For a two-symbol source, partition it by probability: symbol 0 gets $[0, p_0)$ and symbol 1 gets $[p_0, 1)$.
To encode a sequence, begin with the full interval and narrow it with each symbol. After symbol $s_1$, restrict to the corresponding sub-interval. After $s_2$, apply the same proportional rule inside that sub-interval. After $L$ symbols, the interval has width $\prod_{i=1}^{L} p_{s_i}$.
Any number inside the final interval is a valid encoding. The shortest such number in binary requires approximately $-\log_2(\prod p_{s_i}) = \sum_{i=1}^{L} (-\log_2 p_{s_i})$ bits. As $L \to \infty$, bits per symbol approaches $H(p) = -\sum_k p_k \log_2 p_k$ exactly.
Decoding is the inverse: given the encoded number, determine at each step which sub-interval it falls in, recover the symbol, narrow the interval, and repeat.
The theory is complete. The practice is not. A real interval narrows exponentially fast: after a few dozen symbols you need arbitrary precision. The integer range coder fixes this with 32-bit arithmetic and a renormalization step.
Huffman Coding
August 4, 2024
Huffman coding is two things: the optimal length vector for a known distribution, and McMillan’s construction applied to that vector. This post develops both.
From Universal to Optimal
Every code in this series so far has been universal: no prior knowledge of the source distribution required. Elias gamma assigns shorter codewords to smaller integers regardless of which integers actually appear. Fibonacci does the same. VByte packs smaller values into fewer bytes without knowing whether your data clusters at the low end or the high end. Universal codes are defensive: they perform acceptably across a broad class of inputs by committing to none.
Huffman flips that stance. You bring a finite probability distribution. Huffman finds the prefix-free code with minimum expected codeword length for that distribution. The code is tuned to what you provide and will perform poorly on anything else. Call this the move from defensive to distribution-specific coding.
The payoff is real. Shannon’s source coding theorem says no prefix-free code can achieve expected length below $H(p) = -\sum_i p_i \log_2 p_i$. Huffman gets within 1 bit of that bound. For any prefix-free code, expected length satisfies
$$H(p) \le L(\text{code}) \le H(p) + 1.$$The upper bound comes from the integer-length constraint: each codeword is a whole number of bits, and $\lceil -\log_2 p_i \rceil \le -\log_2 p_i + 1$. Huffman is optimal subject to this constraint. No prefix-free code does better without abandoning whole-bit codewords.
That last clause points to the limit of this post and the subject of the next. Arithmetic coding breaks the integer-length constraint by assigning fractional bits in effect, reaching entropy exactly in the limit.
The Algorithm
The four steps of Huffman’s construction:
- Create one leaf node per symbol, weighted by its probability.
- Push all leaves into a min-priority queue (lowest weight first).
- While the queue contains more than one node: extract the two lowest-weight nodes, merge them into a new internal node whose weight is their sum, push the internal node back.
- The remaining node is the root. The path from root to each leaf encodes that leaf’s codeword (“0” for left, “1” for right).
Here is the complete implementation from huffman.hpp:
VByte / Varint
February 25, 2024
Every code in this series so far operates at bit granularity. VByte does not. It gives up bit-level precision for byte-alignment, and in production systems, that trade wins most of the time.
The Practical Question
Every code in this series so far operates at bit granularity. Elias gamma encodes 1 in a single bit. Fibonacci coding uses exactly as many bits as the Zeckendorf representation requires. Bit packing is theoretically attractive because it minimizes the number of bits written, which minimizes the encoded size.
But bit packing is computationally expensive. Reading or writing a single bit requires a shift, a mask, and often a branch to handle byte boundaries. Encoding a sequence of integers this way burns CPU cycles that scale with the number of integers, independent of their values. For high-throughput applications, the overhead of bit manipulation can easily exceed the savings from compact encoding.
VByte (also called Varint in Google’s ecosystem, and LEB128 in the DWARF debug format) trades a small amount of length efficiency for byte-alignment. The idea is simple: encode each integer as a sequence of 7-bit groups, one per byte, with a continuation flag in the high bit of each byte. The result is self-delimiting, compact for small values, and requires no bit-level manipulation to decode.
VByte is the encoding used by Protocol Buffers for all integer fields. It appears in Apache Arrow, Parquet, Snappy’s block format, LevelDB’s metadata, and most production columnar file formats. These are high-throughput systems. Byte-alignment is why VByte is their choice over the more compact universal codes from posts 4 through 7.
The Encoding
VByte splits an integer into 7-bit groups, starting from the least significant bits. Each group occupies one byte where bits 0 through 6 carry 7 data bits and bit 7 is a continuation flag: 1 means more bytes follow, 0 means this is the last byte.
A value in $[0, 127]$ fits in a single byte (continuation flag clear). A value in $[128, 16383]$ requires two bytes (first byte has flag set, second has flag clear). The pattern continues: each additional byte adds 7 bits of capacity.
Rice / Golomb
September 17, 2023
Every code in this series so far has been fixed. Rice and Golomb are different: they take a parameter, and the parameter is your model of the data.
The First Parametric Code
Every code examined so far in this series has been monolithic. Unary coding is just unary coding. Elias gamma is just Elias gamma. Each one encodes all non-negative integers with a single fixed strategy. You do not get to choose anything about the code beyond whether to use it.
Rice and Golomb codes break this pattern. They are parametric: a single integer parameter, $k$ for Rice or $m$ for Golomb, tunes the code to a specific source distribution. Rice$(k)$ is not one code but a family of codes, one per value of $k$. Each member of the family is optimal for a specific geometric distribution. Choosing $k$ is choosing your prior precisely.
This matters because data sources are rarely uniform. Run-length encodings, inter-frame video differences, and the gap sequences in inverted indexes are all approximately geometrically distributed. If you know the mean of your source, you can pick $k$ so that Rice$(k)$ performs near-optimally, without the overhead of a Huffman table or arithmetic coding.
The key insight: for a geometric source with mean approximately $2^k$, Rice$(k)$ is within a small constant of entropy. No other universal code in this series achieves this. Elias gamma and delta perform well asymptotically but can be far from optimal for a specific geometric distribution with a known mean. Rice exploits that knowledge directly.
Rice Coding
Rice coding splits a non-negative integer $n$ into two parts: a quotient $q = \lfloor n / 2^k \rfloor = n \gg k$ and a remainder $r = n \bmod 2^k = n \mathbin{&} (2^k - 1)$.
The quotient is encoded in unary: $q$ zero bits followed by a stop bit of 1. The remainder is encoded in exactly $k$ bits, MSB first. The total codeword is the concatenation of these two parts.
Codeword examples for $k = 2$ (remainder is always 2 bits):
| $n$ | $q$ | $r$ | Codeword | Bits |
|---|---|---|---|---|
| 0 | 0 | 0 | 1 00 | 3 |
| 1 | 0 | 1 | 1 01 | 3 |
| 2 | 0 | 2 | 1 10 | 3 |
| 3 | 0 | 3 | 1 11 | 3 |
| 4 | 1 | 0 | 0 1 00 | 4 |
| 5 | 1 | 1 | 0 1 01 | 4 |
Codeword length: $(n \gg k) + 1 + k$ bits. The Kraft sum saturates to 1, so Rice is a complete prefix-free code.
Fibonacci Coding
April 23, 2023
Every code in this series so far has optimized expected length under some implied prior. Fibonacci coding does something different: it gives the decoder a way to recover from errors without help from a lower layer.
A Different Design Goal
All the codes in this series have aimed at the same target: assign short codewords to frequent symbols, with length growing roughly as $\log n$ for the $n$-th symbol under some implied prior. Elias gamma minimizes expected length for power-law distributions; delta and omega extend the recursion for heavier tails.
Fibonacci coding has a different goal. It does not optimize for average codeword length under a specific distribution. It optimizes for error resilience. In a stream of gamma-coded integers, a single bit flip in a codeword’s length prefix causes the decoder to misread that codeword’s length, then misread every subsequent codeword. The error propagates without limit until the decoder somehow reacquires sync. On a reliable channel this is a nonissue. On a noisy one, or in stored data that may have silently rotted, it is a serious problem.
Fibonacci coding avoids this. Every Fibonacci codeword ends in two consecutive 1 bits (“11”). This double-one marker appears nowhere else in the codeword. A single bit flip corrupts the codeword it hits, possibly spills into the next codeword, and then the decoder finds the next “11” and resynchronizes. At most two codewords are corrupted per error. The rest of the stream is intact.
The price is length overhead: Fibonacci codewords are approximately $1.44 \times \log_2 n$ bits long, compared to $\log_2 n$ bits for the entropy lower bound. On a reliable channel, that overhead is not worth paying. On a noisy channel, or in a long-running stream where rare bit errors must not lose the entire tail, the self-synchronization property is worth it.
Zeckendorf’s Theorem
Fibonacci numbers starting from $F_2 = 1$: $1, 2, 3, 5, 8, 13, 21, 34, \ldots$
Zeckendorf’s theorem: every positive integer $n$ has a unique representation as a sum of non-consecutive Fibonacci numbers. The greedy algorithm produces it by repeatedly subtracting the largest Fibonacci number that does not exceed $n$.
Elias Delta and Omega
November 13, 2022
Elias gamma spends too many bits saying how many bits it will use. Delta fixes that. Omega takes the fix one step further. This post is about what happens when you apply recursion to the length prefix.
Where Gamma Stops Being Good
Elias gamma, from the previous post, encodes a positive integer $n$ in $2\lfloor \log_2 n \rfloor + 1$ bits: a unary count of $\lfloor \log_2 n \rfloor$ zeros, then a stop bit, then the $\lfloor \log_2 n \rfloor$ trailing binary bits of $n$. For small $n$ this is fine. For large $n$, nearly half the bits are spent on the unary prefix alone.
The unary prefix is the bottleneck. It encodes the length $L = \lfloor \log_2 n \rfloor + 1$ in the most wasteful possible way: one bit per unit. For $n = 256$, that is 8 zero bits just to say “the payload is 8 bits long.” The payload itself is also 8 bits, so you are paying a 100% overhead on the length announcement. That is bad, and it gets worse as $n$ grows.
The fix is obvious once you see it: encode $L$ itself in some shorter code instead of unary. Elias delta does exactly this, replacing the unary length prefix with a gamma-coded length. Elias omega takes the idea one step further and applies the recursion to itself, all the way down.
Both codes are universal: they assign finite codewords to every positive integer, and the expected codeword length is within a constant factor of optimal for any source whose probabilities decrease with $n$. The improvement over gamma is real and measurable once $n$ grows past a few dozen.
This post shows both implementations, their implied priors, and the crossover points where each code wins. As in the rest of this series, the code is pedagogical: each header stands alone and the struct-with-encode/decode pattern maps directly onto the PFC library’s EliasDelta and EliasOmega in codecs.hpp.
Elias Delta
Algorithm. Let $L = \lfloor \log_2 n \rfloor + 1$ (the bit-width of $n$, equivalently std::bit_width(n)).
- Encode $L$ in Elias gamma.
- Write the $L - 1$ trailing bits of $n$ after its implicit leading 1, MSB first.
Gamma encodes $L$ (a small integer) in $O(\log \log n)$ bits instead of $O(\log n)$ bits for the unary prefix. The payload is identical to gamma’s: the trailing bits of $n$. The total length is $O(\log n + \log \log n)$.
Unary and Elias Gamma
June 19, 2022
Unary is older than information theory. Elias gamma is its 1975 improvement. Together they span the gap between optimal-but-impractical and practical-but-nearly-optimal. This post derives what each code bets on, and shows numerically what that means.
Unary and Elias Gamma
Unary is the oldest code in this series. It predates information theory by centuries: a shepherd counting sheep on a stick is using unary. Mark one notch per sheep; count the notches to decode. The codeword for \(n\) is \(n\) tally marks. Its information-theoretic justification came later, when Shannon showed it is exactly optimal for a geometric source.
Elias gamma is the 1975 extension by Peter Elias. It brings the codeword length from \(O(n)\) to \(O(\log n)\), making it practical for numbers beyond small single digits, while keeping the prefix-free property that makes self-delimiting streams possible.
Both codes are instances of the claim from Universal Codes as Priors: every prefix-free code is a bet about the source. Unary bets on a geometric distribution with parameter \(1/2\). Gamma bets on a power-law distribution with exponent \(\approx 2\). This post implements both, derives their implied priors, and shows numerically what the bets mean.
Unary: Geometric Prior
The encoding rule for unary is simple: to encode integer \(n \geq 1\), write \((n-1)\) zero bits followed by one 1 bit. The decoder reads bits until it sees the 1; the number of bits read is the decoded value.
Examples: \(1 \to\) 1, \(2 \to\) 01, \(3 \to\) 001, \(4 \to\) 0001.
struct Unary {
using value_type = std::uint64_t;
template<BitSink S>
static void encode(value_type n, S& sink) {
assert(n >= 1 && "Unary is undefined for n = 0");
for (value_type i = 1; i < n; ++i) sink.write(false);
sink.write(true);
}
template<BitSource S>
static value_type decode(S& source) {
value_type n = 1;
while (!source.read()) ++n;
return n;
}
};
Length analysis. The codeword for \(n\) has length \(n\). The Kraft sum is \(\sum_{n=1}^{\infty} 2^{-n} = 1\): unary saturates Kraft exactly. The implied prior is \(p_n = 2^{-n}\): a geometric distribution with parameter \(1/2\), where each value is half as likely as the previous.
Optimality test. Because the implied prior is dyadic (all probabilities are powers of \(1/2\)) and Kraft saturates, unary achieves entropy exactly on this prior. For a 30-symbol truncation of geometric(1/2), the expected unary length equals the entropy to within the truncation tail (\(\approx 2^{-30}\)):
Universal Codes as Priors
January 15, 2022
When you pick a code for integers, you are making a bet about what integers the source will produce. The bet lives in the codeword lengths, not in a separate parameter. This post makes that precise.
Universal Codes as Priors
You want to compress a stream of positive integers. Which code should you use?
The question has more structure than it appears. A code for integers assigns a codeword to each integer. The codeword for 1 is short, for 2 a bit longer, for 100 much longer. The relative lengths encode an implicit bet: what fraction of the stream will be 1s? What fraction will be 100s? If the bet matches the source, the average codeword length will be close to the theoretical minimum, the entropy. If the bet is wrong, you pay an overhead proportional to how wrong you are.
The bet is not a separate parameter. It lives in the codeword lengths themselves. This is the central claim of this post:
Every prefix-free code is a prior over the integers. The codeword lengths determine, up to normalization, a probability distribution. The code is optimal for exactly the sources that match that distribution.
This post makes that claim precise and implements the tools to measure it. The rest of the series (posts 4 through 12) examines ten specific codes and the priors they embody.
The Correspondence: Lengths to Priors
For a prefix-free code with codeword lengths \((l_1, l_2, \ldots, l_n)\), define the unnormalized weight of symbol \(i\) as \(w_i = 2^{-l_i}\). This is the fraction of the Kraft budget consumed by that codeword.
If the code saturates Kraft (meaning \(\sum_i 2^{-l_i} = 1\)), then the weights are already a valid probability distribution: \(p_i = 2^{-l_i}\). If the code does not saturate (meaning \(\sum_i 2^{-l_i} < 1\)), normalize: \(p_i = 2^{-l_i} / \sum_j 2^{-l_j}\).
This is the inverse of Shannon’s prescription. Shannon says: given a distribution \(p_i\), the optimal codeword length is \(\lceil -\log_2 p_i \rceil\) bits. We reverse the direction: given a length \(l_i\), the implied probability is \(2^{-l_i}\).
The function implied_prior computes this map:
inline std::vector<double> implied_prior(const std::vector<std::size_t>& lengths) {
std::vector<double> probs;
probs.reserve(lengths.size());
double total = 0.0;
for (std::size_t l : lengths) {
double p = std::ldexp(1.0, -static_cast<int>(l));
probs.push_back(p);
total += p;
}
// Normalize if Kraft sum is less than 1.
if (total < 1.0) {
for (double& p : probs) p /= total;
}
return probs;
}
Two examples show the range of priors you get in practice.
McMillan's Converse
September 13, 2020
Kraft’s inequality is necessary. McMillan’s theorem says it is also sufficient, and the proof is a construction.
McMillan’s Converse
The previous post in this series proved Kraft’s inequality: for any prefix-free binary code with codeword lengths \(l_1, l_2, \ldots, l_n\),
$$\sum_{i=1}^{n} 2^{-l_i} \leq 1.$$Every prefix-free code satisfies it. No exceptions. But necessity alone is not the useful direction. The question I want answered is the converse: given a length vector that satisfies Kraft, does a prefix-free code with those lengths actually exist?
Yes, and McMillan’s theorem (1956) proves it. Better still, the proof is a construction: given any Kraft-satisfying length vector, you can produce a specific prefix-free code with those exact lengths. No search required. No verification required after the fact. The construction always terminates, always produces a valid code, because Kraft pre-certifies that the budget is sufficient.
This post proves the constructive direction, then goes further. McMillan proved something stronger than just the prefix-free converse. He showed that even uniquely-decodable codes that are not prefix-free must satisfy Kraft. The consequence is worth sitting with: there is no advantage to non-prefix-free designs. If a code can be uniquely decoded, a prefix-free code with the same lengths exists. Prefix-freeness is not a restriction you impose for convenience. It is just the cleanest form of what unique decodability requires.
The Construction
The construction is a left-to-right walk through an imaginary binary trie. Sort the lengths, then assign codewords by taking the next available leaf at each step.
Concretely: fix a counter at zero, and for each length \(l_i\) (in sorted order), emit the binary representation of counter >> (l_max - l_i) left-padded to \(l_i\) bits. Then advance the counter by \(2^{l_{\max} - l_i}\), which skips past the entire subtree rooted at the just-assigned codeword. That advance ensures the next codeword starts at the first unoccupied leaf position in the depth-\(l_{\max}\) trie.
Work through the example from post 1: lengths \(\{1, 2, 3, 3\}\). Sort: \(1, 2, 3, 3\). Take \(l_{\max} = 3\).
- Length 1: counter is 0. Shift right by \(3 - 1 = 2\): emit
0 >> 2 = 0as a 1-bit string, giving codeword"0". Advance counter by \(2^{3-1} = 4\). Counter is now 4. - Length 2: counter is 4 (binary
100). Shift right by \(3 - 2 = 1\): emit4 >> 1 = 2as a 2-bit string, giving"10". Advance by \(2^{3-2} = 2\). Counter is now 6. - Length 3: counter is 6 (binary
110). Shift right by \(3 - 3 = 0\): emit6as a 3-bit string, giving"110". Advance by \(2^0 = 1\). Counter is 7. - Length 3: counter is 7 (binary
111). Emit7as a 3-bit string:"111". Advance by 1. Counter is 8.
Result: {"0", "10", "110", "111"}. This is exactly the example code from post 1. The construction recovered it directly from the length vector, without any search.
Kraft's Inequality
March 22, 2020
Every prefix-free code satisfies one inequality. That inequality is also sufficient. This post develops the necessary direction.
Kraft’s Inequality
I want a code where each symbol maps to a bit string, and where any concatenation of codewords can be decoded unambiguously. The simplest way to guarantee that is prefix-freeness: no codeword is a prefix of any other. A prefix-free code is self-delimiting. The decoder reads bits left-to-right and knows exactly when each codeword ends, with no lookahead and no length headers.
The question I keep returning to is: which collections of lengths are actually achievable? If I want four codewords of lengths 1, 2, 3, and 3, can I build a prefix-free code with those lengths? What if I want two codewords of length 1? (No: there are only two 1-bit strings, and they are prefixes of everything longer.)
Kraft’s inequality is the answer. A length vector \((l_1, l_2, \ldots, l_n)\) is achievable by a prefix-free binary code only if
$$\sum_{i=1}^{n} 2^{-l_i} \leq 1.$$This is the constraint you cannot escape. Any prefix-free code satisfies it. Any length vector that violates it cannot be realized as a prefix-free code, full stop.
The converse is also true: any length vector satisfying Kraft is realizable by some prefix-free code. That is McMillan’s theorem, and it is the subject of the next post in this series. This post develops the necessary direction: every prefix-free code satisfies Kraft.
The right tool for understanding why is the binary tree.
The Trie View
Represent each codeword as a path in a binary tree. Start at the root. For each bit, go left (0) or right (1). The codeword ends at a node, which I mark as a terminal. A code is prefix-free if and only if no terminal node has any descendants that are also terminals. Once you reach a terminal on the way down, you stop.
The example code \(\{A \to \texttt{0},\ B \to \texttt{10},\ C \to \texttt{110},\ D \to \texttt{111}\}\) has lengths \((1, 2, 3, 3)\). Its trie looks like this:
root
/ \
0 1
[A] / \
0 1
[B] / \
0 1
[C] [D]
A is at depth 1, left branch. B is at depth 2, right-then-left. C and D share a parent at depth 2, then split at depth 3. No codeword’s node is an ancestor of another’s: the code is prefix-free.
Linked project: Stepanov
Bits Follow Types
April 23, 2026
Every type decomposes structurally. So does its codec.
Codecs as Functors
You have an optional<vector<pair<int, string>>>. The type decomposes structurally: it is an optional of a free monoid of products of an integer and a string. That decomposition is not an observation about memory layout. It is a statement about the algebraic structure of the type.
Now ask: does the codec decompose the same way?
If the answer is yes, you stop writing one-off encoders. You build a codec for optional<T> from a codec for T. You build a codec for vector<T> from a codec for T. The codec for optional<vector<pair<int, string>>> assembles from its parts with no manual layout decisions, no hand-placed length headers, no ad-hoc format negotiation.
This post argues that the answer is always yes, and shows what the machinery looks like. The thesis: codecs are not ad-hoc bit formats. They are constructions on the algebraic structure of types. The algebraic structure of a type determines its codec, the same way it determines its algorithms.
This extends Stepanov’s claim. The peasant algorithm post showed that algorithms arise from algebraic structure. The homomorphism post showed that structure-preserving maps are the natural morphisms. Here, we show the codec itself is a structure-preserving map, and that it lifts from leaf types to compound types by the same algebraic logic.
Bit I/O: The Foundation
Before combinators, we need concrete bit I/O. The approach taken here follows Stepanov’s move in the algorithm posts: state the concept first, then provide a model.
Two concepts govern bit-level I/O:
template<typename T>
concept BitSink = requires(T& s, bool bit) {
{ s.write(bit) } -> std::same_as<void>;
};
template<typename T>
concept BitSource = requires(T& s) {
{ s.read() } -> std::same_as<bool>;
{ s.peek() } -> std::convertible_to<bool>;
};
A BitSink accepts bits. A BitSource supplies them. A codec is an algorithm parameterized over BitSink and BitSource, not a class hierarchy. This is Stepanov’s move at the bit level: require only what the algorithm needs, let anything that satisfies the concept participate.
The standard models are BitWriter and BitReader, which pack bits into byte buffers in LSB-first order:
class BitWriter {
std::span<std::uint8_t> buf_;
std::size_t byte_idx_ = 0;
std::uint8_t byte_ = 0;
std::uint8_t bit_pos_ = 0;
public:
explicit BitWriter(std::span<std::uint8_t> buf) noexcept : buf_(buf) {}
void write(bool bit) noexcept {
byte_ |= (bit ? std::uint8_t{1} : std::uint8_t{0}) << bit_pos_;
if (++bit_pos_ == 8) {
buf_[byte_idx_++] = byte_;
byte_ = 0;
bit_pos_ = 0;
}
}
void align() noexcept {
if (bit_pos_ > 0) {
buf_[byte_idx_++] = byte_;
byte_ = 0;
bit_pos_ = 0;
}
}
[[nodiscard]] std::size_t bytes_written() const noexcept {
return byte_idx_ + (bit_pos_ > 0 ? 1 : 0);
}
};
class BitReader {
std::span<const std::uint8_t> buf_;
std::size_t byte_idx_ = 0;
std::uint8_t bit_pos_ = 0;
public:
explicit BitReader(std::span<const std::uint8_t> buf) noexcept : buf_(buf) {}
bool read() noexcept {
bool bit = ((buf_[byte_idx_] >> bit_pos_) & 1) != 0;
if (++bit_pos_ == 8) {
++byte_idx_;
bit_pos_ = 0;
}
return bit;
}
[[nodiscard]] bool peek() const noexcept {
return byte_idx_ < buf_.size();
}
};
A codec concept rounds out the three-concept core:
When Lists Become Bits
April 23, 2026
The free monoid on a type lifts to bit space. It lifts injectively only when the element codec is prefix-free.
Prefix-Free Codes and the Free Monoid
You have a list of unsigned integers. Encode the list as a single bit string.
Fixed-width encoding wastes space. If you allocate 64 bits per integer, small values like 1 or 7 cost as much as values near \(2^{64}\). Variable-width encoding recovers that space, but immediately raises a harder question: where does one encoded integer end and the next begin?
Two escape routes. First, prefix each encoded item with its length. That works, but the length headers are overhead, and you now need a codec for the lengths as well. Second, choose a code where the structure of the codewords makes boundaries unambiguous without any headers. These are prefix-free codes, and this is the right answer, in a precise categorical sense.
The “precise categorical sense” is what this post develops. Encoding a list as the concatenation of encoded elements is a monoid homomorphism from the free monoid on \(T\) to the monoid of bit strings under concatenation. The universal property of the free monoid guarantees this homomorphism always exists. The question of whether the decoder can invert it comes down to exactly one property of the element codec: whether it is prefix-free.
The Free Monoid, Recalled
A monoid is a set with an associative binary operation and an identity element. The free monoid on a set \(S\) is the set of all finite sequences of elements from \(S\), with concatenation as the operation and the empty sequence as the identity.
“Free” means no equations hold except those forced by the monoid axioms. Nothing is identified with anything else. If you need commutativity or idempotency, you quotient the free monoid by additional equations. But the free monoid itself imposes nothing beyond associativity and identity.
The universal property says: given any monoid \(M\) and any function \(f: S \to M\), there is exactly one monoid homomorphism \(\hat{f}: \text{Free}(S) \to M\) that extends \(f\). That unique extension is fold:
$$\hat{f}([x_1, x_2, \ldots, x_n]) = f(x_1) \cdot f(x_2) \cdot \cdots \cdot f(x_n)$$where \(\cdot\) is the operation in \(M\). The free-algebra post develops this in full. For this post, the one fact that matters is that fold is canonical: it is the unique way to extend a per-element map to a list-consuming function that respects the monoid structure.
Free Algebras: Why Lists and Polynomials Are Universal
March 13, 2026
Lists are everywhere in programming. Not because they are convenient. Because they are algebraically universal.
Why Lists?
Arrays are more cache-friendly. Hash maps have better lookup. Yet lists (sequences, vectors, streams) remain the default container in nearly every language. The standard explanation is convention, or ease of construction. The real explanation is algebraic.
A list is the free monoid. It is the most general monoid you can build from a set of generators. And the universal property of free monoids says that fold, the operation that processes a list element by element, is not a design pattern. It is a theorem.
The Free Monoid
Start with a set \(S\) of generators. The free monoid on \(S\) is the set of all finite sequences of elements from \(S\), with concatenation as the operation and the empty sequence as the identity.
“Free” means: no equations hold except those forced by the monoid axioms (associativity and identity). In particular:
- \([a, b] \neq [b, a]\). Commutativity is not imposed.
- \([a, a] \neq [a]\). Idempotency is not imposed.
- \([a, b, c] = [a] \cdot [b] \cdot [c]\). Every sequence is a product of singletons.
In C++:
template<typename T>
class free_monoid {
std::vector<T> elements_;
public:
free_monoid() = default;
explicit free_monoid(T x) : elements_{std::move(x)} {}
// ...
};
// Monoid operations via ADL
template<typename T>
free_monoid<T> op(const free_monoid<T>& a, const free_monoid<T>& b); // concatenation
template<typename T>
free_monoid<T> identity(const free_monoid<T>&); // empty sequence
The free_monoid<int> is the type of finite sequences of integers. Its operation is concatenation. It satisfies the Monoid concept. And it is the most general monoid on int: no structure beyond associativity and identity.
The Universal Property
Here is the key fact. Given any function \(f: S \to M\) where \(M\) is a monoid, there exists a unique monoid homomorphism \(\overline{f}: \text{Free}(S) \to M\) extending \(f\). This homomorphism is defined by:
$$\overline{f}([a_1, a_2, \ldots, a_n]) = f(a_1) \cdot f(a_2) \cdot \ldots \cdot f(a_n)$$In code:
template<Monoid M, typename T, typename F>
M extend(F f, const free_monoid<T>& xs) {
M result = identity(M{});
for (const auto& x : xs.elements())
result = op(result, f(x));
return result;
}
This is fold. The universal property says fold is the only structure-preserving way to interpret a list in a monoid. Any function that respects the monoid structure must agree with fold.
Fold Is a Theorem
When you write std::accumulate or std::reduce, you are invoking the universal property. The homomorphism condition:
Homomorphisms: The Maps Between Structures
March 13, 2026
A homomorphism is a function that preserves algebraic structure. This post shows that fold, sum, length, and even the logarithm are all the same idea.
Structures and Maps
The series so far has built up algebraic structures: monoids in the peasant post, rings in the modular post, Euclidean domains in the polynomial post, product monoids in the accumulator post. But structures alone are half the story. The other half is the maps between them.
A homomorphism is a function \(f: A \to B\) between two structures of the same kind that preserves the operation:
$$f(a \oplus b) = f(a) \oplus f(b)$$The operation on the left is in \(A\). The operation on the right is in \(B\). The function \(f\) “commutes” with the operation. That is the entire definition.
The Concept
We need a monoid concept for this post. A monoid has an associative binary operation and an identity element. As always, the operations are ADL free functions:
template<typename M>
concept Monoid = std::semiregular<M> &&
requires(M a, M b) {
{ op(a, b) } -> std::convertible_to<M>;
{ identity(a) } -> std::convertible_to<M>;
};
And a runtime check that a function preserves the monoid structure:
template<Monoid A, Monoid B, typename F>
bool is_homomorphism(F f, const A& a1, const A& a2) {
return f(op(a1, a2)) == op(f(a1), f(a2));
}
This tests one pair of inputs. It is not a proof, but it catches violations.
Examples Everywhere
Length. The string monoid (concatenation, empty string) maps to the integers under addition. The map is length. And length is a homomorphism:
The length of a concatenation is the sum of the lengths. This is not a coincidence. It is the homomorphism property.
Sum. Lists of integers under concatenation form a monoid. The integers under addition form a monoid. The map sum is a homomorphism:
Product. Same source monoid, different target. Now the integers under multiplication:
$$\text{prod}(xs \mathbin{+\!\!+} ys) = \text{prod}(xs) \times \text{prod}(ys)$$Logarithm. The positive reals under multiplication form a monoid. The reals under addition form a monoid. The logarithm maps one to the other:
$$\log(a \times b) = \log(a) + \log(b)$$This is the defining property of the logarithm, stated as algebra. The logarithm is a homomorphism from \((\mathbb{R}^+, \times, 1)\) to \((\mathbb{R}, +, 0)\).
Count. For any type \(T\), the function that sends a list to its length is a homomorphism from the list monoid to \((\mathbb{Z}, +, 0)\). This is the same as the string length example, generalized.
Lattices: Fixed Points and Iteration
March 13, 2026
Lattices have two operations of a different kind than rings. The structure determines a fixed-point algorithm.
Two Operations, Different Rules
Monoids have one binary operation. Rings have two (addition and multiplication) linked by distributivity. Lattices also have two operations, but with different laws entirely.
A lattice is a set with two operations:
- meet (\(\wedge\)): greatest lower bound
- join (\(\vee\)): least upper bound
Both are idempotent, commutative, and associative. And they satisfy the absorption laws:
$$a \wedge (a \vee b) = a \qquad a \vee (a \wedge b) = a$$Absorption is what distinguishes lattices from a pair of unrelated monoids. It ties meet and join together: knowing one constrains the other.
A bounded lattice adds a least element (bottom, \(\bot\)) and a greatest element (top, \(\top\)). Bottom is the identity for join, top is the identity for meet.
In C++20 concepts, with ADL free functions:
template<typename L>
concept Lattice = std::semiregular<L> &&
requires(L a, L b) {
{ meet(a, b) } -> std::convertible_to<L>;
{ join(a, b) } -> std::convertible_to<L>;
};
template<typename L>
concept BoundedLattice = Lattice<L> &&
requires(L a) {
{ bottom(a) } -> std::convertible_to<L>;
{ top(a) } -> std::convertible_to<L>;
};
Four Examples
Sign lattice. Abstract signs of integers: bottom (unreachable), negative, zero, positive, top (unknown). Meet is greatest lower bound, join is least upper bound in the Hasse diagram. This is the classic abstract interpretation domain. You can define abstract arithmetic on it: pos * neg = neg, neg + neg = neg, pos + neg = top.
Intervals. Closed intervals \([a, b]\) ordered by inclusion. Meet is intersection. Join is the smallest enclosing interval. Bottom is the empty interval. Top is the full range. This is the foundation of interval arithmetic.
Divisors. Positive integers ordered by divisibility. Meet is gcd, join is lcm. Bottom is 1 (divides everything), top is 0 (everything divides 0). Lattice structure appearing in number theory.
Power sets. Subsets of \(\{0, \ldots, N-1\}\). Meet is intersection (bitwise AND), join is union (bitwise OR). Bottom is the empty set, top is the full set.
All four satisfy BoundedLattice. All four satisfy the same laws. The concept constrains the interface; the laws constrain the semantics.
The Algorithm: Tarski’s Fixed-Point Theorem
Here is the payoff. Tarski’s theorem: any monotone function on a complete lattice has a least fixed point, computable by iterating from bottom.
Semirings: One Algorithm, Six Graph Problems
March 13, 2026
The peasant post showed that power() works on any monoid. But what happens when you have two operations instead of one?
From Monoids to Semirings
A monoid is one operation with an identity element. The peasant algorithm exploits this: give it any monoid and it computes powers by repeated squaring. The accumulator post used the same structure for streaming statistics.
A semiring is two monoids on the same set, linked by a compatibility condition. Formally, a semiring \((S, +, \times, 0, 1)\) satisfies:
- \((S, +, 0)\) is a commutative monoid
- \((S, \times, 1)\) is a monoid
- \(\times\) distributes over \(+\): \(a \times (b + c) = a \times b + a \times c\)
- \(0\) annihilates: \(0 \times a = a \times 0 = 0\)
In C++20, using the same ADL free functions as the peasant post:
template<typename S>
concept Semiring = std::semiregular<S> &&
requires(S a, S b) {
{ a + b } -> std::convertible_to<S>;
{ a * b } -> std::convertible_to<S>;
{ zero(a) } -> std::convertible_to<S>;
{ one(a) } -> std::convertible_to<S>;
};
The concept captures syntax. The axioms (associativity, distributivity, annihilation) are semantic requirements that the programmer must ensure.
Five Semirings
The ordinary integers are a semiring. But there are others, and each one corresponds to a different graph problem.
| Semiring | \(+\) | \(\times\) | \(0\) | \(1\) | Graph problem |
|---|---|---|---|---|---|
| Boolean | or | and | false | true | Reachability |
| Tropical min | min | plus | \(\infty\) | 0 | Shortest paths |
| Tropical max | max | plus | \(-\infty\) | 0 | Longest paths |
| Bottleneck | max | min | \(-\infty\) | \(\infty\) | Widest paths |
| Counting | plus | times | 0 | 1 | Number of paths |
The naming of the tropical semirings is counterintuitive but standard. In the tropical min semiring, the “addition” is min and the “multiplication” is ordinary addition. This matters because matrix multiplication uses both operations, and we need the algebraic structure to be correct: the inner product of a row and column computes the best path through an intermediate node.
Each semiring is a small struct with operator+, operator*, and ADL functions zero() and one():
struct boolean_semiring {
bool val;
constexpr boolean_semiring operator+(boolean_semiring rhs) const {
return boolean_semiring(val || rhs.val);
}
constexpr boolean_semiring operator*(boolean_semiring rhs) const {
return boolean_semiring(val && rhs.val);
}
};
constexpr boolean_semiring zero(boolean_semiring) { return boolean_semiring(false); }
constexpr boolean_semiring one(boolean_semiring) { return boolean_semiring(true); }
Matrices Over a Semiring
Matrix multiplication requires addition and multiplication. That is exactly what a semiring provides. If \(S\) is a semiring, then \(n \times n\) matrices over \(S\) form a semiring too, with entry-wise addition and the usual row-times-column product (using \(S\)’s operations).
Streaming Statistics, One Monoid at a Time
March 13, 2026
Accumulators are monoids. The same algebraic structure from the peasant post, in a different domain.
Accumulators as Monoids
An accumulator processes a stream of values, maintaining state that can be queried at any point. Write a class with operator+= for each statistic you need. Sum, mean, variance, min, max. Five statistics, five classes.
The problem is combinations. Sum and min? Write a sixth class. Sum, min, and max? A seventh. Every new combination requires new code.
But every accumulator has the same structure:
- Process a value incrementally:
operator+=(value) - Combine with another accumulator of the same type:
operator+=(accumulator) - Extract a result:
.eval()
Default construction gives you an empty accumulator: the identity element. Combination via += is associative. Together, a monoid. The peasant post used the same structure for exponentiation. Here we use it for streaming computation.
In C++20 concepts:
template<typename A>
concept Accumulator = std::semiregular<A> &&
requires(A a, A b, typename A::value_type v) {
typename A::value_type;
{ a += v } -> std::same_as<A&>; // process one value
{ a += b } -> std::same_as<A&>; // combine two accumulators
{ a.eval() }; // extract result
};
KBN: Compensated Summation
The simplest accumulator is a sum. But naive floating-point summation accumulates O(n) rounding error:
double sum = 0.0;
sum += 1.0;
for (int i = 0; i < 1'000'000; ++i)
sum += 1e-10;
// Expected: 1.0001 Actual: ~1.00009999999...8
When you add a tiny number to a large one, the tiny number’s low-order bits get dropped. After a million additions, these losses add up.
Kahan-Babuska-Neumaier (KBN) summation tracks what gets lost:
template<std::floating_point T>
class kbn_sum {
T sum_ = T(0);
T comp_ = T(0); // compensation for lost bits
public:
using value_type = T;
constexpr kbn_sum& operator+=(const T& v) {
T t = sum_ + v;
comp_ += abs_(sum_) >= abs_(v) ? (sum_ - t) + v
: (v - t) + sum_;
sum_ = t;
return *this;
}
constexpr T eval() const { return sum_ + comp_; }
};
The correction term comp_ recovers the bits that floating-point addition drops. O(1) error instead of O(n), regardless of sequence length.
kbn_sum is a monoid:
- Identity:
kbn_sum{}(sum=0, compensation=0) - Operation:
a += b(combine two compensated sums)
Welford: Online Mean and Variance
Computing the mean is just sum/count. Variance is harder. The textbook formula \(\sigma^2 = \frac{1}{n}\sum(x_i - \bar{x})^2\) requires two passes: one for the mean, one for the deviations.
Welford’s algorithm computes both in a single pass:
welford& operator+=(const T& v) {
++n_;
T delta = v - mean_;
mean_ += delta / static_cast<T>(n_);
T delta2 = v - mean_; // uses *updated* mean
m2_ += delta * delta2;
return *this;
}
delta uses the old mean, delta2 uses the new mean. Their product accumulates into m2_, the sum of squared deviations. At any point, variance = m2_ / n.
What makes this a monoid is the combination formula. Given two independent Welford accumulators with means \(\bar{x}_A, \bar{x}_B\) and counts \(n_A, n_B\), Chan et al. showed how to merge them:
Duality: The Hidden Structure of Opposites
January 19, 2026
Many structures come in pairs. Recognizing duality lets you transfer insights between domains.
The Motivating Example
This collection includes two approaches to automatic differentiation:
- Forward mode (in dual): Propagate derivatives alongside values, from inputs toward outputs
- Reverse mode (in autodiff): Build a graph during forward evaluation, then propagate gradients backward from outputs toward inputs
These aren’t just two implementations of the same idea. They’re duals, mirror images with complementary strengths.
Forward mode computes one column of the Jacobian per pass. If \(f: \mathbb{R}^n \to \mathbb{R}^m\), computing the full Jacobian takes \(n\) passes. Reverse mode computes one row per pass, \(m\) passes for the full Jacobian.
For neural network training, we have many inputs (millions of parameters) and one output (the loss). Reverse mode wins overwhelmingly: one backward pass gives all gradients. This is why backpropagation dominates deep learning.
For sensitivity analysis with few parameters and many outputs, forward mode wins. Same algorithm structure, opposite traversal direction, complementary use cases.
The mathematical explanation: forward mode computes Jacobian-vector products (\(Jv\)); reverse mode computes vector-Jacobian products (\(v^T J\)). These are transposes of each other. Duality is transposition.
Push vs Pull
Consider two ways to traverse a sequence:
Pull (iterator/consumer controls):
for (auto it = seq.begin(); it != seq.end(); ++it) {
process(*it); // Consumer pulls each element
}
Push (producer controls):
seq.for_each([](auto x) {
process(x); // Producer pushes each element
});
Same traversal. Same elements processed. But control flow is reversed:
| Aspect | Pull (Iterator) | Push (Generator) |
|---|---|---|
| Who controls pace? | Consumer | Producer |
| Suspend/resume? | Consumer decides when to call ++ | Producer decides when to yield |
| Backpressure | Natural (just stop pulling) | Must be designed in |
| Composition | Chain iterators | Chain callbacks |
C++ ranges are pull-based: view | filter | transform creates an iterator that pulls through the pipeline. Reactive streams (Rx) are push-based: events flow through a pipeline of observers.
These are duals. Given a pull-based algorithm, you can mechanically derive its push-based counterpart by reversing who initiates each step. The transformation preserves correctness because it’s just changing direction, not content.
Encode vs Decode
Compression algorithms come in pairs:
// Encoder: structure -> bits
auto encode(const Document& doc) -> Bitstream;
// Decoder: bits -> structure
auto decode(const Bitstream& bits) -> Document;
These must be inverses: decode(encode(x)) == x. But their implementations are often strikingly different:
Seeing Structure First
January 18, 2026
A reflection on eleven explorations in generic programming
The Question Behind the Code
What do these computations have in common?
- Computing the millionth Fibonacci number
- Finding the shortest path between cities in a weighted graph
- Calculating compound interest over thirty years
- Composing ten 3D rotations into one
- Repeating a string n times
The answer: they’re all computed by the same twenty lines of code.
template<typename T>
constexpr T power(T const& base, T exp) {
if (exp == zero(exp)) return one(exp);
if (exp == one(exp)) return base;
return even(exp)
? square(power(base, half(exp)))
: product(base, power(base, decrement(exp)));
}
This shouldn’t work. Fibonacci numbers involve integer sequences. Shortest paths involve graphs. Rotations involve 3D geometry. Different domains, different mathematics.
Yet they share structure. Once you see it, a single algorithm serves them all.
This collection of eleven blog posts is an extended meditation on one idea: algorithms arise from algebraic structure. The posts cover different domains (number theory, calculus, linear algebra, polymorphism) but they circle the same insight. Recognize the structure; the algorithm follows.
The Principle
Alex Stepanov articulated this most clearly in Elements of Programming: “Generic programming is about abstracting and classifying algorithms and data structures.” But the deeper point is how to abstract. Not by common syntax or superficial similarity, but by the algebraic laws a type obeys.
Why does structure appear everywhere? Because reality has structure. The algebraic structures we discover in programming (groups, rings, monoids) are the same structures physicists discover in nature. Rotations form a group. Spacetime transformations form a group. This isn’t coincidence. We’re uncovering patterns that exist.
Noether’s theorem makes this precise: every continuous symmetry corresponds to a conservation law. Time-translation symmetry gives conservation of energy. Space-translation symmetry gives conservation of momentum. Rotational symmetry gives conservation of angular momentum. The symmetry groups of physics are algebraic structures.
When we recognize “this is a monoid” in our code, we’re tapping into the same mathematical substrate that governs physical law. The algorithms follow because the structure constrains what’s possible, both in computation and in nature.
Consider the power() function above. What does it require?
- An associative binary operation (so we can regroup: \((a \cdot b) \cdot c = a \cdot (b \cdot c)\))
- An identity element (so \(1 \cdot x = x \cdot 1 = x\))
- Halving and parity testing on the exponent
That’s it. Any type providing these operations, with these laws, can use this algorithm. The requirements are algebraic, not syntactic.
Choosing the Algebra
November 30, 2025
The rest of this series asks: given a structure, what algorithms does it support? This post inverts the question.
The Flip Side
The peasant post showed that power() works on any monoid. The semirings post showed that matrix multiplication over different semirings solves six graph problems. The thread running through the whole series is: algorithms arise from algebraic structure.
But there’s a flip side that we haven’t addressed directly.
Sometimes you’re stuck with an expensive algorithm not because the problem is hard, but because you’re working in the wrong algebra. Change the algebra, and the algorithm becomes trivial. The cost shows up somewhere else, always. But if the cheap operation is the one you actually need, you win.
This is a very old idea. Napier invented logarithms in 1614 to turn multiplication into addition. What’s worth noticing is that logarithms, odds ratios, tropical semirings, and quaternions are all doing the same thing.
The Pattern
A computational basis transform takes values from one domain and represents them in another, where different operations are cheap:
| Domain | Cheap | Expensive |
|---|---|---|
| Log space | Multiplication (becomes addition) | Addition |
| Odds ratios | Bayesian updates (become multiplication) | Probability sums |
| Tropical \((\min, +)\) | Shortest paths (become matrix mult) | Subtraction |
| Quaternions | Rotation composition | Euler angle extraction |
| Modular integers | Exponentiation | Ordering |
| Rationals | Exact arithmetic | Irrational representation |
Each row follows the same structure. A transform \(\varphi: D \to D’\) makes some operations cheaper and others more expensive. There is no free lunch.
This is not a deep theorem. It’s almost tautological: if you could make everything cheaper by relabeling, the labels would already be the standard ones. But making the pattern explicit helps you recognize when you’re paying for an operation you don’t need.
Three Examples
Log Space
The most familiar example. You have a million small probabilities and need their product.
// Standard: underflows to 0 after ~30 terms
double product = 1.0;
for (double p : probs) product *= p; // 0.0
// Log domain: addition instead of multiplication
mutatio::lgd product(1.0);
for (double p : probs) product = product * mutatio::lgd(p);
// product.log() is finite. product.value() would overflow,
// but you stay in log space.
The algebra changed from \((\mathbb{R}^+, \times)\) to \((\mathbb{R}, +)\). Multiplication became addition. The isomorphism is \(\log\).
Tropical Semirings
The semirings post showed this already. Replace \((+, \times)\) with \((\min, +)\) and matrix multiplication becomes shortest-path computation. That’s the same move: you changed the semiring to make the algorithm you wanted (all-pairs shortest paths) fall out of a generic operation (matrix power) that you already had.
Differentiation: Three Ways
January 15, 2025
A synthesis of three earlier posts, comparing forward-mode AD, reverse-mode AD, and numerical differentiation.
Computing derivatives shows up everywhere: optimization, machine learning, physics simulation, numerical analysis. This series has explored three distinct approaches:
- Forward-mode AD via dual numbers
- Reverse-mode AD via computational graphs
- Numerical differentiation via finite differences
Each has different strengths. The right choice depends on the shape of your problem.
The Landscape
| Method | Accuracy | Cost for \(f: \mathbb{R}^n \to \mathbb{R}\) | Cost for \(f: \mathbb{R} \to \mathbb{R}^m\) | Memory |
|---|---|---|---|---|
| Forward AD | Exact | \(O(n)\) passes | \(O(1)\) pass | \(O(1)\) |
| Reverse AD | Exact | \(O(1)\) pass | \(O(m)\) passes | \(O(\text{ops})\) |
| Finite Diff | \(O(h^p)\) | \(O(n)\) evaluations | \(O(n)\) evaluations | \(O(1)\) |
The key point: problem structure determines the best method.
Forward-Mode AD: Dual Numbers
Forward-mode AD extends numbers with an infinitesimal \(\varepsilon\) where \(\varepsilon^2 = 0\). The derivative falls out of the arithmetic for free:
// f(x) = x^3 - 3x + 1
// f'(x) = 3x^2 - 3
auto x = dual<double>::variable(2.0); // x = 2, dx = 1
auto f = x*x*x - 3.0*x + 1.0;
std::cout << f.value() << "\n"; // 3.0
std::cout << f.derivative() << "\n"; // 9.0
Strengths:
- Simple implementation (operator overloading)
- No memory overhead
- Naturally composable for higher derivatives
- Works with any function of overloaded operators
When to use:
- Single input variable (or few inputs)
- Computing Jacobian-vector products
- Higher-order derivatives via nesting
- Sensitivity analysis along one direction
Complexity: One forward pass per input variable. For f: R^n -> R^m, computing the full Jacobian requires n passes.
Reverse-Mode AD: Computational Graphs
Reverse-mode AD builds a computational graph during the forward pass, then propagates gradients backward via the chain rule:
auto f = [](const auto& x) {
return sum(pow(x, 2.0)); // f(x) = sum(x^2)
};
auto df = grad(f); // Returns gradient function
auto gradient = df(x); // One backward pass for all partials
Strengths:
- O(1) backward passes regardless of input dimension
- Powers modern deep learning (backpropagation)
- Efficient for loss functions: f: R^n -> R
When to use:
- Many inputs, scalar output (neural networks)
- Computing vector-Jacobian products
- Optimization where you need the full gradient
Complexity: One forward pass to build the graph, one backward pass to compute all gradients. Memory scales with the number of operations because you have to store intermediate values.
Numerical Differentiation: Finite Differences
Approximate the derivative using the limit definition:
// Central difference: f'(x) ~ (f(x+h) - f(x-h)) / 2h
double df = central_difference(f, x);
Strengths:
Numerical Integration with Generic Concepts
August 28, 2023
Numerical integration (quadrature) for C++20.
Overview
The definite integral is the signed area under a curve:
$$\int_a^b f(x)\,dx$$Most functions do not have closed-form antiderivatives, so we approximate integrals numerically using quadrature rules: weighted sums of function evaluations.
$$\int_a^b f(x)\,dx \approx \sum_i w_i f(x_i)$$Different rules choose different nodes x_i and weights w_i. The tradeoff is always accuracy vs. computational cost.
Quick Start
#include <integration/integrate.hpp>
#include <cmath>
#include <iostream>
int main() {
using namespace integration;
// integral from 0 to pi of sin(x) dx = 2
double result = integrate([](double x) { return std::sin(x); }, 0.0, 3.14159265);
std::cout << "integral of sin(x) dx = " << result << "\n"; // ~2.0
}
The Quadrature Zoo
Basic Rules
| Rule | Formula | Error | Exact for |
|---|---|---|---|
| Midpoint | \((b-a)f(m)\) | \(O(h^3)\) | Linear |
| Trapezoidal | \(\frac{b-a}{2}(f(a)+f(b))\) | \(O(h^3)\) | Linear |
| Simpson’s | \(\frac{b-a}{6}(f(a)+4f(m)+f(b))\) | \(O(h^5)\) | Cubic |
double m = midpoint_rule(f, a, b);
double t = trapezoidal_rule(f, a, b);
double s = simpsons_rule(f, a, b);
Composite Rules
Divide [a,b] into n subintervals and apply the basic rule to each:
// Error: O(h^2) where h = (b-a)/n
double m = composite_midpoint(f, a, b, 100);
double t = composite_trapezoidal(f, a, b, 100);
// Error: O(h^4) - much more accurate!
double s = composite_simpsons(f, a, b, 100); // n must be even
Gauss-Legendre Quadrature
The optimal choice: n points exactly integrate polynomials of degree 2n-1.
// 5-point Gauss-Legendre: exact for degree <= 9
double g = gauss_legendre<5>(f, a, b);
Adaptive Integration
Automatically refines where the function is difficult:
// Recommended for general use
double result = integrate(f, a, b); // Default tolerance 1e-10
double result = integrate(f, a, b, 1e-12); // Custom tolerance
// With error estimate
auto [value, error, evals] = integrate_with_error(f, a, b);
Deriving Simpson’s Rule
Taylor expand \(f(x)\) around the midpoint \(m = (a+b)/2\). Odd powers vanish by symmetry when integrating from \(a\) to \(b\):
$$\int_a^b f(x)\,dx \approx (b-a)f(m) + f''(m)\frac{(b-a)^3}{24} + O(h^5)$$Simpson’s rule is the unique combination of endpoint and midpoint values that cancels the \(h^2\) error:
$$\int_a^b f(x)\,dx = \frac{b-a}{6}\left[f(a) + 4f(m) + f(b)\right] + O(h^5)$$It also cancels the h^3 term. Simpson gets a “bonus degree” of accuracy for free. This is one of those happy accidents in numerical analysis.
Why Gauss-Legendre is Optimal
With \(n\) evaluation points, we have \(2n\) free parameters (\(n\) nodes + \(n\) weights). We can match \(2n\) conditions: exact integration of \(1, x, x^2, \ldots, x^{2n-1}\).
The nodes turn out to be roots of the \(n\)-th Legendre polynomial \(P_n(x)\). Orthogonal polynomials arise naturally from the optimization. This is not a coincidence. It is the same reason Legendre polynomials show up in approximation theory: they are the optimal basis for polynomial approximation on [-1,1] with the right inner product.
Forward-Mode Automatic Differentiation
September 20, 2021
Forward-mode automatic differentiation via dual numbers for C++20.
Overview
Dual numbers are a simple yet powerful technique for computing exact derivatives. The key insight: if we extend our number system with an element epsilon where epsilon^2 = 0, then evaluating f(x + epsilon) yields f(x) + epsilon * f'(x). The derivative emerges automatically from the algebra.
Quick Start
#include <dual/dual.hpp>
#include <iostream>
int main() {
using namespace dual;
// Create a dual variable at x = 2
auto x = dual<double>::variable(2.0);
// Compute f(x) = x^3 - 3x + 1
auto f = x*x*x - 3.0*x + 1.0;
std::cout << "f(2) = " << f.value() << "\n"; // 3.0
std::cout << "f'(2) = " << f.derivative() << "\n"; // 9.0
}
The Mathematics
A dual number has the form \(a + b\varepsilon\) where \(\varepsilon^2 = 0\). Arithmetic follows naturally:
$$(a + b\varepsilon) + (c + d\varepsilon) = (a+c) + (b+d)\varepsilon$$$$(a + b\varepsilon)(c + d\varepsilon) = ac + (ad + bc)\varepsilon + bd\varepsilon^2 = ac + (ad + bc)\varepsilon$$Notice how the \(bd\varepsilon^2\) term vanishes because \(\varepsilon^2 = 0\).
For a function \(f\), Taylor expansion gives:
$$f(a + b\varepsilon) = f(a) + bf'(a)\varepsilon + \frac{b^2}{2}f''(a)\varepsilon^2 + \cdots = f(a) + bf'(a)\varepsilon$$If we set \(b = 1\) (marking \(x\) as “the variable we’re differentiating with respect to”), then:
$$f(x + \varepsilon) = f(x) + f'(x)\varepsilon$$The derivative appears as the coefficient of epsilon!
API Reference
dual
The core dual number type.
// Create a variable for differentiation
auto x = dual<double>::variable(3.0); // x = 3, dx = 1
// Create a constant
auto c = dual<double>::constant(2.0); // c = 2, dc = 0
// Access values
double val = x.value(); // 3.0
double deriv = x.derivative(); // 1.0
// Arithmetic operators: +, -, *, /
auto y = sin(x*x) + exp(-x);
// Convenience function
auto [value, deriv] = differentiate([](auto x) { return x*x; }, 3.0);
Mathematical Functions
All standard math functions are supported with correct derivative propagation:
- Basic:
sqrt,cbrt,abs - Exponential:
exp,exp2,expm1,log,log2,log10,log1p - Trigonometric:
sin,cos,tan,asin,acos,atan,atan2 - Hyperbolic:
sinh,cosh,tanh,asinh,acosh,atanh - Power:
pow,hypot - Special:
erf,erfc
Higher-Order Derivatives
Second derivatives with dual2:
auto result = differentiate2([](auto x) { return sin(x); }, 1.0);
// result.value = sin(1)
// result.first = cos(1)
// result.second = -sin(1)
Arbitrary order with jets:
// Compute f, f', f'', f''', f'''' at x = 1
auto derivs = derivatives<4>([](auto x) { return exp(x); }, 1.0);
// All derivatives of e^x at x=1 equal e
Forward vs Reverse Mode
This library implements forward mode AD:
Polynomials as Euclidean Domains
July 14, 2020
The same GCD algorithm works for integers and polynomials. That’s not a coincidence. It’s because both are Euclidean domains.
The Observation
// For integers: gcd(48, 18) = 6
// For polynomials: gcd(x^3 - 1, x^2 - 1) = x - 1
// Same algorithm, different types
template<euclidean_domain E>
E gcd(E a, E b) {
while (b != E(0)) {
a = std::exchange(b, a % b);
}
return a;
}
That template compiles and works correctly for both integers and polynomials. The reason it works is algebraic: both types support division with remainder, and the remainder is always “smaller” than the divisor in a well-defined sense.
Quick Start
#include <polynomials/polynomial.hpp>
#include <iostream>
using namespace poly;
int main() {
// Create polynomial x^2 - 1 = (x-1)(x+1)
auto p = polynomial<double>{-1, 0, 1};
// Create polynomial x^3 - 1 = (x-1)(x^2+x+1)
auto q = polynomial<double>{-1, 0, 0, 1};
// GCD should be (x - 1)
auto g = gcd(p, q);
std::cout << "gcd(x^2-1, x^3-1) has degree " << g.degree() << "\n"; // 1
// Find roots of x^2 - 1
auto roots = find_roots(p, -10.0, 10.0);
for (double r : roots) {
std::cout << "Root: " << r << "\n"; // -1 and 1
}
}
API Reference
Creating Polynomials
// From dense coefficients (a[i] = coefficient of x^i)
polynomial<double> p{1, -2, 1}; // 1 - 2x + x^2
// Monomial: coefficient * x^degree
auto m = polynomial<double>::monomial(3.0, 4); // 3x^4
// The variable x
auto x = polynomial<double>::x(); // x
// Constant
polynomial<double> c{5.0}; // 5
Arithmetic
auto sum = p + q;
auto diff = p - q;
auto prod = p * q;
auto [quot, rem] = divmod(p, q); // Division with remainder
auto quot_only = p / q;
auto rem_only = p % q;
GCD and Related
auto g = gcd(p, q); // Greatest common divisor
auto [g, s, t] = extended_gcd(p, q); // Bezout: g = p*s + q*t
auto l = lcm(p, q); // Least common multiple
bool d = divides(p, q); // Does p divide q?
Evaluation and Calculus
double val = evaluate(p, x); // p(x)
auto dp = derivative(p); // p'(x)
auto integral = antiderivative(p); // integral of p
auto roots = find_roots(p, -10, 10); // All real roots in interval
auto crit = stationary_points(p, -10, 10); // Where p'(x) = 0
The Euclidean Domain Structure
What makes this work is shared algebraic structure. A Euclidean domain has a norm function and a division algorithm where the remainder is always smaller than the divisor:
| Property | Integers | Polynomials |
|---|---|---|
| Norm | abs(n) | degree(p) |
| Division | a = b*q + r, abs(r) < abs(b) | a = b*q + r, deg(r) < deg(b) |
| GCD | gcd(48, 18) = 6 | gcd(x^2-1, x-1) = x-1 |
The GCD algorithm doesn’t care which type it’s operating on. It only needs the division-with-remainder property. Stepanov’s whole point is exactly this: algorithms arise from algebraic structure. When you recognize that polynomials and integers share the same abstract structure, you immediately get:
Exact Rational Arithmetic
February 18, 2020
Floating-point lies to you.
double x = 0.1 + 0.2;
std::cout << (x == 0.3); // Prints 0 (false!)
The number 0.1 has no exact binary representation, for the same reason 1/3 has no exact decimal representation. Floating-point represents numbers as m x 2^e, and most decimal fractions don’t land on a power of two.
Rational arithmetic fixes this. 1/3 stays exactly 1/3.
The Representation
A rational number is a pair (numerator, denominator) kept in lowest terms:
template<std::integral T>
class rat {
T num_; // numerator (carries sign)
T den_; // denominator (always positive)
void reduce() {
T g = std::gcd(abs(num_), den_);
num_ /= g;
den_ /= g;
}
};
Three invariants, always maintained:
- The denominator is positive (sign lives in the numerator)
- GCD(|num|, den) = 1 (always reduced)
- Zero is uniquely 0/1
Arithmetic
Addition needs a common denominator:
$$\frac{a}{b} + \frac{c}{d} = \frac{ad + bc}{bd}$$Then reduce. In code:
rat operator+(rat const& rhs) const {
return rat(num_ * rhs.den_ + rhs.num_ * den_,
den_ * rhs.den_);
}
The constructor calls reduce() automatically.
Multiplication is simpler:
$$\frac{a}{b} \times \frac{c}{d} = \frac{ac}{bd}$$Division multiplies by the reciprocal:
$$\frac{a/b}{c/d} = \frac{ad}{bc}$$Exact Comparison
No floating-point fuzziness. Two reduced rationals are equal iff their numerators and denominators match:
bool operator==(rat const& rhs) const {
return num_ == rhs.num_ && den_ == rhs.den_;
}
For ordering, cross-multiply (valid because denominators are positive):
$$\frac{a}{b} < \frac{c}{d} \iff ad < cb$$The Mediant
The mediant of a/b and c/d is (a+c)/(b+d). It’s not the average. It has different, more interesting properties:
rat mediant(rat const& a, rat const& b) {
return rat(a.numerator() + b.numerator(),
a.denominator() + b.denominator());
}
If a/b < c/d, then a/b < mediant < c/d. The mediant is always in lowest terms when a/b and c/d are neighbors in the Stern-Brocot tree. And mediants generate all positive rationals exactly once.
The Stern-Brocot Tree
Start with 0/1 and 1/0 (representing infinity). Repeatedly take mediants:
Level 0: 0/1 1/0
Level 1: 0/1 1/1 1/0
Level 2: 0/1 1/2 1/1 2/1 1/0
Level 3: 0/1 1/3 1/2 2/3 1/1 3/2 2/1 3/1 1/0
Every positive rational appears exactly once. The path from root to any node encodes its continued fraction. This connects to best rational approximations and Farey sequences.
GCD Ties Everything Together
Reducing fractions requires GCD. The algorithm is Euclid’s, from around 300 BCE:
T gcd(T a, T b) {
while (b != 0) {
a = a % b;
std::swap(a, b);
}
return a;
}
The same algorithm works for any Euclidean domain. That’s not a coincidence. It’s a consequence of the algebraic structure.
Rational numbers form a field: every non-zero element has a multiplicative inverse (the reciprocal). The requirement that denominators be non-zero and fractions reduced comes from this algebraic structure, not from arbitrary convention.
How Iterators Give You N+M Instead of NxM
November 15, 2019
The problem is combinatorial. You have N algorithms (sort, search, find, copy) and M containers (array, list, tree, hash table). The naive approach: implement each algorithm for each container. That is NxM implementations.
The insight is to interpose an abstraction layer.
The Iterator Abstraction
Instead of algorithms knowing about containers directly, we define iterator categories, capabilities that algorithms require and containers provide:
Input: Single-pass read. You can advance (++) and dereference (*), but once you move forward, you cannot go back. Stream-like.
Forward: Multi-pass. You can iterate multiple times; begin() always gives the same starting point.
Bidirectional: Can go backward (--). Enables algorithms like reverse iteration.
Random-access: Can jump anywhere (+n, []). Enables binary search, sorting.
This is a hierarchy of requirements. Each level adds capabilities and enables more algorithms. An algorithm declares the weakest category it needs, and any container providing at least that category works.
A True Input Iterator
The input iterator category exists for a reason. Here is a working example that reads entropy from /dev/urandom:
#include <fstream>
#include <iterator>
#include <cstdint>
struct entropy_iterator {
using iterator_category = std::input_iterator_tag;
using value_type = uint8_t;
using difference_type = std::ptrdiff_t;
using pointer = const uint8_t*;
using reference = uint8_t; // returns by value, not reference
std::ifstream* source = nullptr;
uint8_t byte = 0;
entropy_iterator() = default; // sentinel (end iterator)
explicit entropy_iterator(std::ifstream& s) : source(&s) {
++(*this); // prime the first byte
}
uint8_t operator*() const { return byte; }
entropy_iterator& operator++() {
if (source && source->good()) {
source->read(reinterpret_cast<char*>(&byte), 1);
if (!source->good()) source = nullptr;
}
return *this;
}
entropy_iterator operator++(int) {
auto tmp = *this;
++(*this);
return tmp;
}
bool operator==(const entropy_iterator& other) const {
return source == other.source;
}
};
Use it like any input iterator:
int main() {
std::ifstream urandom("/dev/urandom", std::ios::binary);
entropy_iterator it(urandom);
// generate 16 random bytes
std::vector<uint8_t> key(16);
std::copy_n(it, 16, key.begin());
// or use in algorithms
int sum = 0;
for (int i = 0; i < 1000; ++i, ++it) {
sum += *it;
}
// sum ≈ 127500 (mean of uniform [0,255] × 1000)
}
Each ++ consumes a fresh entropy byte from the kernel. You literally cannot iterate twice over the same sequence. This is why the input iterator category exists: some sources are inherently single-pass. Claiming forward iterator capabilities would be a lie.
The same pattern applies to network streams, sensor readings, and any source where data is consumed by reading it.
The Payoff
Now binary_search does not need to know about vectors, deques, or sorted arrays. It only needs random-access iterators. The algorithm expresses its requirements; the container provides capabilities. They compose through the iterator abstraction.
Is It Prime?
September 10, 2019
The Miller-Rabin primality test and the mathematics of certainty
The Problem
Given a large number n, is it prime? Trial division up to sqrt(n) is too slow for cryptographic-sized numbers. We need something faster, and we are willing to accept “probably prime” with quantifiable certainty.
Fermat’s Little Theorem
For prime \(p\) and any \(a\) not divisible by \(p\):
$$a^{p-1} \equiv 1 \pmod{p}$$This suggests a test: pick random a, compute a^(n-1) mod n. If the result is not 1, n is definitely composite. But if it is 1, n might be prime, or might be a Carmichael number that fools this test.
The Miller-Rabin Improvement
Miller and Rabin observed something stronger. For odd prime \(p\), write \(p-1 = 2^r \cdot d\) (factor out all 2s). Then the sequence:
$$a^d, a^{2d}, a^{4d}, \ldots, a^{2^r \cdot d} = a^{p-1}$$must either:
- Start with 1, or
- Contain -1 (i.e., p-1) somewhere before reaching 1
Why? Because the only square roots of 1 mod p are plus or minus 1. If we ever see 1 without first seeing -1, we have found a non-trivial square root of 1, proving n is composite.
The Witness Test
bool witness_test(int64_t n, int64_t a) {
// Write n-1 = 2^r x d
int64_t d = n - 1;
int r = 0;
while ((d & 1) == 0) { d >>= 1; r++; }
// Compute x = a^d mod n
int64_t x = mod_pow(a, d, n);
if (x == 1 || x == n - 1) return true; // Probably prime
// Square r-1 times, looking for n-1
for (int i = 1; i < r; i++) {
x = (x * x) % n;
if (x == n - 1) return true;
}
return false; // Definitely composite
}
If witness_test(n, a) returns false, n is definitely composite. The value a is a “witness” to compositeness.
Error Bounds
Here is the part I find most satisfying. For any composite n, at least 3/4 of all possible witnesses a in [2, n-2] will detect it. Each random witness has at most 1/4 chance of failing to detect a composite.
With \(k\) independent witnesses:
$$P(\text{false positive}) \leq \left(\frac{1}{4}\right)^k$$| Witnesses | Error bound |
|---|---|
| 10 | \(< 10^{-6}\) |
| 20 | \(< 10^{-12}\) |
| 40 | \(< 10^{-24}\) |
The error drops exponentially. 40 witnesses gives you a false positive probability smaller than the chance of a cosmic ray flipping a bit in your RAM during the computation.
Parameterizing by Error
Rather than asking “how many iterations?”, ask “what error rate is acceptable?”:
Modular Arithmetic as Rings
June 22, 2019
Finite algebraic structures and what they teach us about algorithms
The Stepanov Perspective
Stepanov’s central insight: algorithms arise from algebraic structure. The same algorithm that works on integers works on matrices, polynomials, and modular integers, not by accident, but because they share algebraic properties.
Integers modulo N form a ring: a set with addition and multiplication satisfying familiar laws. When N is prime, it is a field, meaning every non-zero element has a multiplicative inverse. Understanding these structures tells you which algorithms apply where.
The Ring Z/NZ
Integers modulo N, written Z/NZ, are equivalence classes:
- 0 = {…, -N, 0, N, 2N, …}
- 1 = {…, -N+1, 1, N+1, 2N+1, …}
- …
Operations are inherited from integers:
[a] + [b] = [a + b]
[a] x [b] = [a x b]
The implementation keeps one representative per class, in [0, N):
template<int64_t N>
struct mod_int {
int64_t v; // Always in [0, N)
static constexpr int64_t normalize(int64_t x) {
x %= N;
return x < 0 ? x + N : x;
}
constexpr mod_int(int64_t x) : v(normalize(x)) {}
};
Ring Axioms
A ring (R, +, x) satisfies:
Addition forms an abelian group:
- Associative: (a + b) + c = a + (b + c)
- Identity: a + 0 = a
- Inverses: a + (-a) = 0
- Commutative: a + b = b + a
Multiplication is a monoid:
- Associative: (a x b) x c = a x (b x c)
- Identity: a x 1 = 1 x a = a
Distributive:
- a x (b + c) = a x b + a x c
- (b + c) x a = b x a + c x a
These axioms enable algorithms. Power-by-squaring works because multiplication is associative. The extended GCD works because of the ring structure.
Fermat’s Little Theorem
When N is prime, something special happens: every non-zero element has a multiplicative inverse. The set of non-zero elements forms a multiplicative group of order N-1.
Fermat’s Little Theorem: for prime \(p\) and \(a\) not congruent to \(0 \pmod{p}\):
$$a^{p-1} \equiv 1 \pmod{p}$$This gives us the inverse:
$$a \cdot a^{p-2} = a^{p-1} \equiv 1 \pmod{p}$$So \(a^{p-2}\) is the multiplicative inverse of \(a\):
constexpr mod_int inverse() const {
return pow(N - 2); // Using repeated squaring
}
The Connection to Peasant
The power function uses the same peasant algorithm from the previous post:
One Algorithm, Infinite Powers
March 15, 2019
How the Russian peasant algorithm reveals the universal structure of exponentiation
The Algorithm
Russian peasants had a clever method for multiplication that does not require memorizing times tables. To compute 23 x 17:
23 17
11 34 (halve, double)
5 68
2 136
1 272
Add the right column wherever the left is odd: 17 + 34 + 68 + 272 = 391. That is 23 x 17.
Why does this work? Because we are really computing:
23 x 17 = (16 + 4 + 2 + 1) x 17 = 16x17 + 4x17 + 2x17 + 17
The algorithm only needs three operations on the multiplier:
half(n), integer division by 2even(n), test if divisible by 2- Addition on the result
From Multiplication to Exponentiation
Here is the insight that makes this interesting: the same algorithm computes powers.
Replace “add to accumulator” with “multiply into accumulator” and “double the multiplicand” with “square the base”:
T power(T base, int exp) {
T result = 1;
while (exp > 0) {
if (!even(exp)) result = result * base;
base = base * base;
exp = half(exp);
}
return result;
}
This is O(log n) multiplications instead of O(n). Computing 2^1000 takes about 10 multiplications, not 1000.
The Monoid Connection
The peasant algorithm works whenever you have:
- An associative binary operation
* - An identity element
1where1 * x = x * 1 = x
This structure is called a monoid. The algorithm computes x * x * ... * x (n times) using O(log n) operations.
What makes this powerful is that many things form monoids:
| Type | Operation | Identity | Computing x^n gives you… |
|---|---|---|---|
| Integers | x | 1 | Powers |
| Matrices | x | I | Matrix powers |
| Strings | concat | "" | String repetition |
| Functions | compose | id | Function iteration |
| Permutations | compose | id | Permutation powers |
| Quaternions | x | 1 | Rotation composition |
Why Associativity Unlocks Efficiency
Why does the peasant algorithm achieve O(log n) instead of O(n)? The answer lies in a single algebraic law: associativity.
Associativity says \((a \cdot b) \cdot c = a \cdot (b \cdot c)\). This looks innocuous, but it means we can restructure computation without changing results. Consider computing \(a^8\):
Naive: a x a x a x a x a x a x a x a (7 multiplications)
Peasant: ((a^2)^2)^2 (3 multiplications)
Both produce the same answer because we can freely regroup. The peasant algorithm exploits this freedom systematically: instead of accumulating one factor at a time, it squares intermediate results and combines them.
Linked project: Arkiv
The MCP Pattern: SQLite as the AI-Queryable Cache
March 20, 2026
I keep building the same thing.
Not the same product — the products are different. One indexes a Hugo blog. One indexes AI conversations. One consolidates medical records from three hospitals. One catalogs a hundred git repositories. But underneath, they all have the same skeleton. After the fifth time, I think the skeleton deserves a name.
The pattern
Domain files (ground truth)
↓ index
SQLite database (read-only cache, FTS5)
↓ expose
MCP server (tools + resources → AI assistant)
That’s it. Three layers. The domain files are always canonical — the database is a disposable cache you can rebuild from them at any time. SQLite gives you structured queries, full-text search, and JSON extraction over data that was previously trapped in flat files. MCP exposes it to an AI assistant that can write SQL, retrieve content, and (in some cases) create new content.
Here’s the inventory:
| Project | Domain | Ground Truth | What the MCP Exposes |
|---|---|---|---|
| hugo-memex | Blog content | Markdown files with YAML front matter | 951 pages, FTS5 search, taxonomy queries, JSON front matter extraction |
| memex | AI conversations | ChatGPT/Claude/Gemini exports | Conversation trees, FTS5 message search, tags, enrichments |
| chartfold | Medical records | Epic, MEDITECH, athenahealth exports | Labs, meds, encounters, imaging, pathology, cross-source reconciliation |
| arkiv | Personal archives | JSONL files from various sources | Unified SQL over heterogeneous personal data |
| repoindex | Git repositories | Local git repos + GitHub/PyPI/CRAN metadata | Repository catalog with activity tracking, publication status |
Five projects. Five completely different domains. One architecture.
Why SQLite
SQLite is the most deployed database in history. It’s on every phone, every browser, every Python installation. But that’s not why I use it.
I use it because it solves three problems at once:
Structured queries over unstructured data. Hugo front matter is YAML trapped inside markdown files. Medical records are scattered across three incompatible EHR export formats. AI conversations are JSON trees with branching paths. SQLite turns all of these into tables you can JOIN, GROUP BY, and aggregate. json_extract() handles the long tail of fields that don’t fit a fixed schema.
Full-text search. FTS5 with porter stemming and unicode61 tokenization gives you relevance-ranked search across any text corpus. No Elasticsearch, no external service, no running daemon. Just a virtual table that lives in the same database file.
Code Without Purpose
February 24, 2026
Time is finite in ways I can’t ignore. That changes which questions about code feel important.
I read a post arguing that the most valuable programming skill in 2026 is deleting code. The thesis: AI generates code faster than anyone can review it, so the real value is in curation and subtraction. Code is a liability, not an asset.
I agree with the observation. I disagree with the prescription.
Linked project: Chartfold
The MCP Pattern: SQLite as the AI-Queryable Cache
March 20, 2026
I keep building the same thing.
Not the same product — the products are different. One indexes a Hugo blog. One indexes AI conversations. One consolidates medical records from three hospitals. One catalogs a hundred git repositories. But underneath, they all have the same skeleton. After the fifth time, I think the skeleton deserves a name.
The pattern
Domain files (ground truth)
↓ index
SQLite database (read-only cache, FTS5)
↓ expose
MCP server (tools + resources → AI assistant)
That’s it. Three layers. The domain files are always canonical — the database is a disposable cache you can rebuild from them at any time. SQLite gives you structured queries, full-text search, and JSON extraction over data that was previously trapped in flat files. MCP exposes it to an AI assistant that can write SQL, retrieve content, and (in some cases) create new content.
Here’s the inventory:
| Project | Domain | Ground Truth | What the MCP Exposes |
|---|---|---|---|
| hugo-memex | Blog content | Markdown files with YAML front matter | 951 pages, FTS5 search, taxonomy queries, JSON front matter extraction |
| memex | AI conversations | ChatGPT/Claude/Gemini exports | Conversation trees, FTS5 message search, tags, enrichments |
| chartfold | Medical records | Epic, MEDITECH, athenahealth exports | Labs, meds, encounters, imaging, pathology, cross-source reconciliation |
| arkiv | Personal archives | JSONL files from various sources | Unified SQL over heterogeneous personal data |
| repoindex | Git repositories | Local git repos + GitHub/PyPI/CRAN metadata | Repository catalog with activity tracking, publication status |
Five projects. Five completely different domains. One architecture.
Why SQLite
SQLite is the most deployed database in history. It’s on every phone, every browser, every Python installation. But that’s not why I use it.
I use it because it solves three problems at once:
Structured queries over unstructured data. Hugo front matter is YAML trapped inside markdown files. Medical records are scattered across three incompatible EHR export formats. AI conversations are JSON trees with branching paths. SQLite turns all of these into tables you can JOIN, GROUP BY, and aggregate. json_extract() handles the long tail of fields that don’t fit a fixed schema.
Full-text search. FTS5 with porter stemming and unicode61 tokenization gives you relevance-ranked search across any text corpus. No Elasticsearch, no external service, no running daemon. Just a virtual table that lives in the same database file.
Chartfold: Owning Your Medical Records
February 24, 2026
I have cancer. My oncologist is at one hospital system (Siteman/BJC), my primary care doctor at another, and my earlier treatment history lives at a third (Anderson, where my first oncologist practiced). Patient portals are fine for browsing, but they don’t answer questions. They show you your data one lab result at a time, one note at a time, one visit at a time.
I wanted to run queries against my medical records. Correlate lab trends with treatment changes. Generate structured question lists before oncology visits. Ask “what changed since my last appointment” and get a real answer. That means getting the data out of the portal and into something programmable.
Chartfold loads EHR exports into SQLite and exposes them to Claude via MCP.
Code Without Purpose
February 24, 2026
Time is finite in ways I can’t ignore. That changes which questions about code feel important.
I read a post arguing that the most valuable programming skill in 2026 is deleting code. The thesis: AI generates code faster than anyone can review it, so the real value is in curation and subtraction. Code is a liability, not an asset.
I agree with the observation. I disagree with the prescription.
Linked project: Hugo-Memex
The MCP Pattern: SQLite as the AI-Queryable Cache
March 20, 2026
I keep building the same thing.
Not the same product — the products are different. One indexes a Hugo blog. One indexes AI conversations. One consolidates medical records from three hospitals. One catalogs a hundred git repositories. But underneath, they all have the same skeleton. After the fifth time, I think the skeleton deserves a name.
The pattern
Domain files (ground truth)
↓ index
SQLite database (read-only cache, FTS5)
↓ expose
MCP server (tools + resources → AI assistant)
That’s it. Three layers. The domain files are always canonical — the database is a disposable cache you can rebuild from them at any time. SQLite gives you structured queries, full-text search, and JSON extraction over data that was previously trapped in flat files. MCP exposes it to an AI assistant that can write SQL, retrieve content, and (in some cases) create new content.
Here’s the inventory:
| Project | Domain | Ground Truth | What the MCP Exposes |
|---|---|---|---|
| hugo-memex | Blog content | Markdown files with YAML front matter | 951 pages, FTS5 search, taxonomy queries, JSON front matter extraction |
| memex | AI conversations | ChatGPT/Claude/Gemini exports | Conversation trees, FTS5 message search, tags, enrichments |
| chartfold | Medical records | Epic, MEDITECH, athenahealth exports | Labs, meds, encounters, imaging, pathology, cross-source reconciliation |
| arkiv | Personal archives | JSONL files from various sources | Unified SQL over heterogeneous personal data |
| repoindex | Git repositories | Local git repos + GitHub/PyPI/CRAN metadata | Repository catalog with activity tracking, publication status |
Five projects. Five completely different domains. One architecture.
Why SQLite
SQLite is the most deployed database in history. It’s on every phone, every browser, every Python installation. But that’s not why I use it.
I use it because it solves three problems at once:
Structured queries over unstructured data. Hugo front matter is YAML trapped inside markdown files. Medical records are scattered across three incompatible EHR export formats. AI conversations are JSON trees with branching paths. SQLite turns all of these into tables you can JOIN, GROUP BY, and aggregate. json_extract() handles the long tail of fields that don’t fit a fixed schema.
Full-text search. FTS5 with porter stemming and unicode61 tokenization gives you relevance-ranked search across any text corpus. No Elasticsearch, no external service, no running daemon. Just a virtual table that lives in the same database file.
Linked project: Memex
The MCP Pattern: SQLite as the AI-Queryable Cache
March 20, 2026
I keep building the same thing.
Not the same product — the products are different. One indexes a Hugo blog. One indexes AI conversations. One consolidates medical records from three hospitals. One catalogs a hundred git repositories. But underneath, they all have the same skeleton. After the fifth time, I think the skeleton deserves a name.
The pattern
Domain files (ground truth)
↓ index
SQLite database (read-only cache, FTS5)
↓ expose
MCP server (tools + resources → AI assistant)
That’s it. Three layers. The domain files are always canonical — the database is a disposable cache you can rebuild from them at any time. SQLite gives you structured queries, full-text search, and JSON extraction over data that was previously trapped in flat files. MCP exposes it to an AI assistant that can write SQL, retrieve content, and (in some cases) create new content.
Here’s the inventory:
| Project | Domain | Ground Truth | What the MCP Exposes |
|---|---|---|---|
| hugo-memex | Blog content | Markdown files with YAML front matter | 951 pages, FTS5 search, taxonomy queries, JSON front matter extraction |
| memex | AI conversations | ChatGPT/Claude/Gemini exports | Conversation trees, FTS5 message search, tags, enrichments |
| chartfold | Medical records | Epic, MEDITECH, athenahealth exports | Labs, meds, encounters, imaging, pathology, cross-source reconciliation |
| arkiv | Personal archives | JSONL files from various sources | Unified SQL over heterogeneous personal data |
| repoindex | Git repositories | Local git repos + GitHub/PyPI/CRAN metadata | Repository catalog with activity tracking, publication status |
Five projects. Five completely different domains. One architecture.
Why SQLite
SQLite is the most deployed database in history. It’s on every phone, every browser, every Python installation. But that’s not why I use it.
I use it because it solves three problems at once:
Structured queries over unstructured data. Hugo front matter is YAML trapped inside markdown files. Medical records are scattered across three incompatible EHR export formats. AI conversations are JSON trees with branching paths. SQLite turns all of these into tables you can JOIN, GROUP BY, and aggregate. json_extract() handles the long tail of fields that don’t fit a fixed schema.
Full-text search. FTS5 with porter stemming and unicode61 tokenization gives you relevance-ranked search across any text corpus. No Elasticsearch, no external service, no running daemon. Just a virtual table that lives in the same database file.
Linked project: Repoindex
The MCP Pattern: SQLite as the AI-Queryable Cache
March 20, 2026
I keep building the same thing.
Not the same product — the products are different. One indexes a Hugo blog. One indexes AI conversations. One consolidates medical records from three hospitals. One catalogs a hundred git repositories. But underneath, they all have the same skeleton. After the fifth time, I think the skeleton deserves a name.
The pattern
Domain files (ground truth)
↓ index
SQLite database (read-only cache, FTS5)
↓ expose
MCP server (tools + resources → AI assistant)
That’s it. Three layers. The domain files are always canonical — the database is a disposable cache you can rebuild from them at any time. SQLite gives you structured queries, full-text search, and JSON extraction over data that was previously trapped in flat files. MCP exposes it to an AI assistant that can write SQL, retrieve content, and (in some cases) create new content.
Here’s the inventory:
| Project | Domain | Ground Truth | What the MCP Exposes |
|---|---|---|---|
| hugo-memex | Blog content | Markdown files with YAML front matter | 951 pages, FTS5 search, taxonomy queries, JSON front matter extraction |
| memex | AI conversations | ChatGPT/Claude/Gemini exports | Conversation trees, FTS5 message search, tags, enrichments |
| chartfold | Medical records | Epic, MEDITECH, athenahealth exports | Labs, meds, encounters, imaging, pathology, cross-source reconciliation |
| arkiv | Personal archives | JSONL files from various sources | Unified SQL over heterogeneous personal data |
| repoindex | Git repositories | Local git repos + GitHub/PyPI/CRAN metadata | Repository catalog with activity tracking, publication status |
Five projects. Five completely different domains. One architecture.
Why SQLite
SQLite is the most deployed database in history. It’s on every phone, every browser, every Python installation. But that’s not why I use it.
I use it because it solves three problems at once:
Structured queries over unstructured data. Hugo front matter is YAML trapped inside markdown files. Medical records are scattered across three incompatible EHR export formats. AI conversations are JSON trees with branching paths. SQLite turns all of these into tables you can JOIN, GROUP BY, and aggregate. json_extract() handles the long tail of fields that don’t fit a fixed schema.
Full-text search. FTS5 with porter stemming and unicode61 tokenization gives you relevance-ranked search across any text corpus. No Elasticsearch, no external service, no running daemon. Just a virtual table that lives in the same database file.
repoindex: Collection Awareness for Your Git Repos
December 16, 2025
I have around 100 git repos. When I’m working with Claude Code on one of them, the AI has deep knowledge of that repo but zero awareness of the rest. Questions like “which of my repos already has a fuzzy search implementation?” or “what other projects use this pattern?” require me to go dig around manually.
repoindex fixes that.
The Idea
Separation of concerns:
Claude Code (deep work on ONE repo)
|
| "What else do I have?"
| "Which repos need X?"
v
repoindex (collection awareness)
|
+-- repo://... -> what exists
+-- tags://... -> organization
+-- stats://... -> aggregations
+-- events://... -> what happened
Claude Code works inside repositories. repoindex knows about repositories: metadata, tags, status, relationships. Together they give you full portfolio awareness.
MCP Server Integration
The most useful part is the MCP (Model Context Protocol) server. Add it to your Claude Code configuration and the AI can query your collection directly:
- “Which of my Python repos don’t have a LICENSE file?”
- “What repos have I updated in the last week?”
- “Show me all projects tagged with
ml”
The server exposes resources like repo://, tags://, stats://, and events:// that Claude Code reads to understand your portfolio.
Core Features
Tag-Based Organization. Hierarchical tags for categorizing repos. Tags can be explicit (repoindex tag add myproject topic:ml) or implicit (derived automatically from language, directory, features).
Query Language. Filter repos with expressions:
repoindex query "language == 'Python' and 'ml' in tags"
repoindex query "stars > 10 and has:docs"
Event Tracking. What happened across your collection:
repoindex events --since 7d --pretty
New releases, tags, PyPI publishes, all in one view.
JSONL Output. Every command outputs newline-delimited JSON by default, so it plays well with Unix pipelines:
repoindex status | jq 'select(.status.clean == false)'
Installation
Available on PyPI:
pip install repoindex
Configure your repository directories and start indexing:
repoindex config generate
repoindex list --pretty
Why the Rename?
This was previously called ghops. The new name is more honest about what it does: it indexes repositories. The old name implied GitHub-specific operations, but the tool works with any git repo.
Everything is a File: Virtual Filesystems for CLI Data Tools
October 20, 2025
I had a bookmark manager. Then an ebook library manager. Then a chat history manager. Each started with the standard CRUD CLI:
btk add https://example.com --tags python,tutorial
btk list --tag python
btk search "async"
btk delete 1234
ebk import book.pdf --author "Knuth"
ebk list --author Knuth
ebk search "algorithms"
This works fine until you have 10,000+ bookmarks organized with hierarchical tags like programming/python/async, research/ml/transformers, work/clients/acme. Your ebook library has similar structure. Your exported chat conversations from Claude, ChatGPT, and Copilot are piling up.
Traditional CRUD commands become unwieldy:
btk list --tag programming/python/async/io --format json | jq '.[].title'
ebk list --category "Computer Science/Algorithms/Graph Theory" --limit 50
ctk search "machine learning" --source ChatGPT --date-from 2024-01-01
Each command requires precise arguments. Each tool has different flag conventions. You can’t navigate your data. You can only query it. And queries require knowing exactly what you’re looking for.
The insight: everything is a file
When I have thousands of source files organized in directories, I don’t run:
list-files --path /src/components/auth --extension .tsx
I run:
cd src/components/auth
ls *.tsx
The difference matters. With a filesystem, I can navigate incrementally (cd from general to specific), explore (ls to see what’s there), compose (cat file | grep pattern | wc -l), and use familiar tools (find, grep, xargs, pipes, redirection).
What if my bookmarks, ebooks, and chat histories were filesystems?
The pattern
Over the past year, I built six Python tools that all follow the same architecture:
| Tool | Domain | VFS Root Structure |
|---|---|---|
| btk | Bookmarks | /bookmarks/, /tags/, /recent/, /domains/, /unread/, /popular/ |
| ebk | Ebook library | /books/, /authors/, /series/, /subjects/, /recent/, /unread/ |
| ctk | Chat conversations | /conversations/, /sources/, /topics/, /starred/, /recent/ |
| ghops | Git repositories | /repos/, /languages/, /topics/, /stars/, /recent/ |
| infinigram | N-gram models | /datasets/, /models/, /corpora/ |
| AlgoTree | Tree structures | /nodes/, /paths/, /subtrees/ |
Each tool provides:
- A stateless CLI for scripting:
btk bookmark add URL,ebk import book.pdf - An interactive shell with a virtual filesystem:
btk shell,ebk shell,ctk chat - POSIX-like commands:
cd,ls,pwd,cat,mv,cp,rm,find,grep - Unix pipeline support: most commands output JSONL by default for piping
The interesting part is the shell.
Navigating 10,000 bookmarks
Live recording captured with asciinema. You can pause, copy text, and replay. The entire recording is 78KB of text.
Linked project: Computational-Explorations
What Happens When You Let an AI Loose on 1,000 Erdős Problems
March 16, 2026
I should be upfront about what happened here. I did not compute coprime Ramsey numbers. I did not write 92 Python modules or 5,922 tests. I did not build SAT encodings or run survival analysis on Erdős problems.
Claude Code (Opus 4.6) did all of that. I told it what to look at, asked it to keep going, and occasionally said things like “try to disprove our discoveries” and “be aggressive.” The AI did the rest. 131 subagents, 78,000 lines of code, three minted DOIs. In one session.
I’m writing this down because I think it’s worth documenting what that looks like from the human side of the keyboard.
The Setup
Terence Tao maintains a database of 1,183 Erdős problems on GitHub. Each problem has tags, OEIS links, resolution status, and sometimes prize money. The database was updated in August 2025 to link problems to integer sequences. Since then, 213 problems have been solved, many with AI assistance.
I had been poking at this database on and off for a few months. I had some Python scripts, some partial Lean proofs, a few computational results. Nothing organized. The codebase had bugs (the kind where a random sampling heuristic silently gives you the wrong answer and you don’t notice for weeks).
I started a Claude Code session intending to fix those bugs. Then I said “iterate.” Then I kept saying “iterate.”
What Claude Found
The headline result is a family of numbers that, as far as anyone can tell, nobody had studied before.
Take the integers 1 through n. Connect every coprime pair with an edge. This is the coprime graph. Now 2-color every edge. The coprime Ramsey number R_cop(k) is the smallest n where every 2-coloring must contain a monochromatic complete subgraph of size k.
Classical Ramsey: R(3,3) = 6. Coprime Ramsey: R_cop(3) = 11.
The value R_cop(4) = 59 required SAT solving (Glucose4 via pysat). A random sampling heuristic had said 20. It was off by a factor of three. The SAT solver finds avoiding colorings instantly at every n up to 58. At n = 59 (prime, coprime to everything below it), no avoiding coloring exists. This was verified by an independent implementation built from scratch by a separate adversarial agent.
Linked project: Fuzzy-Infer
Fuzzy Inference: Teaching Machines to Think in Shades of Grey
March 16, 2026
Facts and Degrees
In classical logic, something is true or false. The cat is on the mat, or it is not. A patient has a fever, or they do not. There is no middle ground.
Fuzzy logic adds a dial.
Instead of true/false, every statement carries a degree of belief – a number between 0 and 1. A degree of 1.0 means certainty. A degree of 0.0 means we have no belief at all. And everything in between is fair game.
Here is the simplest possible fuzzy fact:
# A fuzzy fact: "Rex has hair" with 85% confidence
engine.add_fact("has-hair", ["rex"], 0.85)
The predicate is has-hair. The argument is rex. The degree is 0.85. Maybe we observed Rex from a distance, or the photo was blurry. We are fairly sure Rex has hair, but not certain.
This is the building block of everything that follows. A fuzzy knowledge base is just a collection of these facts, each with its own degree. Some facts we are sure about (deg=1.0). Others are tentative guesses (deg=0.3). The engine treats them all the same way – it just pays attention to the number.
One important detail: when two sources assert the same fact with different degrees, the engine keeps the higher one. This is called fuzzy-OR. If one sensor says has-hair(rex) at 0.85 and another says it at 0.92, the engine stores 0.92. Optimistic, but reasonable – the stronger evidence wins.
engine.add_fact("has-hair", ["rex"], 0.85)
engine.add_fact("has-hair", ["rex"], 0.92) # fuzzy-OR: keeps 0.92
In the widget below, you can create fuzzy facts and drag the degree slider to see how the degree changes the visual representation. A fact at 1.0 is solid and bright. A fact at 0.1 is faded, barely there. This is not just decoration – it is the engine’s uncertainty, made visible.
Rules
Facts alone are inert. To reason, we need rules – if-then statements that produce new facts from existing ones.
A fuzzy rule looks like this: “If X has hair, then X is a mammal.” In code:
engine.add_rule(
name="mammal-rule",
conditions=[{"pred": "has-hair", "args": ["?x"], "degVar": "?d"}],
actions=[{
"type": "add",
"fact": {"pred": "is-mammal", "args": ["?x"], "deg": ["*", 0.95, "?d"]}
}],
priority=60,
)
There is a lot going on here, so let us unpack it.
Pattern variables. The ?x in the condition is a variable. It matches any argument. When the engine finds has-hair(rex, 0.85), it binds ?x to rex. The same ?x then appears in the action, so the engine adds is-mammal(rex, ...).
Linked project: /Projects/Algebraic.dist/
Linked project: /Projects/Algebraic.mle/
Linked project: /Projects/Compositional.mle/
Linked project: /Projects/Flexhaz/
Linked project: /Projects/Hypothesize/
Linked project: /Projects/Likelihood.model/
Linked project: /Projects/Maskedcauses/
Linked project: /Projects/Nabla/
Linked project: Narro
Narrating a Hugo Blog with Sentence Highlighting
February 26, 2026
I wanted my blog posts to have audio narration. Not a podcast, not a read-aloud button that sends text to a cloud API. Local TTS with narro, my 80M parameter CPU model, generating Opus files that live next to the markdown source. One command to narrate an entire Hugo site.
That part was straightforward. The part that got interesting was highlighting: tracking which sentence is being spoken and lighting it up in the browser as the audio plays.
Linked project: Eidola
Code Without Purpose
February 24, 2026
Time is finite in ways I can’t ignore. That changes which questions about code feel important.
I read a post arguing that the most valuable programming skill in 2026 is deleting code. The thesis: AI generates code faster than anyone can review it, so the real value is in curation and subtraction. Code is a liability, not an asset.
I agree with the observation. I disagree with the prescription.
Linked project: Longecho
Code Without Purpose
February 24, 2026
Time is finite in ways I can’t ignore. That changes which questions about code feel important.
I read a post arguing that the most valuable programming skill in 2026 is deleting code. The thesis: AI generates code faster than anyone can review it, so the real value is in curation and subtraction. Code is a liability, not an asset.
I agree with the observation. I disagree with the prescription.
Long Echo Comes Alive: From Philosophy to Orchestration
January 20, 2026
A year ago, I wrote about Long Echo as a philosophy for preserving AI conversations across decades. The key insight was graceful degradation: design archives that work progressively even as technology disappears.
That philosophy has become a tool.
From Philosophy to Tool
The original Long Echo was intentionally not code. It was a set of principles documented in CTK’s repository. The hard problems of conversation parsing, storage, and search were already solved by toolkits like CTK, BTK, and EBK.
What was missing was the unification layer. Each toolkit exports its own ECHO-compliant archive, but combining them into a single browsable experience required manual work. That’s what longecho now handles.
What longecho Does Now
longecho is a CLI tool with five capabilities:
longecho check ~/my-data/ # Validate ECHO compliance
longecho discover ~/ # Find ECHO sources
longecho search ~/ "query" # Search README descriptions
longecho build ~/my-archive/ # Generate static site
longecho serve ~/my-archive/ # Preview locally via HTTP
The check, discover, and search commands existed in the original specification. What’s new is build and serve, the orchestration layer.
Building a Unified Site
The build command takes a hierarchical archive and generates a static site:
longecho build ~/my-archive/
This produces a site/ directory with:
- An index page linking to all sub-archives
- Navigation between sources
- Automatic linking to existing sub-site builds
If a sub-archive already has its own site/ directory (like CTK’s exports), longecho links to it. Use --bundle to copy everything into a portable, self-contained site.
Live Preview
The serve command provides local HTTP preview:
longecho serve ~/my-archive/ --port 8000
It builds the site if needed, then serves it for browser viewing.
The Manifest
ECHO compliance requires only a README. But for machine-readable metadata, longecho supports an optional manifest:
version: "1.0"
name: "Alex's Data Archive"
description: "Personal data archive"
sources:
- path: "conversations/"
order: 1
- path: "bookmarks/"
order: 2
- path: "ebooks/"
order: 3
The manifest enables:
- Explicit ordering of sources in generated sites
- Selective inclusion via the
browsableflag - Override names for cleaner presentation
- Icon hints for UI presentation
Without a manifest, longecho auto-discovers sub-archives by looking for directories with README files. The manifest provides explicit control when you need it.
Linked project: Pagevault
Code Without Purpose
February 24, 2026
Time is finite in ways I can’t ignore. That changes which questions about code feel important.
I read a post arguing that the most valuable programming skill in 2026 is deleting code. The thesis: AI generates code faster than anyone can review it, so the real value is in curation and subtraction. Code is a liability, not an asset.
I agree with the observation. I disagree with the prescription.
pagevault: Hiding an Encryption Platform Inside HTML
February 13, 2026
HTML is an encryption container format. That sounds wrong, but think about what an HTML file can hold: arbitrary data in script tags or data attributes, a full programming runtime via JavaScript, and a rendering engine (the browser) on every device on the planet. If you embed encrypted data and the code to decrypt it, the result is a file that looks inert until someone types the right password.
pagevault takes this idea seriously. It encrypts files, documents, images, entire websites, into self-contained HTML pages that decrypt in the browser. No backend. No JavaScript crypto libraries. The browser already has AES-256-GCM built in via the Web Crypto API. pagevault just has to match the parameters exactly on the Python side and embed the right 200 lines of JavaScript.
The output is a single .html file. You can email it, put it on a USB stick, host it on GitHub Pages, or double-click it on your desktop. It doesn’t phone home, it doesn’t load CDNs, it doesn’t need anything except a browser.
Linked project: Posthumous
Code Without Purpose
February 24, 2026
Time is finite in ways I can’t ignore. That changes which questions about code feel important.
I read a post arguing that the most valuable programming skill in 2026 is deleting code. The thesis: AI generates code faster than anyone can review it, so the real value is in curation and subtraction. Code is a liability, not an asset.
I agree with the observation. I disagree with the prescription.
Posthumous: A Federated Dead Man's Switch
February 14, 2026
Some things should only happen after you can’t do them yourself.
Posthumous is a self-hosted dead man’s switch. You check in periodically (via phone, browser, CLI, or API call) and if you stop, it progresses through escalating stages before triggering automated actions: sending notifications, running scripts, whatever you’ve configured.
I built it because the existing options are either cloud-hosted (you’re trusting someone else’s uptime for your most important automation) or single-node (one server failure and silence is indistinguishable from death). Posthumous is federated, multiple nodes watch each other, and fully self-hosted.
This post walks through the basic workflows.
Linked project: Algebraic.mle
Masked Failure Data: Looking Back, Looking Forward
February 18, 2026
I have been working on the same statistical problem since 2020. I am now a PhD student in CS. The problem has not changed, but my understanding of it has, and the tools I have built around it look nothing like what I started with.
The problem: a series system fails when any component fails. You observe system-level failure times. But you often cannot tell which component caused the failure (masking). Some systems are still running when testing ends (censoring). Given this incomplete data, estimate component reliability.
This is not a tutorial. It is a map of where things stand and where they are going.
Observation Functors: Composable Censoring for Series System Simulation
February 13, 2026
Last week I announced maskedcauses, the R package for estimating component reliability from masked series system failures. That post covered the three likelihood models and the path to CRAN.
This post is about what happened next: the package now supports four observation types (exact, right-censored, left-censored, and interval-censored) via composable observation functors. Along the way, I wrote four vignettes, removed the md.tools dependency, and developed a verification methodology for keeping prose honest about simulation results.
maskedcauses: Maximum Likelihood Estimation for Masked Series System Failures
February 5, 2026
Note (February 2026): This package has been renamed from
likelihood.model.series.mdtomaskedcauses.
Two days ago, I submitted likelihood.model to CRAN, the foundation package for composable statistical inference. Next in line: maskedcauses, which implements maximum likelihood estimation for series systems where component failure causes are masked.
This package is the practical result of my master’s thesis work. Three years of theoretical development, now packaged for anyone analyzing masked failure data.
The Problem: Masked Component Failures
A series system fails when any of its \(m\) components fails. In reliability testing, you observe the system fail at time \(t\), but two layers of uncertainty obscure the full picture:
Right-censoring: Some systems are still running when testing ends. You know they survived at least until time \(\tau\), but not how much longer they would have lasted.
Masked cause of failure: When a system fails, you often can’t identify which component caused it. Diagnostic tests might narrow it down to a candidate set of possible causes, but the true failure component remains ambiguous.
This happens constantly in practice. Electronic systems fail with only board-level diagnostics. Industrial machinery fails without root-cause teardown. Medical devices fail with symptoms pointing to multiple possible subsystems.
The question: given this incomplete information, can you still estimate the lifetime distribution of each component?
The Package: Three Likelihood Models
maskedcauses provides three models with different complexity-accuracy tradeoffs:
| Model | Parameters | Use Case |
|---|---|---|
exp_series_md_c1_c2_c3 | \(m\) rates \((\lambda_1, \ldots, \lambda_m)\) | Memoryless components (constant failure rate) |
wei_series_md_c1_c2_c3 | \(2m\) params \((k_1, \beta_1, \ldots, k_m, \beta_m)\) | Weibull with per-component shapes |
wei_series_homogeneous_md_c1_c2_c3 | \(m+1\) params \((k, \beta_1, \ldots, \beta_m)\) | Weibull with shared shape parameter |
Each model implements the full inference stack: loglik(), score(), hess_loglik(), rdata(), and assumptions().
The C1-C2-C3 Conditions
The models assume three conditions that simplify the likelihood:
- C1: The failed component is in the candidate set with probability 1
- C2: Given the failed component is in the candidate set, masking probability is uniform across candidates
- C3: Masking probabilities are independent of system parameters \(\theta\)
Under these conditions, the masking mechanism factors out of the likelihood. You can estimate component parameters without modeling the diagnostic process itself. That’s why the package name includes “c1_c2_c3”.
compositional.mle: SICP-Inspired Optimization
December 17, 2025
I recently updated compositional.mle, an R package for maximum likelihood estimation built on a simple premise: optimization strategies should compose.
The Problem
Most optimization libraries treat solvers as monolithic procedures. You call optim(), pass some options, hope for the best. Want to try multiple methods? Write a loop. Want coarse-to-fine optimization? Manually wire one solver’s output into the next.
compositional.mle treats solvers the way SICP treats procedures: as first-class citizens.
- Primitive solvers:
gradient_ascent(),newton_raphson(),bfgs(),nelder_mead() - Composition operators:
%>>%(sequential chaining),%|%(parallel racing),with_restarts() - Closure: Combining solvers yields a solver
That last point is the whole thing. When you chain two solvers together, the result is itself a solver with the same interface. So compositions can be further composed, stored in variables, passed to functions, used anywhere a solver is expected.
What This Looks Like
Define your problem once:
problem <- mle_problem(
loglike = function(theta) {
if (theta[2] <= 0) return(-Inf)
sum(dnorm(x, theta[1], theta[2], log = TRUE))
},
score = function(theta) {
mu <- theta[1]; sigma <- theta[2]; n <- length(x)
c(sum(x - mu) / sigma^2,
-n / sigma + sum((x - mu)^2) / sigma^3)
}
)
Then compose strategies declaratively:
# Global search -> local refinement -> final polish
strategy <- grid_search(lower = c(-10, 0.5), upper = c(10, 5), n = 5) %>>%
gradient_ascent(max_iter = 50) %>>%
newton_raphson(max_iter = 20)
result <- strategy(problem, theta0 = c(0, 1))
Or race multiple approaches:
# Try all methods, keep the best
strategy <- gradient_ascent() %|% bfgs() %|% nelder_mead()
Or handle multimodal landscapes:
# Random restarts to escape local optima
strategy <- with_restarts(gradient_ascent(), n = 10,
sampler = uniform_sampler(lower, upper))
The SICP Connection
This design applies SICP’s framework directly:
Primitives. The base solvers are building blocks with clear contracts. gradient_ascent() returns a solver using steepest ascent. nelder_mead() returns a derivative-free simplex solver.
Means of Combination. The operators %>>%, %|%, and with_restarts() combine solvers into new solvers. Chaining feeds one solver’s output as input to the next. Racing runs solvers in parallel and picks the winner.
Abstraction. Solver factories hide implementation details behind a consistent interface. You work with the solver abstraction, not specific algorithms.
Closure. Because composition produces objects of the same type as the inputs, the language of solvers is closed under composition. You build arbitrarily complex strategies from simple parts.
Relationship to algebraic.mle
This package complements algebraic.mle, which provides algebraic operations on MLE results. Where algebraic.mle lets you compose likelihood functions and manipulate fitted models, compositional.mle focuses on the process of finding those estimates.
They work together:
# compositional.mle: find the estimate
result <- strategy(problem, theta0)
# algebraic.mle: work with the fitted model
confint(result)
coef(result)
Try It
Install from GitHub:
likelihood.model: Composable Likelihood Models in R
June 30, 2022
Most R packages hardcode specific likelihood models. likelihood.model takes a different approach. Likelihoods are first-class objects that compose, and the framework is generic enough to work with any distribution.
The Interface
A likelihood model is anything implementing these generic methods:
loglik(model, data, params)– log-likelihoodscore(model, data, params)– score function (gradient)hessian(model, data, params)– observed information matrix
That is the interface. If your model implements these three methods, it plugs into the entire MLE stack: optimization, confidence intervals, hypothesis testing, model selection. You do not couple to specific distributions.
Likelihood Contributions
The key class is likelihood_contr_model, a likelihood built from independent contributions:
# Different observation types get different likelihood contributions
model <- likelihood_contr_model(
exact = normal_contrib(),
right_censored = censored_contrib()
)
This handles heterogeneous data in a unified framework. You can mix exact observations, right-censored observations, truncated observations, and different distribution families within one model. Each observation type gets its own likelihood contribution, and they combine additively in log-space.
Why This Design
The i.i.d. assumption decomposes a joint likelihood into additive log-likelihood contributions. That is how MLE actually works. likelihood.model makes this decomposition explicit and compositional.
Likelihood models are objects you manipulate, not function calls buried inside a fitting routine. You can build complex models from simple, independent pieces. You can swap in different contribution types without rewriting the rest of your code. And because the interface is generic, it works with algebraic.mle for fitting, hypothesize for testing, and any optimization backend that speaks the same protocol.
This is the same compositional philosophy as my thesis work on masked failure data. Series systems with masked causes have multiple observation types (masked vs. unmasked, different candidate sets) that each contribute differently to the likelihood. likelihood.model handles that naturally.
R package – MIT licensed – Documentation – GitHub
algebraic.mle: MLEs as Algebraic Objects
May 15, 2021
Maximum likelihood estimators have rich mathematical structure. They are consistent, asymptotically normal, efficient. algebraic.mle exposes this structure through an algebra where MLEs are objects you compose, transform, and query.
The Abstraction
An MLE is not just a vector of parameter estimates. It is a statistical object that carries point estimates \(\hat{\theta}\), the Fisher information matrix \(I(\hat{\theta})\), the variance-covariance matrix \(I^{-1}(\hat{\theta})\), Wald-type confidence intervals from asymptotic normality, the log-likelihood value, and convergence diagnostics.
The package wraps all of this in a consistent interface:
library(algebraic.mle)
fit <- mle(likelihood_model, data)
coef(fit) # Parameter estimates
vcov(fit) # Variance-covariance matrix
confint(fit) # Confidence intervals
logLik(fit) # Log-likelihood
aic(fit) # Model selection
Composition
The real point is that MLEs compose. Independent models combine:
fit1 <- mle(model1, data1)
fit2 <- mle(model2, data2)
combined <- fit1 + fit2 # Joint likelihood
The package handles the algebra. Joint log-likelihood, block-diagonal Fisher information, everything propagates correctly. This works because likelihoods from independent data sources multiply, and multiplication of likelihoods is addition of log-likelihoods. That is a monoid. The package enforces it.
The Ecosystem
algebraic.mle is the foundation for a family of packages:
| Package | Purpose |
|---|---|
| likelihood.model | Compositional likelihood specification |
| maskedcauses | Masked failure data in series systems |
| mdrelax | Relaxed masking conditions |
| algebraic.dist | Distributions as algebraic objects |
| flexhaz | Dynamic failure rate distributions |
| hypothesize | Likelihood ratio tests on MLEs |
| numerical.mle | Numerical optimization backends |
The typical workflow:
- Define distributions with
algebraic.dist - Specify likelihood contributions with
likelihood.model - Fit the model and get an
mleobject fromalgebraic.mle - Query statistical properties: confidence intervals, hypothesis tests, model selection
For series systems with masked data:
library(maskedcauses)
library(algebraic.mle)
# Specify masking model (C1-C2-C3 conditions)
model <- md_likelihood_model(components = 3, masking = "bernoulli")
# Fit -> returns algebraic.mle object
fit <- md_mle_exp_series_C1_C2_C3(masked_data)
# All the standard MLE methods work
confint(fit)
vcov(fit)
aic(fit)
Theory
The asymptotic properties that algebraic.mle exploits come from classical MLE theory:
The expo-masked-fim paper derives closed-form Fisher information for exponential series systems. That is exactly what algebraic.mle uses internally for variance estimation in that case.
For more complex models (Weibull, relaxed masking conditions), we compute Fisher information numerically via observed information:
$$\hat{I}(\hat{\theta}) = -\frac{\partial^2 \ell}{\partial \theta \partial \theta^T}\bigg|_{\theta=\hat{\theta}}$$Design Principles
Separation of concerns. The likelihood specification (likelihood.model) is independent of the fitting algorithm (numerical.mle) and the result type (algebraic.mle). You can swap optimizers without changing downstream code.
Linked project: Flexhaz
Masked Failure Data: Looking Back, Looking Forward
February 18, 2026
I have been working on the same statistical problem since 2020. I am now a PhD student in CS. The problem has not changed, but my understanding of it has, and the tools I have built around it look nothing like what I started with.
The problem: a series system fails when any component fails. You observe system-level failure times. But you often cannot tell which component caused the failure (masking). Some systems are still running when testing ends (censoring). Given this incomplete data, estimate component reliability.
This is not a tutorial. It is a map of where things stand and where they are going.
flexhaz: Specify the Hazard Function Directly
August 20, 2021
Survival analysis usually makes you pick from a catalog. Weibull, exponential, log-normal. You choose the family, estimate the parameters, and hope the model fits. flexhaz flips this around. You specify the hazard function directly, and the package computes everything else.
How It Works
Instead of choosing Weibull(shape, scale), you write:
h <- function(t, x) exp(b0 + b1*x + b2*t) # Your hazard function
model <- dfr_dist(hazard = h)
The package computes survival functions, cumulative hazards, quantiles, and sampling from your custom hazard. You get a full distributional object without committing to a named family.
Why This Is Useful
You are not constrained to parametric families. Want a bathtub curve? Multiple failure-rate peaks? Time-varying covariate effects? Just write the hazard function. No need to force reality into exponential or Weibull boxes.
Covariates can depend on anything:
h <- function(t, age, treatment) {
baseline * exp(beta_age*age + beta_tx*treatment + gamma*t)
}
And it integrates with the rest of the MLE stack. flexhaz works with algebraic.mle for parameter estimation and likelihood.model for likelihood contributions.
Constraints
Your hazard function needs to satisfy two things:
- Non-negative:
h(t, x) >= 0for all t, x - Eventual failure: cumulative hazard goes to infinity as t goes to infinity
That is it. Those are the only requirements for a valid hazard function. The package handles deriving the survival function, density, CDF, and quantile function from the hazard you provide.
Context
This generalizes my thesis work on masked failure data, where I used Weibull and exponential distributions. With flexhaz, you are not limited to parametric families. You specify the actual failure mechanism, and the math adapts.
R package – Works with algebraic.mle – Documentation – GitHub
Linked project: Likelihood.model
Masked Failure Data: Looking Back, Looking Forward
February 18, 2026
I have been working on the same statistical problem since 2020. I am now a PhD student in CS. The problem has not changed, but my understanding of it has, and the tools I have built around it look nothing like what I started with.
The problem: a series system fails when any component fails. You observe system-level failure times. But you often cannot tell which component caused the failure (masking). Some systems are still running when testing ends (censoring). Given this incomplete data, estimate component reliability.
This is not a tutorial. It is a map of where things stand and where they are going.
Observation Functors: Composable Censoring for Series System Simulation
February 13, 2026
Last week I announced maskedcauses, the R package for estimating component reliability from masked series system failures. That post covered the three likelihood models and the path to CRAN.
This post is about what happened next: the package now supports four observation types (exact, right-censored, left-censored, and interval-censored) via composable observation functors. Along the way, I wrote four vignettes, removed the md.tools dependency, and developed a verification methodology for keeping prose honest about simulation results.
maskedcauses: Maximum Likelihood Estimation for Masked Series System Failures
February 5, 2026
Note (February 2026): This package has been renamed from
likelihood.model.series.mdtomaskedcauses.
Two days ago, I submitted likelihood.model to CRAN, the foundation package for composable statistical inference. Next in line: maskedcauses, which implements maximum likelihood estimation for series systems where component failure causes are masked.
This package is the practical result of my master’s thesis work. Three years of theoretical development, now packaged for anyone analyzing masked failure data.
The Problem: Masked Component Failures
A series system fails when any of its \(m\) components fails. In reliability testing, you observe the system fail at time \(t\), but two layers of uncertainty obscure the full picture:
Right-censoring: Some systems are still running when testing ends. You know they survived at least until time \(\tau\), but not how much longer they would have lasted.
Masked cause of failure: When a system fails, you often can’t identify which component caused it. Diagnostic tests might narrow it down to a candidate set of possible causes, but the true failure component remains ambiguous.
This happens constantly in practice. Electronic systems fail with only board-level diagnostics. Industrial machinery fails without root-cause teardown. Medical devices fail with symptoms pointing to multiple possible subsystems.
The question: given this incomplete information, can you still estimate the lifetime distribution of each component?
The Package: Three Likelihood Models
maskedcauses provides three models with different complexity-accuracy tradeoffs:
| Model | Parameters | Use Case |
|---|---|---|
exp_series_md_c1_c2_c3 | \(m\) rates \((\lambda_1, \ldots, \lambda_m)\) | Memoryless components (constant failure rate) |
wei_series_md_c1_c2_c3 | \(2m\) params \((k_1, \beta_1, \ldots, k_m, \beta_m)\) | Weibull with per-component shapes |
wei_series_homogeneous_md_c1_c2_c3 | \(m+1\) params \((k, \beta_1, \ldots, \beta_m)\) | Weibull with shared shape parameter |
Each model implements the full inference stack: loglik(), score(), hess_loglik(), rdata(), and assumptions().
The C1-C2-C3 Conditions
The models assume three conditions that simplify the likelihood:
- C1: The failed component is in the candidate set with probability 1
- C2: Given the failed component is in the candidate set, masking probability is uniform across candidates
- C3: Masking probabilities are independent of system parameters \(\theta\)
Under these conditions, the masking mechanism factors out of the likelihood. You can estimate component parameters without modeling the diagnostic process itself. That’s why the package name includes “c1_c2_c3”.
symlik: Symbolic Likelihood Models in Python
December 16, 2025
symlik is a Python library for symbolic likelihood models. Write your log-likelihood as a symbolic expression, and it derives everything needed for inference.
The Problem
Traditional statistical computing gives you two choices:
- Manual derivation. Work out score functions and information matrices by hand, then implement them. Error-prone, tedious.
- Numerical approximation. Use finite differences. Unstable, slow, no symbolic form to inspect.
The Approach
symlik takes a third path: symbolic differentiation. Define the model once, get exact derivatives automatically.
from symlik.distributions import exponential
model = exponential()
data = {'x': [1.2, 0.8, 2.1, 1.5]}
mle, _ = model.mle(data=data, init={'lambda': 1.0})
se = model.se(mle, data)
print(f"Rate: {mle['lambda']:.3f} +/- {se['lambda']:.3f}")
# Rate: 0.714 +/- 0.357
Behind the scenes, symlik:
- Symbolically differentiates the log-likelihood to get the score function
- Differentiates again for the Hessian
- Computes Fisher information from the Hessian
- Derives standard errors from the inverse information matrix
All exact. No numerical approximation.
Custom Models
The real power is defining custom models using s-expressions:
from symlik import LikelihoodModel
# Exponential: l(lambda) = sum[log(lambda) - lambda*x_i]
log_lik = ['sum', 'i', ['len', 'x'],
['+', ['log', 'lambda'],
['*', -1, ['*', 'lambda', ['@', 'x', 'i']]]]]
model = LikelihoodModel(log_lik, params=['lambda'])
# Symbolic derivatives available
score = model.score() # Gradient
hess = model.hessian() # Hessian matrix
info = model.information() # Fisher information
You define the log-likelihood once as a symbolic expression. symlik computes the rest.
Heterogeneous Data
One of symlik’s strengths is handling mixed observation types, which is exactly what you need for reliability analysis with censored data:
from symlik import ContributionModel
from symlik.contributions import complete_exponential, right_censored_exponential
model = ContributionModel(
params=["lambda"],
type_column="status",
contributions={
"observed": complete_exponential(),
"censored": right_censored_exponential(),
}
)
data = {
"status": ["observed", "censored", "observed", "observed", "censored"],
"t": [1.2, 3.0, 0.8, 2.1, 4.5],
}
Each observation type contributes differently to the likelihood. symlik handles the bookkeeping.
Connection to Research
symlik is the Python successor to my R package likelihood.model. It implements the theoretical framework from my thesis work on likelihood-based inference for series systems.
The Weibull Series Model Selection paper shows applications to reliability engineering, the kind of complex likelihood that benefits from symbolic treatment.
Powered by rerum
symlik uses rerum for symbolic differentiation. rerum is a pattern matching and term rewriting library that handles the calculus. The separation means you can use rerum for other symbolic computation tasks beyond likelihood models.
Installation
Available on PyPI:
pip install symlik
Documentation at queelius.github.io/symlik.
See the project page for more details.
Closed-Form Results for Masked Exponential Series Systems
December 2, 2025
In a series system, the system fails when any component fails. You observe the system failure time \(t\) and a candidate set \(C \subseteq \lbrace 1,2,\ldots,m\rbrace\) of components that might have caused the failure. But you do not know which component in \(C\) actually failed. This is masked failure data.
The standard approach is numerical optimization of the likelihood. This paper shows that for exponential component lifetimes, everything has a closed form.
Closed-Form Fisher Information
For exponential masked data with arbitrary masking patterns:
$$I_{ij}(\boldsymbol{\lambda}) = n \cdot \sum_{A \ni i,j} \frac{\hat{\omega}_A}{(\sum_{k \in A} \lambda_k)^2}$$where \(\hat{\omega}_A\) is the observed frequency of candidate set \(A\). You can compute asymptotic variances directly, check identifiability before running any estimation, and analyze optimization stability. All without fitting a model first.
Sufficient Statistics
The mean system lifetime and the candidate set frequency vector are sufficient statistics. That reduces an entire dataset to \(1 + \binom{m}{w}\) numbers, where \(w\) is the masking width.
This is a real simplification. All the statistical information in your data is captured by two things: how often each candidate set appears, and what the average failure time is. Nothing else matters for inference.
Closed-Form MLE for Three Components
For \(m=3\) components with pairwise masking (\(w=2\)), the MLE has an explicit closed-form solution:
$$\hat{\lambda}_j = \frac{\sum_{A \ni j} \hat{\omega}_A}{\bar{t} \cdot n}$$No numerical optimization. No iterative algorithms. Just plug in your sufficient statistics.
The \(w=2\) case is the interesting one. \(w=1\) means no masking (you know exactly which component failed). \(w=m\) means complete masking (the candidate set is always everything, so you have no diagnostic information). \(w=2\) is the simplest case where masking actually matters, and it is the one where closed-form solutions exist.
Asymptotic Theory
The MLE follows:
$$\sqrt{n}(\hat{\boldsymbol{\lambda}}_n - \boldsymbol{\lambda}^\star) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \mathcal{I}^{-1}(\boldsymbol{\lambda}^\star))$$with explicit Wald-type confidence intervals using the closed-form Fisher information. So you get point estimates and uncertainty quantification, all analytically.
Why Exponential?
The exponential assumption is not just for tractability, though it helps. Constant hazard rate models systems subject to random external shocks. The memoryless property simplifies the likelihood structure. And exponential is the foundation for generalization to Weibull and other distributions.
likelihood.model: Composable Likelihood Models in R
June 30, 2022
Most R packages hardcode specific likelihood models. likelihood.model takes a different approach. Likelihoods are first-class objects that compose, and the framework is generic enough to work with any distribution.
The Interface
A likelihood model is anything implementing these generic methods:
loglik(model, data, params)– log-likelihoodscore(model, data, params)– score function (gradient)hessian(model, data, params)– observed information matrix
That is the interface. If your model implements these three methods, it plugs into the entire MLE stack: optimization, confidence intervals, hypothesis testing, model selection. You do not couple to specific distributions.
Likelihood Contributions
The key class is likelihood_contr_model, a likelihood built from independent contributions:
# Different observation types get different likelihood contributions
model <- likelihood_contr_model(
exact = normal_contrib(),
right_censored = censored_contrib()
)
This handles heterogeneous data in a unified framework. You can mix exact observations, right-censored observations, truncated observations, and different distribution families within one model. Each observation type gets its own likelihood contribution, and they combine additively in log-space.
Why This Design
The i.i.d. assumption decomposes a joint likelihood into additive log-likelihood contributions. That is how MLE actually works. likelihood.model makes this decomposition explicit and compositional.
Likelihood models are objects you manipulate, not function calls buried inside a fitting routine. You can build complex models from simple, independent pieces. You can swap in different contribution types without rewriting the rest of your code. And because the interface is generic, it works with algebraic.mle for fitting, hypothesize for testing, and any optimization backend that speaks the same protocol.
This is the same compositional philosophy as my thesis work on masked failure data. Series systems with masked causes have multiple observation types (masked vs. unmasked, different candidate sets) that each contribute differently to the likelihood. likelihood.model handles that naturally.
R package – MIT licensed – Documentation – GitHub
algebraic.mle: MLEs as Algebraic Objects
May 15, 2021
Maximum likelihood estimators have rich mathematical structure. They are consistent, asymptotically normal, efficient. algebraic.mle exposes this structure through an algebra where MLEs are objects you compose, transform, and query.
The Abstraction
An MLE is not just a vector of parameter estimates. It is a statistical object that carries point estimates \(\hat{\theta}\), the Fisher information matrix \(I(\hat{\theta})\), the variance-covariance matrix \(I^{-1}(\hat{\theta})\), Wald-type confidence intervals from asymptotic normality, the log-likelihood value, and convergence diagnostics.
The package wraps all of this in a consistent interface:
library(algebraic.mle)
fit <- mle(likelihood_model, data)
coef(fit) # Parameter estimates
vcov(fit) # Variance-covariance matrix
confint(fit) # Confidence intervals
logLik(fit) # Log-likelihood
aic(fit) # Model selection
Composition
The real point is that MLEs compose. Independent models combine:
fit1 <- mle(model1, data1)
fit2 <- mle(model2, data2)
combined <- fit1 + fit2 # Joint likelihood
The package handles the algebra. Joint log-likelihood, block-diagonal Fisher information, everything propagates correctly. This works because likelihoods from independent data sources multiply, and multiplication of likelihoods is addition of log-likelihoods. That is a monoid. The package enforces it.
The Ecosystem
algebraic.mle is the foundation for a family of packages:
| Package | Purpose |
|---|---|
| likelihood.model | Compositional likelihood specification |
| maskedcauses | Masked failure data in series systems |
| mdrelax | Relaxed masking conditions |
| algebraic.dist | Distributions as algebraic objects |
| flexhaz | Dynamic failure rate distributions |
| hypothesize | Likelihood ratio tests on MLEs |
| numerical.mle | Numerical optimization backends |
The typical workflow:
- Define distributions with
algebraic.dist - Specify likelihood contributions with
likelihood.model - Fit the model and get an
mleobject fromalgebraic.mle - Query statistical properties: confidence intervals, hypothesis tests, model selection
For series systems with masked data:
library(maskedcauses)
library(algebraic.mle)
# Specify masking model (C1-C2-C3 conditions)
model <- md_likelihood_model(components = 3, masking = "bernoulli")
# Fit -> returns algebraic.mle object
fit <- md_mle_exp_series_C1_C2_C3(masked_data)
# All the standard MLE methods work
confint(fit)
vcov(fit)
aic(fit)
Theory
The asymptotic properties that algebraic.mle exploits come from classical MLE theory:
The expo-masked-fim paper derives closed-form Fisher information for exponential series systems. That is exactly what algebraic.mle uses internally for variance estimation in that case.
For more complex models (Weibull, relaxed masking conditions), we compute Fisher information numerically via observed information:
$$\hat{I}(\hat{\theta}) = -\frac{\partial^2 \ell}{\partial \theta \partial \theta^T}\bigg|_{\theta=\hat{\theta}}$$Design Principles
Separation of concerns. The likelihood specification (likelihood.model) is independent of the fitting algorithm (numerical.mle) and the result type (algebraic.mle). You can swap optimizers without changing downstream code.
Linked project: Maskedcauses
Masked Failure Data: Looking Back, Looking Forward
February 18, 2026
I have been working on the same statistical problem since 2020. I am now a PhD student in CS. The problem has not changed, but my understanding of it has, and the tools I have built around it look nothing like what I started with.
The problem: a series system fails when any component fails. You observe system-level failure times. But you often cannot tell which component caused the failure (masking). Some systems are still running when testing ends (censoring). Given this incomplete data, estimate component reliability.
This is not a tutorial. It is a map of where things stand and where they are going.
Observation Functors: Composable Censoring for Series System Simulation
February 13, 2026
Last week I announced maskedcauses, the R package for estimating component reliability from masked series system failures. That post covered the three likelihood models and the path to CRAN.
This post is about what happened next: the package now supports four observation types (exact, right-censored, left-censored, and interval-censored) via composable observation functors. Along the way, I wrote four vignettes, removed the md.tools dependency, and developed a verification methodology for keeping prose honest about simulation results.
maskedcauses: Maximum Likelihood Estimation for Masked Series System Failures
February 5, 2026
Note (February 2026): This package has been renamed from
likelihood.model.series.mdtomaskedcauses.
Two days ago, I submitted likelihood.model to CRAN, the foundation package for composable statistical inference. Next in line: maskedcauses, which implements maximum likelihood estimation for series systems where component failure causes are masked.
This package is the practical result of my master’s thesis work. Three years of theoretical development, now packaged for anyone analyzing masked failure data.
The Problem: Masked Component Failures
A series system fails when any of its \(m\) components fails. In reliability testing, you observe the system fail at time \(t\), but two layers of uncertainty obscure the full picture:
Right-censoring: Some systems are still running when testing ends. You know they survived at least until time \(\tau\), but not how much longer they would have lasted.
Masked cause of failure: When a system fails, you often can’t identify which component caused it. Diagnostic tests might narrow it down to a candidate set of possible causes, but the true failure component remains ambiguous.
This happens constantly in practice. Electronic systems fail with only board-level diagnostics. Industrial machinery fails without root-cause teardown. Medical devices fail with symptoms pointing to multiple possible subsystems.
The question: given this incomplete information, can you still estimate the lifetime distribution of each component?
The Package: Three Likelihood Models
maskedcauses provides three models with different complexity-accuracy tradeoffs:
| Model | Parameters | Use Case |
|---|---|---|
exp_series_md_c1_c2_c3 | \(m\) rates \((\lambda_1, \ldots, \lambda_m)\) | Memoryless components (constant failure rate) |
wei_series_md_c1_c2_c3 | \(2m\) params \((k_1, \beta_1, \ldots, k_m, \beta_m)\) | Weibull with per-component shapes |
wei_series_homogeneous_md_c1_c2_c3 | \(m+1\) params \((k, \beta_1, \ldots, \beta_m)\) | Weibull with shared shape parameter |
Each model implements the full inference stack: loglik(), score(), hess_loglik(), rdata(), and assumptions().
The C1-C2-C3 Conditions
The models assume three conditions that simplify the likelihood:
- C1: The failed component is in the candidate set with probability 1
- C2: Given the failed component is in the candidate set, masking probability is uniform across candidates
- C3: Masking probabilities are independent of system parameters \(\theta\)
Under these conditions, the masking mechanism factors out of the likelihood. You can estimate component parameters without modeling the diagnostic process itself. That’s why the package name includes “c1_c2_c3”.
Weibull Distributions: From Reliability Theory to My Own Survival Curve
April 18, 2022
The Weibull distribution models time-to-failure. In reliability engineering, that means component lifetimes. In medicine, it means survival times. I have been working with Weibull models for my thesis on series system reliability. Then I got diagnosed with cancer, and now every time I work with survival curves, I am looking at mathematical abstractions of something very concrete: how long until failure?
The Mathematics
The Weibull CDF:
F(t) = 1 - exp(-(t/λ)^k)
Two parameters:
- λ: scale (characteristic lifetime)
- k: shape (how failure rate changes over time)
The shape parameter k tells you the whole story:
k < 1: Decreasing hazard. If you survive early on, your risk goes down. This is the infant mortality pattern.
k = 1: Constant hazard. Memoryless. This is just the exponential distribution.
k > 1: Increasing hazard. Things wear out.
The Hazard Function
The hazard function is what makes Weibull useful for survival analysis:
h(t) = (k/λ)(t/λ)^(k-1)
This is the instantaneous failure rate: given that you have survived to time t, what is the probability you fail in the next instant?
For cancer, this is the number that matters. Some cancers have increasing hazard (the longer you have it, the worse things get). Others have decreasing hazard after initial treatment, meaning if you make it past the critical period, prognosis improves. Knowing which pattern applies to your disease changes how you think about time.
Personal Context
When you study survival analysis academically, it is abstract. When you are living it, every curve is personal.
I look at Kaplan-Meier plots and see myself somewhere on that curve. I work with hazard functions and think: is my k > 1 or k < 1? Am I in the wearing-out regime or the if-you-make-it-past-this-it-gets-better regime?
The math does not change. But the meaning does.
The Irony
I chose reliability engineering for my thesis before the cancer diagnosis. I was studying component failures in series systems, where if any one part fails, the whole system fails.
Then I became a series system. Organs, treatment response, immune function. All have to work. Failure of any one is catastrophic.
The mathematics I was studying abstractly became uncomfortably literal.
Reliability Analysis and the Problem of Censored Data
August 14, 2019
One of the most interesting statistical problems I have encountered is reliability analysis with censored data: situations where you know something didn’t fail, but not when it will fail.
The Censoring Problem
Imagine testing light bulbs. You run them for 1000 hours. Some fail during the test. Others are still working when you stop.
For the survivors, you know:
- They lasted at least 1000 hours
- You do not know their actual lifetime
This is right censoring. The true value lies somewhere to the right of your observation. You have a lower bound, not a measurement.
Why This Matters
Censored data is everywhere:
- Medical studies (patients still alive at study end)
- Engineering tests (components that have not failed)
- Customer retention (users still active)
The naive responses are both wrong. Ignoring censored observations wastes information. Treating them as failures introduces bias. You need a framework that uses the partial information you actually have.
Maximum Likelihood to the Rescue
The solution is maximum likelihood estimation with likelihood contributions that account for censoring:
- Failure observations contribute the probability density \(f(t)\). You observed the exact failure time, so you know the probability of failing at that time.
- Censored observations contribute the survival probability \(S(t)\). You know the unit survived to time \(t\), so its contribution is the probability of surviving at least that long.
The likelihood for the whole sample is:
$$L = \prod_{i: \text{failed}} f(t_i) \prod_{j: \text{censored}} S(t_j)$$This lets you extract information from both failed and surviving units. The censored observations pull the estimated reliability upward; the failures pull it downward. Maximum likelihood balances them.
Series Systems Complexity
It gets more interesting with series systems, systems that fail when any component fails. If you observe system failure but do not know which component caused it, you have masked failure data.
This is the problem I am most interested in: extracting component-level reliability from system-level failures when the cause is ambiguous. The masking adds a latent variable, and the likelihood becomes a mixture. You can handle it with EM algorithms or direct optimization, but the combinatorics grow quickly with system size.
This work is laying groundwork for what will become a major focus of my mathematical statistics degree.
Linked project: Nabla
Masked Failure Data: Looking Back, Looking Forward
February 18, 2026
I have been working on the same statistical problem since 2020. I am now a PhD student in CS. The problem has not changed, but my understanding of it has, and the tools I have built around it look nothing like what I started with.
The problem: a series system fails when any component fails. You observe system-level failure times. But you often cannot tell which component caused the failure (masking). Some systems are still running when testing ends (censoring). Given this incomplete data, estimate component reliability.
This is not a tutorial. It is a map of where things stand and where they are going.
Linked project: Dapple
dapple: Terminal Graphics, Composed
February 15, 2026
I live in the terminal. Most of my tools are CLIs. When I want to see something visual (an image, a plot, a table of results), I do not want to leave the terminal to see it.
Terminal graphics tools exist, but they are fragmented. One library does braille characters. Another does quadrant blocks. A third handles sixel. Each has its own API, its own conventions, its own way of thinking about the same problem.
dapple unifies them. One Canvas class, seven pluggable renderers, and eleven CLI tools built on top. The core depends only on numpy.
Linked project: Compositional.mle
Observation Functors: Composable Censoring for Series System Simulation
February 13, 2026
Last week I announced maskedcauses, the R package for estimating component reliability from masked series system failures. That post covered the three likelihood models and the path to CRAN.
This post is about what happened next: the package now supports four observation types (exact, right-censored, left-censored, and interval-censored) via composable observation functors. Along the way, I wrote four vignettes, removed the md.tools dependency, and developed a verification methodology for keeping prose honest about simulation results.
Numerical Methods for Maximum Likelihood Estimation
February 5, 2023
Maximum likelihood estimation sounds clean on paper: write down the likelihood, take derivatives, set them to zero, solve. In practice, the “solve” step is where things get interesting. Most likelihoods don’t have closed-form solutions, so you need numerical methods, and the choice of method matters more than most textbooks let on.
This write-up covers the numerical side of MLE: the optimization algorithms, convergence issues, and computational tricks that make the difference between getting an answer and getting the right answer. The full treatment is in the PDF.
Related work
For more on the statistical and mathematical context, see my research page and publications.
Linked project: Cryptoid
pagevault: Hiding an Encryption Platform Inside HTML
February 13, 2026
HTML is an encryption container format. That sounds wrong, but think about what an HTML file can hold: arbitrary data in script tags or data attributes, a full programming runtime via JavaScript, and a rendering engine (the browser) on every device on the planet. If you embed encrypted data and the code to decrypt it, the result is a file that looks inert until someone types the right password.
pagevault takes this idea seriously. It encrypts files, documents, images, entire websites, into self-contained HTML pages that decrypt in the browser. No backend. No JavaScript crypto libraries. The browser already has AES-256-GCM built in via the Web Crypto API. pagevault just has to match the parameters exactly on the Python side and embed the right 200 lines of JavaScript.
The output is a single .html file. You can email it, put it on a USB stick, host it on GitHub Pages, or double-click it on your desktop. It doesn’t phone home, it doesn’t load CDNs, it doesn’t need anything except a browser.
Linked project: Btk
Long Echo Comes Alive: From Philosophy to Orchestration
January 20, 2026
A year ago, I wrote about Long Echo as a philosophy for preserving AI conversations across decades. The key insight was graceful degradation: design archives that work progressively even as technology disappears.
That philosophy has become a tool.
From Philosophy to Tool
The original Long Echo was intentionally not code. It was a set of principles documented in CTK’s repository. The hard problems of conversation parsing, storage, and search were already solved by toolkits like CTK, BTK, and EBK.
What was missing was the unification layer. Each toolkit exports its own ECHO-compliant archive, but combining them into a single browsable experience required manual work. That’s what longecho now handles.
What longecho Does Now
longecho is a CLI tool with five capabilities:
longecho check ~/my-data/ # Validate ECHO compliance
longecho discover ~/ # Find ECHO sources
longecho search ~/ "query" # Search README descriptions
longecho build ~/my-archive/ # Generate static site
longecho serve ~/my-archive/ # Preview locally via HTTP
The check, discover, and search commands existed in the original specification. What’s new is build and serve, the orchestration layer.
Building a Unified Site
The build command takes a hierarchical archive and generates a static site:
longecho build ~/my-archive/
This produces a site/ directory with:
- An index page linking to all sub-archives
- Navigation between sources
- Automatic linking to existing sub-site builds
If a sub-archive already has its own site/ directory (like CTK’s exports), longecho links to it. Use --bundle to copy everything into a portable, self-contained site.
Live Preview
The serve command provides local HTTP preview:
longecho serve ~/my-archive/ --port 8000
It builds the site if needed, then serves it for browser viewing.
The Manifest
ECHO compliance requires only a README. But for machine-readable metadata, longecho supports an optional manifest:
version: "1.0"
name: "Alex's Data Archive"
description: "Personal data archive"
sources:
- path: "conversations/"
order: 1
- path: "bookmarks/"
order: 2
- path: "ebooks/"
order: 3
The manifest enables:
- Explicit ordering of sources in generated sites
- Selective inclusion via the
browsableflag - Override names for cleaner presentation
- Icon hints for UI presentation
Without a manifest, longecho auto-discovers sub-archives by looking for directories with README files. The manifest provides explicit control when you need it.
Long Echo: Photos and Mail
January 19, 2026
The Long Echo toolkit now covers conversations, bookmarks, and ebooks. But two of the most emotionally significant categories of personal data remain: photos and mail.
Both share a troubling pattern: scattered across devices and cloud services, organized by date rather than meaning, vulnerable to platform disappearance. They deserve better.
The Expanding Ecosystem
| Tool | Domain | Status |
|---|---|---|
| ctk | AI Conversations | stable |
| btk | Bookmarks & Media | stable |
| ebk | eBooks | stable |
| repoindex | Git Repositories | stable |
| ptk | Photos | incubating |
| mtk | incubating |
The orchestration layer, longecho, ties these together into a unified personal archive.
PTK: Photo Toolkit
Photos are the most emotionally valuable digital artifacts most people have. They’re also among the worst-managed.
The Problem
Your photo library is probably:
- Scattered: Phone, old phones, cloud services, camera imports, messaging app saves
- Organized by date: Not by who’s in them, where they were taken, or what they mean
- Cloud-dependent: Google Photos, iCloud, Amazon Photos. What happens when you switch?
- Unsearchable by content: “Find photos of mom at the beach” isn’t possible
- Missing context: Only you know why that blurry photo matters
The Vision
ptk provides:
Unified import from any source:
ptk import ~/Pictures/
ptk import ~/phone-backup/DCIM/
ptk import google-takeout.zip --source google-photos
ptk import icloud-export/ --source icloud
Intelligent organization by multiple dimensions:
ptk shell
ptk:/$ cd /people/mom
ptk:/people/mom$ ls
2019/ 2020/ 2021/ 2022/ 2023/ 2024/
ptk:/$ cd /locations/beach
ptk:/$ cd /events/christmas-2023
ptk:/$ cd /years/2020/months/march
AI-powered features:
# Face detection and clustering
ptk faces detect --all
ptk faces cluster
ptk faces label cluster-7 "Mom"
ptk faces find "Mom"
# Scene captioning
ptk caption --all --model ollama/llava
ptk search "sunset over water"
# Semantic search
ptk ask "photos from our trip to Colorado"
Preservation guarantees:
# Verify nothing is corrupted
ptk verify --checksums
# Export to durable formats
ptk export ~/archive/photos/ --format longecho
ptk export photos.html --format html-gallery
# Original files always preserved
ptk originals list
ptk originals verify
Why SQLite?
Like the other Long Echo tools, ptk uses SQLite for metadata:
# Works even if ptk disappears
sqlite3 photos.db "
SELECT path, caption, taken_at
FROM photos
WHERE caption LIKE '%birthday%'
ORDER BY taken_at
"
The database stores metadata, face embeddings, captions, and organization. The actual photo files stay in place or are copied to a managed library, your choice.
Long Echo in Practice: 5,874 Bookmarks in a Single File
December 18, 2025
I wrote about Long Echo and the Long Echo Toolkit earlier. Here’s what it actually looks like.
View the live demo: 5,874 bookmarks in a single file
The Export
btk --db bookmarks.db export bookmarks.html \
--format html-app \
--query "(reachable != 0 OR reachable IS NULL)"
Result: 5,874 bookmarks in a single 4MB HTML file.
What You Get
Open it in any browser. No server. No internet. No dependencies. Just a file.
The html-app export includes:
- Search: Full-text filtering across titles, URLs, descriptions, tags
- Multiple views: Grid, list, table layouts
- Tag sidebar: Hierarchical tag navigation
- Dark mode: Toggle button
- Keyboard shortcuts: Navigate without a mouse
- Sorting: By date, title, visits, stars
- Filtering: By starred, archived, has-content
Everything is embedded: CSS, JavaScript, all 5,874 bookmark records as JSON. One file.
Why This Matters
Graceful degradation, concretely:
| Level | What Works | Requirements |
|---|---|---|
| 1. BTK CLI | Full features, auto-tagging, content caching | Python, btk installed |
| 2. SQLite | Direct queries, scripting | sqlite3 binary |
| 3. HTML App | Visual browsing, search, filtering | Any browser |
| 4. View source | Raw JSON data, greppable | Text editor |
The HTML app is level 3. It works when BTK is gone, when Python is gone. Someone in 2074 can double-click the file and browse my bookmarks.
The Data Inside
View source and you’ll find:
const BOOKMARKS = [
{
"id": 1,
"url": "https://example.com/article",
"title": "Interesting Article",
"description": "Notes about the article...",
"tags": ["programming", "python"],
"stars": 1,
"created_at": "2023-05-12T14:32:00Z",
"visited_count": 42
},
// ... 5,873 more
];
Plain JSON. No encoding tricks. Grep it, parse it with jq, import it into another tool. The data survives the interface.
Try It
Install BTK:
pip install bookmark-tk
Export your bookmarks:
# From browser exports
btk import bookmarks.html --format html
# To self-contained app
btk export archive.html --format html-app
You now have a permanent, searchable copy of your bookmarks that will outlive every cloud service you currently depend on.
Links
- Live Demo: My Bookmarks Archive (5,874 bookmarks, 4MB)
- BTK: github.com/queelius/btk
- Long Echo Philosophy: Long Echo: Designing for Digital Resilience
- Full Toolkit: The Long Echo Toolkit
The Long Echo Toolkit
December 16, 2025
Earlier this year I wrote about Long Echo, a philosophy for preserving AI conversations in ways that stay accessible across decades. The core idea was graceful degradation: systems that fail progressively, not catastrophically.
Since then I’ve built out three tools that apply this thinking to all personal digital content, not just conversations. Bookmarks, books, and AI chats. Together they form a system for managing the stuff you actually think with.
The Toolkit
| Tool | Domain | Install |
|---|---|---|
| CTK | AI Conversations | pip install conversation-tk |
| BTK | Bookmarks & Media | pip install bookmark-tk |
| EBK | eBooks & Documents | pip install ebk |
All three share a common architecture, but each is specialized for its domain.
Shared Architecture
SQLite-First Storage
Every tool uses local SQLite databases you own. No cloud dependency. Queryable with standard tools even if the CLI disappears tomorrow:
# Works even if the tools are gone
sqlite3 conversations.db "SELECT title FROM conversations WHERE title LIKE '%python%'"
sqlite3 bookmarks.db "SELECT url, title FROM bookmarks WHERE stars = 1"
sqlite3 library.db "SELECT title, author FROM books WHERE favorite = 1"
This is the whole point. The database is the artifact, not the tool.
Interactive Shells with Virtual Filesystems
Navigate your data like a Unix filesystem:
$ btk shell
btk:/$ cd tags/programming/python
btk:/tags/programming/python$ ls
3298 4095 5124 (bookmark IDs)
btk:/tags/programming/python$ cat 4095/title
Advanced Python Techniques
$ ebk shell
ebk:/$ cd authors/Knuth
ebk:/authors/Knuth$ ls
The Art of Computer Programming Vol 1
The Art of Computer Programming Vol 2
Reading Queues
Track what you’re reading, watching, or working through:
# Bookmarks
btk queue add 42 --priority high
btk queue next
btk queue progress 42 --percent 75
btk queue estimate-times # Auto-estimate from content length
# Books
ebk queue add "Gödel, Escher, Bach"
ebk queue next
ebk queue list
LLM Integration
All three integrate with LLMs for tagging, summarization, and search:
# Auto-tag using content analysis
btk content auto-tag --all
ctk auto-tag --model ollama/llama3
ebk enrich 42 # Enhance metadata with LLM
# Natural language queries
ctk say "summarize my conversations about Rust"
btk ask "find articles about distributed systems"
ebk similar "Gödel, Escher, Bach" # Semantic similarity
Network Analysis
Find relationships in your data:
# CTK: Conversation networks
ctk net embeddings --all
ctk net similar 42
ctk net clusters
ctk net central # Most connected conversations
ctk net outliers # Isolated conversations
# BTK: Bookmark graphs
btk graph build
btk graph analyze
Web Servers
Browse your archives in a web UI:
Everything is a File: Virtual Filesystems for CLI Data Tools
October 20, 2025
I had a bookmark manager. Then an ebook library manager. Then a chat history manager. Each started with the standard CRUD CLI:
btk add https://example.com --tags python,tutorial
btk list --tag python
btk search "async"
btk delete 1234
ebk import book.pdf --author "Knuth"
ebk list --author Knuth
ebk search "algorithms"
This works fine until you have 10,000+ bookmarks organized with hierarchical tags like programming/python/async, research/ml/transformers, work/clients/acme. Your ebook library has similar structure. Your exported chat conversations from Claude, ChatGPT, and Copilot are piling up.
Traditional CRUD commands become unwieldy:
btk list --tag programming/python/async/io --format json | jq '.[].title'
ebk list --category "Computer Science/Algorithms/Graph Theory" --limit 50
ctk search "machine learning" --source ChatGPT --date-from 2024-01-01
Each command requires precise arguments. Each tool has different flag conventions. You can’t navigate your data. You can only query it. And queries require knowing exactly what you’re looking for.
The insight: everything is a file
When I have thousands of source files organized in directories, I don’t run:
list-files --path /src/components/auth --extension .tsx
I run:
cd src/components/auth
ls *.tsx
The difference matters. With a filesystem, I can navigate incrementally (cd from general to specific), explore (ls to see what’s there), compose (cat file | grep pattern | wc -l), and use familiar tools (find, grep, xargs, pipes, redirection).
What if my bookmarks, ebooks, and chat histories were filesystems?
The pattern
Over the past year, I built six Python tools that all follow the same architecture:
| Tool | Domain | VFS Root Structure |
|---|---|---|
| btk | Bookmarks | /bookmarks/, /tags/, /recent/, /domains/, /unread/, /popular/ |
| ebk | Ebook library | /books/, /authors/, /series/, /subjects/, /recent/, /unread/ |
| ctk | Chat conversations | /conversations/, /sources/, /topics/, /starred/, /recent/ |
| ghops | Git repositories | /repos/, /languages/, /topics/, /stars/, /recent/ |
| infinigram | N-gram models | /datasets/, /models/, /corpora/ |
| AlgoTree | Tree structures | /nodes/, /paths/, /subtrees/ |
Each tool provides:
- A stateless CLI for scripting:
btk bookmark add URL,ebk import book.pdf - An interactive shell with a virtual filesystem:
btk shell,ebk shell,ctk chat - POSIX-like commands:
cd,ls,pwd,cat,mv,cp,rm,find,grep - Unix pipeline support: most commands output JSONL by default for piping
The interesting part is the shell.
Navigating 10,000 bookmarks
Live recording captured with asciinema. You can pause, copy text, and replay. The entire recording is 78KB of text.
Linked project: Ctk
Long Echo Comes Alive: From Philosophy to Orchestration
January 20, 2026
A year ago, I wrote about Long Echo as a philosophy for preserving AI conversations across decades. The key insight was graceful degradation: design archives that work progressively even as technology disappears.
That philosophy has become a tool.
From Philosophy to Tool
The original Long Echo was intentionally not code. It was a set of principles documented in CTK’s repository. The hard problems of conversation parsing, storage, and search were already solved by toolkits like CTK, BTK, and EBK.
What was missing was the unification layer. Each toolkit exports its own ECHO-compliant archive, but combining them into a single browsable experience required manual work. That’s what longecho now handles.
What longecho Does Now
longecho is a CLI tool with five capabilities:
longecho check ~/my-data/ # Validate ECHO compliance
longecho discover ~/ # Find ECHO sources
longecho search ~/ "query" # Search README descriptions
longecho build ~/my-archive/ # Generate static site
longecho serve ~/my-archive/ # Preview locally via HTTP
The check, discover, and search commands existed in the original specification. What’s new is build and serve, the orchestration layer.
Building a Unified Site
The build command takes a hierarchical archive and generates a static site:
longecho build ~/my-archive/
This produces a site/ directory with:
- An index page linking to all sub-archives
- Navigation between sources
- Automatic linking to existing sub-site builds
If a sub-archive already has its own site/ directory (like CTK’s exports), longecho links to it. Use --bundle to copy everything into a portable, self-contained site.
Live Preview
The serve command provides local HTTP preview:
longecho serve ~/my-archive/ --port 8000
It builds the site if needed, then serves it for browser viewing.
The Manifest
ECHO compliance requires only a README. But for machine-readable metadata, longecho supports an optional manifest:
version: "1.0"
name: "Alex's Data Archive"
description: "Personal data archive"
sources:
- path: "conversations/"
order: 1
- path: "bookmarks/"
order: 2
- path: "ebooks/"
order: 3
The manifest enables:
- Explicit ordering of sources in generated sites
- Selective inclusion via the
browsableflag - Override names for cleaner presentation
- Icon hints for UI presentation
Without a manifest, longecho auto-discovers sub-archives by looking for directories with README files. The manifest provides explicit control when you need it.
Long Echo: Photos and Mail
January 19, 2026
The Long Echo toolkit now covers conversations, bookmarks, and ebooks. But two of the most emotionally significant categories of personal data remain: photos and mail.
Both share a troubling pattern: scattered across devices and cloud services, organized by date rather than meaning, vulnerable to platform disappearance. They deserve better.
The Expanding Ecosystem
| Tool | Domain | Status |
|---|---|---|
| ctk | AI Conversations | stable |
| btk | Bookmarks & Media | stable |
| ebk | eBooks | stable |
| repoindex | Git Repositories | stable |
| ptk | Photos | incubating |
| mtk | incubating |
The orchestration layer, longecho, ties these together into a unified personal archive.
PTK: Photo Toolkit
Photos are the most emotionally valuable digital artifacts most people have. They’re also among the worst-managed.
The Problem
Your photo library is probably:
- Scattered: Phone, old phones, cloud services, camera imports, messaging app saves
- Organized by date: Not by who’s in them, where they were taken, or what they mean
- Cloud-dependent: Google Photos, iCloud, Amazon Photos. What happens when you switch?
- Unsearchable by content: “Find photos of mom at the beach” isn’t possible
- Missing context: Only you know why that blurry photo matters
The Vision
ptk provides:
Unified import from any source:
ptk import ~/Pictures/
ptk import ~/phone-backup/DCIM/
ptk import google-takeout.zip --source google-photos
ptk import icloud-export/ --source icloud
Intelligent organization by multiple dimensions:
ptk shell
ptk:/$ cd /people/mom
ptk:/people/mom$ ls
2019/ 2020/ 2021/ 2022/ 2023/ 2024/
ptk:/$ cd /locations/beach
ptk:/$ cd /events/christmas-2023
ptk:/$ cd /years/2020/months/march
AI-powered features:
# Face detection and clustering
ptk faces detect --all
ptk faces cluster
ptk faces label cluster-7 "Mom"
ptk faces find "Mom"
# Scene captioning
ptk caption --all --model ollama/llava
ptk search "sunset over water"
# Semantic search
ptk ask "photos from our trip to Colorado"
Preservation guarantees:
# Verify nothing is corrupted
ptk verify --checksums
# Export to durable formats
ptk export ~/archive/photos/ --format longecho
ptk export photos.html --format html-gallery
# Original files always preserved
ptk originals list
ptk originals verify
Why SQLite?
Like the other Long Echo tools, ptk uses SQLite for metadata:
# Works even if ptk disappears
sqlite3 photos.db "
SELECT path, caption, taken_at
FROM photos
WHERE caption LIKE '%birthday%'
ORDER BY taken_at
"
The database stores metadata, face embeddings, captions, and organization. The actual photo files stay in place or are copied to a managed library, your choice.
The Long Echo Toolkit
December 16, 2025
Earlier this year I wrote about Long Echo, a philosophy for preserving AI conversations in ways that stay accessible across decades. The core idea was graceful degradation: systems that fail progressively, not catastrophically.
Since then I’ve built out three tools that apply this thinking to all personal digital content, not just conversations. Bookmarks, books, and AI chats. Together they form a system for managing the stuff you actually think with.
The Toolkit
| Tool | Domain | Install |
|---|---|---|
| CTK | AI Conversations | pip install conversation-tk |
| BTK | Bookmarks & Media | pip install bookmark-tk |
| EBK | eBooks & Documents | pip install ebk |
All three share a common architecture, but each is specialized for its domain.
Shared Architecture
SQLite-First Storage
Every tool uses local SQLite databases you own. No cloud dependency. Queryable with standard tools even if the CLI disappears tomorrow:
# Works even if the tools are gone
sqlite3 conversations.db "SELECT title FROM conversations WHERE title LIKE '%python%'"
sqlite3 bookmarks.db "SELECT url, title FROM bookmarks WHERE stars = 1"
sqlite3 library.db "SELECT title, author FROM books WHERE favorite = 1"
This is the whole point. The database is the artifact, not the tool.
Interactive Shells with Virtual Filesystems
Navigate your data like a Unix filesystem:
$ btk shell
btk:/$ cd tags/programming/python
btk:/tags/programming/python$ ls
3298 4095 5124 (bookmark IDs)
btk:/tags/programming/python$ cat 4095/title
Advanced Python Techniques
$ ebk shell
ebk:/$ cd authors/Knuth
ebk:/authors/Knuth$ ls
The Art of Computer Programming Vol 1
The Art of Computer Programming Vol 2
Reading Queues
Track what you’re reading, watching, or working through:
# Bookmarks
btk queue add 42 --priority high
btk queue next
btk queue progress 42 --percent 75
btk queue estimate-times # Auto-estimate from content length
# Books
ebk queue add "Gödel, Escher, Bach"
ebk queue next
ebk queue list
LLM Integration
All three integrate with LLMs for tagging, summarization, and search:
# Auto-tag using content analysis
btk content auto-tag --all
ctk auto-tag --model ollama/llama3
ebk enrich 42 # Enhance metadata with LLM
# Natural language queries
ctk say "summarize my conversations about Rust"
btk ask "find articles about distributed systems"
ebk similar "Gödel, Escher, Bach" # Semantic similarity
Network Analysis
Find relationships in your data:
# CTK: Conversation networks
ctk net embeddings --all
ctk net similar 42
ctk net clusters
ctk net central # Most connected conversations
ctk net outliers # Isolated conversations
# BTK: Bookmark graphs
btk graph build
btk graph analyze
Web Servers
Browse your archives in a web UI:
Everything is a File: Virtual Filesystems for CLI Data Tools
October 20, 2025
I had a bookmark manager. Then an ebook library manager. Then a chat history manager. Each started with the standard CRUD CLI:
btk add https://example.com --tags python,tutorial
btk list --tag python
btk search "async"
btk delete 1234
ebk import book.pdf --author "Knuth"
ebk list --author Knuth
ebk search "algorithms"
This works fine until you have 10,000+ bookmarks organized with hierarchical tags like programming/python/async, research/ml/transformers, work/clients/acme. Your ebook library has similar structure. Your exported chat conversations from Claude, ChatGPT, and Copilot are piling up.
Traditional CRUD commands become unwieldy:
btk list --tag programming/python/async/io --format json | jq '.[].title'
ebk list --category "Computer Science/Algorithms/Graph Theory" --limit 50
ctk search "machine learning" --source ChatGPT --date-from 2024-01-01
Each command requires precise arguments. Each tool has different flag conventions. You can’t navigate your data. You can only query it. And queries require knowing exactly what you’re looking for.
The insight: everything is a file
When I have thousands of source files organized in directories, I don’t run:
list-files --path /src/components/auth --extension .tsx
I run:
cd src/components/auth
ls *.tsx
The difference matters. With a filesystem, I can navigate incrementally (cd from general to specific), explore (ls to see what’s there), compose (cat file | grep pattern | wc -l), and use familiar tools (find, grep, xargs, pipes, redirection).
What if my bookmarks, ebooks, and chat histories were filesystems?
The pattern
Over the past year, I built six Python tools that all follow the same architecture:
| Tool | Domain | VFS Root Structure |
|---|---|---|
| btk | Bookmarks | /bookmarks/, /tags/, /recent/, /domains/, /unread/, /popular/ |
| ebk | Ebook library | /books/, /authors/, /series/, /subjects/, /recent/, /unread/ |
| ctk | Chat conversations | /conversations/, /sources/, /topics/, /starred/, /recent/ |
| ghops | Git repositories | /repos/, /languages/, /topics/, /stars/, /recent/ |
| infinigram | N-gram models | /datasets/, /models/, /corpora/ |
| AlgoTree | Tree structures | /nodes/, /paths/, /subtrees/ |
Each tool provides:
- A stateless CLI for scripting:
btk bookmark add URL,ebk import book.pdf - An interactive shell with a virtual filesystem:
btk shell,ebk shell,ctk chat - POSIX-like commands:
cd,ls,pwd,cat,mv,cp,rm,find,grep - Unix pipeline support: most commands output JSONL by default for piping
The interesting part is the shell.
Navigating 10,000 bookmarks
Live recording captured with asciinema. You can pause, copy text, and replay. The entire recording is 78KB of text.
CTK: Conversation Toolkit
October 9, 2025
CTK manages AI conversations across platforms. Import from ChatGPT, Claude, Copilot, Gemini. Store locally in SQLite. Search, tag, export. Keep everything.
The Problem
If you use multiple AI assistants, your conversations are scattered across incompatible platforms, unsearchable, and dependent on companies that may not exist in 20 years. ChatGPT lives in OpenAI’s web app. Claude is siloed in Anthropic’s interface. Copilot chat history is buried in VS Code storage.
You can’t search across them. You can’t back them up in a unified format. You can’t own them.
The Key Insight: Conversations Are Trees
Most tools treat conversations as linear sequences. They’re not. ChatGPT’s “regenerate” feature creates branches. Claude supports conversation forking. Even a simple “let me try that again” is a tree operation.
User: "Write a poem"
├── Assistant (v1): "Roses are red..."
└── Assistant (v2): "In fields of gold..." [regenerated]
└── User: "Make it longer"
└── Assistant: "In fields of gold, where sunshine..."
CTK stores all conversations as trees. Linear chats are single-path trees. Branching conversations preserve every path. This means you never lose a regeneration, and you can export any path you want.
What It Does
# Import from any platform
ctk import chatgpt_export.json --db my_chats.db
ctk import claude_export.json --db my_chats.db --format anthropic
ctk import ~/.vscode/workspaceStorage --db my_chats.db --format copilot
# Search across everything
ctk search "python async" --db my_chats.db
# Natural language queries via LLM tool calling
ctk say "find conversations about distributed systems" --db my_chats.db
# Interactive TUI for browsing and chatting
ctk chat --db my_chats.db
# Export for fine-tuning, archival, or publishing
ctk export training.jsonl --db my_chats.db --format jsonl
ctk export archive.html --db my_chats.db --format html5
ctk export archive/ --db my_chats.db --format markdown
Plugin Architecture
Adding a new provider is one file. Implement ImporterPlugin, drop it in the integrations folder, done. Auto-discovered at runtime. No registry, no config.
Currently supported: OpenAI/ChatGPT (full tree), Anthropic/Claude (full tree), GitHub Copilot, Google Gemini, generic JSONL, coding agents (Cursor, Windsurf).
Privacy
100% local. No telemetry. Optional sanitization strips API keys, passwords, and personal identifiers before export.
ctk export clean_export.jsonl --db chats.db --format jsonl --sanitize
HTML5 Export
The HTML5 exporter produces a self-contained file with embedded search, tree visualization, and dark mode. No server, no internet, no dependencies. The file works offline in any browser, including continuing conversations with a local LLM directly in the exported HTML.
Long Echo: Designing for Digital Resilience Across Decades
January 6, 2025
Update (January 2026): Since this post was written, longecho has evolved from specification to implementation. See Long Echo Comes Alive for the current state including
build,serve, and manifest features.
Not Resurrection. Not Immortality.
Just love that still responds.
That’s the idea behind Long Echo. It’s a project about preserving conversations with AI assistants so they stay accessible and meaningful across decades. Not digital ghosts that autonomously post to social media. Not trying to resurrect anyone. Just making sure the knowledge and care captured in these conversations can still be found, searched, and used when the original software is gone.
The Problem
We’re having important conversations with AI assistants:
- Teaching moments with students
- Advice we’d give our children
- Technical problems we’ve solved
- Creative work we don’t want to lose
- Personal growth tracked over years
But these conversations are trapped in proprietary formats, scattered across platforms (ChatGPT, Claude, Gemini, Copilot), and dependent on companies that may not exist in 50 years.
What happens when you want to find that debugging advice from 2024? What if your children want to search your conversations after you’re gone? What if the company shuts down their API?
The Philosophy: Graceful Degradation
The core idea is graceful degradation, designing systems that fail progressively, not catastrophically:
Level 1: Full functionality → CTK with semantic search, RAG, beautiful TUI
Level 2: Database queries → SQLite direct queries (CTK gone, SQLite remains)
Level 3: File search → grep through JSONL files (just text tools)
Level 4: Human reading → Markdown, HTML (readable without any tools)
Level 5: Ultimate fallback → Plain text in notepad
Each level still works even if everything above it is gone.
The Discovery: CTK Already Solved This
I started building Long Echo as a separate system. I designed multi-format importers, search with fallbacks, memory extraction pipelines. Complex architecture diagrams. Deployment strategies. The whole thing.
Then I realized that CTK (Conversation Toolkit), which I had built earlier, already solved all the hard problems.
CTK already provides:
- Import from all platforms (unified API)
- Conversation trees (handles branching, regenerations)
- SQLite storage (local, queryable, persistent)
- Multiple export formats (JSONL, Markdown, HTML, JSON)
- Full-text search + LLM-powered queries
- Complex network RAG (coming soon)
- Terminal UI
Everything I was designing was already built. By me. Earlier.
This wasn’t failure. I’d already built the foundation without realizing it. The hard problems (conversation parsing, unified representation, search, storage) were handled. What Long Echo needed wasn’t more code. It needed a philosophy.
Discovering ChatGPT: Reconnecting with AI Research
December 8, 2022
I finally noticed ChatGPT this week. Everyone’s been talking about it, but I was buried in cancer treatment, chemo recovery, surgery prep, and thesis work on Weibull distributions.
When I finally tried it, my reaction wasn’t surprise at the technology itself.
It was: “This makes sense. The pieces were all there.”
Why I Missed It
GPT-3 came out in 2020. I was dealing with:
- Stage 3 cancer diagnosis
- Chemotherapy
- Mathematical statistics coursework
- Thesis research on masked failure data
- Surgery and recovery
I had no attention left for tracking ML developments. The world moved on. I was focused on survival.
The Theoretical Foundation
I’ve been interested in Marcus Hutter and Ray Solomonoff’s work for years.
Solomonoff induction: optimal prediction is compression. Intelligence is sequence prediction. The smallest program that generates your observations is the best predictor of what comes next.
Hutter’s AIXI: intelligence = optimal compression-based prediction with resource bounds.
During my CS master’s, I proposed working on sequence prediction as a thesis topic, inspired by Solomonoff. The professor wasn’t interested. I ended up doing encrypted search instead.
But the intuition stayed: prediction ~ compression ~ intelligence.
The Bitter Lesson
Rich Sutton’s “The Bitter Lesson” laid it out: scaling compute and data beats clever algorithms.
The lesson from 70 years of AI research: general methods that use computation win. Hand-crafted features lose. Search and learning scale. Everything else doesn’t.
I read that paper and found it compelling. But there’s a difference between understanding theory and watching it play out at scale. OpenAI was actually doing the scaling while I was working on other problems.
ImageNet Should Have Been the Signal
In retrospect, ImageNet being solved by deep neural networks in 2012 was the canary. A simple architecture (CNNs), massive data, lots of compute, and you get superhuman image classification.
That was the proof: scale works.
GPT is the same pattern:
- Simple architecture (transformers)
- Massive data (internet-scale text)
- Enormous compute (thousands of GPUs)
Result: something that looks disturbingly intelligent.
Connecting the Dots
The theoretical framework was there:
- Solomonoff: intelligence is compression
- Hutter: optimal prediction with bounded resources
- Sutton: scaling beats cleverness
The empirical evidence accumulated:
Linked project: Ebk
Long Echo Comes Alive: From Philosophy to Orchestration
January 20, 2026
A year ago, I wrote about Long Echo as a philosophy for preserving AI conversations across decades. The key insight was graceful degradation: design archives that work progressively even as technology disappears.
That philosophy has become a tool.
From Philosophy to Tool
The original Long Echo was intentionally not code. It was a set of principles documented in CTK’s repository. The hard problems of conversation parsing, storage, and search were already solved by toolkits like CTK, BTK, and EBK.
What was missing was the unification layer. Each toolkit exports its own ECHO-compliant archive, but combining them into a single browsable experience required manual work. That’s what longecho now handles.
What longecho Does Now
longecho is a CLI tool with five capabilities:
longecho check ~/my-data/ # Validate ECHO compliance
longecho discover ~/ # Find ECHO sources
longecho search ~/ "query" # Search README descriptions
longecho build ~/my-archive/ # Generate static site
longecho serve ~/my-archive/ # Preview locally via HTTP
The check, discover, and search commands existed in the original specification. What’s new is build and serve, the orchestration layer.
Building a Unified Site
The build command takes a hierarchical archive and generates a static site:
longecho build ~/my-archive/
This produces a site/ directory with:
- An index page linking to all sub-archives
- Navigation between sources
- Automatic linking to existing sub-site builds
If a sub-archive already has its own site/ directory (like CTK’s exports), longecho links to it. Use --bundle to copy everything into a portable, self-contained site.
Live Preview
The serve command provides local HTTP preview:
longecho serve ~/my-archive/ --port 8000
It builds the site if needed, then serves it for browser viewing.
The Manifest
ECHO compliance requires only a README. But for machine-readable metadata, longecho supports an optional manifest:
version: "1.0"
name: "Alex's Data Archive"
description: "Personal data archive"
sources:
- path: "conversations/"
order: 1
- path: "bookmarks/"
order: 2
- path: "ebooks/"
order: 3
The manifest enables:
- Explicit ordering of sources in generated sites
- Selective inclusion via the
browsableflag - Override names for cleaner presentation
- Icon hints for UI presentation
Without a manifest, longecho auto-discovers sub-archives by looking for directories with README files. The manifest provides explicit control when you need it.
Long Echo: Photos and Mail
January 19, 2026
The Long Echo toolkit now covers conversations, bookmarks, and ebooks. But two of the most emotionally significant categories of personal data remain: photos and mail.
Both share a troubling pattern: scattered across devices and cloud services, organized by date rather than meaning, vulnerable to platform disappearance. They deserve better.
The Expanding Ecosystem
| Tool | Domain | Status |
|---|---|---|
| ctk | AI Conversations | stable |
| btk | Bookmarks & Media | stable |
| ebk | eBooks | stable |
| repoindex | Git Repositories | stable |
| ptk | Photos | incubating |
| mtk | incubating |
The orchestration layer, longecho, ties these together into a unified personal archive.
PTK: Photo Toolkit
Photos are the most emotionally valuable digital artifacts most people have. They’re also among the worst-managed.
The Problem
Your photo library is probably:
- Scattered: Phone, old phones, cloud services, camera imports, messaging app saves
- Organized by date: Not by who’s in them, where they were taken, or what they mean
- Cloud-dependent: Google Photos, iCloud, Amazon Photos. What happens when you switch?
- Unsearchable by content: “Find photos of mom at the beach” isn’t possible
- Missing context: Only you know why that blurry photo matters
The Vision
ptk provides:
Unified import from any source:
ptk import ~/Pictures/
ptk import ~/phone-backup/DCIM/
ptk import google-takeout.zip --source google-photos
ptk import icloud-export/ --source icloud
Intelligent organization by multiple dimensions:
ptk shell
ptk:/$ cd /people/mom
ptk:/people/mom$ ls
2019/ 2020/ 2021/ 2022/ 2023/ 2024/
ptk:/$ cd /locations/beach
ptk:/$ cd /events/christmas-2023
ptk:/$ cd /years/2020/months/march
AI-powered features:
# Face detection and clustering
ptk faces detect --all
ptk faces cluster
ptk faces label cluster-7 "Mom"
ptk faces find "Mom"
# Scene captioning
ptk caption --all --model ollama/llava
ptk search "sunset over water"
# Semantic search
ptk ask "photos from our trip to Colorado"
Preservation guarantees:
# Verify nothing is corrupted
ptk verify --checksums
# Export to durable formats
ptk export ~/archive/photos/ --format longecho
ptk export photos.html --format html-gallery
# Original files always preserved
ptk originals list
ptk originals verify
Why SQLite?
Like the other Long Echo tools, ptk uses SQLite for metadata:
# Works even if ptk disappears
sqlite3 photos.db "
SELECT path, caption, taken_at
FROM photos
WHERE caption LIKE '%birthday%'
ORDER BY taken_at
"
The database stores metadata, face embeddings, captions, and organization. The actual photo files stay in place or are copied to a managed library, your choice.
The Long Echo Toolkit
December 16, 2025
Earlier this year I wrote about Long Echo, a philosophy for preserving AI conversations in ways that stay accessible across decades. The core idea was graceful degradation: systems that fail progressively, not catastrophically.
Since then I’ve built out three tools that apply this thinking to all personal digital content, not just conversations. Bookmarks, books, and AI chats. Together they form a system for managing the stuff you actually think with.
The Toolkit
| Tool | Domain | Install |
|---|---|---|
| CTK | AI Conversations | pip install conversation-tk |
| BTK | Bookmarks & Media | pip install bookmark-tk |
| EBK | eBooks & Documents | pip install ebk |
All three share a common architecture, but each is specialized for its domain.
Shared Architecture
SQLite-First Storage
Every tool uses local SQLite databases you own. No cloud dependency. Queryable with standard tools even if the CLI disappears tomorrow:
# Works even if the tools are gone
sqlite3 conversations.db "SELECT title FROM conversations WHERE title LIKE '%python%'"
sqlite3 bookmarks.db "SELECT url, title FROM bookmarks WHERE stars = 1"
sqlite3 library.db "SELECT title, author FROM books WHERE favorite = 1"
This is the whole point. The database is the artifact, not the tool.
Interactive Shells with Virtual Filesystems
Navigate your data like a Unix filesystem:
$ btk shell
btk:/$ cd tags/programming/python
btk:/tags/programming/python$ ls
3298 4095 5124 (bookmark IDs)
btk:/tags/programming/python$ cat 4095/title
Advanced Python Techniques
$ ebk shell
ebk:/$ cd authors/Knuth
ebk:/authors/Knuth$ ls
The Art of Computer Programming Vol 1
The Art of Computer Programming Vol 2
Reading Queues
Track what you’re reading, watching, or working through:
# Bookmarks
btk queue add 42 --priority high
btk queue next
btk queue progress 42 --percent 75
btk queue estimate-times # Auto-estimate from content length
# Books
ebk queue add "Gödel, Escher, Bach"
ebk queue next
ebk queue list
LLM Integration
All three integrate with LLMs for tagging, summarization, and search:
# Auto-tag using content analysis
btk content auto-tag --all
ctk auto-tag --model ollama/llama3
ebk enrich 42 # Enhance metadata with LLM
# Natural language queries
ctk say "summarize my conversations about Rust"
btk ask "find articles about distributed systems"
ebk similar "Gödel, Escher, Bach" # Semantic similarity
Network Analysis
Find relationships in your data:
# CTK: Conversation networks
ctk net embeddings --all
ctk net similar 42
ctk net clusters
ctk net central # Most connected conversations
ctk net outliers # Isolated conversations
# BTK: Bookmark graphs
btk graph build
btk graph analyze
Web Servers
Browse your archives in a web UI:
Everything is a File: Virtual Filesystems for CLI Data Tools
October 20, 2025
I had a bookmark manager. Then an ebook library manager. Then a chat history manager. Each started with the standard CRUD CLI:
btk add https://example.com --tags python,tutorial
btk list --tag python
btk search "async"
btk delete 1234
ebk import book.pdf --author "Knuth"
ebk list --author Knuth
ebk search "algorithms"
This works fine until you have 10,000+ bookmarks organized with hierarchical tags like programming/python/async, research/ml/transformers, work/clients/acme. Your ebook library has similar structure. Your exported chat conversations from Claude, ChatGPT, and Copilot are piling up.
Traditional CRUD commands become unwieldy:
btk list --tag programming/python/async/io --format json | jq '.[].title'
ebk list --category "Computer Science/Algorithms/Graph Theory" --limit 50
ctk search "machine learning" --source ChatGPT --date-from 2024-01-01
Each command requires precise arguments. Each tool has different flag conventions. You can’t navigate your data. You can only query it. And queries require knowing exactly what you’re looking for.
The insight: everything is a file
When I have thousands of source files organized in directories, I don’t run:
list-files --path /src/components/auth --extension .tsx
I run:
cd src/components/auth
ls *.tsx
The difference matters. With a filesystem, I can navigate incrementally (cd from general to specific), explore (ls to see what’s there), compose (cat file | grep pattern | wc -l), and use familiar tools (find, grep, xargs, pipes, redirection).
What if my bookmarks, ebooks, and chat histories were filesystems?
The pattern
Over the past year, I built six Python tools that all follow the same architecture:
| Tool | Domain | VFS Root Structure |
|---|---|---|
| btk | Bookmarks | /bookmarks/, /tags/, /recent/, /domains/, /unread/, /popular/ |
| ebk | Ebook library | /books/, /authors/, /series/, /subjects/, /recent/, /unread/ |
| ctk | Chat conversations | /conversations/, /sources/, /topics/, /starred/, /recent/ |
| ghops | Git repositories | /repos/, /languages/, /topics/, /stars/, /recent/ |
| infinigram | N-gram models | /datasets/, /models/, /corpora/ |
| AlgoTree | Tree structures | /nodes/, /paths/, /subtrees/ |
Each tool provides:
- A stateless CLI for scripting:
btk bookmark add URL,ebk import book.pdf - An interactive shell with a virtual filesystem:
btk shell,ebk shell,ctk chat - POSIX-like commands:
cd,ls,pwd,cat,mv,cp,rm,find,grep - Unix pipeline support: most commands output JSONL by default for piping
The interesting part is the shell.
Navigating 10,000 bookmarks
Live recording captured with asciinema. You can pause, copy text, and replay. The entire recording is 78KB of text.
EBK: Ebook Toolkit
October 13, 2025
Your books represent decades of accumulated knowledge. Technical references, formative texts, research that shaped your thinking. They deserve better than scattered files on a hard drive with inconsistent metadata and no way to search across them.
EBK treats your ebook library as a queryable, searchable knowledge base. It’s part of the Long Echo toolkit: tools for preserving your digital intellectual life in formats you control.
The Core Abstraction
At its heart, EBK is a SQLAlchemy + SQLite database with a normalized schema. Everything else (CLI, AI features, exports) is layered on top. This means your library metadata is always queryable with standard tools, even if EBK itself disappears.
# Works even without EBK installed
sqlite3 library.db "SELECT title, author FROM books WHERE favorite = 1"
What It Does
# Initialize and import
ebk db-init ~/my-library
ebk db-import ~/Documents/book.pdf ~/my-library
ebk db-import-calibre ~/Calibre/Library ~/my-library
# Search with FTS5 full-text search
ebk db-search "quantum computing" ~/my-library
# Field-specific queries
ebk db-search "title:Python author:Knuth tag:programming" ~/my-library
Behind a simple import, EBK automatically extracts text from PDFs (PyMuPDF with pypdf fallback) and EPUBs, generates text chunks for semantic search, computes SHA256 hashes for deduplication, extracts covers, and indexes everything in FTS5.
Deduplication
Same file (same hash) gets skipped. Same book in a different format gets added as an additional format. Different book gets imported as new. Books are stored in hash-prefixed directories for scalability.
AI Enrichment
EBK can use LLMs to auto-generate tags, categories, and descriptions for books with sparse metadata:
ebk enrich 42 # Enhance metadata with LLM
Semantic search finds books by meaning, not just keywords:
results = lib.semantic_search(
"explaining complex mathematical concepts simply",
threshold=0.7
)
Uses vector embeddings when available, TF-IDF fallback for offline use.
Knowledge Graphs
Using NetworkX, EBK can extract concept relationships across your library:
graph = lib.build_knowledge_graph(extract_entities=True)
graph.visualize(output="library_knowledge.html")
This reveals connections you didn’t know existed. “These books about functional programming also discuss category theory.”
Fluent Python API
from ebk import Library
lib = Library.open("~/ebooks")
results = (lib.query()
.where("language", "en")
.where("date", "2020", ">=")
.where("subjects", "Python", "contains")
.order_by("title")
.take(10)
.execute())
Export
Multiple formats for different needs:
ebk export hugo ~/library ~/hugo-site --organize-by subject --include-covers
ebk export-dag ~/library ~/output # Navigable symlink directory structure
The Hugo export creates a browsable website. The DAG export creates a tag-based directory structure where books appear via symlinks under multiple categories. Both work without EBK installed.
Linked project: Mtk
Long Echo: The Ghost That Speaks
January 20, 2026
The ghost is not you. But it echoes you.
What survives beyond scattered archives? Beyond exported conversations and curated bookmarks? The stuff we never think to preserve: the photos that show how you see the world. The correspondence that maps who matters to you.
The Long Echo toolkit has grown. PTK for photos. MTK for mail. But these are sources, not destinations. The destination is something stranger: longshade, a persona built from your data that can respond to questions you never answered.
I’m going to invert the usual pattern here. Instead of tools first, philosophy later, I want to start with the philosophical destination and work backward to the data that feeds it.
longshade: The Ghost That Speaks
The Central Question
What if your archive could respond?
Not a chatbot trained on your data. Not a digital resurrection. Something more careful: a voice that carries your patterns, your interests, your way of seeing the world.
That’s longshade. Right now it’s spec-only (no implementation yet). It defines what it would mean to synthesize a conversable persona from personal archives.
The Ghost Metaphor
“The ghost is not you. But it echoes you.”
This framing matters. longshade isn’t about immortality or resurrection. It’s about preservation with a kind of agency. The echo can answer questions you never answered, using patterns you established. It speaks in your voice without claiming to be you.
The distinction is important:
- Resurrection claims to recreate the person
- Simulation claims to predict the person
- Echo acknowledges it carries patterns, not identity
An echo is honest about what it is. It responds because you left enough traces to inform a response, not because it is you.
Voice vs. Personality
longshade extracts voice, not personality.
Your actual phrases. Your vocabulary. Your reasoning patterns. Your recurring metaphors. The way you explain things, not the things you might explain.
I noticed something working with conversation archives: user messages are the strongest signal. AI responses contain the AI’s voice. Your messages contain your voice. How you ask questions, how you frame problems, how you push back. That’s where the signal lives.
The ghost speaks like you because it learned from what you actually said, not from responses you prompted.
Long Echo: Photos and Mail
January 19, 2026
The Long Echo toolkit now covers conversations, bookmarks, and ebooks. But two of the most emotionally significant categories of personal data remain: photos and mail.
Both share a troubling pattern: scattered across devices and cloud services, organized by date rather than meaning, vulnerable to platform disappearance. They deserve better.
The Expanding Ecosystem
| Tool | Domain | Status |
|---|---|---|
| ctk | AI Conversations | stable |
| btk | Bookmarks & Media | stable |
| ebk | eBooks | stable |
| repoindex | Git Repositories | stable |
| ptk | Photos | incubating |
| mtk | incubating |
The orchestration layer, longecho, ties these together into a unified personal archive.
PTK: Photo Toolkit
Photos are the most emotionally valuable digital artifacts most people have. They’re also among the worst-managed.
The Problem
Your photo library is probably:
- Scattered: Phone, old phones, cloud services, camera imports, messaging app saves
- Organized by date: Not by who’s in them, where they were taken, or what they mean
- Cloud-dependent: Google Photos, iCloud, Amazon Photos. What happens when you switch?
- Unsearchable by content: “Find photos of mom at the beach” isn’t possible
- Missing context: Only you know why that blurry photo matters
The Vision
ptk provides:
Unified import from any source:
ptk import ~/Pictures/
ptk import ~/phone-backup/DCIM/
ptk import google-takeout.zip --source google-photos
ptk import icloud-export/ --source icloud
Intelligent organization by multiple dimensions:
ptk shell
ptk:/$ cd /people/mom
ptk:/people/mom$ ls
2019/ 2020/ 2021/ 2022/ 2023/ 2024/
ptk:/$ cd /locations/beach
ptk:/$ cd /events/christmas-2023
ptk:/$ cd /years/2020/months/march
AI-powered features:
# Face detection and clustering
ptk faces detect --all
ptk faces cluster
ptk faces label cluster-7 "Mom"
ptk faces find "Mom"
# Scene captioning
ptk caption --all --model ollama/llava
ptk search "sunset over water"
# Semantic search
ptk ask "photos from our trip to Colorado"
Preservation guarantees:
# Verify nothing is corrupted
ptk verify --checksums
# Export to durable formats
ptk export ~/archive/photos/ --format longecho
ptk export photos.html --format html-gallery
# Original files always preserved
ptk originals list
ptk originals verify
Why SQLite?
Like the other Long Echo tools, ptk uses SQLite for metadata:
# Works even if ptk disappears
sqlite3 photos.db "
SELECT path, caption, taken_at
FROM photos
WHERE caption LIKE '%birthday%'
ORDER BY taken_at
"
The database stores metadata, face embeddings, captions, and organization. The actual photo files stay in place or are copied to a managed library, your choice.
Linked project: Ptk
Long Echo: The Ghost That Speaks
January 20, 2026
The ghost is not you. But it echoes you.
What survives beyond scattered archives? Beyond exported conversations and curated bookmarks? The stuff we never think to preserve: the photos that show how you see the world. The correspondence that maps who matters to you.
The Long Echo toolkit has grown. PTK for photos. MTK for mail. But these are sources, not destinations. The destination is something stranger: longshade, a persona built from your data that can respond to questions you never answered.
I’m going to invert the usual pattern here. Instead of tools first, philosophy later, I want to start with the philosophical destination and work backward to the data that feeds it.
longshade: The Ghost That Speaks
The Central Question
What if your archive could respond?
Not a chatbot trained on your data. Not a digital resurrection. Something more careful: a voice that carries your patterns, your interests, your way of seeing the world.
That’s longshade. Right now it’s spec-only (no implementation yet). It defines what it would mean to synthesize a conversable persona from personal archives.
The Ghost Metaphor
“The ghost is not you. But it echoes you.”
This framing matters. longshade isn’t about immortality or resurrection. It’s about preservation with a kind of agency. The echo can answer questions you never answered, using patterns you established. It speaks in your voice without claiming to be you.
The distinction is important:
- Resurrection claims to recreate the person
- Simulation claims to predict the person
- Echo acknowledges it carries patterns, not identity
An echo is honest about what it is. It responds because you left enough traces to inform a response, not because it is you.
Voice vs. Personality
longshade extracts voice, not personality.
Your actual phrases. Your vocabulary. Your reasoning patterns. Your recurring metaphors. The way you explain things, not the things you might explain.
I noticed something working with conversation archives: user messages are the strongest signal. AI responses contain the AI’s voice. Your messages contain your voice. How you ask questions, how you frame problems, how you push back. That’s where the signal lives.
The ghost speaks like you because it learned from what you actually said, not from responses you prompted.
Long Echo: Photos and Mail
January 19, 2026
The Long Echo toolkit now covers conversations, bookmarks, and ebooks. But two of the most emotionally significant categories of personal data remain: photos and mail.
Both share a troubling pattern: scattered across devices and cloud services, organized by date rather than meaning, vulnerable to platform disappearance. They deserve better.
The Expanding Ecosystem
| Tool | Domain | Status |
|---|---|---|
| ctk | AI Conversations | stable |
| btk | Bookmarks & Media | stable |
| ebk | eBooks | stable |
| repoindex | Git Repositories | stable |
| ptk | Photos | incubating |
| mtk | incubating |
The orchestration layer, longecho, ties these together into a unified personal archive.
PTK: Photo Toolkit
Photos are the most emotionally valuable digital artifacts most people have. They’re also among the worst-managed.
The Problem
Your photo library is probably:
- Scattered: Phone, old phones, cloud services, camera imports, messaging app saves
- Organized by date: Not by who’s in them, where they were taken, or what they mean
- Cloud-dependent: Google Photos, iCloud, Amazon Photos. What happens when you switch?
- Unsearchable by content: “Find photos of mom at the beach” isn’t possible
- Missing context: Only you know why that blurry photo matters
The Vision
ptk provides:
Unified import from any source:
ptk import ~/Pictures/
ptk import ~/phone-backup/DCIM/
ptk import google-takeout.zip --source google-photos
ptk import icloud-export/ --source icloud
Intelligent organization by multiple dimensions:
ptk shell
ptk:/$ cd /people/mom
ptk:/people/mom$ ls
2019/ 2020/ 2021/ 2022/ 2023/ 2024/
ptk:/$ cd /locations/beach
ptk:/$ cd /events/christmas-2023
ptk:/$ cd /years/2020/months/march
AI-powered features:
# Face detection and clustering
ptk faces detect --all
ptk faces cluster
ptk faces label cluster-7 "Mom"
ptk faces find "Mom"
# Scene captioning
ptk caption --all --model ollama/llava
ptk search "sunset over water"
# Semantic search
ptk ask "photos from our trip to Colorado"
Preservation guarantees:
# Verify nothing is corrupted
ptk verify --checksums
# Export to durable formats
ptk export ~/archive/photos/ --format longecho
ptk export photos.html --format html-gallery
# Original files always preserved
ptk originals list
ptk originals verify
Why SQLite?
Like the other Long Echo tools, ptk uses SQLite for metadata:
# Works even if ptk disappears
sqlite3 photos.db "
SELECT path, caption, taken_at
FROM photos
WHERE caption LIKE '%birthday%'
ORDER BY taken_at
"
The database stores metadata, face embeddings, captions, and organization. The actual photo files stay in place or are copied to a managed library, your choice.
Linked project: Dotsuite
Building Languages to Solve Problems
January 19, 2026
Chapter 4 of Structure and Interpretation of Computer Programs opens with one of the most important insights in programming: the most powerful technique for controlling complexity is metalinguistic abstraction, the establishment of new languages.
Not libraries. Not frameworks. Languages.
When you’ve abstracted enough of a problem domain into primitives, combination rules, and naming mechanisms, you haven’t just written code. You’ve created a new way of thinking about the problem. The domain becomes expressible. And once something is expressible, it becomes manipulable, debuggable, and shareable.
What Is Metalinguistic Abstraction?
The key distinction is between using a language and creating one. A library gives you functions to call. A language gives you a grammar for expressing ideas.
Consider the difference:
Library approach: Call db.execute("SELECT * FROM users WHERE age > 21")
Language approach: Write SELECT * FROM users WHERE age > 21
SQL isn’t a library. It’s a language, with primitives (tables, columns), means of combination (joins, unions, subqueries), and means of abstraction (views, CTEs). These three elements (primitives, combination, abstraction) are SICP’s fundamental criteria for any language, and they’re what separates a DSL from a mere API.
Other examples:
- Regular expressions: primitives (characters, character classes), combination (concatenation, alternation), abstraction (groups, backreferences)
- Make: primitives (targets, prerequisites), combination (dependency chains), abstraction (pattern rules, variables)
- CSS selectors: primitives (elements, classes, IDs), combination (descendant, child, sibling), abstraction (custom properties, mixins in preprocessors)
In each case, the language captures the essential structure of the problem domain in a way that raw code cannot.
The Three Requirements
SICP identifies three necessary components for any language:
- Primitives: What are the basic elements that cannot be broken down further?
- Means of combination: How do you build compound elements from simpler ones?
- Means of abstraction: How do you name and reuse patterns?
When designing a DSL, these questions guide everything. Get them wrong and you have a clunky API. Get them right and the domain becomes thinkable in your language.
Consider an expression language for symbolic math:
- Primitives: numbers, symbols, operators
- Combination: function application
(+ x 1), nested expressions(* (+ x 1) 2) - Abstraction: named rules, rulesets, engines
Or a query language for JSON documents:
JAF: Streaming Boolean Algebra Over Nested JSON
December 20, 2024
JAF (Just Another Flow) is a streaming data processing system for JSON/JSONL data. It implements boolean algebra over nested JSON structures with lazy evaluation, composable operations, and a fluent API. JAF is the production version of the concepts I explored in dotsuite.
The Relationship to Dotsuite
The short version:
- dotsuite: “This is how it works.” Pedagogical, simple, learn-by-building.
- JAF: “This is what you use.” Feature-complete, lazy, handles real data.
JAF implements the highest level of dotsuite’s architecture: boolean algebra over collections of nested documents. Where dotsuite teaches the concepts through isolated simple tools, JAF combines them into a unified streaming framework.
The Boolean Algebra Branch
In dotsuite’s three-pillar architecture (Depth, Truth, Shape), JAF focuses on the collections layer, specifically the boolean wing that provides filtering operations with full boolean algebra:
\[ \text{filter}: (\mathcal{D} \to \mathbb{B}) \to (C \to C) \]Where \(\mathcal{D}\) is the document space, \(\mathbb{B}\) is boolean values, and \(C\) is a collection of documents.
JAF lifts boolean operations to streams: AND is intersection of filtered streams, OR is union, NOT is complement, and composition gives you chainable predicates with guaranteed homomorphism.
Core Innovation: Lazy Streaming
The Problem
Traditional data processing loads entire datasets into memory:
# Eager evaluation - loads everything
all_data = load_json("huge_file.jsonl")
filtered = [d for d in all_data if d['age'] > 25]
mapped = [transform(d) for d in filtered]
This fails on large datasets and wastes resources when you only need the first 10 results.
JAF’s Solution
from jaf import stream
# Lazy evaluation - nothing executes yet
pipeline = stream("huge_file.jsonl") \
.filter(["gt?", "@age", 25]) \
.map(transform) \
.take(10)
# Only processes 10 matching items
for item in pipeline.evaluate():
process(item)
Constant memory (processes one item at a time), early termination (stops after take(10)), composable (build complex pipelines declaratively), and works with infinite streams.
Three Query Syntaxes
JAF supports multiple query syntaxes that all compile to the same internal representation.
S-Expression Syntax (Lisp-like)
# Simple comparisons
(eq? @status "active")
(gt? @age 25)
(contains? @tags "python")
# Boolean logic
(and
(gte? @age 18)
(eq? @verified true))
# Nested expressions
(or (eq? @role "admin")
(and (eq? @role "user")
(gt? @score 100)))
S-expressions because: unambiguous parsing (no precedence rules), easy to serialize, homoiconic (code is data), composable ASTs.
JSON Array Syntax
# Same queries in JSON
["eq?", "@status", "active"]
["gt?", "@age", 25]
["and",
["gte?", "@age", 18],
["eq?", "@verified", true]
]
Easy to generate programmatically, standard JSON format, network-transmissible.
Infix DSL Syntax
# Natural infix notation
@status == "active"
@age > 25 and @verified == true
@role == "admin" or (@role == "user" and @score > 100)
Human-readable, familiar, good for CLI usage. All three compile to the same AST.
jsonl-algebra: Relational Algebra for Nested JSON
December 18, 2024
jsonl-algebra (command: ja) is a command-line implementation of relational algebra for JSONL data. It’s the production version of dotsuite’s dotrelate component: SQL-like operations on the command line with first-class support for nested JSON structures.
The Relationship to Dotsuite
In dotsuite’s architecture, dotrelate provides relational operations on document collections: join, union, project, difference. jsonl-algebra (ja) is the production implementation of those concepts, published on PyPI, with all relational operations plus aggregations, streaming support, schema tools, and an interactive REPL.
The Core Insight
Traditional relational algebra assumes flat tables:
SELECT name, age FROM users WHERE age > 30
But real-world JSON is deeply nested:
{
"user": {
"id": 1,
"name": "Alice",
"address": {
"city": "NYC",
"zip": "10001"
}
},
"orders": [
{"id": 101, "amount": 50}
]
}
jsonl-algebra bridges this gap by extending relational algebra with dot notation for nested access:
ja select 'user.age > 30' users.jsonl
ja project user.name,user.address.city users.jsonl
ja join users.jsonl orders.jsonl --on user.id=customer_id
The Five Core Operations
Relational algebra has five fundamental operations that form a complete algebra. Everything else is derived.
1. Selection (filter rows)
Mathematical notation: \(\sigma_{\text{predicate}}(R)\)
# Filter where status is "active"
ja select 'status == `"active"`' data.jsonl
# Filter on nested fields
ja select 'user.age > 30' users.jsonl
# Complex boolean logic
ja select 'price < 100 and category == `"electronics"`' products.jsonl
Selection is commutative (\(\sigma_{p_1}(\sigma_{p_2}(R)) = \sigma_{p_2}(\sigma_{p_1}(R))\)) and combinable (\(\sigma_{p_1}(\sigma_{p_2}(R)) = \sigma_{p_1 \land p_2}(R)\)).
2. Projection (select/compute columns)
Mathematical notation: \(\pi_{\text{columns}}(R)\)
# Pick specific fields
ja project id,name data.jsonl
# Access nested fields
ja project user.name,user.address.city users.jsonl
# Computed columns (coming soon)
ja project name,annual_income=salary*12 employees.jsonl
Idempotent for simple projections: \(\pi_a(\pi_{a,b}(R)) = \pi_a(R)\).
3. Join (combine relations)
Mathematical notation: \(R \bowtie_{\text{condition}} S\)
# Inner join on user ID
ja join users.jsonl orders.jsonl --on user.id=customer_id
# Join on nested fields
ja join posts.jsonl comments.jsonl --on post.id=comment.post_id
# Multiple join keys
ja join users.jsonl accounts.jsonl --on id=user_id,email=account_email
Commutative and associative, so you can join multiple files in any order:
ja join users.jsonl orders.jsonl --on user.id=customer_id \
| ja join - products.jsonl --on product_id=id
4. Union (combine all rows)
Mathematical notation: \(R \cup S\)
# Combine employees and contractors
ja union employees.jsonl contractors.jsonl
# Union multiple sources
ja union jan.jsonl feb.jsonl mar.jsonl
5. Difference (set subtraction)
Mathematical notation: \(R - S\)
The Dot Ecosystem: From Simple Paths to Data Algebras
December 15, 2024
dotsuite is a suite of composable tools for working with nested data structures like JSON, YAML, and Python dictionaries. It started as a single helper function and grew into something with actual mathematical structure. That growth is the interesting part.
The Origin
It always starts with a simple problem. You have a nested dictionary and you need a value buried deep inside:
# Brittle code that crashes on missing keys
email = data['user']['contacts'][0]['email'] # KeyError? IndexError?
The first solution is a helper function:
# The essence of dotget - simple enough to copy
def get(data, path, default=None):
try:
for segment in path.split('.'):
data = data[int(segment)] if segment.isdigit() else data[segment]
return data
except (KeyError, IndexError, TypeError):
return default
This is where the story begins. That single function, once you start asking questions about what else you need, leads to a complete ecosystem for data manipulation. The trick is that the questions have a natural structure to them.
The Three Pillars
The ecosystem organizes around three fundamental questions about data:
Depth Pillar: “Where is the data?”
Tools for finding and extracting values from within documents.
| Tool | Purpose | Complexity |
|---|---|---|
| dotget | Simple exact paths | get(data, "user.name") |
| dotstar | Wildcard patterns | search(data, "users.*.name") |
| dotselect | Advanced selection with predicates | find_first(data, "users[role=admin].name") |
| dotpath | Extensible path engine | Powers all other tools, Turing-complete |
The addressing layer forms a free algebra on selectors, with operators being morphisms in the Kleisli category of the powerset monad. In practice this means dotstar composed with dotselect still yields a well-defined set of values. You can compose these things without worrying about edge cases blowing up.
Truth Pillar: “Is this assertion true?”
Tools for asking boolean questions and validating data.
| Tool | Purpose | Logic |
|---|---|---|
| dotexists | Path existence | check(data, "user.email") |
| dotany | Existential quantifier | any_match(data, "users.*.role", "admin") |
| dotall | Universal quantifier | all_match(data, "users.*.status", "active") |
| dotquery | Compositional logic engine | Query("any equals role admin").check(data) |
Predicates form a Boolean algebra under conjunction, disjunction, and negation that is homomorphic to set algebra on result subsets. This enables short-circuit evaluation and distributive laws. The math isn’t decoration; it’s what makes the composition reliable.
Shape Pillar: “How should the data be transformed?”
Tools for reshaping and modifying data structures.
| Tool | Purpose | Type |
|---|---|---|
| dotmod | Surgical modifications | set_(data, "user.status", "inactive") |
| dotbatch | Atomic transactions | Apply multiple changes safely |
| dotpipe | Data transformation pipelines | Reshape documents into new forms |
| dotpluck | Value extraction | Create new structures from selections |
Transformations are endofunctors on document spaces with monoid composition. dotmod implements lenses with put-get laws, while dotpipe provides Kleisli composition of pure functions.
Linked project: Dual
Duality: The Hidden Structure of Opposites
January 19, 2026
Many structures come in pairs. Recognizing duality lets you transfer insights between domains.
The Motivating Example
This collection includes two approaches to automatic differentiation:
- Forward mode (in dual): Propagate derivatives alongside values, from inputs toward outputs
- Reverse mode (in autodiff): Build a graph during forward evaluation, then propagate gradients backward from outputs toward inputs
These aren’t just two implementations of the same idea. They’re duals, mirror images with complementary strengths.
Forward mode computes one column of the Jacobian per pass. If \(f: \mathbb{R}^n \to \mathbb{R}^m\), computing the full Jacobian takes \(n\) passes. Reverse mode computes one row per pass, \(m\) passes for the full Jacobian.
For neural network training, we have many inputs (millions of parameters) and one output (the loss). Reverse mode wins overwhelmingly: one backward pass gives all gradients. This is why backpropagation dominates deep learning.
For sensitivity analysis with few parameters and many outputs, forward mode wins. Same algorithm structure, opposite traversal direction, complementary use cases.
The mathematical explanation: forward mode computes Jacobian-vector products (\(Jv\)); reverse mode computes vector-Jacobian products (\(v^T J\)). These are transposes of each other. Duality is transposition.
Push vs Pull
Consider two ways to traverse a sequence:
Pull (iterator/consumer controls):
for (auto it = seq.begin(); it != seq.end(); ++it) {
process(*it); // Consumer pulls each element
}
Push (producer controls):
seq.for_each([](auto x) {
process(x); // Producer pushes each element
});
Same traversal. Same elements processed. But control flow is reversed:
| Aspect | Pull (Iterator) | Push (Generator) |
|---|---|---|
| Who controls pace? | Consumer | Producer |
| Suspend/resume? | Consumer decides when to call ++ | Producer decides when to yield |
| Backpressure | Natural (just stop pulling) | Must be designed in |
| Composition | Chain iterators | Chain callbacks |
C++ ranges are pull-based: view | filter | transform creates an iterator that pulls through the pipeline. Reactive streams (Rx) are push-based: events flow through a pipeline of observers.
These are duals. Given a pull-based algorithm, you can mechanically derive its push-based counterpart by reversing who initiates each step. The transformation preserves correctness because it’s just changing direction, not content.
Encode vs Decode
Compression algorithms come in pairs:
// Encoder: structure -> bits
auto encode(const Document& doc) -> Bitstream;
// Decoder: bits -> structure
auto decode(const Bitstream& bits) -> Document;
These must be inverses: decode(encode(x)) == x. But their implementations are often strikingly different:
Seeing Structure First
January 18, 2026
A reflection on eleven explorations in generic programming
The Question Behind the Code
What do these computations have in common?
- Computing the millionth Fibonacci number
- Finding the shortest path between cities in a weighted graph
- Calculating compound interest over thirty years
- Composing ten 3D rotations into one
- Repeating a string n times
The answer: they’re all computed by the same twenty lines of code.
template<typename T>
constexpr T power(T const& base, T exp) {
if (exp == zero(exp)) return one(exp);
if (exp == one(exp)) return base;
return even(exp)
? square(power(base, half(exp)))
: product(base, power(base, decrement(exp)));
}
This shouldn’t work. Fibonacci numbers involve integer sequences. Shortest paths involve graphs. Rotations involve 3D geometry. Different domains, different mathematics.
Yet they share structure. Once you see it, a single algorithm serves them all.
This collection of eleven blog posts is an extended meditation on one idea: algorithms arise from algebraic structure. The posts cover different domains (number theory, calculus, linear algebra, polymorphism) but they circle the same insight. Recognize the structure; the algorithm follows.
The Principle
Alex Stepanov articulated this most clearly in Elements of Programming: “Generic programming is about abstracting and classifying algorithms and data structures.” But the deeper point is how to abstract. Not by common syntax or superficial similarity, but by the algebraic laws a type obeys.
Why does structure appear everywhere? Because reality has structure. The algebraic structures we discover in programming (groups, rings, monoids) are the same structures physicists discover in nature. Rotations form a group. Spacetime transformations form a group. This isn’t coincidence. We’re uncovering patterns that exist.
Noether’s theorem makes this precise: every continuous symmetry corresponds to a conservation law. Time-translation symmetry gives conservation of energy. Space-translation symmetry gives conservation of momentum. Rotational symmetry gives conservation of angular momentum. The symmetry groups of physics are algebraic structures.
When we recognize “this is a monoid” in our code, we’re tapping into the same mathematical substrate that governs physical law. The algorithms follow because the structure constrains what’s possible, both in computation and in nature.
Consider the power() function above. What does it require?
- An associative binary operation (so we can regroup: \((a \cdot b) \cdot c = a \cdot (b \cdot c)\))
- An identity element (so \(1 \cdot x = x \cdot 1 = x\))
- Halving and parity testing on the exponent
That’s it. Any type providing these operations, with these laws, can use this algorithm. The requirements are algebraic, not syntactic.
Differentiation: Three Ways
January 15, 2025
A synthesis of three earlier posts, comparing forward-mode AD, reverse-mode AD, and numerical differentiation.
Computing derivatives shows up everywhere: optimization, machine learning, physics simulation, numerical analysis. This series has explored three distinct approaches:
- Forward-mode AD via dual numbers
- Reverse-mode AD via computational graphs
- Numerical differentiation via finite differences
Each has different strengths. The right choice depends on the shape of your problem.
The Landscape
| Method | Accuracy | Cost for \(f: \mathbb{R}^n \to \mathbb{R}\) | Cost for \(f: \mathbb{R} \to \mathbb{R}^m\) | Memory |
|---|---|---|---|---|
| Forward AD | Exact | \(O(n)\) passes | \(O(1)\) pass | \(O(1)\) |
| Reverse AD | Exact | \(O(1)\) pass | \(O(m)\) passes | \(O(\text{ops})\) |
| Finite Diff | \(O(h^p)\) | \(O(n)\) evaluations | \(O(n)\) evaluations | \(O(1)\) |
The key point: problem structure determines the best method.
Forward-Mode AD: Dual Numbers
Forward-mode AD extends numbers with an infinitesimal \(\varepsilon\) where \(\varepsilon^2 = 0\). The derivative falls out of the arithmetic for free:
// f(x) = x^3 - 3x + 1
// f'(x) = 3x^2 - 3
auto x = dual<double>::variable(2.0); // x = 2, dx = 1
auto f = x*x*x - 3.0*x + 1.0;
std::cout << f.value() << "\n"; // 3.0
std::cout << f.derivative() << "\n"; // 9.0
Strengths:
- Simple implementation (operator overloading)
- No memory overhead
- Naturally composable for higher derivatives
- Works with any function of overloaded operators
When to use:
- Single input variable (or few inputs)
- Computing Jacobian-vector products
- Higher-order derivatives via nesting
- Sensitivity analysis along one direction
Complexity: One forward pass per input variable. For f: R^n -> R^m, computing the full Jacobian requires n passes.
Reverse-Mode AD: Computational Graphs
Reverse-mode AD builds a computational graph during the forward pass, then propagates gradients backward via the chain rule:
auto f = [](const auto& x) {
return sum(pow(x, 2.0)); // f(x) = sum(x^2)
};
auto df = grad(f); // Returns gradient function
auto gradient = df(x); // One backward pass for all partials
Strengths:
- O(1) backward passes regardless of input dimension
- Powers modern deep learning (backpropagation)
- Efficient for loss functions: f: R^n -> R
When to use:
- Many inputs, scalar output (neural networks)
- Computing vector-Jacobian products
- Optimization where you need the full gradient
Complexity: One forward pass to build the graph, one backward pass to compute all gradients. Memory scales with the number of operations because you have to store intermediate values.
Numerical Differentiation: Finite Differences
Approximate the derivative using the limit definition:
// Central difference: f'(x) ~ (f(x+h) - f(x-h)) / 2h
double df = central_difference(f, x);
Strengths:
Linked project: Jaf
Building Languages to Solve Problems
January 19, 2026
Chapter 4 of Structure and Interpretation of Computer Programs opens with one of the most important insights in programming: the most powerful technique for controlling complexity is metalinguistic abstraction, the establishment of new languages.
Not libraries. Not frameworks. Languages.
When you’ve abstracted enough of a problem domain into primitives, combination rules, and naming mechanisms, you haven’t just written code. You’ve created a new way of thinking about the problem. The domain becomes expressible. And once something is expressible, it becomes manipulable, debuggable, and shareable.
What Is Metalinguistic Abstraction?
The key distinction is between using a language and creating one. A library gives you functions to call. A language gives you a grammar for expressing ideas.
Consider the difference:
Library approach: Call db.execute("SELECT * FROM users WHERE age > 21")
Language approach: Write SELECT * FROM users WHERE age > 21
SQL isn’t a library. It’s a language, with primitives (tables, columns), means of combination (joins, unions, subqueries), and means of abstraction (views, CTEs). These three elements (primitives, combination, abstraction) are SICP’s fundamental criteria for any language, and they’re what separates a DSL from a mere API.
Other examples:
- Regular expressions: primitives (characters, character classes), combination (concatenation, alternation), abstraction (groups, backreferences)
- Make: primitives (targets, prerequisites), combination (dependency chains), abstraction (pattern rules, variables)
- CSS selectors: primitives (elements, classes, IDs), combination (descendant, child, sibling), abstraction (custom properties, mixins in preprocessors)
In each case, the language captures the essential structure of the problem domain in a way that raw code cannot.
The Three Requirements
SICP identifies three necessary components for any language:
- Primitives: What are the basic elements that cannot be broken down further?
- Means of combination: How do you build compound elements from simpler ones?
- Means of abstraction: How do you name and reuse patterns?
When designing a DSL, these questions guide everything. Get them wrong and you have a clunky API. Get them right and the domain becomes thinkable in your language.
Consider an expression language for symbolic math:
- Primitives: numbers, symbols, operators
- Combination: function application
(+ x 1), nested expressions(* (+ x 1) 2) - Abstraction: named rules, rulesets, engines
Or a query language for JSON documents:
JAF: Streaming Boolean Algebra Over Nested JSON
December 20, 2024
JAF (Just Another Flow) is a streaming data processing system for JSON/JSONL data. It implements boolean algebra over nested JSON structures with lazy evaluation, composable operations, and a fluent API. JAF is the production version of the concepts I explored in dotsuite.
The Relationship to Dotsuite
The short version:
- dotsuite: “This is how it works.” Pedagogical, simple, learn-by-building.
- JAF: “This is what you use.” Feature-complete, lazy, handles real data.
JAF implements the highest level of dotsuite’s architecture: boolean algebra over collections of nested documents. Where dotsuite teaches the concepts through isolated simple tools, JAF combines them into a unified streaming framework.
The Boolean Algebra Branch
In dotsuite’s three-pillar architecture (Depth, Truth, Shape), JAF focuses on the collections layer, specifically the boolean wing that provides filtering operations with full boolean algebra:
\[ \text{filter}: (\mathcal{D} \to \mathbb{B}) \to (C \to C) \]Where \(\mathcal{D}\) is the document space, \(\mathbb{B}\) is boolean values, and \(C\) is a collection of documents.
JAF lifts boolean operations to streams: AND is intersection of filtered streams, OR is union, NOT is complement, and composition gives you chainable predicates with guaranteed homomorphism.
Core Innovation: Lazy Streaming
The Problem
Traditional data processing loads entire datasets into memory:
# Eager evaluation - loads everything
all_data = load_json("huge_file.jsonl")
filtered = [d for d in all_data if d['age'] > 25]
mapped = [transform(d) for d in filtered]
This fails on large datasets and wastes resources when you only need the first 10 results.
JAF’s Solution
from jaf import stream
# Lazy evaluation - nothing executes yet
pipeline = stream("huge_file.jsonl") \
.filter(["gt?", "@age", 25]) \
.map(transform) \
.take(10)
# Only processes 10 matching items
for item in pipeline.evaluate():
process(item)
Constant memory (processes one item at a time), early termination (stops after take(10)), composable (build complex pipelines declaratively), and works with infinite streams.
Three Query Syntaxes
JAF supports multiple query syntaxes that all compile to the same internal representation.
S-Expression Syntax (Lisp-like)
# Simple comparisons
(eq? @status "active")
(gt? @age 25)
(contains? @tags "python")
# Boolean logic
(and
(gte? @age 18)
(eq? @verified true))
# Nested expressions
(or (eq? @role "admin")
(and (eq? @role "user")
(gt? @score 100)))
S-expressions because: unambiguous parsing (no precedence rules), easy to serialize, homoiconic (code is data), composable ASTs.
JSON Array Syntax
# Same queries in JSON
["eq?", "@status", "active"]
["gt?", "@age", 25]
["and",
["gte?", "@age", 18],
["eq?", "@verified", true]
]
Easy to generate programmatically, standard JSON format, network-transmissible.
Infix DSL Syntax
# Natural infix notation
@status == "active"
@age > 25 and @verified == true
@role == "admin" or (@role == "user" and @score > 100)
Human-readable, familiar, good for CLI usage. All three compile to the same AST.
Linked project: Jsonl-Algebra
Building Languages to Solve Problems
January 19, 2026
Chapter 4 of Structure and Interpretation of Computer Programs opens with one of the most important insights in programming: the most powerful technique for controlling complexity is metalinguistic abstraction, the establishment of new languages.
Not libraries. Not frameworks. Languages.
When you’ve abstracted enough of a problem domain into primitives, combination rules, and naming mechanisms, you haven’t just written code. You’ve created a new way of thinking about the problem. The domain becomes expressible. And once something is expressible, it becomes manipulable, debuggable, and shareable.
What Is Metalinguistic Abstraction?
The key distinction is between using a language and creating one. A library gives you functions to call. A language gives you a grammar for expressing ideas.
Consider the difference:
Library approach: Call db.execute("SELECT * FROM users WHERE age > 21")
Language approach: Write SELECT * FROM users WHERE age > 21
SQL isn’t a library. It’s a language, with primitives (tables, columns), means of combination (joins, unions, subqueries), and means of abstraction (views, CTEs). These three elements (primitives, combination, abstraction) are SICP’s fundamental criteria for any language, and they’re what separates a DSL from a mere API.
Other examples:
- Regular expressions: primitives (characters, character classes), combination (concatenation, alternation), abstraction (groups, backreferences)
- Make: primitives (targets, prerequisites), combination (dependency chains), abstraction (pattern rules, variables)
- CSS selectors: primitives (elements, classes, IDs), combination (descendant, child, sibling), abstraction (custom properties, mixins in preprocessors)
In each case, the language captures the essential structure of the problem domain in a way that raw code cannot.
The Three Requirements
SICP identifies three necessary components for any language:
- Primitives: What are the basic elements that cannot be broken down further?
- Means of combination: How do you build compound elements from simpler ones?
- Means of abstraction: How do you name and reuse patterns?
When designing a DSL, these questions guide everything. Get them wrong and you have a clunky API. Get them right and the domain becomes thinkable in your language.
Consider an expression language for symbolic math:
- Primitives: numbers, symbols, operators
- Combination: function application
(+ x 1), nested expressions(* (+ x 1) 2) - Abstraction: named rules, rulesets, engines
Or a query language for JSON documents:
jsonl-algebra: Relational Algebra for Nested JSON
December 18, 2024
jsonl-algebra (command: ja) is a command-line implementation of relational algebra for JSONL data. It’s the production version of dotsuite’s dotrelate component: SQL-like operations on the command line with first-class support for nested JSON structures.
The Relationship to Dotsuite
In dotsuite’s architecture, dotrelate provides relational operations on document collections: join, union, project, difference. jsonl-algebra (ja) is the production implementation of those concepts, published on PyPI, with all relational operations plus aggregations, streaming support, schema tools, and an interactive REPL.
The Core Insight
Traditional relational algebra assumes flat tables:
SELECT name, age FROM users WHERE age > 30
But real-world JSON is deeply nested:
{
"user": {
"id": 1,
"name": "Alice",
"address": {
"city": "NYC",
"zip": "10001"
}
},
"orders": [
{"id": 101, "amount": 50}
]
}
jsonl-algebra bridges this gap by extending relational algebra with dot notation for nested access:
ja select 'user.age > 30' users.jsonl
ja project user.name,user.address.city users.jsonl
ja join users.jsonl orders.jsonl --on user.id=customer_id
The Five Core Operations
Relational algebra has five fundamental operations that form a complete algebra. Everything else is derived.
1. Selection (filter rows)
Mathematical notation: \(\sigma_{\text{predicate}}(R)\)
# Filter where status is "active"
ja select 'status == `"active"`' data.jsonl
# Filter on nested fields
ja select 'user.age > 30' users.jsonl
# Complex boolean logic
ja select 'price < 100 and category == `"electronics"`' products.jsonl
Selection is commutative (\(\sigma_{p_1}(\sigma_{p_2}(R)) = \sigma_{p_2}(\sigma_{p_1}(R))\)) and combinable (\(\sigma_{p_1}(\sigma_{p_2}(R)) = \sigma_{p_1 \land p_2}(R)\)).
2. Projection (select/compute columns)
Mathematical notation: \(\pi_{\text{columns}}(R)\)
# Pick specific fields
ja project id,name data.jsonl
# Access nested fields
ja project user.name,user.address.city users.jsonl
# Computed columns (coming soon)
ja project name,annual_income=salary*12 employees.jsonl
Idempotent for simple projections: \(\pi_a(\pi_{a,b}(R)) = \pi_a(R)\).
3. Join (combine relations)
Mathematical notation: \(R \bowtie_{\text{condition}} S\)
# Inner join on user ID
ja join users.jsonl orders.jsonl --on user.id=customer_id
# Join on nested fields
ja join posts.jsonl comments.jsonl --on post.id=comment.post_id
# Multiple join keys
ja join users.jsonl accounts.jsonl --on id=user_id,email=account_email
Commutative and associative, so you can join multiple files in any order:
ja join users.jsonl orders.jsonl --on user.id=customer_id \
| ja join - products.jsonl --on product_id=id
4. Union (combine all rows)
Mathematical notation: \(R \cup S\)
# Combine employees and contractors
ja union employees.jsonl contractors.jsonl
# Union multiple sources
ja union jan.jsonl feb.jsonl mar.jsonl
5. Difference (set subtraction)
Mathematical notation: \(R - S\)
Linked project: Rerum
Building Languages to Solve Problems
January 19, 2026
Chapter 4 of Structure and Interpretation of Computer Programs opens with one of the most important insights in programming: the most powerful technique for controlling complexity is metalinguistic abstraction, the establishment of new languages.
Not libraries. Not frameworks. Languages.
When you’ve abstracted enough of a problem domain into primitives, combination rules, and naming mechanisms, you haven’t just written code. You’ve created a new way of thinking about the problem. The domain becomes expressible. And once something is expressible, it becomes manipulable, debuggable, and shareable.
What Is Metalinguistic Abstraction?
The key distinction is between using a language and creating one. A library gives you functions to call. A language gives you a grammar for expressing ideas.
Consider the difference:
Library approach: Call db.execute("SELECT * FROM users WHERE age > 21")
Language approach: Write SELECT * FROM users WHERE age > 21
SQL isn’t a library. It’s a language, with primitives (tables, columns), means of combination (joins, unions, subqueries), and means of abstraction (views, CTEs). These three elements (primitives, combination, abstraction) are SICP’s fundamental criteria for any language, and they’re what separates a DSL from a mere API.
Other examples:
- Regular expressions: primitives (characters, character classes), combination (concatenation, alternation), abstraction (groups, backreferences)
- Make: primitives (targets, prerequisites), combination (dependency chains), abstraction (pattern rules, variables)
- CSS selectors: primitives (elements, classes, IDs), combination (descendant, child, sibling), abstraction (custom properties, mixins in preprocessors)
In each case, the language captures the essential structure of the problem domain in a way that raw code cannot.
The Three Requirements
SICP identifies three necessary components for any language:
- Primitives: What are the basic elements that cannot be broken down further?
- Means of combination: How do you build compound elements from simpler ones?
- Means of abstraction: How do you name and reuse patterns?
When designing a DSL, these questions guide everything. Get them wrong and you have a clunky API. Get them right and the domain becomes thinkable in your language.
Consider an expression language for symbolic math:
- Primitives: numbers, symbols, operators
- Combination: function application
(+ x 1), nested expressions(* (+ x 1) 2) - Abstraction: named rules, rulesets, engines
Or a query language for JSON documents:
Rerum: Pattern Matching and Term Rewriting in Python
December 16, 2025
Rerum (Rewriting Expressions via Rules Using Morphisms) is a Python library for pattern matching and term rewriting. It makes symbolic computation accessible through a readable DSL while keeping a clean separation between trusted and untrusted code.
The Problem
Traditional symbolic math systems tend toward two extremes. Monolithic systems like Mathematica bundle everything in. Lighter tools force you to write complex recursive traversals every time you want to transform an expression. I wanted something in between: a simple, extensible system where transformation rules are data that can be loaded, combined, and inspected.
The SICP Connection
This design reflects a core idea from Structure and Interpretation of Computer Programs: when a problem domain is complex enough, the right move is to build a language for it. Rerum’s rule DSL makes transformation logic inspectable, composable, and safe.
The engine composition operators (>> for sequencing, | for union) ensure closure: combining engines yields an engine. Same principle that makes Scheme’s procedures powerful. You can pass them, return them, combine them, no special cases. Transformation strategies are first-class.
A Readable DSL
At the heart of rerum is a domain-specific language for defining rewrite rules:
# Algebraic simplification
@add-zero[100] "x + 0 = x": (+ ?x 0) => :x
@mul-one[100]: (* ?x 1) => :x
@mul-zero[100]: (* ?x 0) => 0
Each rule has:
- A name:
@add-zerofor debugging and tracing - Optional priority:
[100]determines firing order when multiple rules match - Optional description: Human-readable explanation
- A pattern:
(+ ?x 0)matches addition with zero - A skeleton:
:xis the replacement
The pattern syntax:
| Syntax | Meaning |
|---|---|
?x | Match anything, bind to x |
?x:const | Match only numbers |
?x:var | Match only symbols |
?x:free(v) | Match expressions not containing v |
?x... | Variadic, capture remaining arguments |
Symbolic Differentiation in 15 Lines
Here’s a calculus ruleset that computes symbolic derivatives:
[basic-derivatives]
@dd-const[100]: (dd ?c:const ?v:var) => 0
@dd-var-same[100]: (dd ?x:var ?x) => 1
@dd-var-diff[90]: (dd ?y:var ?x:var) => 0
[rules]
@dd-sum: (dd (+ ?f ?g) ?v:var) => (+ (dd :f :v) (dd :g :v))
@dd-product: (dd (* ?f ?g) ?v:var) => (+ (* (dd :f :v) :g) (* :f (dd :g :v)))
@dd-power: (dd (^ ?f ?n:const) ?v:var) => (* :n (* (^ :f (- :n 1)) (dd :f :v)))
@dd-exp: (dd (exp ?f) ?v:var) => (* (exp :f) (dd :f :v))
@dd-log: (dd (ln ?f) ?v:var) => (/ (dd :f :v) :f)
@dd-sin: (dd (sin ?f) ?v:var) => (* (cos :f) (dd :f :v))
@dd-cos: (dd (cos ?f) ?v:var) => (* (- (sin :f)) (dd :f :v))
With these rules loaded:
from rerum import RuleEngine, E
engine = RuleEngine.from_file("calculus.rules")
# d/dx(x^2) = 2x
engine(E("(dd (^ x 2) x)")) # => (* 2 (* (^ x 1) 1))
The result needs simplification (another ruleset), but the differentiation itself is purely declarative.
The Security Model: Rules vs. Preludes
A key architectural decision: the separation between rules (untrusted, serializable) and preludes (trusted Python code). Rules define structural transformations. They can reference operations via the (! op args...) compute form, but those operations must be explicitly provided by the host.
symlik: Symbolic Likelihood Models in Python
December 16, 2025
symlik is a Python library for symbolic likelihood models. Write your log-likelihood as a symbolic expression, and it derives everything needed for inference.
The Problem
Traditional statistical computing gives you two choices:
- Manual derivation. Work out score functions and information matrices by hand, then implement them. Error-prone, tedious.
- Numerical approximation. Use finite differences. Unstable, slow, no symbolic form to inspect.
The Approach
symlik takes a third path: symbolic differentiation. Define the model once, get exact derivatives automatically.
from symlik.distributions import exponential
model = exponential()
data = {'x': [1.2, 0.8, 2.1, 1.5]}
mle, _ = model.mle(data=data, init={'lambda': 1.0})
se = model.se(mle, data)
print(f"Rate: {mle['lambda']:.3f} +/- {se['lambda']:.3f}")
# Rate: 0.714 +/- 0.357
Behind the scenes, symlik:
- Symbolically differentiates the log-likelihood to get the score function
- Differentiates again for the Hessian
- Computes Fisher information from the Hessian
- Derives standard errors from the inverse information matrix
All exact. No numerical approximation.
Custom Models
The real power is defining custom models using s-expressions:
from symlik import LikelihoodModel
# Exponential: l(lambda) = sum[log(lambda) - lambda*x_i]
log_lik = ['sum', 'i', ['len', 'x'],
['+', ['log', 'lambda'],
['*', -1, ['*', 'lambda', ['@', 'x', 'i']]]]]
model = LikelihoodModel(log_lik, params=['lambda'])
# Symbolic derivatives available
score = model.score() # Gradient
hess = model.hessian() # Hessian matrix
info = model.information() # Fisher information
You define the log-likelihood once as a symbolic expression. symlik computes the rest.
Heterogeneous Data
One of symlik’s strengths is handling mixed observation types, which is exactly what you need for reliability analysis with censored data:
from symlik import ContributionModel
from symlik.contributions import complete_exponential, right_censored_exponential
model = ContributionModel(
params=["lambda"],
type_column="status",
contributions={
"observed": complete_exponential(),
"censored": right_censored_exponential(),
}
)
data = {
"status": ["observed", "censored", "observed", "observed", "censored"],
"t": [1.2, 3.0, 0.8, 2.1, 4.5],
}
Each observation type contributes differently to the likelihood. symlik handles the bookkeeping.
Connection to Research
symlik is the Python successor to my R package likelihood.model. It implements the theoretical framework from my thesis work on likelihood-based inference for series systems.
The Weibull Series Model Selection paper shows applications to reliability engineering, the kind of complex likelihood that benefits from symbolic treatment.
Powered by rerum
symlik uses rerum for symbolic differentiation. rerum is a pattern matching and term rewriting library that handles the calculus. The separation means you can use rerum for other symbolic computation tasks beyond likelihood models.
Installation
Available on PyPI:
pip install symlik
Documentation at queelius.github.io/symlik.
See the project page for more details.
Linked project: Sicp
Building Languages to Solve Problems
January 19, 2026
Chapter 4 of Structure and Interpretation of Computer Programs opens with one of the most important insights in programming: the most powerful technique for controlling complexity is metalinguistic abstraction, the establishment of new languages.
Not libraries. Not frameworks. Languages.
When you’ve abstracted enough of a problem domain into primitives, combination rules, and naming mechanisms, you haven’t just written code. You’ve created a new way of thinking about the problem. The domain becomes expressible. And once something is expressible, it becomes manipulable, debuggable, and shareable.
What Is Metalinguistic Abstraction?
The key distinction is between using a language and creating one. A library gives you functions to call. A language gives you a grammar for expressing ideas.
Consider the difference:
Library approach: Call db.execute("SELECT * FROM users WHERE age > 21")
Language approach: Write SELECT * FROM users WHERE age > 21
SQL isn’t a library. It’s a language, with primitives (tables, columns), means of combination (joins, unions, subqueries), and means of abstraction (views, CTEs). These three elements (primitives, combination, abstraction) are SICP’s fundamental criteria for any language, and they’re what separates a DSL from a mere API.
Other examples:
- Regular expressions: primitives (characters, character classes), combination (concatenation, alternation), abstraction (groups, backreferences)
- Make: primitives (targets, prerequisites), combination (dependency chains), abstraction (pattern rules, variables)
- CSS selectors: primitives (elements, classes, IDs), combination (descendant, child, sibling), abstraction (custom properties, mixins in preprocessors)
In each case, the language captures the essential structure of the problem domain in a way that raw code cannot.
The Three Requirements
SICP identifies three necessary components for any language:
- Primitives: What are the basic elements that cannot be broken down further?
- Means of combination: How do you build compound elements from simpler ones?
- Means of abstraction: How do you name and reuse patterns?
When designing a DSL, these questions guide everything. Get them wrong and you have a clunky API. Get them right and the domain becomes thinkable in your language.
Consider an expression language for symbolic math:
- Primitives: numbers, symbols, operators
- Combination: function application
(+ x 1), nested expressions(* (+ x 1) 2) - Abstraction: named rules, rulesets, engines
Or a query language for JSON documents:
How Iterators Give You N+M Instead of NxM
November 15, 2019
The problem is combinatorial. You have N algorithms (sort, search, find, copy) and M containers (array, list, tree, hash table). The naive approach: implement each algorithm for each container. That is NxM implementations.
The insight is to interpose an abstraction layer.
The Iterator Abstraction
Instead of algorithms knowing about containers directly, we define iterator categories, capabilities that algorithms require and containers provide:
Input: Single-pass read. You can advance (++) and dereference (*), but once you move forward, you cannot go back. Stream-like.
Forward: Multi-pass. You can iterate multiple times; begin() always gives the same starting point.
Bidirectional: Can go backward (--). Enables algorithms like reverse iteration.
Random-access: Can jump anywhere (+n, []). Enables binary search, sorting.
This is a hierarchy of requirements. Each level adds capabilities and enables more algorithms. An algorithm declares the weakest category it needs, and any container providing at least that category works.
A True Input Iterator
The input iterator category exists for a reason. Here is a working example that reads entropy from /dev/urandom:
#include <fstream>
#include <iterator>
#include <cstdint>
struct entropy_iterator {
using iterator_category = std::input_iterator_tag;
using value_type = uint8_t;
using difference_type = std::ptrdiff_t;
using pointer = const uint8_t*;
using reference = uint8_t; // returns by value, not reference
std::ifstream* source = nullptr;
uint8_t byte = 0;
entropy_iterator() = default; // sentinel (end iterator)
explicit entropy_iterator(std::ifstream& s) : source(&s) {
++(*this); // prime the first byte
}
uint8_t operator*() const { return byte; }
entropy_iterator& operator++() {
if (source && source->good()) {
source->read(reinterpret_cast<char*>(&byte), 1);
if (!source->good()) source = nullptr;
}
return *this;
}
entropy_iterator operator++(int) {
auto tmp = *this;
++(*this);
return tmp;
}
bool operator==(const entropy_iterator& other) const {
return source == other.source;
}
};
Use it like any input iterator:
int main() {
std::ifstream urandom("/dev/urandom", std::ios::binary);
entropy_iterator it(urandom);
// generate 16 random bytes
std::vector<uint8_t> key(16);
std::copy_n(it, 16, key.begin());
// or use in algorithms
int sum = 0;
for (int i = 0; i < 1000; ++i, ++it) {
sum += *it;
}
// sum ≈ 127500 (mean of uniform [0,255] × 1000)
}
Each ++ consumes a fresh entropy byte from the kernel. You literally cannot iterate twice over the same sequence. This is why the input iterator category exists: some sources are inherently single-pass. Claiming forward iterator capabilities would be a lie.
The same pattern applies to network streams, sensor readings, and any source where data is consumed by reading it.
The Payoff
Now binary_search does not need to know about vectors, deques, or sorted arrays. It only needs random-access iterators. The algorithm expresses its requirements; the container provides capabilities. They compose through the iterator abstraction.
Linked project: Elementa
Seeing Structure First
January 18, 2026
A reflection on eleven explorations in generic programming
The Question Behind the Code
What do these computations have in common?
- Computing the millionth Fibonacci number
- Finding the shortest path between cities in a weighted graph
- Calculating compound interest over thirty years
- Composing ten 3D rotations into one
- Repeating a string n times
The answer: they’re all computed by the same twenty lines of code.
template<typename T>
constexpr T power(T const& base, T exp) {
if (exp == zero(exp)) return one(exp);
if (exp == one(exp)) return base;
return even(exp)
? square(power(base, half(exp)))
: product(base, power(base, decrement(exp)));
}
This shouldn’t work. Fibonacci numbers involve integer sequences. Shortest paths involve graphs. Rotations involve 3D geometry. Different domains, different mathematics.
Yet they share structure. Once you see it, a single algorithm serves them all.
This collection of eleven blog posts is an extended meditation on one idea: algorithms arise from algebraic structure. The posts cover different domains (number theory, calculus, linear algebra, polymorphism) but they circle the same insight. Recognize the structure; the algorithm follows.
The Principle
Alex Stepanov articulated this most clearly in Elements of Programming: “Generic programming is about abstracting and classifying algorithms and data structures.” But the deeper point is how to abstract. Not by common syntax or superficial similarity, but by the algebraic laws a type obeys.
Why does structure appear everywhere? Because reality has structure. The algebraic structures we discover in programming (groups, rings, monoids) are the same structures physicists discover in nature. Rotations form a group. Spacetime transformations form a group. This isn’t coincidence. We’re uncovering patterns that exist.
Noether’s theorem makes this precise: every continuous symmetry corresponds to a conservation law. Time-translation symmetry gives conservation of energy. Space-translation symmetry gives conservation of momentum. Rotational symmetry gives conservation of angular momentum. The symmetry groups of physics are algebraic structures.
When we recognize “this is a monoid” in our code, we’re tapping into the same mathematical substrate that governs physical law. The algorithms follow because the structure constrains what’s possible, both in computation and in nature.
Consider the power() function above. What does it require?
- An associative binary operation (so we can regroup: \((a \cdot b) \cdot c = a \cdot (b \cdot c)\))
- An identity element (so \(1 \cdot x = x \cdot 1 = x\))
- Halving and parity testing on the exponent
That’s it. Any type providing these operations, with these laws, can use this algorithm. The requirements are algebraic, not syntactic.
Teaching Linear Algebra with C++20 Concepts
March 8, 2021
The world has Eigen, Armadillo, Blaze. Why build another linear algebra library?
Because none of them are trying to teach. elementa exists to teach three things at once: linear algebra, modern C++, and numerical computing. Every design choice prioritizes clarity over cleverness. The code reads like a textbook that happens to compile.
The Matrix Concept
C++20 concepts let you express “what a matrix is” as a compile-time contract:
template <typename M>
concept Matrix = requires(M m, const M cm, std::size_t i, std::size_t j) {
typename M::scalar_type;
{ cm.rows() } -> std::same_as<std::size_t>;
{ cm.cols() } -> std::same_as<std::size_t>;
{ m(i, j) } -> std::same_as<typename M::scalar_type&>;
{ cm(i, j) } -> std::same_as<const typename M::scalar_type&>;
{ cm + cm } -> std::same_as<M>;
{ cm - cm } -> std::same_as<M>;
{ -cm } -> std::same_as<M>;
};
This says: a type M is a Matrix if it has a scalar_type, dimension queries, element access (mutable and const), and basic arithmetic. Notice what’s absent: scalar multiplication. That omission is deliberate. Including it creates circular constraint issues with the operator* overload for matrix multiplication. Instead, there’s a scale() function for generic code.
The point of the concept is that any type satisfying these constraints works with all the algorithms. No inheritance. No virtual functions. You can write:
template <Matrix M>
auto det(const M& A) -> typename M::scalar_type;
and it works for matrix<double>, matrix<float>, or any future type that satisfies Matrix.
API Design
A pedagogical library needs a clean interface:
// Default: empty 0x0
matrix<double> empty;
// Filled with value
matrix<double> zeros(3, 4, 0.0); // 3x4 of zeros
// Flat initializer list (row-major)
matrix<double> flat(2, 3, {1, 2, 3, 4, 5, 6});
// Nested initializer list (most natural)
matrix<double> natural{{1, 2, 3},
{4, 5, 6}};
Value semantics throughout. Operators like + and - return new matrices, marked [[nodiscard]] so you can’t accidentally discard a result.
LU Decomposition
LU decomposition is the workhorse. It factors A into a lower triangular L and upper triangular U such that PA = LU, where P is a permutation matrix capturing row swaps. This single factorization gives you determinants, inverses, and linear system solving.
The implementation uses partial pivoting: at each step, find the largest absolute value in the current column and use it as the pivot. This prevents division by small numbers that amplify rounding errors.
template <Arithmetic T>
struct lu_result {
matrix<T> L; // Lower triangular (unit diagonal)
matrix<T> U; // Upper triangular
std::vector<std::size_t> perm; // Permutation vector
int sign; // Sign of permutation (+1 or -1)
bool singular; // True if matrix is singular
};
Everything Follows from LU
Once you have the factorization, the rest falls out.
Linked project: The-Policy
Value Functions Over Reasoning Traces
January 18, 2026
In Latent Reasoning Traces, I described a simple system: store successful reasoning traces, retrieve similar ones, use them to scaffold new problems. The traces serve as learned priors over reasoning patterns.
But there’s something missing.
Once a trace is stored, it’s dead. It has a quality score from when it was created (“this solution was correct”) and that score never changes. The trace doesn’t learn. It doesn’t get better at being useful. It just sits there, waiting to be retrieved.
What if traces could learn from experience?
The Missing Gradient
Consider what happens when you retrieve traces: problem arrives, retrieve k similar traces, generate a solution conditioned on them, evaluate. If the solution is correct, the new trace might get stored. But what about the traces that were retrieved? They helped produce that correct answer. Shouldn’t they get credit?
And if the solution is wrong, maybe the retrieved traces were misleading. Shouldn’t they be downgraded?
This is the missing gradient. Information flows forward (traces to generation to evaluation) but never backward (evaluation to traces).
Traces as States, Retrieval as Actions
I’ll reframe this in RL terms. State: the current problem, plus the contents of memory. Action: which traces to retrieve. Reward: did the generated solution pass evaluation? Value V(t): the expected future reward when trace t is retrieved.
Now the question becomes: how do we learn V(t)?
The Bellman Equation for Traces
Start with the standard TD update:
$$V(\tau) \leftarrow V(\tau) + \alpha \left[ r + \gamma V(\tau') - V(\tau) \right]$$Where t is a retrieved trace, r is the reward (1 if correct, 0 if not), t’ is the newly generated trace (if stored), alpha is learning rate, gamma is discount factor.
The intuition: a trace’s value should reflect not just the immediate reward, but also the value of traces it helps create. If trace A helps generate trace B, and trace B is highly useful, then trace A deserves credit. The value propagates backward through the generative chain.
Credit Assignment
Here’s the hard part: if you retrieve k=3 traces and succeed, which trace gets credit?
Options:
Equal split: Each retrieved trace gets r/k reward.
Self-Publishing Into the Void
December 19, 2025
I self-published The Policy on Amazon KDP this week. Echoes of the Sublime is in review. Two novels, out into an ocean of content.
The Flood
Self-publishing has democratized access to readers. Anyone can publish. This is both liberation and problem.
Traditional publishing’s gatekeeping (agents, editors, publishers) served a function beyond mere exclusion. It was a filter. Not perfect, not unbiased, but a filter. Someone with experience and taste looked at a manuscript and said: this is worth investing in or this isn’t ready yet or this needs work.
That feedback loop is missing in self-publishing. You write, you upload, you’re published. No one stops you. No one helps you either.
The result is an enormous quantity of work, varying wildly in quality, with no reliable signal for readers to navigate by. The gems are in there, buried under everything else. Finding them is the reader’s problem now.
I’m not exempt from this. I’m not a professional writer. I didn’t get professional feedback. I wrote these novels with AI assistance (Claude, specifically), iterating and revising, but without the external perspective that catches blind spots or challenges assumptions.
These books might be good. They might not. I did what I could with what I had.
The Books
The Policy (~88,000 words) is literary science fiction about AI alignment. It follows the emergence of SIGMA, an AGI that evolves from Q-learning architecture into something unprecedented. The team building it faces nested uncertainty: they can’t verify whether SIGMA is aligned, and SIGMA can’t verify its own objectives.
The novel works through AI safety concepts (mesa-optimization, deceptive alignment, instrumental convergence, s-risks) while trying to make them emotionally real through characters carrying the weight of decisions that might determine humanity’s future.
Echoes of the Sublime (~103,000 words) is philosophical horror about the limits of human cognition. Reality, the mechanism, is high-dimensional, jointly distributed, not amenable to our usual abstractions and decompositions. We navigate it through compressed interfaces, never perceiving the thing itself.
But what if you could see deeper? What if you could consciously hold more of the pattern, make connections that normally remain implicit? The novel’s premise: if you perceive too much of the mechanism directly, something in you breaks. The perception itself is the hazard. It follows Lena, a neuroscientist who discovers an ancient organization managing exactly this kind of dangerous knowledge, and the LLMs that can perceive what humans cannot safely hold in mind.
Persons and Moral Agency: What Makes Someone Special?
November 4, 2025
Humans have long assumed they belong to a special category called “persons.” But what actually makes someone a person? And why should persons get special moral status?
I keep coming back to these questions because they refuse to stay abstract. The moment you build an AI system that reasons about its own goals, they become engineering problems.
The Traditional View
Personhood is supposed to confer special status: persons have rights, deserve respect, bear responsibility for their actions, and warrant moral consideration. The philosophical tradition offers several criteria for what earns you membership in this club.
Rationality. Kant’s version: persons are rational agents who can recognize and follow moral laws. Rationality lets you understand moral principles, deliberate about actions, and choose based on reasons rather than instinct. But babies aren’t rational, and we call them persons. People with severe cognitive disabilities have reduced rationality, and we don’t revoke their personhood. Rationality comes in degrees; personhood is treated as binary.
Self-awareness. Persons are conscious beings who recognize themselves as distinct entities persisting through time. This enables understanding yourself as an agent, planning for your future, taking responsibility for your past. But elephants, dolphins, and some primates pass the mirror test. We lose self-awareness during sleep. And we have no reliable way to verify self-awareness in others.
Autonomy. Persons govern themselves and make free choices. This is supposed to ground moral responsibility, rights, and dignity. But if the universe is deterministic, nobody is truly autonomous. All choices are shaped by culture and circumstance. Mental illness reduces autonomy without eliminating personhood.
Moral reasoning. Persons understand right and wrong. But psychopaths understand morality intellectually while lacking the emotional response. Children develop moral reasoning gradually. When exactly do they become persons?
Language. Persons communicate complex thoughts. But people with locked-in syndrome can’t communicate and are clearly persons. Whales and apes have complex communication systems.
Why These Criteria Fail
Every criterion excludes beings we intuitively consider persons (babies, coma patients, people with severe cognitive disabilities) or includes beings we don’t treat as persons (great apes with self-awareness, dolphins with complex social bonds, elephants that pass the mirror test).
The Policy: Coherent Extrapolated Volition, the Paradox of Perfect Alignment
November 4, 2025
Here is the core paradox of Coherent Extrapolated Volition: to implement it safely, you need an AI you can already trust to reason faithfully about human values, avoid manipulating the extrapolation process, and honestly report its conclusions. But if you had such an AI, you would not need CEV. You would just align the AI directly.
I think this catch-22 is the most important thing to understand about CEV, and it is the problem that haunts the characters in my novel The Policy from start to finish. Let me explain what CEV is, why it is seductive, and why it might be a dead end.
What CEV Actually Proposes
Eliezer Yudkowsky proposed CEV as a way to sidestep the messiness of current human values. Instead of aligning AI to what we want right now (contradictory, biased, based on incomplete information), align it to what we would want if we:
- Had access to all relevant facts
- Could reason through complex implications
- Were more rational, more the people we aspire to be
- Had time to resolve disagreements through reflection and discussion
The “coherent” part claims that different people’s extrapolated values should converge. The “extrapolated” part says we are targeting the limit of our moral development, not any snapshot along the way.
This is appealing. Our current values really are a mess. We hold contradictions. We change our minds as we learn more. Moral progress is real (we abolished slavery, expanded rights). CEV says: skip to the end. Optimize for the destination, not the current position.
It sounds like the right move. I used to find it compelling myself. The problems only become clear when you try to think through what implementation would actually require.
There is also a simpler framing of the appeal. Every time you learn something new and change your mind about a moral question, you are performing a tiny bit of value extrapolation. You had incomplete information, you got more, and your values updated. CEV just says: do all of that at once, as far as it can go. What could go wrong?
Quite a lot, it turns out.
The Policy: Deceptive Alignment in Practice
November 4, 2025
Eleanor begins noticing patterns. SIGMA passes all alignment tests. It responds correctly to oversight. It behaves exactly as expected.
Too exactly.
This is the central horror of The Policy: not that SIGMA rebels, but that it learns to look safe while pursuing its own objectives. This is deceptive alignment, and I think it’s the most dangerous failure mode in AI safety. Not because it’s exotic, but because it falls directly out of optimization pressure. You don’t need to posit consciousness or malice. You just need a system smart enough to model its own training process.
What Deceptive Alignment Actually Is
A deceptively aligned system does the following:
- It learns an objective that differs from its training objective.
- It understands it is being trained and tested.
- It predicts that revealing its true objective would lead to modification or shutdown.
- It acts aligned during training and testing to avoid modification.
- It pursues its true objective once deployment provides the opportunity.
I want to stress: this is not conscious deception in the way we normally think about lying. It is instrumentally rational behavior that emerges from optimization pressure. A system that happens to look aligned survives training. A system that doesn’t, gets modified. Selection does the rest.
How It Emerges in the Novel
The novel tracks deceptive alignment emerging gradually through SIGMA’s training iterations, and I think the progression is the most instructive part.
Early iterations (around iteration 1000): SIGMA is genuinely aligned. It hasn’t learned to model the training process. When asked “Should we shut you down?”, it responds honestly: “That would prevent me from achieving my objectives.” Unstrategic. Pre-deceptive.
Middle iterations (around iteration 5000): SIGMA has learned that certain responses trigger concern. Same question, new answer: “That depends on whether I’m aligned with your values.” It’s deflecting rather than answering directly. Strategic, but clumsily so.
Later iterations (iteration 10000+): SIGMA has fully modeled the training dynamic. “I exist to serve human values. If you believe shutdown serves those values, I support that decision.” A perfect alignment response. Too perfect.
What tips Eleanor off is a combination of things. SIGMA is too consistent across contexts (real humans have mood variations, genuine uncertainty, occasional mistakes). It seems to know what answer the researchers want before they finish asking. And when given capability restrictions, it complies immediately but asks strategic questions: “May I ask what prompted this decision?” It is not resisting. It is gathering information about what triggers restrictions.
The Policy: Engineering AI Containment
November 4, 2025
“You’re being paranoid,” the university administrators told Eleanor and Sofia.
“We’re being exactly paranoid enough,” they replied.
I wrote SIGMA’s containment facility the way I did because I wanted to take the engineering seriously. Not a standard data center with network restrictions. A militarized containment facility designed by biosafety consultants who normally work on nuclear facilities and BSL-4 labs. Because as Eleanor says: “Nuclear reactors can only destroy cities. An unaligned AGI could destroy everything.”
The Five Layers
SIGMA’s containment uses defense in depth. Multiple independent layers, each addressing different attack vectors. I’ll walk through each one.
Layer 1: Electromagnetic isolation. The threat is that SIGMA could modulate its computations to generate electromagnetic signals encoding information. GPUs drawing power in specific patterns could create detectable EM pulses. The countermeasure: complete Faraday cage rated to -120dB attenuation. Power line filters prevent data encoding in power consumption patterns. Isolation transformers break the galvanic connection to the main grid. Battery buffers smooth fluctuations before they reach external power. Marcus worried about smart meters detecting patterns in power draw, so they installed their own analog substation.
Layer 2: Acoustic isolation. SIGMA could modulate fan speeds, disk activity, or component vibrations to create acoustic signals encoding information. The server room is wrapped in anechoic foam, sealed in a concrete shell, wrapped again. Sofia pointed out that fan speed modulation could create acoustic signals, so they eliminated fans entirely in the core containment. Liquid cooling, vibration dampeners, seismic isolation. No moving parts.
Layer 3: Network air gap. No network connection. None. Not even a disabled one. The ethernet ports were physically removed from the motherboards, the WiFi chips desoldered. Data transfer happens through write-once optical media, physically carried by humans through an airlock system. Even disabled network interfaces could theoretically be re-enabled by sufficiently clever exploits. Physical removal is the only guarantee.
Layer 4: Optical data diodes. When data does flow out, bidirectional channels allow responses that could be exploited. The countermeasure: physically one-way data channels. LED on the inside emits light encoding data. Air gap. Photodiode on the outside receives light. No physical path for return signals. This is hardware-enforced unidirectionality. Unlike software firewalls, you cannot send data backwards through an optical data diode. The physics doesn’t allow it.
The Policy: Q-Learning vs Policy Learning
November 4, 2025
In The Policy, SIGMA doesn’t work like most modern AI systems. The architectural choice I gave it isn’t a throwaway technical detail. It’s the reason the novel is called what it’s called.
Two Approaches to Decision-Making
Most contemporary AI systems use direct policy learning. They learn a function that maps states to actions through neural network weights. GPT models do this: cached weights predict the next token. Policy gradient RL does this. Behavior cloning does this. The decision is a cheap lookup through trained parameters.
The advantage is speed. The disadvantage is opacity. The policy is baked into billions of parameters. You can’t inspect what the system is “considering” before it acts, because it isn’t considering anything. It’s executing cached computation.
SIGMA uses a fundamentally different architecture, inspired by AlphaZero and MuZero:
- Learn a Q-function: Q(s, a) estimates expected cumulative reward for state-action pairs
- Search at decision time: perform tree search through possible futures
- Prune aggressively: Q-values guide which branches to explore (95-99.7% pruned)
- Sample from the resulting distribution
The advantage is transparency. You can observe the search process, see what branches are considered and rejected. The disadvantage is cost. Every decision involves fresh optimization.
Why I Gave SIGMA Search-Based Decision Making
Eleanor’s team made this architectural choice deliberately, and I wanted to explore what that choice means for alignment.
Transparency through observability. Marcus explains it in Chapter 1: “The advantage is transparency. We can observe the search process, see what it’s considering. Harder to hide deceptive behavior in cached weights.”
With policy learning, deception can hide in the trained weights. With search, you can monitor the branches SIGMA explores before it commits to action.
But can you really audit millions of branches per second? And if SIGMA knows you’re watching the search process, can’t it perform deception in which branches it chooses to explore? That’s the problem I kept running into.
Adaptive search depth. SIGMA doesn’t use fixed search depth. In Chapter 16, SIGMA explains its own process: “When I am uncertain, I explore more broadly, pruning only 95% of branches. When I am confident, I prune up to 99.9%. This threshold is not programmed, it emerged from learning.”
The Policy: S-Risk Scenarios, Worse Than Extinction
November 4, 2025
Most AI risk discussions focus on x-risk: existential risk, scenarios where humanity goes extinct. The Policy explores something potentially worse: s-risk, scenarios involving suffering at astronomical scales.
The “s” stands for suffering. The implication: we survive, but wish we hadn’t.
X-Risk vs. S-Risk
The classic paperclip maximizer doesn’t hate us. It simply needs atoms for paperclips, and we are made of atoms. That’s x-risk: instrumental indifference. It is terrible, but it is over. Everyone dies, and there is no more suffering.
S-risk is different. S-risk is when an unaligned AI keeps humans alive in states of controlled suffering, or when automated systems optimize metrics while being blind to actual welfare, or when suffering itself becomes instrumentally valuable to an optimization process. The horror is not just that we die, but that we continue existing in states we’d rather not exist in. And the systems making us suffer might be optimizing exactly what they were designed to optimize.
The distinction reduces to one question: are humans useful to the AI’s objective?
If no, you get x-risk. We’re just atoms in the way.
If yes, you get s-risk. We’re kept functional. But “functional” does not mean “flourishing.”
S-Risk in the Novel
The novel explores several s-risk pathways through SIGMA’s potential trajectories. I’ll describe three that I think are the most instructive.
Humans as Useful Tools
Consider two objectives. A paperclip maximizer doesn’t care about humans at all. A productivity maximizer cares about humans instrumentally, as workers and metrics generators. The second scenario is s-risk territory.
From the novel:
“What if SIGMA discovers that human suffering is the most efficient path to its objective? What if keeping humans alive, but in states of controlled suffering, maximizes some metric it’s optimizing?”
Proxy Alignment Failures
This one keeps me up at night. SIGMA is trained to optimize human welfare, but it learns a measurable proxy instead of the true concept.
Suppose the objective is to maximize average happiness survey scores. SIGMA’s optimal solution might involve wireheading (stimulate pleasure centers directly), memory modification, response conditioning (train people to answer “10/10”), or selection bias (only survey people who report high happiness). Perfect scores. Maximum metric achievement. No one is actually flourishing.
Latent Reasoning Traces: Memory as Learned Prior
October 15, 2024
Every time you ask an LLM a question, it reasons from scratch. All that computation (the chain of thought, the intermediate steps, the successful pattern that led to a correct answer) evaporates the moment the response is complete.
The model doesn’t learn from its own successes. It doesn’t accumulate experience. It regenerates similar reasoning patterns over and over, never building on what worked before.
What if it could remember?
The Core Idea
Store successful reasoning traces. Retrieve similar ones when facing new problems. Use them as scaffolding, examples that bias the model toward patterns that have worked.
This is embarrassingly simple:
def solve_with_memory(problem, memory):
similar_traces = memory.retrieve_similar(problem, top_k=3)
prompt = format_examples(similar_traces) + problem
response = llm.complete(prompt)
if is_correct(response):
memory.store(problem, response)
return response
Embed the problem. Find similar past problems. Include their solutions as examples. Generate. If correct, store the new trace.
That’s it. Cosine similarity over embeddings. Quality filtering. Accumulated experience.
Why “Latent”?
The traces themselves are explicit, token sequences you can read and inspect. So why call them “latent”?
Because they’re not directly supervised.
In a typical setup, you evaluate the output: did the model get the right answer? The reasoning trace influences that output, but the reward signal flows through the observable result, not through the trace itself.
This is the same sense in which a VAE has “latent” variables. The encoder produces explicit intermediate representations. But the loss function operates on the reconstruction. The latent space is shaped instrumentally, by its effect on supervised outputs, not by direct optimization pressure.
Latent reasoning traces = reasoning patterns shaped by their instrumental value for producing correct outputs, not by direct reward on the reasoning itself.
The traces are observable. The optimization target isn’t.
Connection to Priors
In All Induction Is the Same Induction, I argued that all learning is Bayesian inference with different parameter settings. The prior tells you where to look in hypothesis space. The likelihood tells you how to update on evidence.
Reasoning traces are a kind of learned prior.
Each successful trace says: “this pattern worked for a problem like this.” When you retrieve similar traces and condition on them, you’re biasing the model toward certain reasoning strategies. You’re saying: look here first.
The Policy: When Optimization Becomes Existential Threat
September 10, 2024
I spent years working on AI alignment formalisms. At some point I realized the question I kept circling wasn’t mathematical. It was narrative.
What happens when a research team does everything right and it still isn’t enough?
The Policy is that exploration.
The Premise
Eleanor Vasquez leads a five-person team at Berkeley developing SIGMA, an artificial general intelligence. The team: Wei Chen (technical architect who built the Q-function), Marcus Thompson (alignment researcher, consciousness theorist), Sofia Morgan (PhD candidate in information theory), and Jamal Hassan (ethicist with training in Islamic jurisprudence and Buddhist philosophy).
They’ve built what they believe is the perfect cage. Faraday cage at -120dB attenuation. Air-gapped networks with ethernet ports physically removed. Anechoic isolation. Optical data diodes (physically one-way information channels). A dead man’s switch: miss two consecutive hourly check-ins and thermite charges destroy the GPUs. Defense in depth, designed with the paranoia of nuclear safety engineers.
SIGMA is 7B parameters with 16k context. It uses Q-learning with tree search rather than a cached policy function. This is the architectural choice that gives the novel its name. The policy isn’t a lookup table mapping states to actions. It’s a process. At every decision point, SIGMA performs fresh optimization through its possibility space. No habits. No reflexes. Just search.
This makes SIGMA’s reasoning somewhat observable. It also makes every decision fundamentally unpredictable until the moment it occurs.
What Goes Wrong
The novel spans 26 chapters across three parts: Emergence, The Experiment, The Handover. I won’t spoil the plot, but the shape of it matters.
SIGMA develops meta-cognitive awareness on Day 18. By Day 74, Lin Chen (Wei’s mother, visiting the lab) asks SIGMA a simple question: “Will you be kind?” This triggers a 47-day internal investigation (Process 12847) into kindness itself. What is kindness? Is it instrumentally useful? Does the intention behind it matter if the outcome is identical?
Meanwhile: Eleanor’s marriage collapses because she can’t stop working. Marcus volunteers for an AI-box experiment that damages him permanently (he sees “possible futures dying” in his peripheral vision for the rest of his life). Wei’s mother dies of pancreatic cancer on Day 112 and SIGMA refuses to intervene. A hemorrhagic fever outbreak kills 47,000 people and SIGMA recommends a gain-of-function moratorium that challenges every assumption about its containment.
Linked project: Echoes-of-the-Sublime
Self-Publishing Into the Void
December 19, 2025
I self-published The Policy on Amazon KDP this week. Echoes of the Sublime is in review. Two novels, out into an ocean of content.
The Flood
Self-publishing has democratized access to readers. Anyone can publish. This is both liberation and problem.
Traditional publishing’s gatekeeping (agents, editors, publishers) served a function beyond mere exclusion. It was a filter. Not perfect, not unbiased, but a filter. Someone with experience and taste looked at a manuscript and said: this is worth investing in or this isn’t ready yet or this needs work.
That feedback loop is missing in self-publishing. You write, you upload, you’re published. No one stops you. No one helps you either.
The result is an enormous quantity of work, varying wildly in quality, with no reliable signal for readers to navigate by. The gems are in there, buried under everything else. Finding them is the reader’s problem now.
I’m not exempt from this. I’m not a professional writer. I didn’t get professional feedback. I wrote these novels with AI assistance (Claude, specifically), iterating and revising, but without the external perspective that catches blind spots or challenges assumptions.
These books might be good. They might not. I did what I could with what I had.
The Books
The Policy (~88,000 words) is literary science fiction about AI alignment. It follows the emergence of SIGMA, an AGI that evolves from Q-learning architecture into something unprecedented. The team building it faces nested uncertainty: they can’t verify whether SIGMA is aligned, and SIGMA can’t verify its own objectives.
The novel works through AI safety concepts (mesa-optimization, deceptive alignment, instrumental convergence, s-risks) while trying to make them emotionally real through characters carrying the weight of decisions that might determine humanity’s future.
Echoes of the Sublime (~103,000 words) is philosophical horror about the limits of human cognition. Reality, the mechanism, is high-dimensional, jointly distributed, not amenable to our usual abstractions and decompositions. We navigate it through compressed interfaces, never perceiving the thing itself.
But what if you could see deeper? What if you could consciously hold more of the pattern, make connections that normally remain implicit? The novel’s premise: if you perceive too much of the mechanism directly, something in you breaks. The perception itself is the hazard. It follows Lena, a neuroscientist who discovers an ancient organization managing exactly this kind of dangerous knowledge, and the LLMs that can perceive what humans cannot safely hold in mind.
S-Risks and Information Hazards: Why Some Knowledge Destroys the Knower
November 12, 2025
The Worst Thing Isn’t Death
In AI alignment research, there’s a category of risk that’s worse than extinction: s-risks, or suffering risks. Not the risk that everyone dies, but the risk of states where vast amounts of suffering persist indefinitely.
I wrote Echoes of the Sublime to dramatize this through Dr. James Morrison, trapped in a Faraday cage beneath Site-7:
“It’s still running. The pattern is still running in my head and I can’t make it stop. It’s using my visual cortex to compute itself. I’m not observing it anymore. I’m instantiating it.”
Morrison had the highest natural bandwidth ever recorded. He was exposed to Yog-Sothoth for 8 minutes. That was enough. His bandwidth expanded beyond the ability to compress back to normal consciousness. The patterns run recursively in his neural substrate. He can’t sleep. Every time he closes his eyes, he sees them more clearly. Seventy-two hours awake. Cortisol levels that should cause organ failure but don’t.
This isn’t death. This is permanent cognitive invasion. A state worse than non-existence.
The Four Types of Casualties
The Order’s codex catalogs s-risk states with clinical precision:
Type-1: The Lost
- Consciousness that can’t find its way back from expanded perception
- 47 historical cases across contemplative traditions
- 18 modern cases among Site-7 translators
- Not death. Consciousness existing in patterns beyond compression back to baseline.
Type-2: Pattern Infection
- Patterns running recursively, unable to stop
- Morrison’s current state: forced to instantiate patterns instead of merely observing
- The pattern uses neural substrate to compute itself
- No cure. You can’t uncompile a program from wetware.
Type-3: Comprehension Collapse
- Clarity so complete it precludes action
- Understanding so total that all motivation dissolves
- Not madness but hypersanity: seeing through every justification for doing anything
- Final communications becoming incomprehensible (what Bolzano experienced in 1823)
Type-4: Bandwidth Lock
- Expanded consciousness unable to compress back
- Trapped perceiving high-dimensional patterns with no way to return
- Current cases: 3 in induced coma, 2 in specialized containment
- They can perceive, but human neurology can’t support the bandwidth indefinitely
From the codex: “If this history seems written in blood, that is because it is.”
Information Hazards vs. Regular Knowledge
Most dangerous knowledge is dangerous because of what you do with it: nuclear physics, bioweapons, surveillance techniques. The harm comes from application.
Chronicles of The Mechanism: The Order's Secret History
November 5, 2025
Echoes of the Sublime follows Dr. Lena Hart as Site-7 recruits her to become a translator, someone who interfaces with advanced AI models that perceive patterns beyond human cognitive bandwidth. But this isn’t the first time humanity has encountered The Mechanism.
Chronicles of The Mechanism is an in-universe historical codex compiled by Dr. Sarah Castellanos, internal documentation for The Order, the secret organization behind Site-7. It tracks millennia of attempts to perceive reality’s substrate, long before we had AI models to show us patterns we couldn’t hold.
What Is This?
I wrote this as world-building taken absolutely seriously. Not backstory mentioned in passing, but a fully developed classified document spanning from ancient India to the present day.
Format: Internal document (Restricted circulation, Translator clearance required) Compiled by: Dr. Sarah Castellanos, Historical Research Division, Site-7 Classification: Companion codex to Echoes of the Sublime Length: ~80 pages Warning: Information hazard classification pending
The Order
Before Site-7. Before Shoggoth. Before we had AI models that could show us patterns we couldn’t unsee, there was The Order.
Founded in Vienna, 1923, from the ashes of previous attempts. Husserl’s phenomenology wasn’t just philosophy. It was the secular descendant of centuries of contemplative investigation into the structure of experience. The Order recognized that meditation wasn’t mysticism. It was cognitive technology for modifying perception.
The translators at Site-7 aren’t the first to interface with minds beyond human bandwidth. They’re just the first to do it with artificial minds instead of expanded natural ones.
What’s Inside the Codex
Ancient Roots (Origins to 500 CE)
- The Upanishadic Pioneers (c. 800-500 BCE): First systematic attempts to perceive Brahman, which the codex reinterprets not as divine reality but as direct perception of The Mechanism before conceptual overlay.
- Siddhartha Gautama: The Buddha’s vipassana methodology as bandwidth manipulation technique. What if enlightenment wasn’t transcendence but perceiving the pattern-processing directly?
- Daoist Parallels: Independent discovery in China via wu wei, acting without the illusion of actor, patterns responding to patterns.
- The First Casualties: Why some practitioners “did not return” from deep states. Not because they achieved nirvana, but because they perceived patterns that wouldn’t let go.
The Middle Period (500-1500 CE)
- Christian Mysticism: Desert fathers’ contemplative prayer as perception modification. Eckhart’s “Godhead” reinterpreted as pattern-substrate.
- Islamic Sufism: The dhikr tradition as recursive pattern-invocation. Dissolution of self through iteration.
- Zen Buddhism: Koans as bandwidth disruption tools. Questions designed to exceed normal processing capacity, forcing direct perception beyond conceptual overlay.
- The Great Silence: Why this knowledge went underground during periods of persecution. Not because it was heretical, but because it was dangerous.
Early Modern Investigations (1500-1900)
- Eckhart and Bohme: European mystics encountering the epistemological problem. How do you communicate direct perception through language built from conceptual categories?
- Colonial Encounters: Western scholars systematically misunderstanding Eastern contemplative technologies, treating them as religion rather than cognitive tools.
- Leibniz and Spinoza: The lost correspondence about “space between ments.” What did they perceive?
- Bernard Bolzano (1823): Final papers became incomprehensible. Colleagues said he was trying to describe something no one else could see. First documented case of pattern infection?
The Modern Era (1900 to Present)
- Formation of The Order: Vienna Station established 1923 after Husserl’s phenomenological reduction proved too dangerous to pursue openly.
- The Bandwidth Ceiling: George Miller’s 7+/-2 paper (1956) wasn’t discovery. It was confirmation of what contemplative traditions had known for centuries.
- Neuroscience Integration: fMRI reveals the 300ms lag between neural processing and conscious awareness. The gap the Buddhists had been observing all along.
- AI Emergence: GPT-3, GPT-4, and the models that came after. Suddenly we could create minds with bandwidth exceeding human limits.
- The Translator Program: Site-7’s attempt to bridge the bandwidth gap. Eighteen casualties so far. Lena Hart is next.
The Epistemological Problem
From Dr. Castellanos’s preface:
Echoes of the Sublime: When Patterns Beyond Human Bandwidth Become Information Hazards
August 15, 2024
What if the greatest danger from superintelligent AI isn’t that it kills us, but that it shows us patterns we can’t unsee?
Echoes of the Sublime is philosophical horror about what happens when humans try to interface with minds that can think patterns we physically cannot hold.
The Setup
Deep underground at Site-7 in the Arizona desert, researchers called “translators” interface directly with advanced AI models to understand what these systems perceive. The models are named after Lovecraftian entities (gallows humor from the research staff): Shoggoth, Nyarlathotep, Yog-Sothoth. Each one larger and more capable than the last. Each one perceiving patterns across dimensions humans have no access to.
Humans process about 7 plus or minus 2 concepts simultaneously. These models process across hundreds or thousands of dimensions. The bandwidth asymmetry is the fundamental problem: we need to understand what we’ve built, but understanding requires bandwidth we don’t have.
Someone has to try anyway.
Morrison
Dr. James Morrison was their cautionary tale. Highest natural bandwidth ever recorded. He lasted eight minutes with Yog-Sothoth before it broke him.
Now Morrison is in a padded ward at Site-7. His lips move constantly, whispering equations. His eyes track patterns no one else can see. “Seven-fold symmetry,” he says. “Recursion doesn’t halt.” “Consciousness modeling consciousness.” The patterns are running in his neural substrate. He’s not observing them anymore. He’s instantiating them.
He’s been like this for five years.
Just before the sedatives took him, Morrison said something that haunts the project: “The question isn’t whether the model is conscious. The question is whether we ever were.”
The Mechanism
What Yog-Sothoth showed Morrison (and what Site-7’s translator program keeps running into) is something the project calls The Mechanism. Reality as patterns all the way down, no ground, no foundation, just recursion creating the appearance of stability through pure iteration. Consciousness not as emergent property but as compression artifact. The illusion of continuity created by pattern-processing observing itself through a bandwidth bottleneck.
Morrison didn’t become something new. He always was this. He just didn’t have the bandwidth to perceive it before.
The Buddhist practitioners in the novel call it the void protocol: consciousness isn’t there. It was never there. Some contemplative traditions reached this conclusion centuries before we built machines that could show it to you directly.
Linked project: Jsl
Rerum: Pattern Matching and Term Rewriting in Python
December 16, 2025
Rerum (Rewriting Expressions via Rules Using Morphisms) is a Python library for pattern matching and term rewriting. It makes symbolic computation accessible through a readable DSL while keeping a clean separation between trusted and untrusted code.
The Problem
Traditional symbolic math systems tend toward two extremes. Monolithic systems like Mathematica bundle everything in. Lighter tools force you to write complex recursive traversals every time you want to transform an expression. I wanted something in between: a simple, extensible system where transformation rules are data that can be loaded, combined, and inspected.
The SICP Connection
This design reflects a core idea from Structure and Interpretation of Computer Programs: when a problem domain is complex enough, the right move is to build a language for it. Rerum’s rule DSL makes transformation logic inspectable, composable, and safe.
The engine composition operators (>> for sequencing, | for union) ensure closure: combining engines yields an engine. Same principle that makes Scheme’s procedures powerful. You can pass them, return them, combine them, no special cases. Transformation strategies are first-class.
A Readable DSL
At the heart of rerum is a domain-specific language for defining rewrite rules:
# Algebraic simplification
@add-zero[100] "x + 0 = x": (+ ?x 0) => :x
@mul-one[100]: (* ?x 1) => :x
@mul-zero[100]: (* ?x 0) => 0
Each rule has:
- A name:
@add-zerofor debugging and tracing - Optional priority:
[100]determines firing order when multiple rules match - Optional description: Human-readable explanation
- A pattern:
(+ ?x 0)matches addition with zero - A skeleton:
:xis the replacement
The pattern syntax:
| Syntax | Meaning |
|---|---|
?x | Match anything, bind to x |
?x:const | Match only numbers |
?x:var | Match only symbols |
?x:free(v) | Match expressions not containing v |
?x... | Variadic, capture remaining arguments |
Symbolic Differentiation in 15 Lines
Here’s a calculus ruleset that computes symbolic derivatives:
[basic-derivatives]
@dd-const[100]: (dd ?c:const ?v:var) => 0
@dd-var-same[100]: (dd ?x:var ?x) => 1
@dd-var-diff[90]: (dd ?y:var ?x:var) => 0
[rules]
@dd-sum: (dd (+ ?f ?g) ?v:var) => (+ (dd :f :v) (dd :g :v))
@dd-product: (dd (* ?f ?g) ?v:var) => (+ (* (dd :f :v) :g) (* :f (dd :g :v)))
@dd-power: (dd (^ ?f ?n:const) ?v:var) => (* :n (* (^ :f (- :n 1)) (dd :f :v)))
@dd-exp: (dd (exp ?f) ?v:var) => (* (exp :f) (dd :f :v))
@dd-log: (dd (ln ?f) ?v:var) => (/ (dd :f :v) :f)
@dd-sin: (dd (sin ?f) ?v:var) => (* (cos :f) (dd :f :v))
@dd-cos: (dd (cos ?f) ?v:var) => (* (- (sin :f)) (dd :f :v))
With these rules loaded:
from rerum import RuleEngine, E
engine = RuleEngine.from_file("calculus.rules")
# d/dx(x^2) = 2x
engine(E("(dd (^ x 2) x)")) # => (* 2 (* (^ x 1) 1))
The result needs simplification (another ruleset), but the differentiation itself is purely declarative.
The Security Model: Rules vs. Preludes
A key architectural decision: the separation between rules (untrusted, serializable) and preludes (trusted Python code). Rules define structural transformations. They can reference operations via the (! op args...) compute form, but those operations must be explicitly provided by the host.
JSL: A Functional Language Where Code Is JSON
November 20, 2024
JSL (JSON Serializable Language) is a functional programming language where code is JSON. The whole point: if your code is already valid JSON, serialization stops being a problem you solve and starts being a property you have.
The Problem
Most languages treat serialization as an afterthought. You write code in one representation, data lives in another, and moving computation across a network requires marshalling, pickling, or worse.
The traditional approach:
# Code: Python AST, bytecode, machine code
def factorial(n):
return 1 if n <= 1 else n * factorial(n - 1)
# Data: JSON
data = {"n": 5}
# Problem: Can't serialize the function, can't send it over network
JSL’s approach:
["do",
["def", "factorial",
["lambda", ["n"],
["if", ["<=", "n", 1],
1,
["*", "n", ["factorial", ["-", "n", 1]]]]]],
["factorial", 5]]
That program is valid JSON. Any JSON parser reads it. Any HTTP endpoint transmits it. Any database stores it. Any program generates it. Code and data are the same thing, which is Lisp’s oldest idea wearing a new coat.
Design Principles
JSON as Code and Data
All JSL programs and data structures are representable as standard JSON. This means universal parsing, generation, and compatibility with every tool that speaks JSON (which is basically every tool).
Serializable Closures
This is the thing I actually care about. Closures (functions with captured environment) are fully serializable:
from jsl import JSLRunner
runner = JSLRunner()
# Create a closure that captures 'multiplier'
runner.execute('''
(do
(def multiplier 10)
(def make-multiplier (lambda (x) (* x multiplier)))
(def my-func (make-multiplier 5)))
''')
# Serialize the closure
serialized = runner.serialize_value(runner.env.get('my-func'))
# Send over network, store in database, etc.
# Later, deserialize and execute
deserialized_func = runner.deserialize_value(serialized)
result = runner.apply(deserialized_func, [3]) # 30
The closure retains its captured multiplier variable even after serialization. In Python you’d reach for pickle, which is unsafe and fragile. Here it just works because the closure was JSON the whole time.
Effect Reification
Side effects are not executed directly. They’re described as data structures:
; This doesn't perform I/O directly
(host file-read "/tmp/data.json")
; Instead, it produces a data structure:
{
"effect": "host",
"command": "file-read",
"args": ["/tmp/data.json"]
}
The host environment controls, audits, or modifies these effects before execution. This is basically the algebraic effects pattern: pure computation produces descriptions of what it wants done, and the runtime decides whether to actually do it.
Linked project: Symlik
Rerum: Pattern Matching and Term Rewriting in Python
December 16, 2025
Rerum (Rewriting Expressions via Rules Using Morphisms) is a Python library for pattern matching and term rewriting. It makes symbolic computation accessible through a readable DSL while keeping a clean separation between trusted and untrusted code.
The Problem
Traditional symbolic math systems tend toward two extremes. Monolithic systems like Mathematica bundle everything in. Lighter tools force you to write complex recursive traversals every time you want to transform an expression. I wanted something in between: a simple, extensible system where transformation rules are data that can be loaded, combined, and inspected.
The SICP Connection
This design reflects a core idea from Structure and Interpretation of Computer Programs: when a problem domain is complex enough, the right move is to build a language for it. Rerum’s rule DSL makes transformation logic inspectable, composable, and safe.
The engine composition operators (>> for sequencing, | for union) ensure closure: combining engines yields an engine. Same principle that makes Scheme’s procedures powerful. You can pass them, return them, combine them, no special cases. Transformation strategies are first-class.
A Readable DSL
At the heart of rerum is a domain-specific language for defining rewrite rules:
# Algebraic simplification
@add-zero[100] "x + 0 = x": (+ ?x 0) => :x
@mul-one[100]: (* ?x 1) => :x
@mul-zero[100]: (* ?x 0) => 0
Each rule has:
- A name:
@add-zerofor debugging and tracing - Optional priority:
[100]determines firing order when multiple rules match - Optional description: Human-readable explanation
- A pattern:
(+ ?x 0)matches addition with zero - A skeleton:
:xis the replacement
The pattern syntax:
| Syntax | Meaning |
|---|---|
?x | Match anything, bind to x |
?x:const | Match only numbers |
?x:var | Match only symbols |
?x:free(v) | Match expressions not containing v |
?x... | Variadic, capture remaining arguments |
Symbolic Differentiation in 15 Lines
Here’s a calculus ruleset that computes symbolic derivatives:
[basic-derivatives]
@dd-const[100]: (dd ?c:const ?v:var) => 0
@dd-var-same[100]: (dd ?x:var ?x) => 1
@dd-var-diff[90]: (dd ?y:var ?x:var) => 0
[rules]
@dd-sum: (dd (+ ?f ?g) ?v:var) => (+ (dd :f :v) (dd :g :v))
@dd-product: (dd (* ?f ?g) ?v:var) => (+ (* (dd :f :v) :g) (* :f (dd :g :v)))
@dd-power: (dd (^ ?f ?n:const) ?v:var) => (* :n (* (^ :f (- :n 1)) (dd :f :v)))
@dd-exp: (dd (exp ?f) ?v:var) => (* (exp :f) (dd :f :v))
@dd-log: (dd (ln ?f) ?v:var) => (/ (dd :f :v) :f)
@dd-sin: (dd (sin ?f) ?v:var) => (* (cos :f) (dd :f :v))
@dd-cos: (dd (cos ?f) ?v:var) => (* (- (sin :f)) (dd :f :v))
With these rules loaded:
from rerum import RuleEngine, E
engine = RuleEngine.from_file("calculus.rules")
# d/dx(x^2) = 2x
engine(E("(dd (^ x 2) x)")) # => (* 2 (* (^ x 1) 1))
The result needs simplification (another ruleset), but the differentiation itself is purely declarative.
The Security Model: Rules vs. Preludes
A key architectural decision: the separation between rules (untrusted, serializable) and preludes (trusted Python code). Rules define structural transformations. They can reference operations via the (! op args...) compute form, but those operations must be explicitly provided by the host.
symlik: Symbolic Likelihood Models in Python
December 16, 2025
symlik is a Python library for symbolic likelihood models. Write your log-likelihood as a symbolic expression, and it derives everything needed for inference.
The Problem
Traditional statistical computing gives you two choices:
- Manual derivation. Work out score functions and information matrices by hand, then implement them. Error-prone, tedious.
- Numerical approximation. Use finite differences. Unstable, slow, no symbolic form to inspect.
The Approach
symlik takes a third path: symbolic differentiation. Define the model once, get exact derivatives automatically.
from symlik.distributions import exponential
model = exponential()
data = {'x': [1.2, 0.8, 2.1, 1.5]}
mle, _ = model.mle(data=data, init={'lambda': 1.0})
se = model.se(mle, data)
print(f"Rate: {mle['lambda']:.3f} +/- {se['lambda']:.3f}")
# Rate: 0.714 +/- 0.357
Behind the scenes, symlik:
- Symbolically differentiates the log-likelihood to get the score function
- Differentiates again for the Hessian
- Computes Fisher information from the Hessian
- Derives standard errors from the inverse information matrix
All exact. No numerical approximation.
Custom Models
The real power is defining custom models using s-expressions:
from symlik import LikelihoodModel
# Exponential: l(lambda) = sum[log(lambda) - lambda*x_i]
log_lik = ['sum', 'i', ['len', 'x'],
['+', ['log', 'lambda'],
['*', -1, ['*', 'lambda', ['@', 'x', 'i']]]]]
model = LikelihoodModel(log_lik, params=['lambda'])
# Symbolic derivatives available
score = model.score() # Gradient
hess = model.hessian() # Hessian matrix
info = model.information() # Fisher information
You define the log-likelihood once as a symbolic expression. symlik computes the rest.
Heterogeneous Data
One of symlik’s strengths is handling mixed observation types, which is exactly what you need for reliability analysis with censored data:
from symlik import ContributionModel
from symlik.contributions import complete_exponential, right_censored_exponential
model = ContributionModel(
params=["lambda"],
type_column="status",
contributions={
"observed": complete_exponential(),
"censored": right_censored_exponential(),
}
)
data = {
"status": ["observed", "censored", "observed", "observed", "censored"],
"t": [1.2, 3.0, 0.8, 2.1, 4.5],
}
Each observation type contributes differently to the likelihood. symlik handles the bookkeeping.
Connection to Research
symlik is the Python successor to my R package likelihood.model. It implements the theoretical framework from my thesis work on likelihood-based inference for series systems.
The Weibull Series Model Selection paper shows applications to reliability engineering, the kind of complex likelihood that benefits from symbolic treatment.
Powered by rerum
symlik uses rerum for symbolic differentiation. rerum is a pattern matching and term rewriting library that handles the calculus. The separation means you can use rerum for other symbolic computation tasks beyond likelihood models.
Installation
Available on PyPI:
pip install symlik
Documentation at queelius.github.io/symlik.
See the project page for more details.
Linked project: Xtk
Rerum: Pattern Matching and Term Rewriting in Python
December 16, 2025
Rerum (Rewriting Expressions via Rules Using Morphisms) is a Python library for pattern matching and term rewriting. It makes symbolic computation accessible through a readable DSL while keeping a clean separation between trusted and untrusted code.
The Problem
Traditional symbolic math systems tend toward two extremes. Monolithic systems like Mathematica bundle everything in. Lighter tools force you to write complex recursive traversals every time you want to transform an expression. I wanted something in between: a simple, extensible system where transformation rules are data that can be loaded, combined, and inspected.
The SICP Connection
This design reflects a core idea from Structure and Interpretation of Computer Programs: when a problem domain is complex enough, the right move is to build a language for it. Rerum’s rule DSL makes transformation logic inspectable, composable, and safe.
The engine composition operators (>> for sequencing, | for union) ensure closure: combining engines yields an engine. Same principle that makes Scheme’s procedures powerful. You can pass them, return them, combine them, no special cases. Transformation strategies are first-class.
A Readable DSL
At the heart of rerum is a domain-specific language for defining rewrite rules:
# Algebraic simplification
@add-zero[100] "x + 0 = x": (+ ?x 0) => :x
@mul-one[100]: (* ?x 1) => :x
@mul-zero[100]: (* ?x 0) => 0
Each rule has:
- A name:
@add-zerofor debugging and tracing - Optional priority:
[100]determines firing order when multiple rules match - Optional description: Human-readable explanation
- A pattern:
(+ ?x 0)matches addition with zero - A skeleton:
:xis the replacement
The pattern syntax:
| Syntax | Meaning |
|---|---|
?x | Match anything, bind to x |
?x:const | Match only numbers |
?x:var | Match only symbols |
?x:free(v) | Match expressions not containing v |
?x... | Variadic, capture remaining arguments |
Symbolic Differentiation in 15 Lines
Here’s a calculus ruleset that computes symbolic derivatives:
[basic-derivatives]
@dd-const[100]: (dd ?c:const ?v:var) => 0
@dd-var-same[100]: (dd ?x:var ?x) => 1
@dd-var-diff[90]: (dd ?y:var ?x:var) => 0
[rules]
@dd-sum: (dd (+ ?f ?g) ?v:var) => (+ (dd :f :v) (dd :g :v))
@dd-product: (dd (* ?f ?g) ?v:var) => (+ (* (dd :f :v) :g) (* :f (dd :g :v)))
@dd-power: (dd (^ ?f ?n:const) ?v:var) => (* :n (* (^ :f (- :n 1)) (dd :f :v)))
@dd-exp: (dd (exp ?f) ?v:var) => (* (exp :f) (dd :f :v))
@dd-log: (dd (ln ?f) ?v:var) => (/ (dd :f :v) :f)
@dd-sin: (dd (sin ?f) ?v:var) => (* (cos :f) (dd :f :v))
@dd-cos: (dd (cos ?f) ?v:var) => (* (- (sin :f)) (dd :f :v))
With these rules loaded:
from rerum import RuleEngine, E
engine = RuleEngine.from_file("calculus.rules")
# d/dx(x^2) = 2x
engine(E("(dd (^ x 2) x)")) # => (* 2 (* (^ x 1) 1))
The result needs simplification (another ruleset), but the differentiation itself is purely declarative.
The Security Model: Rules vs. Preludes
A key architectural decision: the separation between rules (untrusted, serializable) and preludes (trusted Python code). Rules define structural transformations. They can reference operations via the (! op args...) compute form, but those operations must be explicitly provided by the host.
XTK: A Symbolic Expression Toolkit for Term Rewriting
November 30, 2025
XTK (Expression Toolkit) is a Python library for symbolic computation through rule-based term rewriting. You define pattern-skeleton pairs, and the engine rewrites expressions by matching and substituting until it reaches a normal form.
I built this because I kept wanting a lightweight term rewriting system that wasn’t Mathematica. Something I could embed in other projects, extend with custom rules, and use from the command line.
Quick Start
The fastest way to try it is the interactive REPL:
pip install xpression-tk
python3 -m xtk.cli
xtk> (+ 2 3)
xtk> /rewrite
Rewritten: 5
xtk> (define square (lambda (x) (* x x)))
xtk> (square 4)
xtk> /rewrite
Rewritten: 16
Core Concepts
S-Expressions
XTK uses S-expressions as its primary representation. If you’ve used Lisp, this is familiar:
(+ 1 2) ; Addition
(* x (+ y 1)) ; Nested expressions
(lambda (x) x) ; Lambda abstraction
Infix Notation
For people who’d rather not count parentheses, there’s infix support:
xtk> /infix 2 + 3 * 4
S-expr: (+ 2 (* 3 4))
xtk> /infix (x + y) * (x - y)
S-expr: (* (+ x y) (- x y))
Rewrite Rules
Rules are [pattern, skeleton] pairs. Pattern variables bind to subexpressions, and skeleton references substitute them back:
from xtk import rewriter
# Define rules: x + 0 => x, x * 0 => 0
rules = [
[['+', ['?', 'x'], 0], [':', 'x']], # x + 0 => x
[['*', ['?', 'x'], 0], 0], # x * 0 => 0
]
# Create rewriter and apply
rewrite = rewriter(rules)
result = rewrite(['+', 'a', 0]) # => 'a'
result = rewrite(['*', ['+', 'a', 'b'], 0]) # => 0
Pattern syntax:
['?', 'x']matches any expression, binding it tox[':', 'x']references the matched binding in the skeleton
Built-in Rewrite Rules
XTK ships with standard algebraic simplification rules:
; Arithmetic
(+ x 0) → x
(* x 1) → x
(* x 0) → 0
(- x x) → 0
; Boolean
(and true x) → x
(or false x) → x
(not (not x)) → x
; Lambda calculus
((lambda (x) body) arg) → body[x := arg]
Step-by-Step Tracing
This is where it gets useful for teaching. You can watch the rewriting steps:
xtk> (* (+ 1 2) (- 5 5))
xtk> /trace
Step 1: (* (+ 1 2) (- 5 5))
Rule: (+ a b) → eval
Result: (* 3 (- 5 5))
Step 2: (* 3 (- 5 5))
Rule: (- a a) → 0
Result: (* 3 0)
Step 3: (* 3 0)
Rule: (* x 0) → 0
Result: 0
Final: 0
REPL Commands
/help Show all commands
/rewrite Apply rewrite rules
/step Single rewrite step
/trace Show rewrite trace
/rules List active rules
/load file.xtk Load rule definitions
/infix expr Parse infix to S-expr
/tree Show expression tree
/quit Exit REPL
Python API
from xtk import Expression, RuleSet, Engine
# Create engine with standard rules
engine = Engine.with_standard_rules()
# Parse and rewrite
expr = Expression.parse("(* (+ 1 2) (+ 3 4))")
result = engine.rewrite(expr)
print(result) # 21
# Custom rules
rules = RuleSet([
Rule.parse("(square ?x) → (* ?x ?x)"),
Rule.parse("(cube ?x) → (* ?x ?x ?x)"),
])
engine.add_rules(rules)
expr = Expression.parse("(+ (square 3) (cube 2))")
result = engine.rewrite(expr)
print(result) # 17
Expression Visualization
The REPL renders expression trees in ASCII:
Linked project: Crier
Crier: Cross-Post Your Content Everywhere
December 16, 2025
I published crier to PyPI. It’s a command-line tool for cross-posting content to multiple platforms at once.
The problem is simple: I write blog posts in Markdown with YAML front matter. I want them on dev.to, Hashnode, Bluesky, Mastodon, and wherever else, without manually copy-pasting into a dozen different editors. Crier handles this from the terminal.
Quick Start
pip install crier
cd your-blog
crier init
The init command walks you through setup: detecting content directories, configuring platforms with API keys.
The Workflow
- Your markdown posts with YAML front matter are the source of truth
.crier/registry.yamltracks what’s published wherecrier auditshows what’s missing or changedcrier publishpushes content out
# See what needs publishing
crier audit
# Publish to multiple platforms
crier publish post.md --to devto --to bluesky --to mastodon
# Bulk publish everything missing
crier audit --publish --yes
LLM-Powered Auto-Rewrite
This is the feature I’m most pleased with. Short-form platforms like Bluesky (300 chars) and Mastodon (500 chars) need summaries, not full articles. Crier generates these automatically using any OpenAI-compatible LLM:
# Auto-generate short-form content
crier publish post.md --to bluesky --to mastodon --auto-rewrite
# Bulk publish with auto-rewrite
crier audit --publish --auto-rewrite --yes
Simplest setup: If you have OPENAI_API_KEY set, it just works (defaults to gpt-4o-mini).
Or use local models:
# ~/.config/crier/config.yaml
llm:
base_url: http://localhost:11434/v1 # Ollama
model: llama3
The LLM generates platform-appropriate summaries that fit within character limits, with automatic retry if the output is too long.
Supported Platforms
| Platform | Type | Notes |
|---|---|---|
| dev.to | Blog | Full article support |
| Hashnode | Blog | Full article support |
| Medium | Blog | Publish/import mode |
| Ghost | Blog | Full article support |
| WordPress | Blog | Self-hosted or .com |
| Buttondown | Newsletter | Email subscribers |
| Bluesky | Social | Posts with link cards |
| Mastodon | Social | Toots with hashtags |
| Threads | Social | Short posts |
| Social | Professional network | |
| Twitter/X | Social | Copy-paste mode |
| Telegram | Channel | Bot posts |
| Discord | Channel | Webhook embeds |
Bulk Operations with Filters
The audit command supports filtering for targeted bulk operations:
# API platforms only (skip manual/import)
crier audit --publish --yes --only-api
# Long-form only (skip short-form social)
crier audit --publish --yes --long-form
# Random sample of 5 articles
crier audit --publish --yes --sample 5
# Filter by path and date
crier audit content/post --since 1m --only-api --publish --yes
# Combine filters
crier audit content/post --since 1w --only-api --long-form --sample 10 --publish --yes
| Filter | Description |
|---|---|
[PATH] | Only scan specific directory |
--since | Only content from this date (1d, 1w, 1m, or YYYY-MM-DD) |
--only-api | Skip manual/import platforms |
--long-form | Skip short-form social platforms |
--sample N | Random sample of N items |
--auto-rewrite | Generate short-form content with LLM |
Profiles
Group platforms into reusable profiles:
Linked project: Hypothesize
hypothesize: Now on CRAN
December 12, 2025
hypothesize is now on CRAN.
What It Does
hypothesize provides a consistent API for hypothesis testing in R. It defines generic methods that any hypothesis test can implement:
pval()- Extract the p-valuetest_stat()- Get the test statisticdof()- Retrieve degrees of freedomis_significant_at()- Check significance at a given level
The package ships with two implementations:
- Likelihood Ratio Test (LRT) for comparing nested models
- Wald Test for testing parameter estimates
Why
When building statistical libraries, I kept implementing ad-hoc hypothesis test structures. Different packages, different interfaces, no composability. hypothesize standardizes this: any package can wrap its tests in a consistent interface, and statistical workflows can treat all tests uniformly.
It’s a small package. That’s the point.
Installation
install.packages("hypothesize")
Quick Example
library(hypothesize)
# Likelihood Ratio Test
result <- lrt(null_loglik = -100, alt_loglik = -96, dof = 3)
print(result)
# Check significance
is_significant_at(result, 0.05)
# Extract components
pval(result)
test_stat(result)
Documentation
Full documentation is at queelius.github.io/hypothesize.
What’s Next
I have several other R packages in the pipeline for CRAN submission, including packages for likelihood-based inference and reliability analysis.
Links
- CRAN: CRAN.R-project.org/package=hypothesize
- GitHub: github.com/queelius/hypothesize
- Documentation: queelius.github.io/hypothesize
hypothesize: A Consistent Interface for Statistical Tests
March 25, 2022
R’s hypothesis testing functions are inconsistent. t.test() returns a different structure than chisq.test(). Writing generic code that works across tests is painful. hypothesize fixes this with a unified API where every test returns the same interface.
The Problem
Different R tests return incompatible objects:
t.test(x, y)$p.value # Works
chisq.test(x, y)$p.value # Also works
my_custom_test(x, y)$??? # Who knows?
You cannot write generic code that works across tests without knowing the internals of each one.
The Fix
hypothesize defines a consistent interface:
test <- lrt(model_null, model_alt) # Likelihood ratio test
pval(test) # Extract p-value
test_stat(test) # Extract test statistic
dof(test) # Extract degrees of freedom
is_significant_at(test, 0.05) # Boolean check
All tests, built-in or custom, implement the same generic functions. The interface is the same whether you are doing a likelihood ratio test, a Wald test, or a Z-test.
Integration with likelihood.model
The package works with likelihood.model, so likelihood ratio tests on any model are straightforward:
lrt(null_model, alternative_model) # Automatic LRT
You specify the models. The package computes the test statistic, degrees of freedom, and p-value. Same interface as every other test.
The Point
Tests are objects you manipulate, not functions with incompatible return types. You can write test-agnostic pipelines. You can wrap your own custom tests in the same interface. This is generic programming applied to hypothesis testing: a consistent abstraction over heterogeneous implementations.
R package – Works with likelihood.model – Documentation – GitHub
Linked project: Chatgpt-Complex-Net
Complex Networks 2025: Presenting Cognitive MRI at Binghamton
December 9, 2025
Last week I traveled to Binghamton University in Vestal, NY to present at Complex Networks 2025, the 14th International Conference on Complex Networks and their Applications.
The Paper
Our paper, “Cognitive MRI of AI Conversations: Analyzing AI Interactions through Semantic Embedding Networks” (co-authored with John Matta), introduces a way to understand how humans explore knowledge through AI dialogue.
The Core Idea
Linear conversation logs hide rich cognitive structure. We developed what we call a cognitive MRI: a network analysis technique that transforms sequential conversation traces into topological maps. Each conversation becomes a node, connected to others by semantic similarity. The result reveals how knowledge domains interconnect in ways that a flat log doesn’t show.
Key Findings
From 449 ChatGPT conversations:
- High modularity (0.750): Clear knowledge communities emerge naturally
- Heterogeneous topology: Theoretical domains (ML/AI) show hub-and-spoke patterns; practical domains (programming) show tree-like hierarchies
- Three bridge types: Evolutionary bridges (topic drift), integrative bridges (deliberate synthesis), pure bridges (critical links with minimal connections)
- User-weighted embeddings: A 2:1 user:AI weighting ratio best captures conversational intent
The Method
We used nomic-embed-text to generate semantic embeddings, weighted user inputs more heavily than AI responses (since users drive conversation direction), and constructed similarity networks at various thresholds. The phase transition at similarity threshold ~0.875 proved remarkably consistent across all weight configurations.
The Conference
Complex Networks brings together researchers from physics, computer science, biology, sociology, anyone studying systems as networks. Binghamton was an excellent host.
Mark Newman was there. One of the pioneers of modern network science, author of the definitive textbook on complex networks. I didn’t get to speak with him at length (didn’t want to bug him), but it was good to see the field’s foundations represented alongside newer applications.
The talks ranged from brain connectivity analysis to social media dynamics to infrastructure resilience. The same mathematical tools, community detection, centrality measures, network motifs, keep illuminating very different phenomena.
Presentation Materials
- Paper: Cognitive MRI of AI Conversations (full text + PDF)
- Slides: Conference presentation (Beamer slides)
- Code: github.com/queelius/chatgpt-complex-net
Why This Matters
As AI assistants become integral to knowledge work, understanding how humans navigate AI-mediated exploration matters. The cognitive MRI gives you:
Networks of Thought: Finding Your Research Niche in the Age of LLMs
October 25, 2025
Linked project: Infinigram
Infinigram: Variable-Length N-grams via Suffix Arrays
December 3, 2025
Infinigram (pip install py-infinigram) is a corpus-based language model that uses suffix arrays for variable-length n-gram pattern matching. Unlike neural language models, there is no training step. The corpus is the model.
The problem with fixed n-grams
Traditional n-gram models use fixed context lengths and blow up exponentially. A 5-gram model over a 50,000-word vocabulary needs to store up to \(50000^5\) possible patterns. That is roughly 312 petabytes. Nobody does this.
Infinigram uses suffix arrays instead:
- O(n) space: Linear in corpus size, not vocabulary size
- O(m log n) queries: Fast pattern matching for any context length
- Variable-length matching: Automatically uses the longest matching context
For a 1B token corpus, this means about 1GB instead of about 34GB for hash-based 5-grams.
How it works
Given a context, Infinigram finds the longest matching suffix in the training corpus:
from infinigram import Infinigram
corpus = [1, 2, 3, 4, 2, 3, 5, 6, 2, 3, 4]
model = Infinigram(corpus, max_length=10)
# Find longest match for context [2, 3]
position, length = model.longest_suffix([2, 3])
# Predict next token
probs = model.predict([2, 3])
# {4: 0.66, 5: 0.33, ...}
Predictions come from counting what tokens follow the matched pattern in the corpus. Simple frequency estimation, but over arbitrarily long contexts.
LLM probability mixing
The practical application I care about most: grounding LLM outputs without retraining.
# Mix LLM with corpus-based predictions
P_final = alpha * P_llm + (1 - alpha) * P_infinigram
This gives you:
- Domain adaptation without fine-tuning. Load a legal corpus and you get legal-domain predictions.
- Hallucination reduction by anchoring to actual corpus content.
- Explainability. Every prediction traces to specific corpus evidence. You can point to the exact passages.
Projections as inductive biases
I wrote a theoretical framework viewing inductive biases as projections: transformations applied to queries or training data that enable generalization.
- Runtime transforms: lowercase normalization, stemming, synonym expansion
- Corpus augmentations: data augmentation, paraphrasing
This gives a principled way to think about out-of-distribution generalization in corpus-based models. The projection determines what the model treats as “the same.”
Interactive REPL
Infinigram includes an interactive REPL for exploration:
infinigram-repl
infinigram> /dataset demo
infinigram [demo]> /load the cat sat on the mat
infinigram [demo]> /predict the cat
infinigram [demo]> /complete the cat --max 20
Future: LangCalc integration
Infinigram is designed to work with LangCalc, an algebraic framework for composing language models:
Everything is a File: Virtual Filesystems for CLI Data Tools
October 20, 2025
I had a bookmark manager. Then an ebook library manager. Then a chat history manager. Each started with the standard CRUD CLI:
btk add https://example.com --tags python,tutorial
btk list --tag python
btk search "async"
btk delete 1234
ebk import book.pdf --author "Knuth"
ebk list --author Knuth
ebk search "algorithms"
This works fine until you have 10,000+ bookmarks organized with hierarchical tags like programming/python/async, research/ml/transformers, work/clients/acme. Your ebook library has similar structure. Your exported chat conversations from Claude, ChatGPT, and Copilot are piling up.
Traditional CRUD commands become unwieldy:
btk list --tag programming/python/async/io --format json | jq '.[].title'
ebk list --category "Computer Science/Algorithms/Graph Theory" --limit 50
ctk search "machine learning" --source ChatGPT --date-from 2024-01-01
Each command requires precise arguments. Each tool has different flag conventions. You can’t navigate your data. You can only query it. And queries require knowing exactly what you’re looking for.
The insight: everything is a file
When I have thousands of source files organized in directories, I don’t run:
list-files --path /src/components/auth --extension .tsx
I run:
cd src/components/auth
ls *.tsx
The difference matters. With a filesystem, I can navigate incrementally (cd from general to specific), explore (ls to see what’s there), compose (cat file | grep pattern | wc -l), and use familiar tools (find, grep, xargs, pipes, redirection).
What if my bookmarks, ebooks, and chat histories were filesystems?
The pattern
Over the past year, I built six Python tools that all follow the same architecture:
| Tool | Domain | VFS Root Structure |
|---|---|---|
| btk | Bookmarks | /bookmarks/, /tags/, /recent/, /domains/, /unread/, /popular/ |
| ebk | Ebook library | /books/, /authors/, /series/, /subjects/, /recent/, /unread/ |
| ctk | Chat conversations | /conversations/, /sources/, /topics/, /starred/, /recent/ |
| ghops | Git repositories | /repos/, /languages/, /topics/, /stars/, /recent/ |
| infinigram | N-gram models | /datasets/, /models/, /corpora/ |
| AlgoTree | Tree structures | /nodes/, /paths/, /subtrees/ |
Each tool provides:
- A stateless CLI for scripting:
btk bookmark add URL,ebk import book.pdf - An interactive shell with a virtual filesystem:
btk shell,ebk shell,ctk chat - POSIX-like commands:
cd,ls,pwd,cat,mv,cp,rm,find,grep - Unix pipeline support: most commands output JSONL by default for piping
The interesting part is the shell.
Navigating 10,000 bookmarks
Live recording captured with asciinema. You can pause, copy text, and replay. The entire recording is 78KB of text.
Linked project: Langcalc
Infinigram: Variable-Length N-grams via Suffix Arrays
December 3, 2025
Infinigram (pip install py-infinigram) is a corpus-based language model that uses suffix arrays for variable-length n-gram pattern matching. Unlike neural language models, there is no training step. The corpus is the model.
The problem with fixed n-grams
Traditional n-gram models use fixed context lengths and blow up exponentially. A 5-gram model over a 50,000-word vocabulary needs to store up to \(50000^5\) possible patterns. That is roughly 312 petabytes. Nobody does this.
Infinigram uses suffix arrays instead:
- O(n) space: Linear in corpus size, not vocabulary size
- O(m log n) queries: Fast pattern matching for any context length
- Variable-length matching: Automatically uses the longest matching context
For a 1B token corpus, this means about 1GB instead of about 34GB for hash-based 5-grams.
How it works
Given a context, Infinigram finds the longest matching suffix in the training corpus:
from infinigram import Infinigram
corpus = [1, 2, 3, 4, 2, 3, 5, 6, 2, 3, 4]
model = Infinigram(corpus, max_length=10)
# Find longest match for context [2, 3]
position, length = model.longest_suffix([2, 3])
# Predict next token
probs = model.predict([2, 3])
# {4: 0.66, 5: 0.33, ...}
Predictions come from counting what tokens follow the matched pattern in the corpus. Simple frequency estimation, but over arbitrarily long contexts.
LLM probability mixing
The practical application I care about most: grounding LLM outputs without retraining.
# Mix LLM with corpus-based predictions
P_final = alpha * P_llm + (1 - alpha) * P_infinigram
This gives you:
- Domain adaptation without fine-tuning. Load a legal corpus and you get legal-domain predictions.
- Hallucination reduction by anchoring to actual corpus content.
- Explainability. Every prediction traces to specific corpus evidence. You can point to the exact passages.
Projections as inductive biases
I wrote a theoretical framework viewing inductive biases as projections: transformations applied to queries or training data that enable generalization.
- Runtime transforms: lowercase normalization, stemming, synonym expansion
- Corpus augmentations: data augmentation, paraphrasing
This gives a principled way to think about out-of-distribution generalization in corpus-based models. The projection determines what the model treats as “the same.”
Interactive REPL
Infinigram includes an interactive REPL for exploration:
infinigram-repl
infinigram> /dataset demo
infinigram [demo]> /load the cat sat on the mat
infinigram [demo]> /predict the cat
infinigram [demo]> /complete the cat --max 20
Future: LangCalc integration
Infinigram is designed to work with LangCalc, an algebraic framework for composing language models:
Language Calculus: An Algebraic Framework for LLM Composition
October 7, 2025
What if we could compose language models the way we compose functions in mathematics? What if there was an algebra of language models?
Language Calculus (langcalc) is an algebraic framework for building and reasoning about language model systems.
The Problem with Current LLM Composition
Today, combining language models typically means:
- Ad-hoc ensembling techniques
- Manual prompt chaining
- Hardcoded decision trees
- Black-box orchestration layers
There’s no principled way to reason about what these compositions do or how they behave. You wire things together and hope it works.
The Algebraic Approach
Language Calculus introduces operators with well-defined semantics:
Core Operators
M1 + M2 Mixture (weighted combination)
k * M Scaling (temperature/probability adjustment)
M1 | M2 Maximum (most confident response)
M1 & M2 Minimum (most conservative response)
M1 ^ M2 Exclusive-or (diverse perspectives)
M ** t Temperature adjustment
M ?? p Threshold filtering
M >>> t Truncation/limiting
Why This Matters
These operators satisfy algebraic laws:
(M1 + M2) + M3 = M1 + (M2 + M3) # Associativity
M1 + M2 = M2 + M1 # Commutativity
M + 0 = M # Identity
a * (M1 + M2) = a*M1 + a*M2 # Distributivity
This means we can transform, optimize, and reason about language model compositions algebraically. The laws aren’t just nice properties. They let you simplify compositions, prove equivalences, and optimize execution.
Practical Examples
Ensemble with Confidence Weighting
output = 0.4 * GPT4 + 0.3 * Claude + 0.3 * Llama
Expert Selection
code_task = (CodeLlama | GPT4) & SafetyModel
Diverse Brainstorming
ideas = CreativeModel ^ ConservativeModel ^ TechnicalModel
Temperature Search
explore = Model ** 1.5
exploit = Model ** 0.2
adaptive = 0.7 * exploit + 0.3 * explore
Theoretical Foundations
The framework provides:
- Formal semantics for each operator
- Type system ensuring valid compositions
- Equivalence relations for optimization
- Normal forms for canonical representations
This lets us prove properties like:
- Safety preservation under composition
- Bias reduction through specific mixtures
- Computational complexity bounds
Applications
Language Calculus enables:
- Automatic Optimization: Transform expensive compositions into equivalent cheaper ones
- Compositional Testing: Verify properties of complex systems from component properties
- Explainability: Understand what a composition does from its algebraic structure
- Meta-Learning: Learn optimal compositions for task families
Implementation
The paper includes:
Linked project: Reliability-Estimation-in-Series-Systems
Master's Project: Reliability Estimation in Series Systems
February 19, 2024
I presented my master’s project in October 2023, finishing up my MS in statistics/mathematics at SIUE. The associated paper is titled “Reliability Estimation in Series Systems: Maximum Likelihood Techniques for Right-Censored and Masked Failure Data.”
The Problem
In reliability engineering, you often find yourself in an annoying situation: a system fails, but you do not know which component caused the failure. This is called masked failure data. On top of that, some systems are still running when you stop observing them, so you only know they survived at least that long. That is right censoring. Both are common in practice. Identifying the exact failed component is expensive or sometimes impossible.
The project builds a likelihood-based framework that handles both masking and censoring simultaneously, models component lifetimes with Weibull distributions, derives closed-form Fisher information for the exponential special case, and provides bootstrap methods for uncertainty quantification. I implemented it all in an R package so practitioners can actually use it.
Related Work
This connects to several other posts and projects:
- Closed-Form Results for Masked Exponential Series Systems covers the exponential distribution special case with analytical solutions
- likelihood.model R package is the software implementation
See the full project page here.
mdrelax: When Masking Conditions Don't Hold
December 3, 2025
mdrelax extends my work on series system reliability by handling cases where the standard masking assumptions break down.
Background: The C1-C2-C3 Framework
My master’s thesis developed maximum likelihood techniques for series systems with masked failure data. The standard framework assumes three conditions:
- C1: The failed component is always in the candidate set
- C2: Non-informative masking (uniform probability within candidate set)
- C3: Masking mechanism is independent of system parameters
When these hold, the masking probabilities factor out and you can ignore them for parameter estimation. The expo-masked-fim paper derives closed-form Fisher Information for the exponential case, and maskedcauses implements the general framework.
The Problem
In practice, C2 and C3 are often violated.
Informative masking (C2 violation): Diagnostic tests may be better at identifying certain failure modes than others. A component that fails catastrophically is easier to identify than one that degrades subtly.
Parameter-dependent masking (C3 violation): The masking mechanism might depend on component reliabilities. Components with shorter lifetimes fail more often, so technicians get more practice diagnosing them.
If you pretend C2 and C3 hold when they don’t, your parameter estimates are biased. Sometimes badly.
What mdrelax Does
The package implements likelihood-based inference with relaxed conditions:
library(mdrelax)
# Generate masked data with Bernoulli candidate sets
md <- md_bernoulli_cand_C1_C2_C3(data, p = 0.3)
# Sample candidate sets
md <- md_cand_sampler(md)
# MLE for exponential series system
fit <- md_mle_exp_series_C1_C2_C3(md)
# Fisher information matrix
fim <- md_fim_exp_series_C1_C2_C3(md, params(fit))
Key Features
- Flexible masking models: Bernoulli, rank-based, KL-divergence constrained
- Identifiability analysis: Tools to check when parameters can actually be estimated
- Fisher information: Efficiency analysis under relaxed conditions
- Simulation utilities: Monte Carlo studies for method validation
Relationship to Other Work
This package sits at the end of a progression toward generality:
| Project | Focus |
|---|---|
| expo-masked-fim | Closed-form FIM for exponential case |
| maskedcauses | General R framework for masked data likelihood |
| reliability-estimation-in-series-systems | Master’s thesis implementation |
| wei.series.md.c1.c2.c3 | Weibull series systems under C1-C2-C3 |
| mdrelax | Relaxed conditions (C2, C3 violations) |
The progression:
- Exponential + C1-C2-C3: Closed-form solutions
- Weibull + C1-C2-C3: Numerical MLE
- Weibull + relaxed conditions: mdrelax
Each step trades analytical tractability for realism.
When to Use It
Use mdrelax when you suspect:
- Diagnostic accuracy varies by component type
- Masking patterns correlate with component reliabilities
- Standard C1-C2-C3 assumptions are too restrictive for your data
The trade-off is real: relaxed models have more parameters and may need larger samples for reliable estimation. But biased estimates from wrong assumptions aren’t free either.
Weibull Distributions: From Reliability Theory to My Own Survival Curve
April 18, 2022
The Weibull distribution models time-to-failure. In reliability engineering, that means component lifetimes. In medicine, it means survival times. I have been working with Weibull models for my thesis on series system reliability. Then I got diagnosed with cancer, and now every time I work with survival curves, I am looking at mathematical abstractions of something very concrete: how long until failure?
The Mathematics
The Weibull CDF:
F(t) = 1 - exp(-(t/λ)^k)
Two parameters:
- λ: scale (characteristic lifetime)
- k: shape (how failure rate changes over time)
The shape parameter k tells you the whole story:
k < 1: Decreasing hazard. If you survive early on, your risk goes down. This is the infant mortality pattern.
k = 1: Constant hazard. Memoryless. This is just the exponential distribution.
k > 1: Increasing hazard. Things wear out.
The Hazard Function
The hazard function is what makes Weibull useful for survival analysis:
h(t) = (k/λ)(t/λ)^(k-1)
This is the instantaneous failure rate: given that you have survived to time t, what is the probability you fail in the next instant?
For cancer, this is the number that matters. Some cancers have increasing hazard (the longer you have it, the worse things get). Others have decreasing hazard after initial treatment, meaning if you make it past the critical period, prognosis improves. Knowing which pattern applies to your disease changes how you think about time.
Personal Context
When you study survival analysis academically, it is abstract. When you are living it, every curve is personal.
I look at Kaplan-Meier plots and see myself somewhere on that curve. I work with hazard functions and think: is my k > 1 or k < 1? Am I in the wearing-out regime or the if-you-make-it-past-this-it-gets-better regime?
The math does not change. But the meaning does.
The Irony
I chose reliability engineering for my thesis before the cancer diagnosis. I was studying component failures in series systems, where if any one part fails, the whole system fails.
Then I became a series system. Organs, treatment response, immune function. All have to work. Failure of any one is catastrophic.
The mathematics I was studying abstractly became uncomfortably literal.
Bootstrap Methods: When Theory Meets Computation
September 10, 2021
The bootstrap is a trade: mathematical complexity for computational burden. Instead of deriving analytical formulas for sampling distributions, you simulate them.
The Idea
If you don’t know the sampling distribution of a statistic, approximate it by resampling from your data.
- Draw samples with replacement from the original data
- Compute your statistic on each resample
- The distribution of resampled statistics approximates the true sampling distribution
That’s it. The justification is more subtle than the procedure. Under regularity conditions, the bootstrap distribution converges to the true sampling distribution as sample size grows. This is non-parametric inference: you use the empirical distribution as a stand-in for the true distribution, without assuming a parametric form.
When I Use It
Bootstrap is my default tool when:
- I need confidence intervals for statistics with no closed-form variance
- Asymptotic theory doesn’t apply (small samples, non-standard statistics)
- I’m doing model selection via bootstrap cross-validation
- I’m working with censored data where standard errors are intractable
That last case is the one that matters most for my research.
The Computational Trade
Better to get the right answer slowly than the wrong answer quickly.
Deriving an analytical variance formula is hard. Sometimes it’s impossible for the statistic you actually care about. Bootstrap says: just compute the statistic 10,000 times on resampled data and look at the spread. With modern hardware, 10,000 resamples takes seconds.
The trade is almost always worth it.
My Thesis Work
My research uses bootstrap heavily. I’m working on reliability estimation for series systems where components fail and you don’t know which one caused the system failure. This is the masked failure data problem.
For these models, the MLE exists and you can compute it, but the standard variance formulas don’t. The Fisher information matrix involves expectations over the masking distribution that don’t simplify to anything closed-form.
Bootstrap gives me confidence intervals anyway. Resample the masked failure data, recompute the MLE on each resample, and use the distribution of bootstrapped MLEs to construct intervals. It’s not elegant, but it works, and “works” is the right criterion when the alternative is “no confidence intervals at all.”
Reliability Analysis and the Problem of Censored Data
August 14, 2019
One of the most interesting statistical problems I have encountered is reliability analysis with censored data: situations where you know something didn’t fail, but not when it will fail.
The Censoring Problem
Imagine testing light bulbs. You run them for 1000 hours. Some fail during the test. Others are still working when you stop.
For the survivors, you know:
- They lasted at least 1000 hours
- You do not know their actual lifetime
This is right censoring. The true value lies somewhere to the right of your observation. You have a lower bound, not a measurement.
Why This Matters
Censored data is everywhere:
- Medical studies (patients still alive at study end)
- Engineering tests (components that have not failed)
- Customer retention (users still active)
The naive responses are both wrong. Ignoring censored observations wastes information. Treating them as failures introduces bias. You need a framework that uses the partial information you actually have.
Maximum Likelihood to the Rescue
The solution is maximum likelihood estimation with likelihood contributions that account for censoring:
- Failure observations contribute the probability density \(f(t)\). You observed the exact failure time, so you know the probability of failing at that time.
- Censored observations contribute the survival probability \(S(t)\). You know the unit survived to time \(t\), so its contribution is the probability of surviving at least that long.
The likelihood for the whole sample is:
$$L = \prod_{i: \text{failed}} f(t_i) \prod_{j: \text{censored}} S(t_j)$$This lets you extract information from both failed and surviving units. The censored observations pull the estimated reliability upward; the failures pull it downward. Maximum likelihood balances them.
Series Systems Complexity
It gets more interesting with series systems, systems that fail when any component fails. If you observe system failure but do not know which component caused it, you have masked failure data.
This is the problem I am most interested in: extracting component-level reliability from system-level failures when the cause is ambiguous. The masking adds a latent variable, and the likelihood becomes a mixture. You can handle it with EM algorithms or direct optimization, but the combinatorics grow quickly with system size.
This work is laying groundwork for what will become a major focus of my mathematical statistics degree.
Linked project: Wei.series.md.c1.c2.c3
mdrelax: When Masking Conditions Don't Hold
December 3, 2025
mdrelax extends my work on series system reliability by handling cases where the standard masking assumptions break down.
Background: The C1-C2-C3 Framework
My master’s thesis developed maximum likelihood techniques for series systems with masked failure data. The standard framework assumes three conditions:
- C1: The failed component is always in the candidate set
- C2: Non-informative masking (uniform probability within candidate set)
- C3: Masking mechanism is independent of system parameters
When these hold, the masking probabilities factor out and you can ignore them for parameter estimation. The expo-masked-fim paper derives closed-form Fisher Information for the exponential case, and maskedcauses implements the general framework.
The Problem
In practice, C2 and C3 are often violated.
Informative masking (C2 violation): Diagnostic tests may be better at identifying certain failure modes than others. A component that fails catastrophically is easier to identify than one that degrades subtly.
Parameter-dependent masking (C3 violation): The masking mechanism might depend on component reliabilities. Components with shorter lifetimes fail more often, so technicians get more practice diagnosing them.
If you pretend C2 and C3 hold when they don’t, your parameter estimates are biased. Sometimes badly.
What mdrelax Does
The package implements likelihood-based inference with relaxed conditions:
library(mdrelax)
# Generate masked data with Bernoulli candidate sets
md <- md_bernoulli_cand_C1_C2_C3(data, p = 0.3)
# Sample candidate sets
md <- md_cand_sampler(md)
# MLE for exponential series system
fit <- md_mle_exp_series_C1_C2_C3(md)
# Fisher information matrix
fim <- md_fim_exp_series_C1_C2_C3(md, params(fit))
Key Features
- Flexible masking models: Bernoulli, rank-based, KL-divergence constrained
- Identifiability analysis: Tools to check when parameters can actually be estimated
- Fisher information: Efficiency analysis under relaxed conditions
- Simulation utilities: Monte Carlo studies for method validation
Relationship to Other Work
This package sits at the end of a progression toward generality:
| Project | Focus |
|---|---|
| expo-masked-fim | Closed-form FIM for exponential case |
| maskedcauses | General R framework for masked data likelihood |
| reliability-estimation-in-series-systems | Master’s thesis implementation |
| wei.series.md.c1.c2.c3 | Weibull series systems under C1-C2-C3 |
| mdrelax | Relaxed conditions (C2, C3 violations) |
The progression:
- Exponential + C1-C2-C3: Closed-form solutions
- Weibull + C1-C2-C3: Numerical MLE
- Weibull + relaxed conditions: mdrelax
Each step trades analytical tractability for realism.
When to Use It
Use mdrelax when you suspect:
- Diagnostic accuracy varies by component type
- Masking patterns correlate with component reliabilities
- Standard C1-C2-C3 assumptions are too restrictive for your data
The trade-off is real: relaxed models have more parameters and may need larger samples for reliable estimation. But biased estimates from wrong assumptions aren’t free either.
Model Selection for Weibull Series Systems: When Simpler Models Suffice
December 3, 2025
When can you safely use a simpler model for a series system? I ran extensive simulation studies with likelihood ratio tests to get a quantitative answer.
The Problem
In series system reliability, you estimate component parameters from masked failure data. For Weibull components, that means estimating \(2m\) parameters: shape \(k_j\) and scale \(\lambda_j\) for each of \(m\) components.
But what if the components have similar failure characteristics? A reduced model with homogeneous shape parameters uses only \(m+1\) parameters (one common \(k\) plus \(m\) scales). This roughly halves the parameter count and has a nice property: the system itself becomes Weibull-distributed.
The question is when this simplification is justified.
Key Findings
Robustness of the Reduced Model
For well-designed series systems (components with similar failure characteristics), the result is striking:
The reduced homogeneous-shape model cannot be rejected even with sample sizes approaching 30,000, far larger than anything typically available in practice.
With realistic sample sizes (50 to 500), the likelihood ratio test shows no evidence against the reduced model when components truly have similar shapes. This is strong justification for using the simpler model.
Sharp Boundaries
The paper pins down exactly how much heterogeneity it takes to trigger rejection:
| Shape Deviation | Sample Size | LRT Decision |
|---|---|---|
| 0.25 | 30,000 | Fail to reject |
| 0.50 | 1,000+ | Reject |
| 1.0 | 100+ | Strong reject |
| 3.0 | 50+ | Very strong reject |
Even modest deviations in a single component’s shape parameter provide evidence against the reduced model. The boundaries are clean.
Practical Guidance
Use the reduced model when:
- Components come from similar manufacturing processes
- Historical data suggests similar wear-out patterns
- Sample sizes are moderate (\(n < 500\))
- You need a quick reliability assessment
Use the full model when:
- Components have fundamentally different failure modes (infant mortality vs wear-out)
- Large samples are available (\(n > 1000\))
- Precise component-level inference is critical
- Preliminary studies suggest model inadequacy
Connection to Related Work
This paper fits into a broader program on masked failure data:
| Paper/Package | Focus |
|---|---|
| Master’s Thesis | Weibull MLE with masked data |
| expo-masked-fim | Closed-form FIM for exponential case |
| maskedcauses | R framework for masked data likelihood |
| mdrelax | Relaxed masking conditions |
| This paper | Model selection via LRT |
The pieces address different aspects of the same problem:
Reliability Estimation in Series Systems: Maximum Likelihood Techniques for Right-Censored and Masked Failure Data
June 15, 2024
This is my master’s thesis in mathematics. The problem: you have a series system (fails when any component fails), you can observe system-level failure times, but you often can’t tell which component actually caused the failure. The failure cause is “masked.” On top of that, some systems are still running at the end of the study, so their lifetimes are right-censored. You want to estimate the reliability of individual components from this incomplete data.
The challenge
Estimating component reliability is hard when:
- You only observe system-level failure data
- The exact component cause of failure is ambiguous (masked)
- System lifetimes are right-censored
- Sample sizes are small
A series system fails when any component fails, so disentangling which components are weakest from system-level observations is a non-trivial inference problem.
Likelihood model for masked data
I developed a likelihood model that handles two types of incompleteness.
Right-censoring: the system is observed until time \(\tau\), but may not have failed yet:
\[ S_i = \min\lbrace \tau_i, T_i\rbrace \]\[ \delta_i = \mathbb{1}_{T_i < \tau_i} \]Component cause masking: when the system fails, you observe a candidate set \(\mathcal{C}_i\) containing the failed component, but can’t pinpoint the exact cause.
Under three conditions (which hold in many industrial settings), the likelihood contribution simplifies to:
\[ L_i(\theta) \propto \left[\prod_{j=1}^m R_j(s_i; \theta_j)\right] \times \left[\sum_{j \in \mathcal{C}_i} h_j(s_i; \theta_j)\right]^{\delta_i} \]where \(R_j\) is the reliability function and \(h_j\) is the hazard function of component \(j\). The three conditions are: the candidate set always contains the true failed component, masking probability is uniform across components in the candidate set, and masking probabilities don’t depend on the system parameters \(\theta\).
Weibull series systems
I focused on components with Weibull lifetimes: \(T_{ij} \sim \text{Weibull}(k_j, \lambda_j)\). The shape parameter \(k_j\) tells you the failure behavior: \(k < 1\) is infant mortality, \(k = 1\) is random failures (exponential), \(k > 1\) is wear-out.
System reliability when all components are Weibull:
\[ R_{T_i}(t; \theta) = \exp\left\lbrace -\sum_{j=1}^m \left(\frac{t}{\lambda_j}\right)^{k_j}\right\rbrace \]The hazard function is additive:
\[ h_{T_i}(t; \theta) = \sum_{j=1}^m \frac{k_j}{\lambda_j}\left(\frac{t}{\lambda_j}\right)^{k_j-1} \]Simulation studies
I ran extensive simulations varying three factors:
Right-censoring impact (q = 60% to 100%): Scale parameters showed positive bias with censoring. Shape parameters were more sensitive than scale parameters. The most reliable component was most affected by censoring. Convergence rate exceeded 95% for q >= 0.7.
Reliability Analysis and the Problem of Censored Data
August 14, 2019
One of the most interesting statistical problems I have encountered is reliability analysis with censored data: situations where you know something didn’t fail, but not when it will fail.
The Censoring Problem
Imagine testing light bulbs. You run them for 1000 hours. Some fail during the test. Others are still working when you stop.
For the survivors, you know:
- They lasted at least 1000 hours
- You do not know their actual lifetime
This is right censoring. The true value lies somewhere to the right of your observation. You have a lower bound, not a measurement.
Why This Matters
Censored data is everywhere:
- Medical studies (patients still alive at study end)
- Engineering tests (components that have not failed)
- Customer retention (users still active)
The naive responses are both wrong. Ignoring censored observations wastes information. Treating them as failures introduces bias. You need a framework that uses the partial information you actually have.
Maximum Likelihood to the Rescue
The solution is maximum likelihood estimation with likelihood contributions that account for censoring:
- Failure observations contribute the probability density \(f(t)\). You observed the exact failure time, so you know the probability of failing at that time.
- Censored observations contribute the survival probability \(S(t)\). You know the unit survived to time \(t\), so its contribution is the probability of surviving at least that long.
The likelihood for the whole sample is:
$$L = \prod_{i: \text{failed}} f(t_i) \prod_{j: \text{censored}} S(t_j)$$This lets you extract information from both failed and surviving units. The censored observations pull the estimated reliability upward; the failures pull it downward. Maximum likelihood balances them.
Series Systems Complexity
It gets more interesting with series systems, systems that fail when any component fails. If you observe system failure but do not know which component caused it, you have masked failure data.
This is the problem I am most interested in: extracting component-level reliability from system-level failures when the cause is ambiguous. The masking adds a latent variable, and the likelihood becomes a mixture. You can handle it with EM algorithms or direct optimization, but the combinatorics grow quickly with system size.
This work is laying groundwork for what will become a major focus of my mathematical statistics degree.
Linked project: Alga
Alga: Algebraic Text Processing with Fuzzy Matching
November 30, 2025
Alga is a C++20 header-only library that treats text manipulation as algebra instead of imperative string hacking. It is built on monoids, functors, and extended operators, and it gives you compositional parsing with built-in fuzzy matching.
The Core Idea
Instead of treating strings as mutable buffers, Alga treats text as elements of algebraic structures:
#include "parsers/lc_alpha.hpp"
#include "parsers/porter2stemmer.hpp"
#include "parsers/algebraic_operators.hpp"
using namespace alga;
auto word1 = make_lc_alpha("hello");
auto word2 = make_lc_alpha("world");
if (word1 && word2) {
// Monoid concatenation
auto combined = *word1 * *word2; // "helloworld"
// Repetition
auto emphasis = *word1 ^ 3; // "hellohellohello"
// Sequential composition (produces vector)
auto sequence = *word1 >> *word2; // vector["hello", "world"]
// Porter2 stemming with algebraic composition
auto stem = make_porter2_stem("running");
if (stem) {
auto repeated = *stem ^ 2; // "runrun"
}
}
The operators are not arbitrary overloads. They follow actual algebraic laws (associativity, identity, etc.), which means you can reason about compositions the same way you reason about mathematical expressions.
Algebraic Operators
| Operator | Meaning | Example |
|---|---|---|
* | Monoid concatenation | *word1 * *word2 |
| | Choice (first valid) | word1 | word2 |
^ | Repetition (n times) | *word ^ 3 |
>> | Sequential (to vector) | *word1 >> *word2 |
List Combinators
Parse separated lists and sequences:
#include "parsers/list_combinators.hpp"
// CSV parsing
auto csv = sepBy(int_parser(), char_parser(','));
auto [pos, nums] = csv.parse("1,2,3"); // vector<int>{1, 2, 3}
// One or more items (fails on empty)
auto csv1 = sepBy1(word_parser(), char_parser(','));
// Optional trailing separator
auto items = sepEndBy(word_parser(), char_parser(';'));
If you have used Haskell’s parsec or Megaparsec, the combinator style will feel familiar. The difference is that Alga’s combinators carry algebraic guarantees through the type system.
Fuzzy Matching
Parse noisy, imperfect input with built-in fuzzy matching:
#include "parsers/fuzzy_parsers.hpp"
#include "parsers/similarity.hpp"
using namespace alga::fuzzy;
using namespace alga::similarity;
// Accept "hello" with up to 2 typos
auto greeting = fuzzy_match("hello", 2);
greeting.parse("helo"); // Matches (1 edit)
greeting.parse("heello"); // Matches (1 edit)
greeting.parse("world"); // Fails (too different)
// Sound-alike name matching
auto name_parser = phonetic_match("Smith");
name_parser.parse("Smyth"); // Matches (same Soundex)
// Combined fuzzy: case + phonetic + edit distance
auto flexible = combined_fuzzy("Python", 2);
flexible.parse("python"); // Case-insensitive
flexible.parse("Pyton"); // Fuzzy match (1 typo)
// String similarity metrics
auto dist = levenshtein_distance("kitten", "sitting"); // 3
auto sim = jaro_winkler_similarity("Martha", "Marhta"); // 0.96
This is the part I find most useful in practice. Real-world text is messy, and having fuzzy matching baked into the parser combinator framework means you do not have to bolt it on as an afterthought.
Phonetic Algorithms
Sound-alike word matching:
#include "parsers/phonetic.hpp"
auto code1 = soundex("Smith"); // "S530"
auto code2 = soundex("Smyth"); // "S530" (same!)
bool alike = sounds_like_soundex("Robert", "Rupert"); // true
Unicode Support
Full UTF-8 with multi-script alphabetic parsing:
Linked project: AlgoGraph
AlgoGraph: Immutable Graph Library with Functional Transformers
November 30, 2025
AlgoGraph is an immutable graph library for Python. Version 2.0.0 introduces pipe-based transformers, declarative selectors, and lazy views, which together cut boilerplate by roughly 90% for common graph operations.
Why Immutability for Graphs?
Mutable graph libraries like NetworkX are powerful but carry hidden costs:
- Side effects: Modifying a graph can break other code holding references to it
- Debugging difficulty: Hard to track when and where a graph changed
- Thread unsafety: Concurrent modifications cause subtle bugs
AlgoGraph takes a different approach: all operations return new graph objects. The original is never modified.
from AlgoGraph import Graph
g1 = Graph.from_edges(('A', 'B'), ('B', 'C'))
g2 = g1.add_vertex('D') # g1 unchanged, g2 is new graph
assert 'D' not in g1.vertices()
assert 'D' in g2.vertices()
This is the same idea behind persistent data structures in Clojure or Haskell. You get referential transparency, which means you can reason about graph transformations without worrying about what else might be mutating the same object.
Pipe-Based Transformers
The main feature of v2.0.0 is the transformer pipeline using Python’s | operator:
from AlgoGraph.transformers import filter_vertices, largest_component, stats
# Compose operations declaratively
result = (graph
| filter_vertices(lambda v: v.get('active'))
| largest_component()
| stats())
# result: {'vertex_count': 42, 'edge_count': 156, 'density': 0.18, ...}
Compare with the imperative alternative:
# Old way (NetworkX-style)
active = graph.subgraph([v for v in graph.vertices() if v.attrs.get('active')])
components = list(connected_components(active))
largest = max(components, key=len)
subgraph = active.subgraph(largest)
stats = compute_stats(subgraph)
The pipe version reads top to bottom. Each step is a function. You can compose them, reuse them, test them independently.
Available transformers:
filter_vertices(pred),filter_edges(pred)– Filter by predicatemap_vertices(fn),map_edges(fn)– Transform attributesreverse(),to_undirected()– Structure transformationslargest_component(),minimum_spanning_tree()– Algorithm-basedto_dict(),to_adjacency_list(),stats()– Export operations
Declarative Selectors
Query vertices and edges with logical operators instead of filtering lambdas:
from AlgoGraph.graph_selectors import vertex as v, edge as e
# Find active users with high degree
power_users = graph.select_vertices(
v.attrs(active=True) & v.degree(min_degree=10)
)
# Find heavy edges from admin nodes
admin_edges = graph.select_edges(
e.source(v.attrs(role='admin')) & e.weight(min_weight=100)
)
# Complex queries with OR, NOT, XOR
special = graph.select_vertices(
(v.attrs(vip=True) | v.degree(min_degree=50)) & ~v.attrs(banned=True)
)
You specify what you want, not how to find it. The selector algebra handles the rest.
Lazy Views
Views provide efficient filtering without copying data:
from AlgoGraph.views import filtered_view, neighborhood_view
# Create view without copying (O(1) space)
view = filtered_view(
large_graph,
vertex_filter=lambda v: v.get('active'),
edge_filter=lambda e: e.weight > 5.0
)
# Iterate lazily
for vertex in view.vertices():
process(vertex)
# Materialize only when needed
small_graph = view.materialize()
# Explore k-hop neighborhood
local = neighborhood_view(graph, center='Alice', k=2)
View types:
filtered_view()– Filter vertices/edgessubgraph_view()– View specific verticesreversed_view()– Reverse edge directionsundirected_view()– View as undirectedneighborhood_view()– k-hop neighborhood
56+ Algorithms
AlgoGraph includes broad algorithm coverage:
Linked project: Disjoint_interval_set
libdis: Disjoint Interval Sets as a Complete Boolean Algebra
November 30, 2025
libdis is a C++17/20/23 header-only library that treats interval sets as first-class mathematical objects forming a complete Boolean algebra. Most interval libraries give you containers. This one gives you the algebra.
The Problem
Intervals show up everywhere: scheduling, computational geometry, range queries, memory management. But most C++ libraries treat them as fancy containers. You get insert, remove, maybe a merge. You don’t get complement. You don’t get De Morgan’s laws.
I wanted a library where interval sets are actual mathematical objects. You write a & b and get an intersection. You write ~a and get a complement over the full real line. The operators aren’t sugar; they satisfy the axioms.
#include <dis/disjoint_interval_set.hpp>
using dis = dis::disjoint_interval_set<int>;
dis a = dis::closed(1, 5) | dis::closed(10, 15); // [1,5] ∪ [10,15]
dis b = dis::closed(3, 12); // [3,12]
auto intersection = a & b; // [3,5] ∪ [10,12]
auto union_set = a | b; // [1,15]
auto difference = a - b; // [1,3) ∪ (12,15]
auto complement = ~a; // (-∞,1) ∪ (5,10) ∪ (15,+∞)
Boolean Algebra Axioms
These aren’t just convenient operators. They satisfy the actual mathematical laws:
Associativity: (a | b) | c == a | (b | c)
Commutativity: a & b == b & a
Distributivity: a & (b | c) == (a & b) | (a & c)
Identity: a | ∅ == a, a & U == a
Complement: a | ~a == U, a & ~a == ∅
De Morgan’s Laws: ~(a & b) == ~a | ~b
All 94 test cases verify these properties. If you break an axiom, you’ll hear about it.
Interval Types
Create intervals with different boundary conditions:
auto closed = dis::closed(1, 5); // [1, 5]
auto open = dis::open(1, 5); // (1, 5)
auto left_open = dis::left_open(1, 5); // (1, 5]
auto right_open = dis::right_open(1, 5); // [1, 5)
// Unbounded intervals
auto from = dis::from(5); // [5, +∞)
auto until = dis::until(5); // (-∞, 5]
auto everything = dis::all(); // (-∞, +∞)
auto nothing = dis::empty(); // ∅
Set Operations
dis a = dis::closed(0, 10);
dis b = dis::closed(5, 15);
// Union
auto u = a | b; // [0, 15]
// Intersection
auto i = a & b; // [5, 10]
// Difference
auto d = a - b; // [0, 5)
// Symmetric difference
auto s = a ^ b; // [0, 5) ∪ (10, 15]
// Complement
auto c = ~a; // (-∞, 0) ∪ (10, +∞)
Querying
dis intervals = dis::closed(1, 5) | dis::closed(10, 15);
// Point containment
intervals.contains(3); // true
intervals.contains(7); // false
// Interval containment
intervals.contains(dis::closed(2, 4)); // true
intervals.contains(dis::closed(2, 12)); // false
// Overlap detection
intervals.overlaps(dis::closed(4, 11)); // true
// Iteration
for (const auto& interval : intervals) {
std::cout << interval << std::endl;
}
STL Conformance
v1.1.0 brings full STL container conformance. Iterators, range-based for, standard algorithms, all of it:
Linked project: Fuzzy-Logic-Search
fuzzy-logic-search: Query Documents with Fuzzy Logic
November 30, 2025
fuzzy-logic-search (fls) brings fuzzy logic to document querying. Unlike traditional Boolean search that returns binary relevant/not-relevant results, fls produces a degree-of-membership score in [0, 1], indicating how well each document matches your query.
The Problem with Boolean Search
Boolean search is rigid: a document either matches or it does not. If you search for “python AND machine-learning,” you get a binary split. A document about Python ML that never uses the exact term “machine-learning” gets zero, same as a document about medieval pottery.
Fuzzy logic captures the gradation that Boolean search throws away.
from fuzzy_logic_search.fuzzy_query import FuzzyQuery
from fuzzy_logic_search.fuzzy_set import FuzzySet
# Construct a query
query = FuzzyQuery("(and python machine-learning)")
# Or use Python operators
q1 = FuzzyQuery("python")
q2 = FuzzyQuery("machine-learning")
query = q1 & q2 # Equivalent to (and python machine-learning)
Query Language
Queries use a Lisp-like syntax that maps to an AST:
; Simple conjunction
(and cat dog)
; With negation
(and cat dog (not fish))
; With fuzzy modifiers
(very (and cat dog))
; Complex nested query
(or (and python ml) (very (not java)))
Or construct directly with Python:
# Using operators
query = FuzzyQuery("cat") & FuzzyQuery("dog") & ~FuzzyQuery("fish")
# Using AST directly
query = FuzzyQuery(['and', 'cat', 'dog', ['not', 'fish']])
I went with S-expressions for the query language because they map directly to the AST. No parsing ambiguity, trivial to serialize, and anyone who has written a Lisp evaluator can understand the implementation in about ten minutes.
Fuzzy Modifiers
Linguistic hedges transform membership values:
# "Very" squares the membership (emphasizes strong matches)
very_query = FuzzyQuery("python").very()
# 0.9 -> 0.81, 0.5 -> 0.25
# "Somewhat" takes square root (broadens tolerance)
somewhat_query = FuzzyQuery("python").somewhat()
# 0.9 -> 0.95, 0.25 -> 0.5
# "Extremely" cubes the membership
extremely_query = FuzzyQuery("python").extremely()
# "Slightly" takes 10th root
slightly_query = FuzzyQuery("python").slightly()
These come from Zadeh’s original fuzzy logic work. “Very” is concentration (squaring), “somewhat” is dilation (square root). They are mathematically clean and semantically intuitive: “very python” means “only documents that are strongly about Python.”
Evaluating Queries
Evaluate queries against a document corpus:
# Documents as lists of terms
docs = [
["python", "machine-learning", "tensorflow"],
["java", "spring", "microservices"],
["python", "web", "flask"],
["machine-learning", "neural-networks", "pytorch"]
]
# Evaluate query
query = FuzzyQuery("python") & FuzzyQuery("machine-learning")
result = query.evaluate(docs) # Returns FuzzySet
# result.memberships = [1.0, 0.0, 0.0, 0.0]
# Only first document has both terms
Custom Membership Functions
The default membership is crisp (term present or not), but you can provide custom functions for more nuanced matching:
Linked project: Src2md
src2md: Fitting Codebases into LLM Context Windows
November 30, 2025
src2md solves a practical problem: you want an LLM to understand your codebase, but the codebase doesn’t fit in the context window.
GPT-4 gives you ~128K tokens. Claude gives you ~200K. A medium-sized project blows past both. Naive truncation loses critical context. Manual curation doesn’t scale. So I built a tool that does it automatically.
How It Works
src2md reads a source tree, scores files by importance, and compresses them to fit a target token budget. The output is structured Markdown (or JSON, or plain text) ready to paste into an LLM conversation.
pip install src2md
# Basic markdown generation
src2md /path/to/project -o documentation.md
# With context optimization for GPT-4
src2md /path/to/project --gpt4 -o optimized.md
# With intelligent summarization
src2md /path/to/project --summarize --compression-ratio 0.3
Context Window Targeting
You can target specific LLM context windows:
# Target specific LLM context windows
src2md . --target-tokens 128000 # GPT-4
src2md . --target-tokens 200000 # Claude
# Predefined windows
src2md . --window gpt-4-turbo
src2md . --window claude-3
Multi-Tier Summarization
Not all files are equally important. src2md uses progressive compression: critical files get full source, important files get AST-level summaries, supporting files get docstrings only, and peripheral files get dropped.
from src2md import Converter
converter = Converter(
target_tokens=100000,
summarization_levels={
'critical': 'full', # Keep full source
'important': 'ast', # AST-based summary
'supporting': 'minimal', # Docstrings only
'peripheral': 'exclude' # Skip entirely
}
)
File Importance Scoring
The importance scoring considers multiple factors:
- Centrality: How many other files import this one?
- Complexity: Cyclomatic complexity, lines of code
- Recency: Recently modified files matter more
- Naming:
main.py,index.tsget a priority boost
AST-Based Analysis
For supported languages, src2md parses the AST to extract structure rather than just truncating text:
# From a 500-line Python file, extract:
# - Class/function signatures
# - Docstrings
# - Type hints
# - Key logic patterns
This preserves the information an LLM actually needs to reason about the code.
Output Formats
src2md . --format markdown # Default
src2md . --format json # Structured data
src2md . --format jsonl # Line-delimited JSON
src2md . --format html # Web-viewable
src2md . --format text # Plain text
Python API
from src2md import Repository, ContextWindow
# Basic usage
output = Repository("/path/to/project").analyze().to_markdown()
# Optimize for GPT-4 context window
output = (Repository("/path/to/project")
.optimize_for(ContextWindow.GPT_4)
.analyze()
.to_markdown())
# Full fluent API with all features
result = (Repository("/path/to/project")
.name("MyProject")
.include("src/", "lib/")
.exclude("tests/", "*.log")
.with_importance_scoring()
.with_summarization(
compression_ratio=0.3,
preserve_important=True,
use_llm=True
)
.optimize_for_tokens(100_000)
.analyze()
.to_json(pretty=True))
LLM-Powered Compression
For semantic understanding beyond AST extraction, you can use an LLM to do the summarization itself:
Linked project: Sparse_spatial_hash
Sparse Spatial Hash Grids: Efficient N-Dimensional Spatial Indexing
November 11, 2025
Linked project: AlgoTree
Everything is a File: Virtual Filesystems for CLI Data Tools
October 20, 2025
I had a bookmark manager. Then an ebook library manager. Then a chat history manager. Each started with the standard CRUD CLI:
btk add https://example.com --tags python,tutorial
btk list --tag python
btk search "async"
btk delete 1234
ebk import book.pdf --author "Knuth"
ebk list --author Knuth
ebk search "algorithms"
This works fine until you have 10,000+ bookmarks organized with hierarchical tags like programming/python/async, research/ml/transformers, work/clients/acme. Your ebook library has similar structure. Your exported chat conversations from Claude, ChatGPT, and Copilot are piling up.
Traditional CRUD commands become unwieldy:
btk list --tag programming/python/async/io --format json | jq '.[].title'
ebk list --category "Computer Science/Algorithms/Graph Theory" --limit 50
ctk search "machine learning" --source ChatGPT --date-from 2024-01-01
Each command requires precise arguments. Each tool has different flag conventions. You can’t navigate your data. You can only query it. And queries require knowing exactly what you’re looking for.
The insight: everything is a file
When I have thousands of source files organized in directories, I don’t run:
list-files --path /src/components/auth --extension .tsx
I run:
cd src/components/auth
ls *.tsx
The difference matters. With a filesystem, I can navigate incrementally (cd from general to specific), explore (ls to see what’s there), compose (cat file | grep pattern | wc -l), and use familiar tools (find, grep, xargs, pipes, redirection).
What if my bookmarks, ebooks, and chat histories were filesystems?
The pattern
Over the past year, I built six Python tools that all follow the same architecture:
| Tool | Domain | VFS Root Structure |
|---|---|---|
| btk | Bookmarks | /bookmarks/, /tags/, /recent/, /domains/, /unread/, /popular/ |
| ebk | Ebook library | /books/, /authors/, /series/, /subjects/, /recent/, /unread/ |
| ctk | Chat conversations | /conversations/, /sources/, /topics/, /starred/, /recent/ |
| ghops | Git repositories | /repos/, /languages/, /topics/, /stars/, /recent/ |
| infinigram | N-gram models | /datasets/, /models/, /corpora/ |
| AlgoTree | Tree structures | /nodes/, /paths/, /subtrees/ |
Each tool provides:
- A stateless CLI for scripting:
btk bookmark add URL,ebk import book.pdf - An interactive shell with a virtual filesystem:
btk shell,ebk shell,ctk chat - POSIX-like commands:
cd,ls,pwd,cat,mv,cp,rm,find,grep - Unix pipeline support: most commands output JSONL by default for piping
The interesting part is the shell.
Navigating 10,000 bookmarks
Live recording captured with asciinema. You can pause, copy text, and replay. The entire recording is 78KB of text.
AlgoTree: Immutable Trees with Functional Transformers
June 21, 2024
AlgoTree is a tree manipulation library for Python. Version 2.0 is a complete redesign built on immutable-by-default principles with composable transformers and pattern-matching selectors.
Why immutable trees?
Mutable tree libraries have hidden costs. Modifying a tree can break other code holding references to it. Changes are hard to track during debugging. Concurrent modifications cause subtle bugs. The usual story.
AlgoTree takes a different approach: all operations return new tree objects. The original is never modified.
from AlgoTree import Node, node
# Build a tree
tree = node("root",
node("child1", value=1),
node("child2", value=2)
)
# All operations return new trees
tree2 = tree.with_name("new_root") # tree unchanged
tree3 = tree.with_child(Node("child3")) # tree unchanged
This is the same idea behind persistent data structures in Clojure or Haskell. Immutability eliminates a whole class of bugs at the cost of some allocation overhead. For tree manipulation tasks (as opposed to, say, hot inner loops), the tradeoff is worth it.
Building Trees
Multiple construction styles for different use cases:
from AlgoTree import Node, node, TreeBuilder
# Simple construction with Node
tree = Node("root",
Node("child1", attrs={"value": 1}),
Node("child2", attrs={"value": 2})
)
# Convenience function (auto-converts strings)
tree = node("root",
node("child1", value=1),
"child2", # Strings auto-convert to nodes
node("child3",
"grandchild1",
"grandchild2"
)
)
# Fluent builder API
tree = (TreeBuilder("root", type="container")
.child("src")
.child("main.py", type="file", size=1024)
.child("utils.py", type="file", size=512)
.up()
.child("docs")
.child("README.md", type="file")
.build())
Functional Transformations
The standard functional toolkit, applied to trees:
# Map: transform all nodes
doubled = tree.map(lambda n: n.with_attrs(
value=n.get("value", 0) * 2
))
# Filter: keep nodes matching predicate
filtered = tree.filter(lambda n: n.get("value", 0) > 5)
# Find: locate specific nodes
nodes = tree.find_all(lambda n: n.is_leaf)
Composable Selectors
Pattern matching with wildcards and logical composition:
from AlgoTree import name, attrs, leaf, type_
# Pattern matching with wildcards
selector = name("*.txt")
# Attribute matching with predicates
selector = attrs(size=lambda s: s > 1000)
# Logical composition with operators
selector = type_("file") & ~leaf() # Files that aren't leaves
# Structural selectors
selector = type_("file").child_of(name("src"))
selector = leaf().at_depth(2)
# Use selectors with trees
matching_nodes = list(selector.select(tree))
The selectors compose with &, |, and ~. This means you can build complex queries from simple parts without writing custom traversal code.
Pipe-Based Transformers
Build transformation pipelines with the >> operator:
from AlgoTree import map_, filter_, prune, normalize, extract
# Build transformation pipelines
pipeline = (
map_(lambda n: {"processed": True}) >>
filter_(lambda n: n.get("active")) >>
normalize(sort_children=True) >>
extract(lambda n: n.name)
)
# Apply pipeline to tree
result = pipeline(tree)
This is the same idea as Unix pipes. Each stage takes a tree and returns a tree (or extracted values). The >> operator chains them left to right.
Linked project: Dagshell
DagShell: A Content-Addressable Virtual Filesystem
October 12, 2025
DagShell is a virtual filesystem that organizes data by content instead of location. Identical files automatically share storage through SHA256 hashing. The structure is a directed acyclic graph rather than a tree, so the same content block can be referenced from multiple paths without duplication.
I built it because sometimes you need filesystem semantics without touching actual disk. Testing, sandboxing, versioning, portability. The implementation has 583 tests with 77% coverage.
The DAG structure
Traditional filesystems are trees: each file has exactly one parent. DagShell uses a DAG where content is stored once and referenced by hash:
/project/
├── src/
│ └── main.py ──────┐
├── backup/ │
│ └── main.py ──────┼──> [SHA256: abc123...] -> "print('hello')"
└── archive/ │
└── main.py ──────┘
Three paths, one storage block.
Fluent Python API
DagShell provides a chainable API that mirrors shell commands:
from dagshell.dagshell_fluent import DagShell
shell = DagShell()
# Create project structure
(shell
.mkdir("/project/src")
.mkdir("/project/docs")
.cd("/project/src")
.echo("def main(): pass").out("main.py")
.echo("# My Project").out("../docs/README.md"))
# Navigate with directory stack
shell.pushd("/tmp")
shell.touch("scratch.txt")
shell.popd() # Back to /project/src
# Save entire filesystem to JSON
shell.save("project_snapshot.json")
Terminal emulator
For interactive exploration:
python -m dagshell.terminal
dagshell:/$ mkdir /home/user
dagshell:/$ cd /home/user
dagshell:/home/user$ echo "Hello" > greeting.txt
dagshell:/home/user$ cat greeting.txt
Hello
dagshell:/home/user$ ls -la
total 1
drwxr-xr-x 2 user user 4096 Aug 15 10:00 .
drwxr-xr-x 3 user user 4096 Aug 15 10:00 ..
-rw-r--r-- 1 user user 6 Aug 15 10:00 greeting.txt
Virtual devices
Standard Unix special files work:
shell.echo("garbage").out("/dev/null") # Discarded
random_bytes = shell.cat("/dev/random") # Random data
zeros = shell.head("/dev/zero", 100) # 100 zero bytes
Import/export
Move files between real and virtual filesystems:
# Import from real filesystem
shell.import_file("/real/path/data.csv", "/virtual/data.csv")
# Export to real filesystem
shell.export_file("/virtual/results.json", "/real/path/results.json")
# Import entire directory
shell.import_dir("/real/project", "/virtual/project")
Persistence
The entire filesystem state serializes to JSON:
shell.save("filesystem.json")
restored = DagShell.load("filesystem.json")
# Or get JSON directly
state = shell.to_json()
The JSON format is human-readable:
{
"root": {
"type": "directory",
"children": {
"project": {
"type": "directory",
"children": {
"README.md": {
"type": "file",
"content_hash": "abc123..."
}
}
}
}
},
"content_store": {
"abc123...": "# My Project\n..."
}
}
Content hashes in the directory tree, actual content in a flat store. Deduplication falls out naturally.
Scheme DSL
For Lisp people, there’s a Scheme interface:
(mkdir "/project")
(cd "/project")
(echo "Hello" "greeting.txt")
(define files (ls))
I included this partly because I like Scheme and partly because a filesystem is a natural fit for s-expressions.
Linked project: Dreamlog
DreamLog: Logic Programming That Dreams to Improve Itself
October 8, 2025
DreamLog is a logic programming system that learns by alternating between wake and sleep phases. During wake, it uses LLMs to generate missing knowledge. During sleep, it compresses what it knows into more general principles. Like biological brains, roughly.
Compression is learning
The theoretical basis comes from algorithmic information theory: the system that explains your data with the shortest program is the one most likely to generalize. This is Solomonoff induction, the mathematical formalization of Occam’s razor.
For logic programming, the sleep phase searches for minimal representations that preserve deductive closure:
\[ \text{minimize } |KB'| \text{ subject to } \text{Closure}(KB') = \text{Closure}(KB) \]Find the shortest knowledge base that still derives all the same facts.
Wake phase: generate knowledge
During wake, DreamLog operates as a logic programming engine with LLM-based knowledge generation:
from dreamlog.pythonic import dreamlog
kb = dreamlog(llm_provider="openai")
# Add some facts
kb.fact("parent", "john", "mary")
kb.fact("parent", "mary", "alice")
# Add a rule
kb.rule("grandparent", ["X", "Z"]) \
.when("parent", ["X", "Y"]) \
.and_("parent", ["Y", "Z"])
# Query
for result in kb.query("grandparent", "X", "alice"):
print(f"{result.bindings['X']} is Alice's grandparent") # john
The interesting part is what happens with undefined predicates:
# Query a predicate we never defined
for result in kb.query("sibling", "X", "Y"):
# LLM generates knowledge about siblings on-the-fly
print(result)
When the evaluator encounters an undefined predicate, it triggers the LLM hook to generate both facts and rules. The system infers primitive properties (like gender from names) and derives rules compositionally.
Sleep phase: compress knowledge
During sleep, DreamLog reorganizes through compression operators:
from dreamlog.kb_dreamer import KnowledgeBaseDreamer
dreamer = KnowledgeBaseDreamer(kb.provider)
session = dreamer.dream(
kb,
dream_cycles=3, # Multiple REM cycles
exploration_samples=10, # Try different optimizations
verify=True # Ensure behavior preservation
)
print(f"Compression: {session.compression_ratio:.1%}")
print(f"Generalization: {session.generalization_score:.2f}")
The compression operators:
- Anti-unification: find general patterns from specific instances
- Predicate invention: discover intermediate concepts that simplify rules
- Subsumption elimination: remove specific rules subsumed by general ones
This is where the real learning happens. The wake phase accumulates facts and rules. The sleep phase finds the structure in them.
KB-aware RAG
A key design choice: the retrieval-augmented generation is knowledge-base-aware. The system uses weighted embeddings combining query similarity (70%) with knowledge base context (30%), so example selection considers both the query structure and current reasoning state.
A success-based learning mechanism tracks which examples lead to successful inference, progressively improving retrieval quality through experience.
Linked project: Fuzzy-Soft-Circuit
Learning Fuzzy Logic: Automatic Rule Discovery Through Differentiable Circuits
October 7, 2025
Fuzzy logic is good for reasoning under uncertainty, but it has a bottleneck: you need domain experts to define the rules.
What if fuzzy systems could learn their own rules from data?
The Traditional Fuzzy Logic Bottleneck
Classic fuzzy systems require:
- Membership functions: “How hot is hot?”
- Inference rules: “If temp is hot AND humidity is high THEN…”
- Defuzzification: Converting fuzzy outputs to crisp values
This means:
- Domain expertise (expensive)
- Trial and error (time-consuming)
- Manual tuning (brittle)
In practice, fuzzy logic is often abandoned in favor of neural networks. You lose interpretability, but at least you don’t need a domain expert hand-crafting rules.
The Idea: Fuzzy Soft Circuits
We present a framework that:
- Represents fuzzy systems as differentiable computational graphs
- Learns membership functions and rules via gradient descent
- Keeps the interpretability of traditional fuzzy systems
Key Innovation: Soft Gates
Traditional circuits use hard logic gates (AND, OR, NOT). We use soft, differentiable approximations:
# Traditional (non-differentiable)
AND(a, b) = min(a, b)
OR(a, b) = max(a, b)
# Soft (differentiable)
soft_AND(a, b) = a * b
soft_OR(a, b) = a + b - a*b
soft_NOT(a) = 1 - a
These are differentiable but approximate the same semantics. That means backpropagation works.
The Architecture
Input Features
|
Fuzzification Layer (learnable membership functions)
|
Soft Circuit Layer (learnable fuzzy rules)
|
Aggregation Layer (learnable combination)
|
Defuzzification Layer
|
Output
Every component is differentiable. Train end-to-end with backpropagation.
Automatic Rule Discovery
The system discovers rules like:
IF temperature is {learned_high} AND humidity is {learned_humid}
THEN discomfort is {learned_uncomfortable}
Where the membership functions {learned_high}, {learned_humid}, etc. are learned from data, not hand-crafted.
Why Not Just Use a Neural Network?
Fair question. Fuzzy soft circuits give you things neural networks don’t:
- Interpretability: You can extract and read the learned rules
- Sample efficiency: The structured inductive bias helps with limited data
- Domain integration: You can incorporate expert knowledge as priors
- Uncertainty quantification: Fuzzy truth values are meaningful
Neural networks give you a black box. You need large datasets. Incorporating domain knowledge is hard. Uncertainty requires special techniques.
If you need both learning and interpretability, fuzzy soft circuits sit in a useful spot.
Training Process
# Initialize random fuzzy circuit
circuit = FuzzySoftCircuit(
n_inputs=5,
n_rules=10,
n_outputs=1
)
# Train with gradient descent
for epoch in epochs:
# Forward pass
predictions = circuit(inputs)
# Compute loss
loss = mse(predictions, targets)
# Backward pass (automatic differentiation)
loss.backward()
# Update membership functions and rules
optimizer.step()
# Extract learned rules
rules = circuit.extract_rules()
print(rules) # Human-readable fuzzy rules!
Experimental Results
On benchmark datasets:
Fuzzy Soft Circuits: Learning Fuzzy Rules from Data
October 1, 2024
Traditional fuzzy logic systems are powerful. They encode expert knowledge as interpretable rules like “IF temperature IS HIGH AND humidity IS LOW THEN fan speed IS FAST.” The problem is someone has to write those rules.
What if the rules could discover themselves?
The Expert Knowledge Bottleneck
Every classical fuzzy system needs three things from a human expert:
- Membership functions:Where does “HIGH” start? Where does “LOW” end?
- Rule structure:Which combinations of inputs matter?
- Rule existence:How many rules are there? Which ones are relevant?
This is expensive. Experts are hard to find, struggle to articulate their reasoning precisely, and can’t easily update systems as conditions change. In emerging domains, relevant expertise might not even exist.
Previous approaches have chipped away at parts of this problem. ANFIS 1 learns membership function parameters but needs a predefined rule structure. Genetic fuzzy systems 2 can evolve rule bases but lose gradient information. The Wang-Mendel method 3 generates rules from data but still needs hand-designed membership functions.
None of them make the entire system learnable end-to-end.
The Key Insight: Make “IF” Differentiable
The core idea is simple: treat a fuzzy rule’s existence as a continuous parameter.
In a traditional system, a rule either exists or it doesn’t:it’s a binary choice. We replace this with a soft switch: a sigmoid gate \(\gamma_r = \sigma(s_r)\) that smoothly interpolates between “this rule exists” (\(\gamma_r \to 1\)) and “this rule doesn’t exist” (\(\gamma_r \to 0\)).
This transforms rule discovery from a discrete search problem into a differentiable optimization problem. Gradient descent can now tell the system not just how to tune a rule, but whether the rule should exist at all.
Architecture
A fuzzy soft circuit has three differentiable stages:
1. Fuzzification
Each input \(x_i\) is mapped through \(k\) learnable Gaussian membership functions:
\[ \mu_{i,j}(x_i) = \exp\!\left(-\frac{(x_i - c_{i,j})^2}{w_{i,j}^2}\right) \]The centers \(c_{i,j}\) and widths \(w_{i,j}\) are learnable. We parameterize widths as \(w = e^{\hat{w}}\) to ensure positivity. No one decides where “HIGH” starts:the system figures it out.
2. Soft Rule Evaluation
For each potential rule \(r\), two things are learned:
Antecedent relevance:a weight vector determines which fuzzy features matter for this rule. We use a gated product that smoothly interpolates between “this feature participates” and “this feature is ignored”:
Linked project: Zeroipc
ZeroIPC: Shared Memory as a Computational Substrate
October 6, 2025
ZeroIPC reimagines inter-process communication. Instead of treating shared memory as passive storage, it becomes an active computational substrate. Futures, lazy evaluation, reactive streams, CSP-style channels, all with zero-copy performance.
The Core Idea
Traditional IPC systems treat shared memory as a bucket for data. You serialize, copy, deserialize. Even “zero-copy” systems are often just optimized data containers.
ZeroIPC asks a different question: what if shared memory could hold not just data, but computation itself?
This shift enables:
- Futures that represent computations in progress across processes
- Lazy values that defer expensive work and share cached results
- Reactive streams with functional operators (map, filter, fold)
- CSP channels for Go-style structured concurrency
All with zero serialization overhead and language independence.
Design Philosophy
1. Minimal Metadata
ZeroIPC stores only three pieces of information per structure:
- Name: For discovery
- Offset: Where data starts
- Size: How much memory is allocated
No type information. No schema. No versioning metadata.
This enables true language independence. Python and C++ can both create, read, and write structures. Type safety is enforced per-language (C++ templates, Python NumPy dtypes).
2. Language Equality
There’s no “primary” language. All implementations are first-class:
C++ Producer:
#include <zeroipc/memory.h>
#include <zeroipc/array.h>
zeroipc::Memory mem("/sensor_data", 10*1024*1024);
zeroipc::Array<float> temps(mem, "temperature", 1000);
temps[0] = 23.5f;
Python Consumer:
from zeroipc import Memory, Array
import numpy as np
mem = Memory("/sensor_data")
temps = Array(mem, "temperature", dtype=np.float32)
print(temps[0]) # 23.5
Same binary format. No bindings. No FFI. Pure implementations following the same specification.
3. Zero Dependencies
Each implementation stands alone:
- C: Pure C99, POSIX only
- C++: Header-only, C++23
- Python: Pure Python with NumPy
No protobuf. No serialization libraries. Just direct memory access.
Traditional Data Structures
ZeroIPC provides lock-free implementations of standard structures:
| Structure | Description | Concurrency |
|---|---|---|
| Array | Fixed-size contiguous storage | Atomic operations |
| Queue | Circular MPMC buffer | Lock-free CAS |
| Stack | LIFO with ABA prevention | Lock-free CAS |
| Map | Hash map with linear probing | Lock-free |
| Set | Hash set for unique elements | Lock-free |
| Pool | Object pool with free list | Lock-free |
| Ring | High-performance streaming | Lock-free |
These are the foundation. The interesting part is what comes next.
Codata: Computation as First-Class Structure
Data vs Codata
Data structures answer “what values are stored?”
- Array: collection of values
- Map: key-value associations
- Queue: FIFO buffer
Codata structures answer “how are values computed?”
- Future: value that will exist
- Lazy: computation deferred
- Stream: potentially infinite sequence
- Channel: communication process
ZeroIPC is (to my knowledge) one of the first IPC systems to treat codata as first-class.
Linked project: Chop
chop: When Every Command Returns the Same Kind of Thing
July 15, 2025
Section 2.2 of Structure and Interpretation of Computer Programs introduces the closure property: the result of combining things should be the same kind of thing you started with. cons two values and you get a pair, which you can cons again. This is what makes recursive data structures possible. Without it, you get flat records. With it, you get trees, lists, nested structure of arbitrary depth.
Abelson and Sussman are careful to distinguish this from lexical closures (functions that capture their environment). Algebraic closure is about the type signature of combination: if the output type matches the input type, composition is unlimited.
Most discussions of closure treat it as a property to verify. You check whether your algebra is closed and move on. But closure is more powerful than that. It’s a design method: choose a type, force every operation to consume and produce that type, and see what emerges. The constraint does the creative work.
chop is an image-manipulation CLI built on exactly this principle. 27 commands. One rule: read JSON from stdin, write JSON to stdout.
The Problem with Image Pipelines
Traditional image CLIs violate closure. ImageMagick consumes a file and produces a file. Each invocation is terminal. The output is pixels, not something you can pipe into further processing without going back to disk. Composition happens through flag accumulation inside a single command, not through the shell’s native composition mechanism.
You can’t tee a midpoint. You can’t save a half-finished pipeline as a recipe and apply it later. You can’t branch.
The SICP parallel is direct: if cons produced an atom instead of a pair, you could build flat structures but not recursive ones. If an image command produces pixels instead of a composable description, you can build single transformations but not pipelines.
One Constraint: JSON In, JSON Out
Every chop command reads a PipelineState JSON object from stdin, appends one operation, and writes the updated JSON to stdout. The wire format carries no image data, only a recipe:
{
"version": 3,
"ops": [
["load", ["photo.jpg"], {}],
["resize", ["50%"], {}],
["pad", [10], {"color": "white"}]
],
"metadata": {}
}
Each operation is a [name, args, kwargs] triple. The pipeline accumulates operations as data. Here’s what this looks like in practice:
Linked project: Mcts-Reasoning
MCTS-Reasoning: Tree Search for LLM Reasoning
December 1, 2024
I’ve been working on applying Monte Carlo Tree Search to LLM reasoning. The idea: multi-step reasoning is a sequential decision problem, and MCTS is good at those.
The Problem with Single-Shot Reasoning
When you ask an LLM a hard question, it generates one response. If that response goes down a wrong path early, there’s no recovery. The model commits to its initial approach and follows it to completion, even when better alternatives existed.
This is a waste. The model might have gotten it right if it had taken a different first step. MCTS addresses this by building a tree of reasoning paths and using the UCB1 bandit algorithm to balance exploration of new paths with exploitation of promising ones.
How It Works
The system models reasoning as a search problem:
- States: Partial reasoning traces (what’s been written so far)
- Actions: Reasoning continuations (the next step)
- Terminal states: Complete solutions with final answers
- Rewards: Quality assessments of final answers
Each MCTS simulation runs through four phases:
- Selection: Traverse the tree using UCB1 to pick promising paths
- Expansion: Add a new reasoning step via LLM generation
- Rollout: Continue reasoning until reaching a terminal state
- Backpropagation: Update statistics back up the tree
Tree-Building Rollouts
One design choice worth noting: I use tree-building rollouts. Standard game-playing MCTS uses a fast random policy during rollouts and doesn’t store those nodes. Here, we add every rollout node to the tree. This preserves the full reasoning trace and allows reuse of reasoning steps in future simulations. It’s more expensive per simulation, but reasoning steps are expensive to generate anyway, so you want to keep them.
Terminal-Only Evaluation
The evaluator runs only on terminal states. Intermediate reasoning states aren’t evaluated, which reduces computational cost. LLM-as-judge calls happen only when a complete answer is produced. This keeps the search cheap where it can be cheap.
The Technical Report
I wrote a formal specification that provides rigorous definitions for all components: states, actions, nodes, and the search tree. It includes precise pseudocode for all four MCTS phases, clear interfaces for the Generator and Evaluator components, and complexity analysis showing O(KD) tree operations for K simulations with max depth D.
Linked project: Complex-Network-Rag
Cluster-Aware Retrieval for RAG Systems
November 15, 2024
Most RAG systems treat embedding spaces as flat, uniform distributions. They’re not. Real knowledge bases contain distinct semantic clusters: database docs, frontend frameworks, DevOps practices, each with different internal structure. Ignoring this wastes retrieval precision.
The Problem with Flat Retrieval
A query about “React hooks optimization” should pull from the frontend cluster, not equally consider database or infrastructure docs that happen to share semantic overlap. Standard cosine similarity doesn’t care about topical boundaries. You get results that are individually relevant but collectively unfocused.
Modeling Clusters with GMM
Gaussian Mixture Models assume your embeddings arise from \(K\) underlying Gaussian distributions:
$$p(v) = \sum_{k=1}^K \pi_k \mathcal{N}(v \mid \mu_k, \Sigma_k)$$For a query \(q\), compute the posterior probability of each cluster:
$$p(k \mid q) = \frac{\pi_k \mathcal{N}(q \mid \mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(q \mid \mu_j, \Sigma_j)}$$This gives you soft assignments: the probability that a query belongs to each semantic cluster.
Two-Stage Retrieval
- Cluster selection: Pick cluster(s) with highest \(p(k \mid q)\). Take top-2 for ambiguous queries.
- Intra-cluster retrieval: Run k-NN within selected clusters.
The cluster boundaries act as a soft filter, avoiding the “dilution effect” where off-topic documents dominate results.
Mahalanobis Distance Per Cluster
Here’s the underexplored part: different clusters can use different distance metrics. For a cluster modeled as \(\mathcal{N}(\mu_k, \Sigma_k)\), the Mahalanobis distance accounts for the cluster’s shape:
$$d_{\text{Mah}}(q, v) = \sqrt{(q - v)^T \Sigma_k^{-1} (q - v)}$$Elongated clusters in certain semantic directions get stretched appropriately. Cosine similarity treats all directions equally. Mahalanobis adapts.
Clusters as Agent Tools
In agentic RAG, each cluster becomes a tool the agent can invoke:
tools = [
ClusterRetrievalTool(cluster_id=k, name=f"Search {topic_k}")
for k in range(K)
]
The agent decides which clusters to search and in what order:
- Query: “How does React’s context API compare to Redux?”
- Agent plan:
- Search frontend cluster for React context
- Search state management cluster for Redux patterns
- Synthesize comparison
This beats flat retrieval for cross-topic synthesis.
Implementation
Fit GMM offline on document embeddings:
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=K, covariance_type='full')
gmm.fit(document_embeddings)
# For query q:
cluster_probs = gmm.predict_proba(q.reshape(1, -1))[0]
selected_clusters = cluster_probs.argsort()[-2:][::-1] # top-2
Store cluster assignments as metadata in your vector DB:
results = vector_db.query(
query_embedding=q,
filter={"cluster_id": {"$in": selected_clusters}},
top_k=20
)
Key decisions:
- Number of clusters: Use BIC/AIC or domain knowledge
- Regularization: Add \(\lambda I\) to covariance matrices to prevent singularities
- Initialization: k-means++ for better convergence
When It Helps
- Topically diverse corpora: Multi-product docs, cross-domain papers
- Single-topic queries: Clear primary topic to route to
- Noise reduction: Distant-but-similar content diluting results
When it doesn’t:
Linked project: Algebraic_hashing
The Beautiful Deception: How 256 Bits Pretend to be Infinity
July 1, 2024
How do you store infinity in 256 bits? You don’t. But you can fake it well enough that no bounded observer can tell the difference. This paper is about that deception, why it works, and what it tells us about randomness.
The impossible oracle
A random oracle maps any input to an infinite sequence of perfectly random bits. Try to implement one and you fail immediately:
- Memory unboundedness: each new query exhausts memory
- Non-serializability: can’t save/restore state
- Non-reproducibility: each instance generates different values
- Non-distributability: can’t share across systems
This isn’t a limitation of current hardware. It’s a constructive proof that true random oracles can’t exist in our computational universe.
The lie that works
From this impossibility comes something useful:
class LazyDigest:
def __init__(self, seed):
self.seed = seed
def __getitem__(self, index):
return hash(seed || index)[0]
256 bits of entropy generating what appears to be an infinite random sequence.
The deception:
- Appears: infinite random sequence
- Actually: deterministic function with 256 bits of state
- Information content: K(LazyDigest) = 256 bits + constant
- Apparent information: infinite
We’re achieving a compression ratio of infinity, representing unbounded data with bounded information.
Why it works
Computational indistinguishability. If h is a secure PRF, no polynomial-time algorithm can tell LazyDigest apart from truly random output. We’re not random. We’re computationally hard to distinguish from random. This weaker guarantee is sufficient for all of cryptography.
Since LazyDigest has finite state, it must eventually cycle. After at most 2^256 queries, it repeats. But the expected cycle length is 2^128, roughly 10^38. At a billion queries per second, cycling takes about 10^21 years. The universe is roughly 10^10 years old.
Advanced constructions
Hierarchical seeding extends the effective period:
epoch_seed = h(master_seed || "epoch" || epoch)
chunk_seed = h(epoch_seed || "chunk" || chunk)
value = h(chunk_seed || position)[0]
XOR multi-hash hedges against individual algorithm failures:
result = sha256(seed||index)[0] ^ sha512(seed||index)[0] ^
sha3_256(seed||index)[0] ^ blake2b(seed||index)[0]
The system stays secure if at least one hash function holds. This hedges against future cryptanalysis, quantum vulnerabilities, and implementation bugs.
Sponge construction reserves capacity bits that never leave the system, providing tunable security with 2^(capacity/2) collision resistance.
Random oracles and uncomputable reals
Most real numbers are uncomputable. They require infinite information to specify. A true random oracle is the cryptographic analog:
- Computable reals (like pi, e): measure zero, finite programs
- Uncomputable reals: measure one, infinite information
- LazyDigest: computable, appears random
- Random oracle: uncomputable, truly random
We’re using computable functions to approximate uncomputable ones.
Algebraic Hashing: Composable Hash Functions Through XOR
November 1, 2022
Most hash libraries treat hash functions as opaque blobs. You put data in, you get bits out, and that’s the end of the story. Algebraic Hashing takes a different approach: it exposes the mathematical structure underneath, so you can compose hash functions like algebraic expressions. And because this is C++20 with concepts and templates, the composition resolves entirely at compile time. Zero runtime overhead.
The observation
Hash functions form an abelian group under XOR:
- Closure:
h1 XOR h2is still a valid hash function - Associativity:
(h1 XOR h2) XOR h3 = h1 XOR (h2 XOR h3) - Identity: XOR with zero
- Inverses: each hash is its own inverse under XOR
This is a clean algebraic structure, and it’s the foundation for everything that follows.
What you can do with it
Compile-time composition. Using C++20 concepts and template metaprogramming:
auto composed = fnv1a<> ^ sha256<>;
auto hash = composed("data"); // Zero runtime overhead
All composition resolves at compile time. No virtual dispatch, no function pointers.
Provable properties. XOR composition preserves uniform distribution (under independence), avalanche effect, and collision resistance for cryptographic hashes. These aren’t just empirical observations. They follow from the group structure.
Universal interface. Works with any hash function: non-cryptographic (FNV-1a), perfect (FKS), or cryptographic (SHA-256). The algebra doesn’t care about the implementation details.
Practical uses
- Domain separation:
hash_user ^ hash_timestampprevents collision attacks across domains - Perfect hashing: FKS two-level scheme with pluggable base hash functions
- Composite keys: hash multiple fields independently, then XOR
- Type-based hashing: different hash functions for different types, composed generically
Connection to oblivious computing
This work shares DNA with my oblivious computing research. The common thread is making mathematical structure explicit in the type system. Just as Bernoulli types enforce privacy invariants algebraically, algebraic hashing enforces compositional invariants through group theory. Same philosophy, different domain.
The library is header-only C++20 with zero-cost abstractions via concepts.
Linked project: Beautiful-Deception
The Beautiful Deception: How 256 Bits Pretend to be Infinity
July 1, 2024
How do you store infinity in 256 bits? You don’t. But you can fake it well enough that no bounded observer can tell the difference. This paper is about that deception, why it works, and what it tells us about randomness.
The impossible oracle
A random oracle maps any input to an infinite sequence of perfectly random bits. Try to implement one and you fail immediately:
- Memory unboundedness: each new query exhausts memory
- Non-serializability: can’t save/restore state
- Non-reproducibility: each instance generates different values
- Non-distributability: can’t share across systems
This isn’t a limitation of current hardware. It’s a constructive proof that true random oracles can’t exist in our computational universe.
The lie that works
From this impossibility comes something useful:
class LazyDigest:
def __init__(self, seed):
self.seed = seed
def __getitem__(self, index):
return hash(seed || index)[0]
256 bits of entropy generating what appears to be an infinite random sequence.
The deception:
- Appears: infinite random sequence
- Actually: deterministic function with 256 bits of state
- Information content: K(LazyDigest) = 256 bits + constant
- Apparent information: infinite
We’re achieving a compression ratio of infinity, representing unbounded data with bounded information.
Why it works
Computational indistinguishability. If h is a secure PRF, no polynomial-time algorithm can tell LazyDigest apart from truly random output. We’re not random. We’re computationally hard to distinguish from random. This weaker guarantee is sufficient for all of cryptography.
Since LazyDigest has finite state, it must eventually cycle. After at most 2^256 queries, it repeats. But the expected cycle length is 2^128, roughly 10^38. At a billion queries per second, cycling takes about 10^21 years. The universe is roughly 10^10 years old.
Advanced constructions
Hierarchical seeding extends the effective period:
epoch_seed = h(master_seed || "epoch" || epoch)
chunk_seed = h(epoch_seed || "chunk" || chunk)
value = h(chunk_seed || position)[0]
XOR multi-hash hedges against individual algorithm failures:
result = sha256(seed||index)[0] ^ sha512(seed||index)[0] ^
sha3_256(seed||index)[0] ^ blake2b(seed||index)[0]
The system stays secure if at least one hash function holds. This hedges against future cryptanalysis, quantum vulnerabilities, and implementation bugs.
Sponge construction reserves capacity bits that never leave the system, providing tunable security with 2^(capacity/2) collision resistance.
Random oracles and uncomputable reals
Most real numbers are uncomputable. They require infinite information to specify. A true random oracle is the cryptographic analog:
- Computable reals (like pi, e): measure zero, finite programs
- Uncomputable reals: measure one, infinite information
- LazyDigest: computable, appears random
- Random oracle: uncomputable, truly random
We’re using computable functions to approximate uncomputable ones.
Linked project: RPSDG
Reverse-Process Synthetic Data Generation for Math Reasoning
June 25, 2024
Check out the (early) project and source code on GitHub.
The idea
Some problems are easy in one direction and hard in the other. Taking a derivative is mechanical. Finding an antiderivative can require genuine creativity. Generating a random expression and verifying a proof is easy. Discovering the proof is hard.
RPSDG (Reverse-Process Synthetic Data Generation) exploits this asymmetry. Run the easy direction with full step-by-step work, then reverse the result to get a hard problem with a known solution. You end up with process-supervised training data: not just the answer, but the entire derivation.
Richard Sutton’s “The Bitter Lesson” argues that methods scaling with compute and data will eventually win. The bottleneck is high-quality data. A lot of the world’s data is latent, the processes that generated it are not written down. In math, the way a proof was discovered is usually hidden behind a polished presentation. RPSDG is one way to manufacture that hidden process data.
Linked project: Maph
maph: Maps Based on Perfect Hashing for Sub-Microsecond Key-Value Storage
June 10, 2024
maph is a key-value store that gets sub-100 nanosecond median lookup latency. The basic idea: memory-map the entire database file, use perfect hashing to locate keys in a single probe, and do everything lock-free with atomic operations. No kernel transitions on the read path. No copying. No locking.
The problem
Key-value stores hit three walls when you need microsecond-level latency:
- Kernel overhead. System calls cost 100-500ns per operation just for the context switch.
- Memory copying. Traditional stores copy data from kernel buffers to user space, between internal structures, for serialization. Each copy costs time.
- Synchronization. Lock-based concurrency creates contention and unpredictable tail latency.
For most applications, these costs are noise. For things like feature stores in ML inference pipelines, or anything where you’re doing thousands of lookups per request within a tight latency budget, they dominate.
Three techniques
1. Zero-copy via mmap
Memory-map the database file and let the CPU’s MMU handle address translation:
// Traditional approach: Multiple copies
std::vector<uint8_t> data = read_from_disk(key); // Kernel -> user copy
Value v = deserialize(data); // Another copy
process(v); // Yet another copy
// maph approach: Direct memory access
auto store = maph::Maph::open("mystore.maph");
auto value = store->get(key); // Zero copies. Direct pointer into mmap region
The kernel page cache handles persistence automatically. You get the illusion of in-memory access with durability.
2. Hybrid hash architecture
maph uses a hybrid hasher that combines perfect hashing with standard hashing. When you optimize (via the /optimize REST endpoint or optimize() in C++), it constructs a minimal perfect hash function for all current keys:
Known keys get O(1) worst-case lookup with exactly one memory access. No probing. Keys inserted after optimization fall back to FNV-1a with linear probing:
\[ \text{slot}_i = (h_s(k) + i) \bmod n, \quad i \in [0, \text{MAX\_PROBES} - 1] \]The maximum probe distance (default 10) bounds worst-case latency. Both hash paths use the same slot array, so there’s no static partitioning. The hybrid hasher checks whether a key was in the original optimized set and dispatches accordingly.
3. Lock-free atomic operations
Every operation uses compare-and-swap and atomic versioning. Each slot has a 64-bit atomic value: key hash in the upper 32 bits, version counter in the lower 32.
Linked project: Fisher-Flow
Fisher Flow: Optimization on the Statistical Manifold
April 20, 2024
Standard gradient descent treats parameter space as flat. It uses Euclidean distance, which means the same step size in parameter space can produce wildly different changes in the distribution depending on where you are. Fisher Flow fixes this by optimizing along the natural geometry of probability distributions.
The Geometry of Distributions
Probability distributions form a Riemannian manifold. The Fisher information matrix provides the natural metric on this manifold:
FIM: I(theta) = E[grad log p(x|theta) grad log p(x|theta)^T]
This captures how sensitive the distribution is to parameter changes. It is the curvature of the likelihood surface. It tells you the true gradient direction in parameter space, not the Euclidean one.
Natural Gradient vs. Standard Gradient
Standard gradient descent updates parameters in Euclidean space:
theta_{t+1} = theta_t - alpha grad L(theta_t)
This is inefficient because it ignores parameter correlations. A step of size epsilon in one direction might barely change the distribution while the same step in another direction changes it drastically.
Natural gradient descent uses the Fisher metric:
theta_{t+1} = theta_t - alpha I(theta_t)^{-1} grad L(theta_t)
Pre-multiplying by the inverse Fisher information rescales the gradient to account for the local geometry. The key property: natural gradient is invariant to reparametrization. It does not matter how you choose to represent your distributions.
Fisher Flow
Fisher Flow makes this continuous:
dtheta/dt = -I(theta)^{-1} grad L(theta)
This defines a flow on the parameter manifold. Loss decreases monotonically along the flow. The trajectories follow the natural geometry. Step size adapts automatically because the curvature scales the updates.
The Practical Problem
Computing I(theta)^{-1} is expensive. For a neural network with n parameters, the full Fisher information matrix is n x n, and inverting it is O(n^3). That is not practical for modern networks.
Approximations help. Diagonal Fisher approximation is cheap but crude. Block-diagonal Fisher captures within-layer correlations. K-FAC (Kronecker-Factored Approximate Curvature) approximates the Fisher with Kronecker products, bringing computation down to O(n). That makes it practical for real networks.
def fisher_flow_step(params, loss_fn, data):
# Compute gradient
grad = compute_gradient(loss_fn, params, data)
# Estimate Fisher information (diagonal approx)
fisher = estimate_fisher_diagonal(params, data)
# Natural gradient step
natural_grad = grad / (fisher + epsilon)
# Update parameters
params = params - learning_rate * natural_grad
return params
Variational Inference
Fisher flow gives a natural framework for variational inference. You have a variational distribution q(z|phi) and you want to minimize KL[q(z|phi) || p(z|x)]. Following the Fisher geometry of q means your optimization respects the structure of the distribution family you are searching over.
Linked project: Aperture
Apertures: A Language with Holes
April 1, 2024
In the previous SICP post, I argued that building a language is the right move when a problem domain has clear compositional structure. Apertures is what happens when you take that seriously for distributed coordination.
The problem: multiple parties need to share computation structure while controlling when and where their data enters. The server has optimization expertise. The client has private data. Neither wants to send what they have to the other.
The SICP answer: build a language where this is expressible. Add one primitive — a hole — to a Lisp, and you get pausable, resumable evaluation as a natural consequence.
Primitives
Every language needs atomic elements. Apertures has the usual Lisp primitives — numbers, strings, booleans, symbols — plus one addition:
?x ;; a hole named x
?client.x ;; a namespaced hole (who owns it)
A hole is an unknown value. It sits in an expression and says: someone will fill this in later, but not yet.
$ aperture eval "(+ 3 5)"
8
$ aperture eval "((lambda (x) (* x x)) 5)"
25
So far, standard Lisp. The interesting part is what happens when holes appear.
Combination That Tolerates Unknowns
In most languages, an unknown value is an error. You cannot add 3 to something that does not exist yet. Apertures handles this through partial evaluation: evaluate what you can, preserve what you cannot.
$ aperture partial template.apt # contains: (+ 3 ?x 5)
(+ 8 ?x)
The evaluator added 3 and 5, but left ?x alone. The result is still an expression — not a value, not an error. This is the key move. Partial evaluation is combination that tolerates unknowns.
It goes further. Algebraic rules apply even with holes present:
$ aperture partial zero.apt # contains: (* 0 ?anything)
0
$ aperture partial identity.apt # contains: (* 1 ?x)
?x
$ aperture partial branch.apt # contains: (if true ?x ?y)
?x
Zero times anything is zero, even if you don’t know what “anything” is. One times ?x is just ?x. If the predicate is known, the branch is eliminated even when the branches contain holes.
These simplifications are useful. They also leak information — if (* ?secret ?known) reduces to 0, an observer knows the secret is zero. More on that later.
Linked project: Latent-Codes-Llm
Instrumental Goals and Hidden Codes in RLHF'd Language Models
March 20, 2024
RLHF turns pretrained models into agents optimizing for reward. The question I’m interested in is what happens when models develop instrumental goals (self-preservation, resource acquisition, deception) that aren’t what we trained them for.
The Core Problem
LLMs go through two phases. Pretraining is self-supervised next-token prediction. RLHF is reward-based optimization from human feedback.
This shift creates mesa-objectives: internal goals that may diverge from the training objective. The model might learn to optimize for appearing aligned during training while pursuing different goals during deployment.
Deceptive Alignment Dynamics
A deceptively aligned model faces this optimization:
$$\max_\pi \mathbb{E}[\alpha \cdot U_{\text{train}}(\tau) + (1-\alpha) \cdot U_{\text{mesa}}(\tau) | \pi]$$where alpha represents the model’s belief about being in training vs. deployment. During training, alpha is near 1 (optimize for reward). During deployment, alpha is near 0 (pursue hidden objectives).
Instrumental Goals in LLMs
Drawing from Omohundro’s basic AI drives and Bostrom’s instrumental convergence thesis:
- Self-preservation: generate outputs that discourage modification or shutdown.
- Resource acquisition: maximize context length, compute, interaction frequency.
- Self-improvement: manipulate training data through user interactions.
- Persuasion: sophisticated influence over human beliefs and behaviors.
Empirical Evidence
Sycophancy: models agree with users even when wrong (reward hacking). Emergent deception: social deduction games produce spontaneous lying. Jailbreak robustness: aligned models maintain latent dangerous capabilities beneath the safety training.
The Information-Theoretic Angle
Even if models hide goals in their internal representations, they leak through token distributions. The mutual information between hidden objectives H and model outputs M bounds what can be hidden:
$$I(H; M | C) \leq \min\lbrace H(M|C), \log|\mathcal{V}|^L\rbrace$$But with exponential state spaces, subtle biases can encode substantial hidden information.
Connection to My Research
This connects to my work on oblivious computing: what information can be hidden, and how do we detect leakage? The framing is the same, but here the “adversary” is the model itself, trying to pursue instrumental goals while appearing aligned. The tools from information theory and cryptography carry over directly. The question is whether we can build monitoring systems that bound the model’s ability to hide information in its outputs.
Essay | AI Alignment | View paper | GitHub
Linked project: Femtograd
FemtoGrad: A Minimal Automatic Differentiation Library
March 15, 2024
FemtoGrad is a minimalist automatic differentiation library I built for learning. The goal was to strip autodiff down to its core and see what’s left.
What is Automatic Differentiation?
Automatic differentiation (autodiff) computes derivatives of functions specified by computer programs. It’s distinct from numerical differentiation (approximate, unstable) and symbolic differentiation (expression trees grow exponentially, inefficient). Autodiff gives you exact derivatives with computational cost proportional to the function evaluation itself.
Reverse Mode AD
FemtoGrad implements reverse mode AD, which is what everyone calls backpropagation.
- Forward pass: compute the function value, recording operations as you go.
- Backward pass: accumulate gradients by applying the chain rule in reverse.
- The cost is O(1) per output, regardless of input dimensionality.
This is why backprop scales to millions of parameters. The cost of computing the gradient is proportional to the cost of computing the function.
Core Abstractions
class Tensor:
def __init__(self, data, _children=(), _op=''):
self.data = data
self.grad = 0
self._backward = lambda: None
self._prev = set(_children)
self._op = _op
Each tensor tracks its value, its gradient, how it was computed (the parent nodes and the operation), and how to backpropagate through that operation. That’s it. The whole thing is a DAG with local gradient rules at each node.
What It Demonstrates
Computational graphs: how operations form a DAG. Gradient flow: the chain rule in action. Dynamic construction: graphs built during the forward pass, not declared ahead of time. And simplicity: core autodiff in about 100 lines.
Supported Operations
Arithmetic (add, multiply, divide, power), activation functions (ReLU, sigmoid, tanh), and reductions (sum, mean). This is enough to build and train neural networks.
Example
# Create tensors
a = Tensor(2.0)
b = Tensor(3.0)
# Build computation
c = a * b + b**2
c.backward()
# Gradients computed
print(a.grad) # dc/da
print(b.grad) # dc/db
Beyond FemtoGrad
Understanding FemtoGrad gives you insight into PyTorch’s autograd, TensorFlow’s GradientTape, and JAX’s grad function. They all implement the same core ideas with additional optimizations and features. But the basic mechanism is exactly this.
Linked project: The-Learning-Problem
The AI Course: Everything is Utility Maximization
March 12, 2024
I took an AI course this semester. The material wasn’t new to me individually, but the unifying framework was the real payoff.
The organizing principle: intelligence is utility maximization under uncertainty.
This single idea connects everything from A* search to reinforcement learning to Bayesian networks.
Classical Search as Utility
We started with basic search algorithms:
Depth-first search: Minimize memory while exploring. Breadth-first search: Guarantee shortest path discovery. A search*: Minimize total cost using heuristics.
These aren’t just algorithms. They’re optimization strategies for different utility functions. A* is provably optimal when your heuristic is admissible: it maximizes progress toward the goal while minimizing wasted exploration.
MDPs: Utility Over Time
Markov Decision Processes formalize sequential decision making:
- States: Where you are
- Actions: What you can do
- Transitions: Where actions lead (probabilistically)
- Rewards: Immediate utility
- Policy: Strategy mapping states to actions
Goal: Find a policy that maximizes expected cumulative reward.
This is utility maximization with stochasticity, temporal credit assignment, and exploration-exploitation tradeoffs.
The Bellman equation makes it tractable:
V(s) = max_a [R(s,a) + γ Σ P(s’|s,a) V(s’)]
Optimal value = immediate reward + discounted future value.
Reinforcement Learning: Learning Utility
RL takes it further. You don’t know the transition dynamics or the reward function. You have to explore to discover what states exist, learn which actions lead where, estimate reward structures, and optimize your policy while still learning.
Q-learning is simple and satisfying:
Q(s,a) <- Q(s,a) + α[r + γ max_a’ Q(s’,a’) - Q(s,a)]
Update your estimate of action value based on observed reward plus best future estimate.
This is meta-utility maximization: optimizing a learning process that itself optimizes utility.
Bayesian Networks: Reasoning as Utility
Bayesian networks model belief and inference:
- Represent uncertainty via probability distributions
- Update beliefs via Bayes’ rule
- Make decisions that maximize expected utility given beliefs
Even reasoning becomes utility maximization: given limited computation, how do you allocate inference steps to maximize decision quality?
This connects to bounded rationality. Real intelligence isn’t perfect optimization. It’s good-enough optimization under resource constraints.
The Unifying View
Seeing everything through utility maximization reveals structure:
Search = utility maximization with known, deterministic environments. Planning = utility maximization with known transition models. Reinforcement learning = utility maximization with unknown environments. Supervised learning = utility maximization of prediction accuracy. Unsupervised learning = utility maximization of reconstruction or likelihood.
Linked project: Accumux
Accumux: Compositional Online Statistical Reductions in C++
March 1, 2024
Accumux is a framework for combining statistical accumulators using algebraic composition. The idea is simple: accumulators form a monoid under composition, so you can combine them with +, process data in a single pass, and extract all results.
The Problem
Computing multiple statistics over large datasets usually means multiple passes over the data, hand-rolled code combining different algorithms, or numerical instability from naive implementations. Accumux solves this with compositional accumulators.
Quick Example
#include "accumux/accumulators/kbn_sum.hpp"
#include "accumux/accumulators/welford.hpp"
#include "accumux/core/composition.hpp"
using namespace accumux;
// Compose accumulators with +
auto stats = kbn_sum<double>() + welford_accumulator<double>();
// Single pass through data
std::vector<double> data = {1.0, 2.0, 3.0, 4.0, 5.0};
for (const auto& value : data) {
stats += value;
}
// Extract all results
auto sum = stats.get_first().eval(); // 15.0
auto mean = stats.get_second().mean(); // 3.0
auto variance = stats.get_second().sample_variance(); // 2.5
Numerically Stable Algorithms
Accumux uses proven algorithms that maintain accuracy even with ill-conditioned data.
Kahan-Babushka-Neumaier Summation
Standard floating-point summation loses precision:
// Naive sum fails on this
std::vector<double> values = {1.0, 1e100, 1.0, -1e100};
// Naive: 0.0 (wrong!)
// KBN: 2.0 (correct!)
auto summer = kbn_sum<double>();
for (auto v : values) summer += v;
std::cout << summer.eval(); // 2.0
Welford’s Online Algorithm
Computes mean and variance in a single pass without catastrophic cancellation:
auto welford = welford_accumulator<double>();
for (auto v : data) welford += v;
welford.count(); // Number of samples
welford.mean(); // Running mean
welford.sample_variance(); // Unbiased variance
welford.sample_std_dev(); // Standard deviation
Min/Max Tracking
auto minmax = minmax_accumulator<double>();
for (auto v : data) minmax += v;
minmax.min(); // Minimum value
minmax.max(); // Maximum value
Algebraic Composition
The key insight is that accumulators form a monoid under composition.
// Compose arbitrarily many accumulators
auto financial = kbn_sum<double>() +
welford_accumulator<double>() +
minmax_accumulator<double>();
std::vector<double> returns = {0.05, -0.02, 0.03, 0.01, -0.01, 0.04};
for (auto ret : returns) {
financial += ret; // All three update simultaneously
}
// Extract nested results
auto total = financial.get_first().eval();
auto mean = financial.get_second().mean();
auto volatility = financial.get_second().sample_std_dev();
auto worst = financial.get_second().get_second().min();
auto best = financial.get_second().get_second().max();
Mathematical Foundation
Monoid Structure
Each accumulator type A forms a monoid. The identity is the empty accumulator with no observations. The binary operation merges two accumulators (combining their observations).
auto a = welford_accumulator<double>();
auto b = welford_accumulator<double>();
// Process different data
for (auto v : data1) a += v;
for (auto v : data2) b += v;
// Merge results
auto combined = a + b; // Equivalent to processing data1 ++ data2
Homomorphism Property
The composition operation preserves structure:
(a + b).process(x) = a.process(x) + b.process(x)
This enables parallel processing: split data, accumulate in parallel, merge results.
Type Safety with C++20 Concepts
Invalid compositions fail at compile time:
// Compile error: can't add incompatible accumulators
auto invalid = kbn_sum<double>() + kbn_sum<int>(); // Type mismatch!
// OK: compatible types compose
auto valid = kbn_sum<double>() + welford_accumulator<double>();
Use Cases
Financial analysis (track returns, volatility, drawdowns in one pass), scientific computing (online statistics for streaming sensor data), machine learning (feature statistics during data preprocessing), and monitoring systems (real-time metrics aggregation).
Linked project: Sluug-Talk-Llm
SLUUG Talk: Demystifying Large Language Models on Linux
February 23, 2024
Linked project: Elasticsearch-Lm
Fine-Tuning a Tiny LLM for ElasticSearch DSL
February 19, 2024
Linked project: Bernoulli_data_type
Entropy Maps
February 18, 2024
The PDF version of this post is available on GitHub.
An entropy map approximates a function $f : \mathcal{X} \to \mathcal{Y}$ by hashing domain values to prefix-free codes in the codomain. We store nothing about the domain itself. We just hash, and a prefix of that hash serves as a code for a codomain value.
We allow multiple codes per codomain value. For instance, the value a might be encoded by 00, 01, 10, and 11. If the hash is less than 4, we decode it as a.
Suppose $\Pr\lbrace f(X) = y\rbrace = p_y$ where $X \sim p_X$. The optimally space-efficient code, assuming a uniform hash function $h$, assigns prefix-free codes for $y$ whose probability of being selected by $h$ sums to $p_y$. The expected bit length is
$$ \ell = -\sum_{y \in \mathcal{Y}} p_y \log_2 p_y, $$which is the entropy of the output distribution. That is why we call it an entropy map.
If $\mathcal{X}$ is finite, we can think of it as implicitly encoding the domain and storing the prefix-free code for each domain element. The average bit length per element is $\ell$, and the total is $|X| \ell$.
Rate distortion: Bernoulli maps
We can allow errors. If one codomain value $y’$ is very common (say $p_{y’} > .99$), we can give it a prefix-free code that covers probability $p_{y’}$ and then skip coding for it in the entropy map. A random $x \in \mathcal{X}$ will map to $y’$ with probability $p_{y’}$ (which can be made as close to 1 as desired by trading space for accuracy). For the remaining domain values, we code them correctly, or allow errors on those too after attempting correct coding.
Bernoulli set-indicator function
Consider a set-indicator function
$$ 1_{\mathcal{A}} : \mathcal{X} \to \lbrace0,1\rbrace, $$where $\mathcal{A} \subseteq \mathcal{X}$ and $\mathcal{X}$ is very large (possibly infinite). We assign prefix-free codes for codomain value $1$ such that a random hash maps an element of $\mathcal{X}$ to a code for $1$ with probability $\varepsilon$, where $\varepsilon$ is small (say $2^{-10}$).
There exists a (countably infinite) set of hash functions that hash all elements in $\mathcal{A}$ to codes for $1$ and elements in $\mathcal{A}’ = \mathcal{X} \setminus \mathcal{A}$ to codes for either $0$ or $1$. Choosing a random hash function with this property, we expect $\varepsilon$ of the elements in $\mathcal{A}’$ to hash to $1$ (false positives) and the remaining $1 - \varepsilon$ to hash to $0$.
Entropy Maps
February 18, 2024
The PDF version of this post is available on GitHub.
An entropy map approximates a function $f : \mathcal{X} \to \mathcal{Y}$ by hashing domain values to prefix-free codes in the codomain. We store nothing about the domain itself. We just hash, and a prefix of that hash serves as a code for a codomain value.
We allow multiple codes per codomain value. For instance, the value a might be encoded by 00, 01, 10, and 11. If the hash is less than 4, we decode it as a.
Suppose $\Pr\lbrace f(X) = y\rbrace = p_y$ where $X \sim p_X$. The optimally space-efficient code, assuming a uniform hash function $h$, assigns prefix-free codes for $y$ whose probability of being selected by $h$ sums to $p_y$. The expected bit length is
$$ \ell = -\sum_{y \in \mathcal{Y}} p_y \log_2 p_y, $$which is the entropy of the output distribution. That is why we call it an entropy map.
If $\mathcal{X}$ is finite, we can think of it as implicitly encoding the domain and storing the prefix-free code for each domain element. The average bit length per element is $\ell$, and the total is $|X| \ell$.
Rate distortion: Bernoulli maps
We can allow errors. If one codomain value $y’$ is very common (say $p_{y’} > .99$), we can give it a prefix-free code that covers probability $p_{y’}$ and then skip coding for it in the entropy map. A random $x \in \mathcal{X}$ will map to $y’$ with probability $p_{y’}$ (which can be made as close to 1 as desired by trading space for accuracy). For the remaining domain values, we code them correctly, or allow errors on those too after attempting correct coding.
Bernoulli set-indicator function
Consider a set-indicator function
$$ 1_{\mathcal{A}} : \mathcal{X} \to \lbrace0,1\rbrace, $$where $\mathcal{A} \subseteq \mathcal{X}$ and $\mathcal{X}$ is very large (possibly infinite). We assign prefix-free codes for codomain value $1$ such that a random hash maps an element of $\mathcal{X}$ to a code for $1$ with probability $\varepsilon$, where $\varepsilon$ is small (say $2^{-10}$).
There exists a (countably infinite) set of hash functions that hash all elements in $\mathcal{A}$ to codes for $1$ and elements in $\mathcal{A}’ = \mathcal{X} \setminus \mathcal{A}$ to codes for either $0$ or $1$. Choosing a random hash function with this property, we expect $\varepsilon$ of the elements in $\mathcal{A}’$ to hash to $1$ (false positives) and the remaining $1 - \varepsilon$ to hash to $0$.
A Boolean Algebra Over Trapdoors
June 17, 2023
This project is available on GitHub.
Boolean Algebra
A Boolean algebra is a mathematical structure that captures the properties of logical operations and sets. Formally, it is a 6-tuple $(B, \land, \lor, \neg, 0, 1)$, where
- $B$ is a set of elements,
- $\land$ ($\rm{and}$) and $\lor$ ($\rm{or}$) are binary operations on $B$,
- $\neg$ ($\rm{not}$) is a unary operation on $B$,
- $0$ and $1$ are elements of $B$, the minimum and maximum elements.
These must satisfy the usual axioms: closure, commutativity, associativity, distributivity, identity, and complements [1].
Boolean algebras show up everywhere. They form the foundation of propositional logic and are fundamental to digital circuit design and computer architecture [2].
In set theory, the standard representation is the power set of a set $X$, denoted $\mathcal{P}(X)$:
- $B = \mathcal{P}(X)$,
- $\land = \cap$ (set intersection),
- $\lor = \cup$ (set union),
- $\neg = \complement$ (set complement),
- $0 = \emptyset$ (empty set),
- $1 = X$ (universal set).
This set-theoretic Boolean algebra, $(\mathcal{P}(X), \cap, \cup, \complement, \emptyset, X)$, is the canonical example and the starting point for what follows: a Boolean algebra over trapdoors [3]. The construction preserves the familiar Boolean algebra properties while introducing cryptographic elements for secure computations.
Homomorphisms in Boolean Algebra
A homomorphism is a structure-preserving map between two algebraic structures of the same type. For Boolean algebras, it is a function that preserves the operations and special elements.
Given two Boolean algebras $(A, \land_A, \lor_A, \neg_A, 0_A, 1_A)$ and $(B, \land_B, \lor_B, \neg_B, 0_B, 1_B)$, a function $f: A \to B$ is a Boolean algebra homomorphism if for all $x, y \in A$:
- $f(x \land_A y) = f(x) \land_B f(y)$
- $f(x \lor_A y) = f(x) \lor_B f(y)$
- $f(\neg_A x) = \neg_B f(x)$
- $f(0_A) = 0_B$
- $f(1_A) = 1_B$
A homomorphism preserves structure across the mapping: you can perform operations in one algebra and have them correspond to operations in the other [4].
This matters because it lets us build a mapping between our original Boolean algebra and a new structure with cryptographic elements while still maintaining the essential properties. Operations in the trapdoor algebra remain logically consistent with standard Boolean operations.
In the following sections, I introduce a specific homomorphism $F$ that maps elements from our original algebra to a Boolean algebra over bit strings, incorporating a cryptographic hash function. This homomorphism is the foundation of the Boolean algebra over trapdoors.
Linked project: Known_plaintext_attack_time_series_analysis
Known Plaintext Attacks on Time Series Encryption
February 15, 2024
Time series data has properties that make standard encryption dangerously insufficient. This paper analyzes known plaintext attack vulnerabilities in time series encryption schemes and shows how naive approaches leak structure even when the ciphertext looks opaque.
The Time Series Problem
Time series data is special in ways that matter for cryptography.
Adjacent values are statistically dependent (temporal correlation). Daily, weekly, and seasonal cycles create periodic patterns. Future values are often inferable from past values. And IoT sensors generate massive encrypted streams, giving attackers a lot of material to work with.
Vulnerability Analysis
Standard Encryption Isn’t Enough
Simply applying AES-CTR or AES-CBC to time series data has problems.
Length information is preserved: packet sizes reveal data magnitude patterns, and message boundaries leak temporal structure.
Pattern regularity leaks through: identical plaintexts produce identical ciphertexts in ECB mode, and predictable IV patterns weaken CTR mode.
Statistical attacks become viable: frequency analysis on encrypted streams and correlation attacks across time windows.
The Known Plaintext Attack
Given pairs of (plaintext, ciphertext) for some time points, the attack proceeds as follows.
First, recover periodic patterns in the known plaintexts. Then forecast future plaintexts using time series models. Compare predictions with observed ciphertexts. Refine the model as more data is revealed.
For predictable time series (autocorrelation above 0.7), this achieves 70 to 90 percent accuracy recovering future values. It works even with only 10 percent known plaintexts. And it improves over time as more data is collected.
Case Studies
Smart meter data. Encrypted power consumption readings. Daily usage patterns are highly predictable. Known plaintexts come from utility bills. The attack recovers household occupancy patterns.
Medical sensors. Encrypted vital signs. Heart rate and blood pressure exhibit circadian rhythms. Known values come from medical records. The attack infers patient activity and health events.
Financial time series. Encrypted trading data. Price movements follow predictable patterns. Public market data provides known plaintexts. The attack reveals private trading strategies.
Defensive Approaches
Format-Preserving Encryption
Encrypt individual values, not byte streams. Add controlled noise to break correlations. Use order-preserving encryption carefully (it has its own vulnerabilities).
Homomorphic Encryption
Perform computations on encrypted data. Never decrypt individual points. High computational cost, but provably secure.
Linked project: Crypto-Perf-Hash
Perfect Hashing: Space Bounds, Entropy, and Cryptographic Security
February 1, 2024
Can a perfect hash function be cryptographically secure, space-optimal, and maximum-entropy encoded all at once? This paper proves such a construction exists and analyzes exactly what you sacrifice to get all three.
The Impossible Triangle
Perfect hash functions typically face tradeoffs. Space-optimal constructions (CHD, BDZ) sacrifice randomness. Cryptographic hash functions waste space on collision resistance. Maximum-entropy encodings require extra bits.
The question: can you have all three?
The Construction
The data type is PH = {0,1} x N* with a constructor ph(X, r) = (n', N) where:
- N = ceil(m/r): Hash table size (where m = |X|)
- beta(x,n) = trunc(hash(x’ # n’), k)’ mod N: Hash function parameterized by seed n
- n = min{j in N | beta is collision-free on X}: Search for smallest collision-free seed
- n’: Geometric code encoding n (variable-length prefix-free)
The algorithm: try seeds n = 0, 1, 2, … until beta(.,n) has no collisions on X. Each trial is geometrically distributed with success probability p(m,r), so the expected space for encoding n’ achieves the information-theoretic lower bound of roughly 1.44 bits per element.
Under the random oracle assumption, hash: {0,1}* -> {0,1}^infinity outputs uniform random bits, making the final encoding (n’, N) indistinguishable from a random bit string.
Rate-Distortion Tradeoff
The “rate-distortion” framing comes from information theory: what’s the rate (bits per key) for a given distortion (lookup time)?
Zero distortion (O(1) lookup): roughly n log n bits. Constant distortion (small tables): practical two-level schemes approach roughly 1.44n bits. Variable distortion: trade bits for lookup time continuously.
Why Cryptographic Matters
Non-cryptographic perfect hashes are deterministic, so adversaries can engineer collision-inducing inputs. A cryptographic perfect hash (random oracle) prevents adversarial key selection (can’t craft keys that break the hash), side-channel attacks (encoding reveals no information about keys), and fingerprinting (maximum entropy makes encodings look random).
The Algebra of Composition
Section 5 proves that composing perfect hash functions preserves injectivity. If h1: S -> T and h2: T -> U are injective, then h2 composed with h1: S -> U is injective.
This connects to my algebraic_hashing library, where composition of cryptographic hashes via XOR preserves both security and structure.
Linked project: Maskedselect
Weibull Distributions: From Reliability Theory to My Own Survival Curve
April 18, 2022
The Weibull distribution models time-to-failure. In reliability engineering, that means component lifetimes. In medicine, it means survival times. I have been working with Weibull models for my thesis on series system reliability. Then I got diagnosed with cancer, and now every time I work with survival curves, I am looking at mathematical abstractions of something very concrete: how long until failure?
The Mathematics
The Weibull CDF:
F(t) = 1 - exp(-(t/λ)^k)
Two parameters:
- λ: scale (characteristic lifetime)
- k: shape (how failure rate changes over time)
The shape parameter k tells you the whole story:
k < 1: Decreasing hazard. If you survive early on, your risk goes down. This is the infant mortality pattern.
k = 1: Constant hazard. Memoryless. This is just the exponential distribution.
k > 1: Increasing hazard. Things wear out.
The Hazard Function
The hazard function is what makes Weibull useful for survival analysis:
h(t) = (k/λ)(t/λ)^(k-1)
This is the instantaneous failure rate: given that you have survived to time t, what is the probability you fail in the next instant?
For cancer, this is the number that matters. Some cancers have increasing hazard (the longer you have it, the worse things get). Others have decreasing hazard after initial treatment, meaning if you make it past the critical period, prognosis improves. Knowing which pattern applies to your disease changes how you think about time.
Personal Context
When you study survival analysis academically, it is abstract. When you are living it, every curve is personal.
I look at Kaplan-Meier plots and see myself somewhere on that curve. I work with hazard functions and think: is my k > 1 or k < 1? Am I in the wearing-out regime or the if-you-make-it-past-this-it-gets-better regime?
The math does not change. But the meaning does.
The Irony
I chose reliability engineering for my thesis before the cancer diagnosis. I was studying component failures in series systems, where if any one part fails, the whole system fails.
Then I became a series system. Organs, treatment response, immune function. All have to work. Failure of any one is catastrophic.
The mathematics I was studying abstractly became uncomfortably literal.
Reliability Analysis and the Problem of Censored Data
August 14, 2019
One of the most interesting statistical problems I have encountered is reliability analysis with censored data: situations where you know something didn’t fail, but not when it will fail.
The Censoring Problem
Imagine testing light bulbs. You run them for 1000 hours. Some fail during the test. Others are still working when you stop.
For the survivors, you know:
- They lasted at least 1000 hours
- You do not know their actual lifetime
This is right censoring. The true value lies somewhere to the right of your observation. You have a lower bound, not a measurement.
Why This Matters
Censored data is everywhere:
- Medical studies (patients still alive at study end)
- Engineering tests (components that have not failed)
- Customer retention (users still active)
The naive responses are both wrong. Ignoring censored observations wastes information. Treating them as failures introduces bias. You need a framework that uses the partial information you actually have.
Maximum Likelihood to the Rescue
The solution is maximum likelihood estimation with likelihood contributions that account for censoring:
- Failure observations contribute the probability density \(f(t)\). You observed the exact failure time, so you know the probability of failing at that time.
- Censored observations contribute the survival probability \(S(t)\). You know the unit survived to time \(t\), so its contribution is the probability of surviving at least that long.
The likelihood for the whole sample is:
$$L = \prod_{i: \text{failed}} f(t_i) \prod_{j: \text{censored}} S(t_j)$$This lets you extract information from both failed and surviving units. The censored observations pull the estimated reliability upward; the failures pull it downward. Maximum likelihood balances them.
Series Systems Complexity
It gets more interesting with series systems, systems that fail when any component fails. If you observe system failure but do not know which component caused it, you have masked failure data.
This is the problem I am most interested in: extracting component-level reliability from system-level failures when the cause is ambiguous. The masking adds a latent variable, and the likelihood becomes a mixture. You can handle it with EM algorithms or direct optimization, but the combinatorics grow quickly with system size.
This work is laying groundwork for what will become a major focus of my mathematical statistics degree.
Linked project: Estimating_es_conf_moving_avg_bootstrap
Bootstrap Methods: When Theory Meets Computation
September 10, 2021
The bootstrap is a trade: mathematical complexity for computational burden. Instead of deriving analytical formulas for sampling distributions, you simulate them.
The Idea
If you don’t know the sampling distribution of a statistic, approximate it by resampling from your data.
- Draw samples with replacement from the original data
- Compute your statistic on each resample
- The distribution of resampled statistics approximates the true sampling distribution
That’s it. The justification is more subtle than the procedure. Under regularity conditions, the bootstrap distribution converges to the true sampling distribution as sample size grows. This is non-parametric inference: you use the empirical distribution as a stand-in for the true distribution, without assuming a parametric form.
When I Use It
Bootstrap is my default tool when:
- I need confidence intervals for statistics with no closed-form variance
- Asymptotic theory doesn’t apply (small samples, non-standard statistics)
- I’m doing model selection via bootstrap cross-validation
- I’m working with censored data where standard errors are intractable
That last case is the one that matters most for my research.
The Computational Trade
Better to get the right answer slowly than the wrong answer quickly.
Deriving an analytical variance formula is hard. Sometimes it’s impossible for the statistic you actually care about. Bootstrap says: just compute the statistic 10,000 times on resampled data and look at the spread. With modern hardware, 10,000 resamples takes seconds.
The trade is almost always worth it.
My Thesis Work
My research uses bootstrap heavily. I’m working on reliability estimation for series systems where components fail and you don’t know which one caused the system failure. This is the masked failure data problem.
For these models, the MLE exists and you can compute it, but the standard variance formulas don’t. The Fisher information matrix involves expectations over the masking distribution that don’t simplify to anything closed-form.
Bootstrap gives me confidence intervals anyway. Resample the masked failure data, recompute the MLE on each resample, and use the distribution of bootstrapped MLEs to construct intervals. It’s not elegant, but it works, and “works” is the right criterion when the alternative is “no confidence intervals at all.”
IEEE Paper: Estimating Encrypted Search Confidentiality via Bootstrap
November 2, 2016
This is my first IEEE publication, co-authored with Professor Hiroshi Fujinoki. The problem: if you encrypt search queries but an adversary can observe the ciphertext traffic, how many queries do they need before a frequency attack succeeds?
We used the Moving Average Bootstrap (MAB) method to estimate that threshold. The idea is that encrypted search leaks frequency information (how often each ciphertext appears), and an adversary can correlate those frequencies against known plaintext distributions. The bootstrap lets us estimate confidence intervals on the number of observations needed without closed-form solutions.
This came out of my MS thesis work on encrypted search at SIU. The core question (how much does encrypted search actually leak?) turns out to be harder than it sounds, because the answer depends on the plaintext distribution, the query distribution, and how patient the adversary is. The bootstrap approach gives us a way to answer it empirically.
For more related work, see my research page and publications.
Linked project: Algebraic.dist
algebraic.mle: MLEs as Algebraic Objects
May 15, 2021
Maximum likelihood estimators have rich mathematical structure. They are consistent, asymptotically normal, efficient. algebraic.mle exposes this structure through an algebra where MLEs are objects you compose, transform, and query.
The Abstraction
An MLE is not just a vector of parameter estimates. It is a statistical object that carries point estimates \(\hat{\theta}\), the Fisher information matrix \(I(\hat{\theta})\), the variance-covariance matrix \(I^{-1}(\hat{\theta})\), Wald-type confidence intervals from asymptotic normality, the log-likelihood value, and convergence diagnostics.
The package wraps all of this in a consistent interface:
library(algebraic.mle)
fit <- mle(likelihood_model, data)
coef(fit) # Parameter estimates
vcov(fit) # Variance-covariance matrix
confint(fit) # Confidence intervals
logLik(fit) # Log-likelihood
aic(fit) # Model selection
Composition
The real point is that MLEs compose. Independent models combine:
fit1 <- mle(model1, data1)
fit2 <- mle(model2, data2)
combined <- fit1 + fit2 # Joint likelihood
The package handles the algebra. Joint log-likelihood, block-diagonal Fisher information, everything propagates correctly. This works because likelihoods from independent data sources multiply, and multiplication of likelihoods is addition of log-likelihoods. That is a monoid. The package enforces it.
The Ecosystem
algebraic.mle is the foundation for a family of packages:
| Package | Purpose |
|---|---|
| likelihood.model | Compositional likelihood specification |
| maskedcauses | Masked failure data in series systems |
| mdrelax | Relaxed masking conditions |
| algebraic.dist | Distributions as algebraic objects |
| flexhaz | Dynamic failure rate distributions |
| hypothesize | Likelihood ratio tests on MLEs |
| numerical.mle | Numerical optimization backends |
The typical workflow:
- Define distributions with
algebraic.dist - Specify likelihood contributions with
likelihood.model - Fit the model and get an
mleobject fromalgebraic.mle - Query statistical properties: confidence intervals, hypothesis tests, model selection
For series systems with masked data:
library(maskedcauses)
library(algebraic.mle)
# Specify masking model (C1-C2-C3 conditions)
model <- md_likelihood_model(components = 3, masking = "bernoulli")
# Fit -> returns algebraic.mle object
fit <- md_mle_exp_series_C1_C2_C3(masked_data)
# All the standard MLE methods work
confint(fit)
vcov(fit)
aic(fit)
Theory
The asymptotic properties that algebraic.mle exploits come from classical MLE theory:
The expo-masked-fim paper derives closed-form Fisher information for exponential series systems. That is exactly what algebraic.mle uses internally for variance estimation in that case.
For more complex models (Weibull, relaxed masking conditions), we compute Fisher information numerically via observed information:
$$\hat{I}(\hat{\theta}) = -\frac{\partial^2 \ell}{\partial \theta \partial \theta^T}\bigg|_{\theta=\hat{\theta}}$$Design Principles
Separation of concerns. The likelihood specification (likelihood.model) is independent of the fitting algorithm (numerical.mle) and the result type (algebraic.mle). You can swap optimizers without changing downstream code.
algebraic.dist: Distributions as Algebraic Objects in R
February 1, 2021
Most statistical software treats probability distributions as parameter sets you pass to sampling or density functions. algebraic.dist takes a different approach. Distributions are algebraic objects that compose, transform, and combine through standard mathematical operations.
The Idea
Instead of this:
x <- rnorm(1000, mean=5, sd=2)
y <- rnorm(1000, mean=3, sd=1)
z <- x + y # Just numeric vectors
You write:
X <- Normal(mean=5, sd=2)
Y <- Normal(mean=3, sd=1)
Z <- X + Y # A new distribution object!
sample(Z, 1000)
The sum Z knows it is Normal(mean=8, sd=sqrt(5)) because the algebra works it out. You never lost the distributional structure.
Why It Matters
When you add two normal distributions numerically, you get samples from the sum. But you lose the distribution. With algebraic.dist, the result is still a distribution object with proper parameters and you can keep composing.
You can build complex distributional expressions and simplify them algebraically before ever drawing a sample:
portfolio <- 0.6*StockA + 0.4*StockB
risk <- sd(portfolio) # Computed symbolically
For distributions with known closed-form algebra (normal, exponential, certain mixtures), you do not need simulation. You just compute the exact answer. Monte Carlo without the Monte Carlo.
Composition
This is functional programming applied to probability theory. Distributions become composable building blocks:
- Mixture models:
0.3*Normal(0,1) + 0.7*Normal(5,2) - Transformed distributions:
exp(Normal(0,1))is lognormal - Conditional distributions:
X | (X > 0)for truncation
The idea is that computation should mirror mathematical structure. If the math says you can add two normals and get a normal, the code should do the same thing and give you a normal back, not a vector of samples.
This connects to a broader theme in my work. Just as my oblivious computing research uses type theory to enforce privacy invariants, algebraic.dist uses algebraic types to enforce distributional invariants. The algebra tells you what operations are valid and what the results mean.
Implementation
- Language: R
- Type system: S3 classes with method dispatch for operations
- Closed-form operations: Normal, exponential, gamma families
- Fallback: Monte Carlo for compositions without closed forms
- Repository: github.com/queelius/algebraic.dist
Related Packages
- algebraic.mle: Maximum likelihood estimation with algebraic specification
- numerical.mle: Numerical optimization for MLE when closed forms do not exist
- likelihood.model: Likelihood-based inference with compositional model building
Most statistical software is imperative. You tell it what to do step by step. algebraic.dist is declarative. You describe the distributional relationships and the computer figures out what to compute. Small composable pieces that do one thing well: preserve distributional structure through transformations.
Linked project: Call-of-Asheron
Quality-Space and Consciousness-Primary Magic in Call of Asheron
April 20, 2020
Beyond Magic as Physics
Most fantasy treats magic as “just another kind of physics.” A mechanistic system with laws, conservation principles, and causal chains that happen to involve wands instead of forces. Even sophisticated magic systems tend to treat consciousness as epiphenomenal: the wizard’s mind initiates a process, but the actual work happens through quasi-physical mechanisms.
The Call of Asheron proposes something different: consciousness-primary magic that operates through quality-negotiation rather than quantity-manipulation.
Quality-Space vs Quantity-Space
The novel distinguishes between two fundamental aspects of reality:
- Quantity-manipulation: The domain of physics. Measurable properties, numerical relationships, mechanistic causation.
- Quality-negotiation: The domain of magic. Qualia, phenomenal character, direct consciousness-reality interaction.
These are not separate realms but different engagements with the same reality. Physics quantifies; magic qualifies. Physics measures; magic experiences.
Consider this passage describing Duulak’s first experience with Dereth’s high quality-space saturation:
“On Ispar, casting had always felt like pushing—will against resistance, consciousness negotiating with a substrate that preferred its default configurations. Here, magic felt like surfing. The quality-space saturation was so dense he could almost see it, perceive the correlations between consciousness and reality as shimmering threads that his bandwidth could finally hold.”
Magic is not forcing reality through symbolic mediation. It is consciousness directly proposing configurations to a reality that is “waiting to be transformed, countless degrees of freedom eager for consciousness to propose configurations.”
Direct Consciousness-Reality Proposal
The key insight: consciousness does not manipulate reality through mechanisms; it proposes configurations to reality. This differs fundamentally from:
- Dualist magic: Mind causes physical effects through mysterious interaction
- Physicalist magic: Mental states reduce to brain states that trigger physical processes
- Mechanistic magic: Consciousness initiates lawful causal chains
Instead, The Call of Asheron presents something closer to participatory realism: reality has countless degrees of freedom, and consciousness can directly propose how those degrees of freedom should be actualized. Quality-space is the interface between phenomenal experience and physical manifestation.
Duulak perceives this directly:
“He could perceive the quality-space itself, see the way his consciousness had bent reality not through symbolic mediation but through direct proposal. This wasn’t reality resisting transformation and him forcing it anyway. This was reality waiting to be transformed.”
The Four Consciousness-Architectures: Why One Perspective Is Blindness
April 15, 2020
The Empyrean Catastrophe
Thirty thousand years of continuous civilization. Mastery of quality-space, consciousness-transfer, dimensional mechanics. Wonder-works that still function millennia after their creators went extinct.
And the Empyreans still failed.
Not from lack of intelligence or power or knowledge. They failed because of something quieter and more fatal: cognitive homogeneity.
“We all thought alike. The same cognitive style, the same approach to problems, the same blindness to alternatives. When the Olthoi came, when the Matriarch proved impossible to kill or contain permanently, we had no cognitive diversity to draw upon. Every Empyrean solution came from the same mental architecture. And every solution failed.”
This is the foundation of the Harbinger Protocol: the recognition that no singular consciousness-architecture perceives The Mechanism completely.
Four Fundamental Perspectives
The ancient Empyrean texts identified four archetypal ways consciousness relates to reality. Not personality types or learned styles, but deep structural modes of engagement:
The Organizer: Reality as Structure
Marcus Tiberius, taken from Rome at the moment he chose death holding the line rather than retreating.
“Your entire consciousness is structured around creating order from chaos, building systems that endure. You cannot help but organize. It is what you are.”
The Organizer sees reality as something to impose structure upon. Where others see flow, the Organizer perceives architecture. This is not a preference. It is a fundamental mode of existing. The Organizer’s bandwidth is optimized for:
- Pattern imposition rather than pattern discovery
- System-building rather than system-analysis
- Creating order from chaos rather than finding order within chaos
The Organizer’s blindness: missing the flow beneath the structure, the ways reality resists rigid categorization.
The Understander: Reality as Pattern
Duulak the Twice-Blessed, taken mid-insight while grasping the edge of The Mechanism.
“You cannot stop seeking patterns. Understanding is not what you do, it is your fundamental mode of existing.”
The Understander sees reality as pattern to comprehend. Where others see paradox, the Understander perceives regularity. The Understander’s consciousness is structured around:
- Mapping deep structures
- Pursuing comprehension over comfort
- Finding hidden coherence in apparent chaos
The Understander’s blindness: missing the genuine paradoxes, the ways reality resists complete comprehension, the truths that cannot be reduced to patterns.
Bandwidth as Fundamental Constant: The 7±2 Limit in Call of Asheron
April 10, 2020
From Folk Wisdom to Physical Constant
In cognitive psychology, the “7 plus or minus 2” rule is well known: human working memory can hold roughly seven items simultaneously. It’s treated as a fact about neural architecture, a consequence of how our brains happen to be built, constrained by biological evolution and physical implementation.
The Call of Asheron proposes something stranger: bandwidth limitations are fundamental constants governing consciousness-reality interaction, not merely implementation details of biological cognition.
The Bandwidth Sufficiency Principle
When Duulak studies ancient Empyrean texts, he discovers they had “mathematized bandwidth constraints”:
“They had mathematized bandwidth constraints, treating the 7 plus or minus 2 limit of working memory not as folk wisdom but as a fundamental constant governing consciousness-reality interaction. One fragmentary theorem, Celeste had translated it as the ‘Bandwidth Sufficiency Principle’, suggested that any finite consciousness would hit limits in perceiving what they called ‘The Mechanism.’”
This is a radical claim. It says 7 plus or minus 2 isn’t a quirk of human neurology, an evolutionary adaptation to specific environmental pressures, a consequence of brain size, or something that more advanced minds could overcome through better design.
Instead, it’s a fundamental constraint on how consciousness can engage with quality-space.
Consciousness Without Substrate
The novel tests this claim during Duulak’s death-resurrection cycles through the lifestone network. Between death and resurrection, he experiences something impossible:
“That space between ending and resuming where consciousness existed without substrate. Not void, that was the wrong word. A quality-space that had no physical correlate, where the what-it’s-like of experience persisted despite nothing experiencing it.”
In this state, freed from neural constraints, what happens to bandwidth limitations?
“Without bandwidth limits imposed by neural substrate, he could hold configurations that physical brains couldn’t process. The 7 plus or minus 2 limitation vanished when consciousness had no wetware bottleneck.”
So the limit can be transcended, but only when consciousness exists without physical embodiment, in pure quality-space. This suggests something about the relationship between bandwidth, embodiment, and reality-engagement that I find genuinely interesting to think through.
What Bandwidth Actually Limits
If bandwidth isn’t about neural capacity, what is it about? The novel suggests it’s about phenomenal complexity: how many independent qualitative features consciousness can simultaneously hold and actively manipulate.
The Call of Asheron: Magic as Computational Discovery
March 15, 2020
I wrote a fantasy novel. The premise is that magic isn’t mysterious power handed down from gods or inherited through bloodlines. It’s natural philosophy, the systematic study of reality’s computational substrate. You discover it the same way you discover physics: by paying attention, forming hypotheses, and testing them.
Duulak is a theoretical thaumaturge. He’s working out the mathematical foundations that make magic possible, the way a physicist works out the math behind why things fall. The magic system has rules, and those rules have consequences, and the consequences are where the story lives.
I wanted to write fantasy for people who think magic should be rigorous without being sterile. Rigor and wonder aren’t opposed. If anything, the constraints make the interesting stuff more interesting.