Synthesis: Codecs as Structure

May 15, 2026

Twelve posts, twelve codes, one thesis that refused to change. This is the closing summary.

A. The Twelve Codes Together

Every post in this series answered a version of the same question: given a source of positive integers, how do you represent its values compactly as a sequence of bits? The answers differ in shape, in assumptions, and in which distribution each code implicitly expects.

PostCodeImplied prior (one phrase)
1-2FoundationsPrefix-free codes are possible iff Kraft’s inequality holds
3Priors frameworkAny code defines a prior; the best code matches the source
4UnaryGeometric(1/2): value 1 is twice as likely as value 2, etc.
5aElias GammaPower-law: probability falls as 1/n^2
5bElias DeltaHeavier-tailed power law: slower decay for large values
5cElias OmegaRecursive structure: no fixed polynomial decay rate
6FibonacciNear-geometric with Zeckendorf structure; good for Zeckendorf-sparse integers
7Rice / GolombGeometric with known parameter m; optimal when m divides entropy
8VByteRoughly uniform over byte-aligned ranges; engineering favorite
9HuffmanSource-optimal given the exact symbol distribution
10Arithmetic codingApproaches entropy to an arbitrary fraction of a bit
11Succinct bit vectorsNot a code for integers: a representation that answers rank/select queries
12RoaringBitmapPolyalgorithm: picks array, bitset, or run-length per container chunk

Posts 1 and 2 (Kraft’s Inequality and McMillan’s Converse) established why prefix-free codes are the right unit of analysis. Post 3 (Universal Codes as Priors) named the frame: a code is a hypothesis about the source. Posts 4 through 10 filled in the catalogue. Posts 11 and 12 extended from integer coding to set representation, where the questions shift from “how long is this codeword?” to “how do you store membership?” and “how do you answer rank/select?”

Looking across all twelve, the main lesson is not that one code dominates. It is that the question “which code?” is always empirically answerable given a sample.

B. The Unifying Frame Restated

Post 3 introduced the codes-as-priors thesis with two instances behind it. We now have twelve. The thesis has not changed; it has only become more evidently true.

Read More

Arithmetic Coding

January 12, 2025

Huffman codes one symbol at a time. Arithmetic coding encodes the whole sequence as a single number. The difference is a factor of twelve, at least on the right source.

The Last Bit of Redundancy

Huffman coding gets expected codeword length within one bit of entropy. That is the best it can do, because codeword lengths must be integers while entropy is a real number.

The waste is structural. A symbol with probability $p = 0.7$ has optimal (fractional) length $-\log_2(0.7) \approx 0.515$ bits. Huffman rounds that up to 1 bit: 0.485 bits wasted per occurrence. For a nearly-deterministic source with $p_0 = 0.99$ and $p_1 = 0.01$, the entropy is $H \approx 0.081$ bits per symbol. Huffman is stuck at 1 bit per symbol. That is a factor-of-twelve gap, and Huffman cannot close it: a symbol that appears 99% of the time still gets a complete codeword.

Arithmetic coding steps back from per-symbol codewords entirely. It encodes an entire sequence as a single rational number in $[0, 1)$. The bit-length of that number converges to the entropy of the sequence as the sequence grows. No integer rounding, no per-symbol overhead.

This post builds an integer range coder in C++23 and demonstrates the factor-of-twelve improvement on the Bernoulli(0.99) source.


The Continuous View

Start with the unit interval $[0, 1)$. For a two-symbol source, partition it by probability: symbol 0 gets $[0, p_0)$ and symbol 1 gets $[p_0, 1)$.

To encode a sequence, begin with the full interval and narrow it with each symbol. After symbol $s_1$, restrict to the corresponding sub-interval. After $s_2$, apply the same proportional rule inside that sub-interval. After $L$ symbols, the interval has width $\prod_{i=1}^{L} p_{s_i}$.

Any number inside the final interval is a valid encoding. The shortest such number in binary requires approximately $-\log_2(\prod p_{s_i}) = \sum_{i=1}^{L} (-\log_2 p_{s_i})$ bits. As $L \to \infty$, bits per symbol approaches $H(p) = -\sum_k p_k \log_2 p_k$ exactly.

Decoding is the inverse: given the encoded number, determine at each step which sub-interval it falls in, recover the symbol, narrow the interval, and repeat.

The theory is complete. The practice is not. A real interval narrows exponentially fast: after a few dozen symbols you need arbitrary precision. The integer range coder fixes this with 32-bit arithmetic and a renormalization step.

Read More

Huffman Coding

August 4, 2024

Huffman coding is two things: the optimal length vector for a known distribution, and McMillan’s construction applied to that vector. This post develops both.

From Universal to Optimal

Every code in this series so far has been universal: no prior knowledge of the source distribution required. Elias gamma assigns shorter codewords to smaller integers regardless of which integers actually appear. Fibonacci does the same. VByte packs smaller values into fewer bytes without knowing whether your data clusters at the low end or the high end. Universal codes are defensive: they perform acceptably across a broad class of inputs by committing to none.

Huffman flips that stance. You bring a finite probability distribution. Huffman finds the prefix-free code with minimum expected codeword length for that distribution. The code is tuned to what you provide and will perform poorly on anything else. Call this the move from defensive to distribution-specific coding.

The payoff is real. Shannon’s source coding theorem says no prefix-free code can achieve expected length below $H(p) = -\sum_i p_i \log_2 p_i$. Huffman gets within 1 bit of that bound. For any prefix-free code, expected length satisfies

$$H(p) \le L(\text{code}) \le H(p) + 1.$$

The upper bound comes from the integer-length constraint: each codeword is a whole number of bits, and $\lceil -\log_2 p_i \rceil \le -\log_2 p_i + 1$. Huffman is optimal subject to this constraint. No prefix-free code does better without abandoning whole-bit codewords.

That last clause points to the limit of this post and the subject of the next. Arithmetic coding breaks the integer-length constraint by assigning fractional bits in effect, reaching entropy exactly in the limit.

The Algorithm

The four steps of Huffman’s construction:

  1. Create one leaf node per symbol, weighted by its probability.
  2. Push all leaves into a min-priority queue (lowest weight first).
  3. While the queue contains more than one node: extract the two lowest-weight nodes, merge them into a new internal node whose weight is their sum, push the internal node back.
  4. The remaining node is the root. The path from root to each leaf encodes that leaf’s codeword (“0” for left, “1” for right).

Here is the complete implementation from huffman.hpp:

Read More

VByte / Varint

February 25, 2024

Every code in this series so far operates at bit granularity. VByte does not. It gives up bit-level precision for byte-alignment, and in production systems, that trade wins most of the time.

The Practical Question

Every code in this series so far operates at bit granularity. Elias gamma encodes 1 in a single bit. Fibonacci coding uses exactly as many bits as the Zeckendorf representation requires. Bit packing is theoretically attractive because it minimizes the number of bits written, which minimizes the encoded size.

But bit packing is computationally expensive. Reading or writing a single bit requires a shift, a mask, and often a branch to handle byte boundaries. Encoding a sequence of integers this way burns CPU cycles that scale with the number of integers, independent of their values. For high-throughput applications, the overhead of bit manipulation can easily exceed the savings from compact encoding.

VByte (also called Varint in Google’s ecosystem, and LEB128 in the DWARF debug format) trades a small amount of length efficiency for byte-alignment. The idea is simple: encode each integer as a sequence of 7-bit groups, one per byte, with a continuation flag in the high bit of each byte. The result is self-delimiting, compact for small values, and requires no bit-level manipulation to decode.

VByte is the encoding used by Protocol Buffers for all integer fields. It appears in Apache Arrow, Parquet, Snappy’s block format, LevelDB’s metadata, and most production columnar file formats. These are high-throughput systems. Byte-alignment is why VByte is their choice over the more compact universal codes from posts 4 through 7.


The Encoding

VByte splits an integer into 7-bit groups, starting from the least significant bits. Each group occupies one byte where bits 0 through 6 carry 7 data bits and bit 7 is a continuation flag: 1 means more bytes follow, 0 means this is the last byte.

A value in $[0, 127]$ fits in a single byte (continuation flag clear). A value in $[128, 16383]$ requires two bytes (first byte has flag set, second has flag clear). The pattern continues: each additional byte adds 7 bits of capacity.

Read More

Rice / Golomb

September 17, 2023

Every code in this series so far has been fixed. Rice and Golomb are different: they take a parameter, and the parameter is your model of the data.

The First Parametric Code

Every code examined so far in this series has been monolithic. Unary coding is just unary coding. Elias gamma is just Elias gamma. Each one encodes all non-negative integers with a single fixed strategy. You do not get to choose anything about the code beyond whether to use it.

Rice and Golomb codes break this pattern. They are parametric: a single integer parameter, $k$ for Rice or $m$ for Golomb, tunes the code to a specific source distribution. Rice$(k)$ is not one code but a family of codes, one per value of $k$. Each member of the family is optimal for a specific geometric distribution. Choosing $k$ is choosing your prior precisely.

This matters because data sources are rarely uniform. Run-length encodings, inter-frame video differences, and the gap sequences in inverted indexes are all approximately geometrically distributed. If you know the mean of your source, you can pick $k$ so that Rice$(k)$ performs near-optimally, without the overhead of a Huffman table or arithmetic coding.

The key insight: for a geometric source with mean approximately $2^k$, Rice$(k)$ is within a small constant of entropy. No other universal code in this series achieves this. Elias gamma and delta perform well asymptotically but can be far from optimal for a specific geometric distribution with a known mean. Rice exploits that knowledge directly.


Rice Coding

Rice coding splits a non-negative integer $n$ into two parts: a quotient $q = \lfloor n / 2^k \rfloor = n \gg k$ and a remainder $r = n \bmod 2^k = n \mathbin{&} (2^k - 1)$.

The quotient is encoded in unary: $q$ zero bits followed by a stop bit of 1. The remainder is encoded in exactly $k$ bits, MSB first. The total codeword is the concatenation of these two parts.

Codeword examples for $k = 2$ (remainder is always 2 bits):

$n$$q$$r$CodewordBits
0001 003
1011 013
2021 103
3031 113
4100 1 004
5110 1 014

Codeword length: $(n \gg k) + 1 + k$ bits. The Kraft sum saturates to 1, so Rice is a complete prefix-free code.

Read More

Fibonacci Coding

April 23, 2023

Every code in this series so far has optimized expected length under some implied prior. Fibonacci coding does something different: it gives the decoder a way to recover from errors without help from a lower layer.

A Different Design Goal

All the codes in this series have aimed at the same target: assign short codewords to frequent symbols, with length growing roughly as $\log n$ for the $n$-th symbol under some implied prior. Elias gamma minimizes expected length for power-law distributions; delta and omega extend the recursion for heavier tails.

Fibonacci coding has a different goal. It does not optimize for average codeword length under a specific distribution. It optimizes for error resilience. In a stream of gamma-coded integers, a single bit flip in a codeword’s length prefix causes the decoder to misread that codeword’s length, then misread every subsequent codeword. The error propagates without limit until the decoder somehow reacquires sync. On a reliable channel this is a nonissue. On a noisy one, or in stored data that may have silently rotted, it is a serious problem.

Fibonacci coding avoids this. Every Fibonacci codeword ends in two consecutive 1 bits (“11”). This double-one marker appears nowhere else in the codeword. A single bit flip corrupts the codeword it hits, possibly spills into the next codeword, and then the decoder finds the next “11” and resynchronizes. At most two codewords are corrupted per error. The rest of the stream is intact.

The price is length overhead: Fibonacci codewords are approximately $1.44 \times \log_2 n$ bits long, compared to $\log_2 n$ bits for the entropy lower bound. On a reliable channel, that overhead is not worth paying. On a noisy channel, or in a long-running stream where rare bit errors must not lose the entire tail, the self-synchronization property is worth it.


Zeckendorf’s Theorem

Fibonacci numbers starting from $F_2 = 1$: $1, 2, 3, 5, 8, 13, 21, 34, \ldots$

Zeckendorf’s theorem: every positive integer $n$ has a unique representation as a sum of non-consecutive Fibonacci numbers. The greedy algorithm produces it by repeatedly subtracting the largest Fibonacci number that does not exceed $n$.

Read More

Elias Delta and Omega

November 13, 2022

Elias gamma spends too many bits saying how many bits it will use. Delta fixes that. Omega takes the fix one step further. This post is about what happens when you apply recursion to the length prefix.

Where Gamma Stops Being Good

Elias gamma, from the previous post, encodes a positive integer $n$ in $2\lfloor \log_2 n \rfloor + 1$ bits: a unary count of $\lfloor \log_2 n \rfloor$ zeros, then a stop bit, then the $\lfloor \log_2 n \rfloor$ trailing binary bits of $n$. For small $n$ this is fine. For large $n$, nearly half the bits are spent on the unary prefix alone.

The unary prefix is the bottleneck. It encodes the length $L = \lfloor \log_2 n \rfloor + 1$ in the most wasteful possible way: one bit per unit. For $n = 256$, that is 8 zero bits just to say “the payload is 8 bits long.” The payload itself is also 8 bits, so you are paying a 100% overhead on the length announcement. That is bad, and it gets worse as $n$ grows.

The fix is obvious once you see it: encode $L$ itself in some shorter code instead of unary. Elias delta does exactly this, replacing the unary length prefix with a gamma-coded length. Elias omega takes the idea one step further and applies the recursion to itself, all the way down.

Both codes are universal: they assign finite codewords to every positive integer, and the expected codeword length is within a constant factor of optimal for any source whose probabilities decrease with $n$. The improvement over gamma is real and measurable once $n$ grows past a few dozen.

This post shows both implementations, their implied priors, and the crossover points where each code wins. As in the rest of this series, the code is pedagogical: each header stands alone and the struct-with-encode/decode pattern maps directly onto the PFC library’s EliasDelta and EliasOmega in codecs.hpp.


Elias Delta

Algorithm. Let $L = \lfloor \log_2 n \rfloor + 1$ (the bit-width of $n$, equivalently std::bit_width(n)).

  1. Encode $L$ in Elias gamma.
  2. Write the $L - 1$ trailing bits of $n$ after its implicit leading 1, MSB first.

Gamma encodes $L$ (a small integer) in $O(\log \log n)$ bits instead of $O(\log n)$ bits for the unary prefix. The payload is identical to gamma’s: the trailing bits of $n$. The total length is $O(\log n + \log \log n)$.

Read More

Unary and Elias Gamma

June 19, 2022

Unary is older than information theory. Elias gamma is its 1975 improvement. Together they span the gap between optimal-but-impractical and practical-but-nearly-optimal. This post derives what each code bets on, and shows numerically what that means.

Unary and Elias Gamma

Unary is the oldest code in this series. It predates information theory by centuries: a shepherd counting sheep on a stick is using unary. Mark one notch per sheep; count the notches to decode. The codeword for \(n\) is \(n\) tally marks. Its information-theoretic justification came later, when Shannon showed it is exactly optimal for a geometric source.

Elias gamma is the 1975 extension by Peter Elias. It brings the codeword length from \(O(n)\) to \(O(\log n)\), making it practical for numbers beyond small single digits, while keeping the prefix-free property that makes self-delimiting streams possible.

Both codes are instances of the claim from Universal Codes as Priors: every prefix-free code is a bet about the source. Unary bets on a geometric distribution with parameter \(1/2\). Gamma bets on a power-law distribution with exponent \(\approx 2\). This post implements both, derives their implied priors, and shows numerically what the bets mean.


Unary: Geometric Prior

The encoding rule for unary is simple: to encode integer \(n \geq 1\), write \((n-1)\) zero bits followed by one 1 bit. The decoder reads bits until it sees the 1; the number of bits read is the decoded value.

Examples: \(1 \to\) 1, \(2 \to\) 01, \(3 \to\) 001, \(4 \to\) 0001.

struct Unary {
    using value_type = std::uint64_t;

    template<BitSink S>
    static void encode(value_type n, S& sink) {
        assert(n >= 1 && "Unary is undefined for n = 0");
        for (value_type i = 1; i < n; ++i) sink.write(false);
        sink.write(true);
    }

    template<BitSource S>
    static value_type decode(S& source) {
        value_type n = 1;
        while (!source.read()) ++n;
        return n;
    }
};

Length analysis. The codeword for \(n\) has length \(n\). The Kraft sum is \(\sum_{n=1}^{\infty} 2^{-n} = 1\): unary saturates Kraft exactly. The implied prior is \(p_n = 2^{-n}\): a geometric distribution with parameter \(1/2\), where each value is half as likely as the previous.

Optimality test. Because the implied prior is dyadic (all probabilities are powers of \(1/2\)) and Kraft saturates, unary achieves entropy exactly on this prior. For a 30-symbol truncation of geometric(1/2), the expected unary length equals the entropy to within the truncation tail (\(\approx 2^{-30}\)):

Read More

Universal Codes as Priors

January 15, 2022

When you pick a code for integers, you are making a bet about what integers the source will produce. The bet lives in the codeword lengths, not in a separate parameter. This post makes that precise.

Universal Codes as Priors

You want to compress a stream of positive integers. Which code should you use?

The question has more structure than it appears. A code for integers assigns a codeword to each integer. The codeword for 1 is short, for 2 a bit longer, for 100 much longer. The relative lengths encode an implicit bet: what fraction of the stream will be 1s? What fraction will be 100s? If the bet matches the source, the average codeword length will be close to the theoretical minimum, the entropy. If the bet is wrong, you pay an overhead proportional to how wrong you are.

The bet is not a separate parameter. It lives in the codeword lengths themselves. This is the central claim of this post:

Every prefix-free code is a prior over the integers. The codeword lengths determine, up to normalization, a probability distribution. The code is optimal for exactly the sources that match that distribution.

This post makes that claim precise and implements the tools to measure it. The rest of the series (posts 4 through 12) examines ten specific codes and the priors they embody.


The Correspondence: Lengths to Priors

For a prefix-free code with codeword lengths \((l_1, l_2, \ldots, l_n)\), define the unnormalized weight of symbol \(i\) as \(w_i = 2^{-l_i}\). This is the fraction of the Kraft budget consumed by that codeword.

If the code saturates Kraft (meaning \(\sum_i 2^{-l_i} = 1\)), then the weights are already a valid probability distribution: \(p_i = 2^{-l_i}\). If the code does not saturate (meaning \(\sum_i 2^{-l_i} < 1\)), normalize: \(p_i = 2^{-l_i} / \sum_j 2^{-l_j}\).

This is the inverse of Shannon’s prescription. Shannon says: given a distribution \(p_i\), the optimal codeword length is \(\lceil -\log_2 p_i \rceil\) bits. We reverse the direction: given a length \(l_i\), the implied probability is \(2^{-l_i}\).

The function implied_prior computes this map:

inline std::vector<double> implied_prior(const std::vector<std::size_t>& lengths) {
    std::vector<double> probs;
    probs.reserve(lengths.size());
    double total = 0.0;
    for (std::size_t l : lengths) {
        double p = std::ldexp(1.0, -static_cast<int>(l));
        probs.push_back(p);
        total += p;
    }
    // Normalize if Kraft sum is less than 1.
    if (total < 1.0) {
        for (double& p : probs) p /= total;
    }
    return probs;
}

Two examples show the range of priors you get in practice.

Read More

McMillan's Converse

September 13, 2020

Kraft’s inequality is necessary. McMillan’s theorem says it is also sufficient, and the proof is a construction.

McMillan’s Converse

The previous post in this series proved Kraft’s inequality: for any prefix-free binary code with codeword lengths \(l_1, l_2, \ldots, l_n\),

$$\sum_{i=1}^{n} 2^{-l_i} \leq 1.$$

Every prefix-free code satisfies it. No exceptions. But necessity alone is not the useful direction. The question I want answered is the converse: given a length vector that satisfies Kraft, does a prefix-free code with those lengths actually exist?

Yes, and McMillan’s theorem (1956) proves it. Better still, the proof is a construction: given any Kraft-satisfying length vector, you can produce a specific prefix-free code with those exact lengths. No search required. No verification required after the fact. The construction always terminates, always produces a valid code, because Kraft pre-certifies that the budget is sufficient.

This post proves the constructive direction, then goes further. McMillan proved something stronger than just the prefix-free converse. He showed that even uniquely-decodable codes that are not prefix-free must satisfy Kraft. The consequence is worth sitting with: there is no advantage to non-prefix-free designs. If a code can be uniquely decoded, a prefix-free code with the same lengths exists. Prefix-freeness is not a restriction you impose for convenience. It is just the cleanest form of what unique decodability requires.

The Construction

The construction is a left-to-right walk through an imaginary binary trie. Sort the lengths, then assign codewords by taking the next available leaf at each step.

Concretely: fix a counter at zero, and for each length \(l_i\) (in sorted order), emit the binary representation of counter >> (l_max - l_i) left-padded to \(l_i\) bits. Then advance the counter by \(2^{l_{\max} - l_i}\), which skips past the entire subtree rooted at the just-assigned codeword. That advance ensures the next codeword starts at the first unoccupied leaf position in the depth-\(l_{\max}\) trie.

Work through the example from post 1: lengths \(\{1, 2, 3, 3\}\). Sort: \(1, 2, 3, 3\). Take \(l_{\max} = 3\).

  • Length 1: counter is 0. Shift right by \(3 - 1 = 2\): emit 0 >> 2 = 0 as a 1-bit string, giving codeword "0". Advance counter by \(2^{3-1} = 4\). Counter is now 4.
  • Length 2: counter is 4 (binary 100). Shift right by \(3 - 2 = 1\): emit 4 >> 1 = 2 as a 2-bit string, giving "10". Advance by \(2^{3-2} = 2\). Counter is now 6.
  • Length 3: counter is 6 (binary 110). Shift right by \(3 - 3 = 0\): emit 6 as a 3-bit string, giving "110". Advance by \(2^0 = 1\). Counter is 7.
  • Length 3: counter is 7 (binary 111). Emit 7 as a 3-bit string: "111". Advance by 1. Counter is 8.

Result: {"0", "10", "110", "111"}. This is exactly the example code from post 1. The construction recovered it directly from the length vector, without any search.

Read More

Kraft's Inequality

March 22, 2020

Every prefix-free code satisfies one inequality. That inequality is also sufficient. This post develops the necessary direction.

Kraft’s Inequality

I want a code where each symbol maps to a bit string, and where any concatenation of codewords can be decoded unambiguously. The simplest way to guarantee that is prefix-freeness: no codeword is a prefix of any other. A prefix-free code is self-delimiting. The decoder reads bits left-to-right and knows exactly when each codeword ends, with no lookahead and no length headers.

The question I keep returning to is: which collections of lengths are actually achievable? If I want four codewords of lengths 1, 2, 3, and 3, can I build a prefix-free code with those lengths? What if I want two codewords of length 1? (No: there are only two 1-bit strings, and they are prefixes of everything longer.)

Kraft’s inequality is the answer. A length vector \((l_1, l_2, \ldots, l_n)\) is achievable by a prefix-free binary code only if

$$\sum_{i=1}^{n} 2^{-l_i} \leq 1.$$

This is the constraint you cannot escape. Any prefix-free code satisfies it. Any length vector that violates it cannot be realized as a prefix-free code, full stop.

The converse is also true: any length vector satisfying Kraft is realizable by some prefix-free code. That is McMillan’s theorem, and it is the subject of the next post in this series. This post develops the necessary direction: every prefix-free code satisfies Kraft.

The right tool for understanding why is the binary tree.

The Trie View

Represent each codeword as a path in a binary tree. Start at the root. For each bit, go left (0) or right (1). The codeword ends at a node, which I mark as a terminal. A code is prefix-free if and only if no terminal node has any descendants that are also terminals. Once you reach a terminal on the way down, you stop.

The example code \(\{A \to \texttt{0},\ B \to \texttt{10},\ C \to \texttt{110},\ D \to \texttt{111}\}\) has lengths \((1, 2, 3, 3)\). Its trie looks like this:

         root
        /    \
       0      1
      [A]    / \
            0   1
           [B]  / \
               0   1
              [C] [D]

A is at depth 1, left branch. B is at depth 2, right-then-left. C and D share a parent at depth 2, then split at depth 3. No codeword’s node is an ancestor of another’s: the code is prefix-free.

Read More