Algebra over Wire Formats

A pedagogical series exploring information theory by construction in C++23

11 parts

A Code Is a Hypothesis About the Source

This series develops information theory the same way the Stepanov series develops algorithms: by construction, with minimal pedagogical code, and with the algebraic structure made explicit.

The Stepanov series argued that the algebraic structure of a type determines its algorithms. This companion series argues the analogous claim for bit-level encodings: the algebraic structure of a code determines its compression efficiency, its decoding cost, and its compositional behavior. Each universal code corresponds to a different prior over the integers; each entropy-optimal code (Huffman, arithmetic) is best under different assumptions; each succinct data structure is the right answer under different access patterns.

The Principle

A prefix-free code assigns codeword lengths $(l_1, \ldots, l_n)$ to symbols. Kraft’s inequality characterizes which length vectors are achievable:

$$\sum_i 2^{-l_i} \leq 1.$$

Within this budget, every code is a different way of spending it. Allocating more bits to symbol $i$ means less budget for the others. Information theory’s optimum (assigning $-\log_2 p_i$ bits to a symbol with probability $p_i$) saturates Kraft when the probabilities are dyadic.

This series walks through the codes that have shown up in practice, treating each one as a different hypothesis about the integer distribution it expects to encode. The choices are not arbitrary: each code is optimal somewhere, and recognizing where reframes “compression algorithm” as “model selection.”

The Production Reference

The code in this series is pedagogical. The production version (with full STL integration, 31k+ test assertions, and the rich combinator library these posts only sketch) lives in PFC.

Companion Series

The Stepanov series develops algorithms from algebra on type. The bridge posts (Bits Follow Types, When Lists Become Bits) connect that series to this one.

Posts in this Series

Showing 11 of 11 posts

1 of 11

Kraft's Inequality

March 22, 2020 10 min read

Which codeword-length vectors are achievable by prefix-free codes? Kraft's inequality is the answer.

→

2 of 11

McMillan's Converse

September 13, 2020 10 min read

Any length vector satisfying Kraft has a prefix-free code. Here is the construction.

C++ information-theory coding-theory prefix-free +1

→

3 of 11

Universal Codes as Priors

January 15, 2022 9 min read

Every prefix-free code is a hypothesis about the source. The codeword lengths determine an implicit probability distribution; the code is optimal when that prior matches the true source.

C++ information-theory coding-theory prefix-free +2

→

4 of 11

Unary and Elias Gamma

June 19, 2022 8 min read

Unary and Elias gamma are the two simplest universal codes. Unary encodes n in n bits; gamma in 2 log2(n)+1 bits. Each implies a different prior over the integers.

C++ information-theory coding-theory prefix-free +3

→

5 of 11

Elias Delta and Omega

November 13, 2022 8 min read

Elias delta and omega extend Elias gamma by recursively encoding the length prefix. Each step yields shorter codewords for large integers at a small constant cost for small ones.

C++ information-theory coding-theory prefix-free +3

→

6 of 11

Fibonacci Coding

April 23, 2023 8 min read

Fibonacci coding uses Zeckendorf's representation to produce self-synchronizing codewords. Every codeword ends in two consecutive ones; a single bit flip corrupts at most two codewords.

C++ information-theory coding-theory prefix-free +3

→

7 of 11

Rice / Golomb

September 17, 2023 9 min read

Rice and Golomb codes are parametric: a single parameter k (or m) tunes the code to a specific geometric distribution. Choosing k is choosing your prior precisely.

C++ information-theory coding-theory prefix-free +4

→

8 of 11

VByte / Varint

February 25, 2024 8 min read

VByte trades bit-level precision for byte-alignment, and that trade wins in practice. Most production columnar databases and network protocols use VByte for integer encoding.

C++ information-theory coding-theory prefix-free +4

→

9 of 11

Huffman Coding

August 4, 2024 10 min read

Given a finite distribution, Huffman's algorithm builds the prefix-free code with minimum expected length. It is the first entropy-optimal code in this series.

C++ information-theory coding-theory prefix-free +2

→

10 of 11

Arithmetic Coding

January 12, 2025 8 min read

Arithmetic coding closes the gap between Huffman's per-symbol integer lengths and true entropy. A single number in the unit interval encodes an entire sequence; 32-bit integer arithmetic makes it practical.

C++ information-theory coding-theory prefix-free +3

→

13 of 11

Synthesis: Codecs as Structure

May 15, 2026 11 min read

The series closes by restating the codes-as-priors thesis across all twelve instances and connecting the wire-format side to the Stepanov type-algebra side.

C++ information-theory coding-theory prefix-free +3

→