Stepanov

Below you will find pages that utilize the taxonomy term “Stepanov”

Bits Follow Types

April 23, 2026

Every type decomposes structurally. So does its codec.

Codecs as Functors

You have an optional<vector<pair<int, string>>>. The type decomposes structurally: it is an optional of a free monoid of products of an integer and a string. That decomposition is not an observation about memory layout. It is a statement about the algebraic structure of the type.

Now ask: does the codec decompose the same way?

If the answer is yes, you stop writing one-off encoders. You build a codec for optional<T> from a codec for T. You build a codec for vector<T> from a codec for T. The codec for optional<vector<pair<int, string>>> assembles from its parts with no manual layout decisions, no hand-placed length headers, no ad-hoc format negotiation.

This post argues that the answer is always yes, and shows what the machinery looks like. The thesis: codecs are not ad-hoc bit formats. They are constructions on the algebraic structure of types. The algebraic structure of a type determines its codec, the same way it determines its algorithms.

This extends Stepanov’s claim. The peasant algorithm post showed that algorithms arise from algebraic structure. The homomorphism post showed that structure-preserving maps are the natural morphisms. Here, we show the codec itself is a structure-preserving map, and that it lifts from leaf types to compound types by the same algebraic logic.

Bit I/O: The Foundation

Before combinators, we need concrete bit I/O. The approach taken here follows Stepanov’s move in the algorithm posts: state the concept first, then provide a model.

Two concepts govern bit-level I/O:

template<typename T>
concept BitSink = requires(T& s, bool bit) {
    { s.write(bit) } -> std::same_as<void>;
};

template<typename T>
concept BitSource = requires(T& s) {
    { s.read() } -> std::same_as<bool>;
    { s.peek() } -> std::convertible_to<bool>;
};

A BitSink accepts bits. A BitSource supplies them. A codec is an algorithm parameterized over BitSink and BitSource, not a class hierarchy. This is Stepanov’s move at the bit level: require only what the algorithm needs, let anything that satisfies the concept participate.

The standard models are BitWriter and BitReader, which pack bits into byte buffers in LSB-first order:

class BitWriter {
    std::span<std::uint8_t> buf_;
    std::size_t byte_idx_ = 0;
    std::uint8_t byte_ = 0;
    std::uint8_t bit_pos_ = 0;
public:
    explicit BitWriter(std::span<std::uint8_t> buf) noexcept : buf_(buf) {}

    void write(bool bit) noexcept {
        byte_ |= (bit ? std::uint8_t{1} : std::uint8_t{0}) << bit_pos_;
        if (++bit_pos_ == 8) {
            buf_[byte_idx_++] = byte_;
            byte_ = 0;
            bit_pos_ = 0;
        }
    }

    void align() noexcept {
        if (bit_pos_ > 0) {
            buf_[byte_idx_++] = byte_;
            byte_ = 0;
            bit_pos_ = 0;
        }
    }

    [[nodiscard]] std::size_t bytes_written() const noexcept {
        return byte_idx_ + (bit_pos_ > 0 ? 1 : 0);
    }
};

class BitReader {
    std::span<const std::uint8_t> buf_;
    std::size_t byte_idx_ = 0;
    std::uint8_t bit_pos_ = 0;
public:
    explicit BitReader(std::span<const std::uint8_t> buf) noexcept : buf_(buf) {}

    bool read() noexcept {
        bool bit = ((buf_[byte_idx_] >> bit_pos_) & 1) != 0;
        if (++bit_pos_ == 8) {
            ++byte_idx_;
            bit_pos_ = 0;
        }
        return bit;
    }

    [[nodiscard]] bool peek() const noexcept {
        return byte_idx_ < buf_.size();
    }
};

A codec concept rounds out the three-concept core:

When Lists Become Bits

April 23, 2026

The free monoid on a type lifts to bit space. It lifts injectively only when the element codec is prefix-free.

Prefix-Free Codes and the Free Monoid

You have a list of unsigned integers. Encode the list as a single bit string.

Fixed-width encoding wastes space. If you allocate 64 bits per integer, small values like 1 or 7 cost as much as values near $2^{64}$. Variable-width encoding recovers that space, but immediately raises a harder question: where does one encoded integer end and the next begin?

Two escape routes. First, prefix each encoded item with its length. That works, but the length headers are overhead, and you now need a codec for the lengths as well. Second, choose a code where the structure of the codewords makes boundaries unambiguous without any headers. These are prefix-free codes, and this is the right answer, in a precise categorical sense.

The “precise categorical sense” is what this post develops. Encoding a list as the concatenation of encoded elements is a monoid homomorphism from the free monoid on $T$ to the monoid of bit strings under concatenation. The universal property of the free monoid guarantees this homomorphism always exists. The question of whether the decoder can invert it comes down to exactly one property of the element codec: whether it is prefix-free.

The Free Monoid, Recalled

A monoid is a set with an associative binary operation and an identity element. The free monoid on a set $S$ is the set of all finite sequences of elements from $S$, with concatenation as the operation and the empty sequence as the identity.

“Free” means no equations hold except those forced by the monoid axioms. Nothing is identified with anything else. If you need commutativity or idempotency, you quotient the free monoid by additional equations. But the free monoid itself imposes nothing beyond associativity and identity.

The universal property says: given any monoid $M$ and any function $f: S \to M$, there is exactly one monoid homomorphism $\hat{f}: \text{Free}(S) \to M$ that extends $f$. That unique extension is fold:

$$\hat{f}([x_1, x_2, \ldots, x_n]) = f(x_1) \cdot f(x_2) \cdot \cdots \cdot f(x_n)$$

where $\cdot$ is the operation in $M$. The free-algebra post develops this in full. For this post, the one fact that matters is that fold is canonical: it is the unique way to extend a per-element map to a list-consuming function that respects the monoid structure.

Free Algebras: Why Lists and Polynomials Are Universal

March 13, 2026

Lists are everywhere in programming. Not because they are convenient. Because they are algebraically universal.

Why Lists?

Arrays are more cache-friendly. Hash maps have better lookup. Yet lists (sequences, vectors, streams) remain the default container in nearly every language. The standard explanation is convention, or ease of construction. The real explanation is algebraic.

A list is the free monoid. It is the most general monoid you can build from a set of generators. And the universal property of free monoids says that fold, the operation that processes a list element by element, is not a design pattern. It is a theorem.

The Free Monoid

Start with a set $S$ of generators. The free monoid on $S$ is the set of all finite sequences of elements from $S$, with concatenation as the operation and the empty sequence as the identity.

“Free” means: no equations hold except those forced by the monoid axioms (associativity and identity). In particular:

$[a, b] \neq [b, a]$. Commutativity is not imposed.
$[a, a] \neq [a]$. Idempotency is not imposed.
$[a, b, c] = [a] \cdot [b] \cdot [c]$. Every sequence is a product of singletons.

In C++:

template<typename T>
class free_monoid {
    std::vector<T> elements_;
public:
    free_monoid() = default;
    explicit free_monoid(T x) : elements_{std::move(x)} {}
    // ...
};

// Monoid operations via ADL
template<typename T>
free_monoid<T> op(const free_monoid<T>& a, const free_monoid<T>& b);  // concatenation

template<typename T>
free_monoid<T> identity(const free_monoid<T>&);  // empty sequence

The free_monoid<int> is the type of finite sequences of integers. Its operation is concatenation. It satisfies the Monoid concept. And it is the most general monoid on int: no structure beyond associativity and identity.

The Universal Property

Here is the key fact. Given any function $f: S \to M$ where $M$ is a monoid, there exists a unique monoid homomorphism $\overline{f}: \text{Free}(S) \to M$ extending $f$. This homomorphism is defined by:

$$\overline{f}([a_1, a_2, \ldots, a_n]) = f(a_1) \cdot f(a_2) \cdot \ldots \cdot f(a_n)$$

In code:

template<Monoid M, typename T, typename F>
M extend(F f, const free_monoid<T>& xs) {
    M result = identity(M{});
    for (const auto& x : xs.elements())
        result = op(result, f(x));
    return result;
}

This is fold. The universal property says fold is the only structure-preserving way to interpret a list in a monoid. Any function that respects the monoid structure must agree with fold.

Fold Is a Theorem

When you write std::accumulate or std::reduce, you are invoking the universal property. The homomorphism condition:

Homomorphisms: The Maps Between Structures

March 13, 2026

A homomorphism is a function that preserves algebraic structure. This post shows that fold, sum, length, and even the logarithm are all the same idea.

Structures and Maps

The series so far has built up algebraic structures: monoids in the peasant post, rings in the modular post, Euclidean domains in the polynomial post, product monoids in the accumulator post. But structures alone are half the story. The other half is the maps between them.

A homomorphism is a function $f: A \to B$ between two structures of the same kind that preserves the operation:

$$f(a \oplus b) = f(a) \oplus f(b)$$

The operation on the left is in $A$. The operation on the right is in $B$. The function $f$ “commutes” with the operation. That is the entire definition.

The Concept

We need a monoid concept for this post. A monoid has an associative binary operation and an identity element. As always, the operations are ADL free functions:

template<typename M>
concept Monoid = std::semiregular<M> &&
    requires(M a, M b) {
        { op(a, b) } -> std::convertible_to<M>;
        { identity(a) } -> std::convertible_to<M>;
    };

And a runtime check that a function preserves the monoid structure:

template<Monoid A, Monoid B, typename F>
bool is_homomorphism(F f, const A& a1, const A& a2) {
    return f(op(a1, a2)) == op(f(a1), f(a2));
}

This tests one pair of inputs. It is not a proof, but it catches violations.

Examples Everywhere

Length. The string monoid (concatenation, empty string) maps to the integers under addition. The map is length. And length is a homomorphism:

$$\text{length}(s_1 + s_2) = \text{length}(s_1) + \text{length}(s_2)$$

The length of a concatenation is the sum of the lengths. This is not a coincidence. It is the homomorphism property.

Sum. Lists of integers under concatenation form a monoid. The integers under addition form a monoid. The map sum is a homomorphism:

$$\text{sum}(xs \mathbin{+\!\!+} ys) = \text{sum}(xs) + \text{sum}(ys)$$

Product. Same source monoid, different target. Now the integers under multiplication:

$$\text{prod}(xs \mathbin{+\!\!+} ys) = \text{prod}(xs) \times \text{prod}(ys)$$

Logarithm. The positive reals under multiplication form a monoid. The reals under addition form a monoid. The logarithm maps one to the other:

$$\log(a \times b) = \log(a) + \log(b)$$

This is the defining property of the logarithm, stated as algebra. The logarithm is a homomorphism from $(\mathbb{R}^+, \times, 1)$ to $(\mathbb{R}, +, 0)$.

Count. For any type $T$, the function that sends a list to its length is a homomorphism from the list monoid to $(\mathbb{Z}, +, 0)$. This is the same as the string length example, generalized.

Lattices: Fixed Points and Iteration

March 13, 2026

Lattices have two operations of a different kind than rings. The structure determines a fixed-point algorithm.

Two Operations, Different Rules

Monoids have one binary operation. Rings have two (addition and multiplication) linked by distributivity. Lattices also have two operations, but with different laws entirely.

A lattice is a set with two operations:

meet ($\wedge$): greatest lower bound
join ($\vee$): least upper bound

Both are idempotent, commutative, and associative. And they satisfy the absorption laws:

$$a \wedge (a \vee b) = a \qquad a \vee (a \wedge b) = a$$

Absorption is what distinguishes lattices from a pair of unrelated monoids. It ties meet and join together: knowing one constrains the other.

A bounded lattice adds a least element (bottom, $\bot$) and a greatest element (top, $\top$). Bottom is the identity for join, top is the identity for meet.

In C++20 concepts, with ADL free functions:

template<typename L>
concept Lattice = std::semiregular<L> &&
    requires(L a, L b) {
        { meet(a, b) } -> std::convertible_to<L>;
        { join(a, b) } -> std::convertible_to<L>;
    };

template<typename L>
concept BoundedLattice = Lattice<L> &&
    requires(L a) {
        { bottom(a) } -> std::convertible_to<L>;
        { top(a) } -> std::convertible_to<L>;
    };

Four Examples

Sign lattice. Abstract signs of integers: bottom (unreachable), negative, zero, positive, top (unknown). Meet is greatest lower bound, join is least upper bound in the Hasse diagram. This is the classic abstract interpretation domain. You can define abstract arithmetic on it: pos * neg = neg, neg + neg = neg, pos + neg = top.

Intervals. Closed intervals $[a, b]$ ordered by inclusion. Meet is intersection. Join is the smallest enclosing interval. Bottom is the empty interval. Top is the full range. This is the foundation of interval arithmetic.

Divisors. Positive integers ordered by divisibility. Meet is gcd, join is lcm. Bottom is 1 (divides everything), top is 0 (everything divides 0). Lattice structure appearing in number theory.

Power sets. Subsets of $\{0, \ldots, N-1\}$. Meet is intersection (bitwise AND), join is union (bitwise OR). Bottom is the empty set, top is the full set.

All four satisfy BoundedLattice. All four satisfy the same laws. The concept constrains the interface; the laws constrain the semantics.

The Algorithm: Tarski’s Fixed-Point Theorem

Here is the payoff. Tarski’s theorem: any monotone function on a complete lattice has a least fixed point, computable by iterating from bottom.

Semirings: One Algorithm, Six Graph Problems

March 13, 2026

The peasant post showed that power() works on any monoid. But what happens when you have two operations instead of one?

From Monoids to Semirings

A monoid is one operation with an identity element. The peasant algorithm exploits this: give it any monoid and it computes powers by repeated squaring. The accumulator post used the same structure for streaming statistics.

A semiring is two monoids on the same set, linked by a compatibility condition. Formally, a semiring $(S, +, \times, 0, 1)$ satisfies:

$(S, +, 0)$ is a commutative monoid
$(S, \times, 1)$ is a monoid
$\times$ distributes over $+$: $a \times (b + c) = a \times b + a \times c$
$0$ annihilates: $0 \times a = a \times 0 = 0$

In C++20, using the same ADL free functions as the peasant post:

template<typename S>
concept Semiring = std::semiregular<S> &&
    requires(S a, S b) {
        { a + b } -> std::convertible_to<S>;
        { a * b } -> std::convertible_to<S>;
        { zero(a) } -> std::convertible_to<S>;
        { one(a) } -> std::convertible_to<S>;
    };

The concept captures syntax. The axioms (associativity, distributivity, annihilation) are semantic requirements that the programmer must ensure.

Five Semirings

The ordinary integers are a semiring. But there are others, and each one corresponds to a different graph problem.

Semiring	$+$	$\times$	$0$	$1$	Graph problem
Boolean	or	and	false	true	Reachability
Tropical min	min	plus	$\infty$	0	Shortest paths
Tropical max	max	plus	$-\infty$	0	Longest paths
Bottleneck	max	min	$-\infty$	$\infty$	Widest paths
Counting	plus	times	0	1	Number of paths

The naming of the tropical semirings is counterintuitive but standard. In the tropical min semiring, the “addition” is min and the “multiplication” is ordinary addition. This matters because matrix multiplication uses both operations, and we need the algebraic structure to be correct: the inner product of a row and column computes the best path through an intermediate node.

Each semiring is a small struct with operator+, operator*, and ADL functions zero() and one():

struct boolean_semiring {
    bool val;
    constexpr boolean_semiring operator+(boolean_semiring rhs) const {
        return boolean_semiring(val || rhs.val);
    }
    constexpr boolean_semiring operator*(boolean_semiring rhs) const {
        return boolean_semiring(val && rhs.val);
    }
};
constexpr boolean_semiring zero(boolean_semiring) { return boolean_semiring(false); }
constexpr boolean_semiring one(boolean_semiring)  { return boolean_semiring(true); }

Matrices Over a Semiring

Matrix multiplication requires addition and multiplication. That is exactly what a semiring provides. If $S$ is a semiring, then $n \times n$ matrices over $S$ form a semiring too, with entry-wise addition and the usual row-times-column product (using $S$’s operations).

Streaming Statistics, One Monoid at a Time

March 13, 2026

Accumulators are monoids. The same algebraic structure from the peasant post, in a different domain.

Accumulators as Monoids

An accumulator processes a stream of values, maintaining state that can be queried at any point. Write a class with operator+= for each statistic you need. Sum, mean, variance, min, max. Five statistics, five classes.

The problem is combinations. Sum and min? Write a sixth class. Sum, min, and max? A seventh. Every new combination requires new code.

But every accumulator has the same structure:

Process a value incrementally: operator+=(value)
Combine with another accumulator of the same type: operator+=(accumulator)
Extract a result: .eval()

Default construction gives you an empty accumulator: the identity element. Combination via += is associative. Together, a monoid. The peasant post used the same structure for exponentiation. Here we use it for streaming computation.

In C++20 concepts:

template<typename A>
concept Accumulator = std::semiregular<A> &&
    requires(A a, A b, typename A::value_type v) {
        typename A::value_type;
        { a += v } -> std::same_as<A&>;   // process one value
        { a += b } -> std::same_as<A&>;   // combine two accumulators
        { a.eval() };                       // extract result
    };

KBN: Compensated Summation

The simplest accumulator is a sum. But naive floating-point summation accumulates O(n) rounding error:

double sum = 0.0;
sum += 1.0;
for (int i = 0; i < 1'000'000; ++i)
    sum += 1e-10;
// Expected: 1.0001    Actual: ~1.00009999999...8

When you add a tiny number to a large one, the tiny number’s low-order bits get dropped. After a million additions, these losses add up.

Kahan-Babuska-Neumaier (KBN) summation tracks what gets lost:

template<std::floating_point T>
class kbn_sum {
    T sum_ = T(0);
    T comp_ = T(0);   // compensation for lost bits

public:
    using value_type = T;

    constexpr kbn_sum& operator+=(const T& v) {
        T t = sum_ + v;
        comp_ += abs_(sum_) >= abs_(v) ? (sum_ - t) + v
                                       : (v - t) + sum_;
        sum_ = t;
        return *this;
    }

    constexpr T eval() const { return sum_ + comp_; }
};

The correction term comp_ recovers the bits that floating-point addition drops. O(1) error instead of O(n), regardless of sequence length.

kbn_sum is a monoid:

Identity: kbn_sum{} (sum=0, compensation=0)
Operation: a += b (combine two compensated sums)

Welford: Online Mean and Variance

Computing the mean is just sum/count. Variance is harder. The textbook formula $\sigma^2 = \frac{1}{n}\sum(x_i - \bar{x})^2$ requires two passes: one for the mean, one for the deviations.

Welford’s algorithm computes both in a single pass:

welford& operator+=(const T& v) {
    ++n_;
    T delta = v - mean_;
    mean_ += delta / static_cast<T>(n_);
    T delta2 = v - mean_;   // uses *updated* mean
    m2_ += delta * delta2;
    return *this;
}

delta uses the old mean, delta2 uses the new mean. Their product accumulates into m2_, the sum of squared deviations. At any point, variance = m2_ / n.

What makes this a monoid is the combination formula. Given two independent Welford accumulators with means $\bar{x}_A, \bar{x}_B$ and counts $n_A, n_B$, Chan et al. showed how to merge them:

Duality: The Hidden Structure of Opposites

January 19, 2026

Many structures come in pairs. Recognizing duality lets you transfer insights between domains.

The Motivating Example

This collection includes two approaches to automatic differentiation:

Forward mode (in dual): Propagate derivatives alongside values, from inputs toward outputs
Reverse mode (in autodiff): Build a graph during forward evaluation, then propagate gradients backward from outputs toward inputs

These aren’t just two implementations of the same idea. They’re duals, mirror images with complementary strengths.

Forward mode computes one column of the Jacobian per pass. If $f: \mathbb{R}^n \to \mathbb{R}^m$, computing the full Jacobian takes $n$ passes. Reverse mode computes one row per pass, $m$ passes for the full Jacobian.

For neural network training, we have many inputs (millions of parameters) and one output (the loss). Reverse mode wins overwhelmingly: one backward pass gives all gradients. This is why backpropagation dominates deep learning.

For sensitivity analysis with few parameters and many outputs, forward mode wins. Same algorithm structure, opposite traversal direction, complementary use cases.

The mathematical explanation: forward mode computes Jacobian-vector products ($Jv$); reverse mode computes vector-Jacobian products ($v^T J$). These are transposes of each other. Duality is transposition.

Push vs Pull

Consider two ways to traverse a sequence:

Pull (iterator/consumer controls):

for (auto it = seq.begin(); it != seq.end(); ++it) {
    process(*it);  // Consumer pulls each element
}

Push (producer controls):

seq.for_each([](auto x) {
    process(x);  // Producer pushes each element
});

Same traversal. Same elements processed. But control flow is reversed:

Aspect	Pull (Iterator)	Push (Generator)
Who controls pace?	Consumer	Producer
Suspend/resume?	Consumer decides when to call `++`	Producer decides when to yield
Backpressure	Natural (just stop pulling)	Must be designed in
Composition	Chain iterators	Chain callbacks

C++ ranges are pull-based: view | filter | transform creates an iterator that pulls through the pipeline. Reactive streams (Rx) are push-based: events flow through a pipeline of observers.

These are duals. Given a pull-based algorithm, you can mechanically derive its push-based counterpart by reversing who initiates each step. The transformation preserves correctness because it’s just changing direction, not content.

Encode vs Decode

Compression algorithms come in pairs:

// Encoder: structure -> bits
auto encode(const Document& doc) -> Bitstream;

// Decoder: bits -> structure
auto decode(const Bitstream& bits) -> Document;

These must be inverses: decode(encode(x)) == x. But their implementations are often strikingly different:

Seeing Structure First

January 18, 2026

A reflection on eleven explorations in generic programming

The Question Behind the Code

What do these computations have in common?

Computing the millionth Fibonacci number
Finding the shortest path between cities in a weighted graph
Calculating compound interest over thirty years
Composing ten 3D rotations into one
Repeating a string n times

The answer: they’re all computed by the same twenty lines of code.

template<typename T>
constexpr T power(T const& base, T exp) {
    if (exp == zero(exp)) return one(exp);
    if (exp == one(exp))  return base;

    return even(exp)
        ? square(power(base, half(exp)))
        : product(base, power(base, decrement(exp)));
}

This shouldn’t work. Fibonacci numbers involve integer sequences. Shortest paths involve graphs. Rotations involve 3D geometry. Different domains, different mathematics.

Yet they share structure. Once you see it, a single algorithm serves them all.

This collection of eleven blog posts is an extended meditation on one idea: algorithms arise from algebraic structure. The posts cover different domains (number theory, calculus, linear algebra, polymorphism) but they circle the same insight. Recognize the structure; the algorithm follows.

The Principle

Alex Stepanov articulated this most clearly in Elements of Programming: “Generic programming is about abstracting and classifying algorithms and data structures.” But the deeper point is how to abstract. Not by common syntax or superficial similarity, but by the algebraic laws a type obeys.

Why does structure appear everywhere? Because reality has structure. The algebraic structures we discover in programming (groups, rings, monoids) are the same structures physicists discover in nature. Rotations form a group. Spacetime transformations form a group. This isn’t coincidence. We’re uncovering patterns that exist.

Noether’s theorem makes this precise: every continuous symmetry corresponds to a conservation law. Time-translation symmetry gives conservation of energy. Space-translation symmetry gives conservation of momentum. Rotational symmetry gives conservation of angular momentum. The symmetry groups of physics are algebraic structures.

When we recognize “this is a monoid” in our code, we’re tapping into the same mathematical substrate that governs physical law. The algorithms follow because the structure constrains what’s possible, both in computation and in nature.

Consider the power() function above. What does it require?

An associative binary operation (so we can regroup: $(a \cdot b) \cdot c = a \cdot (b \cdot c)$)
An identity element (so $1 \cdot x = x \cdot 1 = x$)
Halving and parity testing on the exponent

That’s it. Any type providing these operations, with these laws, can use this algorithm. The requirements are algebraic, not syntactic.

Choosing the Algebra

November 30, 2025

The rest of this series asks: given a structure, what algorithms does it support? This post inverts the question.

The Flip Side

The peasant post showed that power() works on any monoid. The semirings post showed that matrix multiplication over different semirings solves six graph problems. The thread running through the whole series is: algorithms arise from algebraic structure.

But there’s a flip side that we haven’t addressed directly.

Sometimes you’re stuck with an expensive algorithm not because the problem is hard, but because you’re working in the wrong algebra. Change the algebra, and the algorithm becomes trivial. The cost shows up somewhere else, always. But if the cheap operation is the one you actually need, you win.

This is a very old idea. Napier invented logarithms in 1614 to turn multiplication into addition. What’s worth noticing is that logarithms, odds ratios, tropical semirings, and quaternions are all doing the same thing.

The Pattern

A computational basis transform takes values from one domain and represents them in another, where different operations are cheap:

Domain	Cheap	Expensive
Log space	Multiplication (becomes addition)	Addition
Odds ratios	Bayesian updates (become multiplication)	Probability sums
Tropical $(\min, +)$	Shortest paths (become matrix mult)	Subtraction
Quaternions	Rotation composition	Euler angle extraction
Modular integers	Exponentiation	Ordering
Rationals	Exact arithmetic	Irrational representation

Each row follows the same structure. A transform $\varphi: D \to D’$ makes some operations cheaper and others more expensive. There is no free lunch.

This is not a deep theorem. It’s almost tautological: if you could make everything cheaper by relabeling, the labels would already be the standard ones. But making the pattern explicit helps you recognize when you’re paying for an operation you don’t need.

Three Examples

Log Space

The most familiar example. You have a million small probabilities and need their product.

// Standard: underflows to 0 after ~30 terms
double product = 1.0;
for (double p : probs) product *= p;  // 0.0

// Log domain: addition instead of multiplication
mutatio::lgd product(1.0);
for (double p : probs) product = product * mutatio::lgd(p);
// product.log() is finite. product.value() would overflow,
// but you stay in log space.

The algebra changed from $(\mathbb{R}^+, \times)$ to $(\mathbb{R}, +)$. Multiplication became addition. The isomorphism is $\log$.

Tropical Semirings

The semirings post showed this already. Replace $(+, \times)$ with $(\min, +)$ and matrix multiplication becomes shortest-path computation. That’s the same move: you changed the semiring to make the algorithm you wanted (all-pairs shortest paths) fall out of a generic operation (matrix power) that you already had.

Differentiation: Three Ways

January 15, 2025

A synthesis of three earlier posts, comparing forward-mode AD, reverse-mode AD, and numerical differentiation.

Computing derivatives shows up everywhere: optimization, machine learning, physics simulation, numerical analysis. This series has explored three distinct approaches:

Forward-mode AD via dual numbers
Reverse-mode AD via computational graphs
Numerical differentiation via finite differences

Each has different strengths. The right choice depends on the shape of your problem.

The Landscape

Method	Accuracy	Cost for $f: \mathbb{R}^n \to \mathbb{R}$	Cost for $f: \mathbb{R} \to \mathbb{R}^m$	Memory
Forward AD	Exact	$O(n)$ passes	$O(1)$ pass	$O(1)$
Reverse AD	Exact	$O(1)$ pass	$O(m)$ passes	$O(\text{ops})$
Finite Diff	$O(h^p)$	$O(n)$ evaluations	$O(n)$ evaluations	$O(1)$

The key point: problem structure determines the best method.

Forward-Mode AD: Dual Numbers

Forward-mode AD extends numbers with an infinitesimal $\varepsilon$ where $\varepsilon^2 = 0$. The derivative falls out of the arithmetic for free:

// f(x) = x^3 - 3x + 1
// f'(x) = 3x^2 - 3

auto x = dual<double>::variable(2.0);  // x = 2, dx = 1
auto f = x*x*x - 3.0*x + 1.0;

std::cout << f.value() << "\n";       // 3.0
std::cout << f.derivative() << "\n";  // 9.0

Strengths:

Simple implementation (operator overloading)
No memory overhead
Naturally composable for higher derivatives
Works with any function of overloaded operators

When to use:

Single input variable (or few inputs)
Computing Jacobian-vector products
Higher-order derivatives via nesting
Sensitivity analysis along one direction

Complexity: One forward pass per input variable. For f: R^n -> R^m, computing the full Jacobian requires n passes.

Reverse-Mode AD: Computational Graphs

Reverse-mode AD builds a computational graph during the forward pass, then propagates gradients backward via the chain rule:

auto f = [](const auto& x) {
    return sum(pow(x, 2.0));  // f(x) = sum(x^2)
};

auto df = grad(f);  // Returns gradient function
auto gradient = df(x);  // One backward pass for all partials

Strengths:

O(1) backward passes regardless of input dimension
Powers modern deep learning (backpropagation)
Efficient for loss functions: f: R^n -> R

When to use:

Many inputs, scalar output (neural networks)
Computing vector-Jacobian products
Optimization where you need the full gradient

Complexity: One forward pass to build the graph, one backward pass to compute all gradients. Memory scales with the number of operations because you have to store intermediate values.

Numerical Differentiation: Finite Differences

Approximate the derivative using the limit definition:

// Central difference: f'(x) ~ (f(x+h) - f(x-h)) / 2h
double df = central_difference(f, x);

Strengths:

Numerical Integration with Generic Concepts

August 28, 2023

Numerical integration (quadrature) for C++20.

Overview

The definite integral is the signed area under a curve:

$$\int_a^b f(x)\,dx$$

Most functions do not have closed-form antiderivatives, so we approximate integrals numerically using quadrature rules: weighted sums of function evaluations.

$$\int_a^b f(x)\,dx \approx \sum_i w_i f(x_i)$$

Different rules choose different nodes x_i and weights w_i. The tradeoff is always accuracy vs. computational cost.

Quick Start

#include <integration/integrate.hpp>
#include <cmath>
#include <iostream>

int main() {
    using namespace integration;

    // integral from 0 to pi of sin(x) dx = 2
    double result = integrate([](double x) { return std::sin(x); }, 0.0, 3.14159265);

    std::cout << "integral of sin(x) dx = " << result << "\n";  // ~2.0
}

The Quadrature Zoo

Basic Rules

Rule	Formula	Error	Exact for
Midpoint	$(b-a)f(m)$	$O(h^3)$	Linear
Trapezoidal	$\frac{b-a}{2}(f(a)+f(b))$	$O(h^3)$	Linear
Simpson’s	$\frac{b-a}{6}(f(a)+4f(m)+f(b))$	$O(h^5)$	Cubic

double m = midpoint_rule(f, a, b);
double t = trapezoidal_rule(f, a, b);
double s = simpsons_rule(f, a, b);

Composite Rules

Divide [a,b] into n subintervals and apply the basic rule to each:

// Error: O(h^2) where h = (b-a)/n
double m = composite_midpoint(f, a, b, 100);
double t = composite_trapezoidal(f, a, b, 100);

// Error: O(h^4) - much more accurate!
double s = composite_simpsons(f, a, b, 100);  // n must be even

Gauss-Legendre Quadrature

The optimal choice: n points exactly integrate polynomials of degree 2n-1.

// 5-point Gauss-Legendre: exact for degree <= 9
double g = gauss_legendre<5>(f, a, b);

Adaptive Integration

Automatically refines where the function is difficult:

// Recommended for general use
double result = integrate(f, a, b);              // Default tolerance 1e-10
double result = integrate(f, a, b, 1e-12);       // Custom tolerance

// With error estimate
auto [value, error, evals] = integrate_with_error(f, a, b);

Deriving Simpson’s Rule

Taylor expand $f(x)$ around the midpoint $m = (a+b)/2$. Odd powers vanish by symmetry when integrating from $a$ to $b$:

$$\int_a^b f(x)\,dx \approx (b-a)f(m) + f''(m)\frac{(b-a)^3}{24} + O(h^5)$$

Simpson’s rule is the unique combination of endpoint and midpoint values that cancels the $h^2$ error:

$$\int_a^b f(x)\,dx = \frac{b-a}{6}\left[f(a) + 4f(m) + f(b)\right] + O(h^5)$$

It also cancels the h^3 term. Simpson gets a “bonus degree” of accuracy for free. This is one of those happy accidents in numerical analysis.

Why Gauss-Legendre is Optimal

With $n$ evaluation points, we have $2n$ free parameters ($n$ nodes + $n$ weights). We can match $2n$ conditions: exact integration of $1, x, x^2, \ldots, x^{2n-1}$.

The nodes turn out to be roots of the $n$-th Legendre polynomial $P_n(x)$. Orthogonal polynomials arise naturally from the optimization. This is not a coincidence. It is the same reason Legendre polynomials show up in approximation theory: they are the optimal basis for polynomial approximation on [-1,1] with the right inner product.

Forward-Mode Automatic Differentiation

September 20, 2021

Forward-mode automatic differentiation via dual numbers for C++20.

Overview

Dual numbers are a simple yet powerful technique for computing exact derivatives. The key insight: if we extend our number system with an element epsilon where epsilon^2 = 0, then evaluating f(x + epsilon) yields f(x) + epsilon * f'(x). The derivative emerges automatically from the algebra.

Quick Start

#include <dual/dual.hpp>
#include <iostream>

int main() {
    using namespace dual;

    // Create a dual variable at x = 2
    auto x = dual<double>::variable(2.0);

    // Compute f(x) = x^3 - 3x + 1
    auto f = x*x*x - 3.0*x + 1.0;

    std::cout << "f(2) = " << f.value() << "\n";       // 3.0
    std::cout << "f'(2) = " << f.derivative() << "\n"; // 9.0
}

The Mathematics

A dual number has the form $a + b\varepsilon$ where $\varepsilon^2 = 0$. Arithmetic follows naturally:

$$(a + b\varepsilon) + (c + d\varepsilon) = (a+c) + (b+d)\varepsilon$$$$(a + b\varepsilon)(c + d\varepsilon) = ac + (ad + bc)\varepsilon + bd\varepsilon^2 = ac + (ad + bc)\varepsilon$$

Notice how the $bd\varepsilon^2$ term vanishes because $\varepsilon^2 = 0$.

For a function $f$, Taylor expansion gives:

$$f(a + b\varepsilon) = f(a) + bf'(a)\varepsilon + \frac{b^2}{2}f''(a)\varepsilon^2 + \cdots = f(a) + bf'(a)\varepsilon$$

If we set $b = 1$ (marking $x$ as “the variable we’re differentiating with respect to”), then:

$$f(x + \varepsilon) = f(x) + f'(x)\varepsilon$$

The derivative appears as the coefficient of epsilon!

API Reference

dual

The core dual number type.

// Create a variable for differentiation
auto x = dual<double>::variable(3.0);  // x = 3, dx = 1

// Create a constant
auto c = dual<double>::constant(2.0);  // c = 2, dc = 0

// Access values
double val = x.value();       // 3.0
double deriv = x.derivative(); // 1.0

// Arithmetic operators: +, -, *, /
auto y = sin(x*x) + exp(-x);

// Convenience function
auto [value, deriv] = differentiate([](auto x) { return x*x; }, 3.0);

Mathematical Functions

All standard math functions are supported with correct derivative propagation:

Basic: sqrt, cbrt, abs
Exponential: exp, exp2, expm1, log, log2, log10, log1p
Trigonometric: sin, cos, tan, asin, acos, atan, atan2
Hyperbolic: sinh, cosh, tanh, asinh, acosh, atanh
Power: pow, hypot
Special: erf, erfc

Higher-Order Derivatives

Second derivatives with dual2:

auto result = differentiate2([](auto x) { return sin(x); }, 1.0);
// result.value  = sin(1)
// result.first  = cos(1)
// result.second = -sin(1)

Arbitrary order with jets:

// Compute f, f', f'', f''', f'''' at x = 1
auto derivs = derivatives<4>([](auto x) { return exp(x); }, 1.0);
// All derivatives of e^x at x=1 equal e

Forward vs Reverse Mode

This library implements forward mode AD:

Polynomials as Euclidean Domains

July 14, 2020

The same GCD algorithm works for integers and polynomials. That’s not a coincidence. It’s because both are Euclidean domains.

The Observation

// For integers: gcd(48, 18) = 6
// For polynomials: gcd(x^3 - 1, x^2 - 1) = x - 1

// Same algorithm, different types
template<euclidean_domain E>
E gcd(E a, E b) {
    while (b != E(0)) {
        a = std::exchange(b, a % b);
    }
    return a;
}

That template compiles and works correctly for both integers and polynomials. The reason it works is algebraic: both types support division with remainder, and the remainder is always “smaller” than the divisor in a well-defined sense.

Quick Start

#include <polynomials/polynomial.hpp>
#include <iostream>

using namespace poly;

int main() {
    // Create polynomial x^2 - 1 = (x-1)(x+1)
    auto p = polynomial<double>{-1, 0, 1};

    // Create polynomial x^3 - 1 = (x-1)(x^2+x+1)
    auto q = polynomial<double>{-1, 0, 0, 1};

    // GCD should be (x - 1)
    auto g = gcd(p, q);

    std::cout << "gcd(x^2-1, x^3-1) has degree " << g.degree() << "\n";  // 1

    // Find roots of x^2 - 1
    auto roots = find_roots(p, -10.0, 10.0);
    for (double r : roots) {
        std::cout << "Root: " << r << "\n";  // -1 and 1
    }
}

API Reference

Creating Polynomials

// From dense coefficients (a[i] = coefficient of x^i)
polynomial<double> p{1, -2, 1};  // 1 - 2x + x^2

// Monomial: coefficient * x^degree
auto m = polynomial<double>::monomial(3.0, 4);  // 3x^4

// The variable x
auto x = polynomial<double>::x();  // x

// Constant
polynomial<double> c{5.0};  // 5

Arithmetic

auto sum = p + q;
auto diff = p - q;
auto prod = p * q;
auto [quot, rem] = divmod(p, q);  // Division with remainder
auto quot_only = p / q;
auto rem_only = p % q;

auto g = gcd(p, q);                    // Greatest common divisor
auto [g, s, t] = extended_gcd(p, q);   // Bezout: g = p*s + q*t
auto l = lcm(p, q);                    // Least common multiple
bool d = divides(p, q);                // Does p divide q?

Evaluation and Calculus

double val = evaluate(p, x);           // p(x)
auto dp = derivative(p);               // p'(x)
auto integral = antiderivative(p);     // integral of p
auto roots = find_roots(p, -10, 10);   // All real roots in interval
auto crit = stationary_points(p, -10, 10);  // Where p'(x) = 0

The Euclidean Domain Structure

What makes this work is shared algebraic structure. A Euclidean domain has a norm function and a division algorithm where the remainder is always smaller than the divisor:

Property	Integers	Polynomials
Norm	abs(n)	degree(p)
Division	a = b*q + r, abs(r) < abs(b)	a = b*q + r, deg(r) < deg(b)
GCD	gcd(48, 18) = 6	gcd(x^2-1, x-1) = x-1

The GCD algorithm doesn’t care which type it’s operating on. It only needs the division-with-remainder property. Stepanov’s whole point is exactly this: algorithms arise from algebraic structure. When you recognize that polynomials and integers share the same abstract structure, you immediately get:

Exact Rational Arithmetic

February 18, 2020

Floating-point lies to you.

double x = 0.1 + 0.2;
std::cout << (x == 0.3);  // Prints 0 (false!)

The number 0.1 has no exact binary representation, for the same reason 1/3 has no exact decimal representation. Floating-point represents numbers as m x 2^e, and most decimal fractions don’t land on a power of two.

Rational arithmetic fixes this. 1/3 stays exactly 1/3.

The Representation

A rational number is a pair (numerator, denominator) kept in lowest terms:

template<std::integral T>
class rat {
    T num_;  // numerator (carries sign)
    T den_;  // denominator (always positive)

    void reduce() {
        T g = std::gcd(abs(num_), den_);
        num_ /= g;
        den_ /= g;
    }
};

Three invariants, always maintained:

The denominator is positive (sign lives in the numerator)
GCD(|num|, den) = 1 (always reduced)
Zero is uniquely 0/1

Arithmetic

Addition needs a common denominator:

$$\frac{a}{b} + \frac{c}{d} = \frac{ad + bc}{bd}$$

Then reduce. In code:

rat operator+(rat const& rhs) const {
    return rat(num_ * rhs.den_ + rhs.num_ * den_,
               den_ * rhs.den_);
}

The constructor calls reduce() automatically.

Multiplication is simpler:

$$\frac{a}{b} \times \frac{c}{d} = \frac{ac}{bd}$$

Division multiplies by the reciprocal:

$$\frac{a/b}{c/d} = \frac{ad}{bc}$$

Exact Comparison

No floating-point fuzziness. Two reduced rationals are equal iff their numerators and denominators match:

bool operator==(rat const& rhs) const {
    return num_ == rhs.num_ && den_ == rhs.den_;
}

For ordering, cross-multiply (valid because denominators are positive):

$$\frac{a}{b} < \frac{c}{d} \iff ad < cb$$

The Mediant

The mediant of a/b and c/d is (a+c)/(b+d). It’s not the average. It has different, more interesting properties:

rat mediant(rat const& a, rat const& b) {
    return rat(a.numerator() + b.numerator(),
               a.denominator() + b.denominator());
}

If a/b < c/d, then a/b < mediant < c/d. The mediant is always in lowest terms when a/b and c/d are neighbors in the Stern-Brocot tree. And mediants generate all positive rationals exactly once.

The Stern-Brocot Tree

Start with 0/1 and 1/0 (representing infinity). Repeatedly take mediants:

Level 0:     0/1                     1/0
Level 1:     0/1       1/1           1/0
Level 2:     0/1   1/2   1/1   2/1   1/0
Level 3:  0/1 1/3 1/2 2/3 1/1 3/2 2/1 3/1 1/0

Every positive rational appears exactly once. The path from root to any node encodes its continued fraction. This connects to best rational approximations and Farey sequences.

GCD Ties Everything Together

Reducing fractions requires GCD. The algorithm is Euclid’s, from around 300 BCE:

T gcd(T a, T b) {
    while (b != 0) {
        a = a % b;
        std::swap(a, b);
    }
    return a;
}

The same algorithm works for any Euclidean domain. That’s not a coincidence. It’s a consequence of the algebraic structure.

Rational numbers form a field: every non-zero element has a multiplicative inverse (the reciprocal). The requirement that denominators be non-zero and fractions reduced comes from this algebraic structure, not from arbitrary convention.

How Iterators Give You N+M Instead of NxM

November 15, 2019

The problem is combinatorial. You have N algorithms (sort, search, find, copy) and M containers (array, list, tree, hash table). The naive approach: implement each algorithm for each container. That is NxM implementations.

The insight is to interpose an abstraction layer.

The Iterator Abstraction

Instead of algorithms knowing about containers directly, we define iterator categories, capabilities that algorithms require and containers provide:

Input: Single-pass read. You can advance (++) and dereference (*), but once you move forward, you cannot go back. Stream-like.

Forward: Multi-pass. You can iterate multiple times; begin() always gives the same starting point.

Bidirectional: Can go backward (--). Enables algorithms like reverse iteration.

Random-access: Can jump anywhere (+n, []). Enables binary search, sorting.

This is a hierarchy of requirements. Each level adds capabilities and enables more algorithms. An algorithm declares the weakest category it needs, and any container providing at least that category works.

A True Input Iterator

The input iterator category exists for a reason. Here is a working example that reads entropy from /dev/urandom:

#include <fstream>
#include <iterator>
#include <cstdint>

struct entropy_iterator {
    using iterator_category = std::input_iterator_tag;
    using value_type        = uint8_t;
    using difference_type   = std::ptrdiff_t;
    using pointer           = const uint8_t*;
    using reference         = uint8_t;  // returns by value, not reference

    std::ifstream* source = nullptr;
    uint8_t byte = 0;

    entropy_iterator() = default;  // sentinel (end iterator)

    explicit entropy_iterator(std::ifstream& s) : source(&s) {
        ++(*this);  // prime the first byte
    }

    uint8_t operator*() const { return byte; }

    entropy_iterator& operator++() {
        if (source && source->good()) {
            source->read(reinterpret_cast<char*>(&byte), 1);
            if (!source->good()) source = nullptr;
        }
        return *this;
    }

    entropy_iterator operator++(int) {
        auto tmp = *this;
        ++(*this);
        return tmp;
    }

    bool operator==(const entropy_iterator& other) const {
        return source == other.source;
    }
};

Use it like any input iterator:

int main() {
    std::ifstream urandom("/dev/urandom", std::ios::binary);
    entropy_iterator it(urandom);

    // generate 16 random bytes
    std::vector<uint8_t> key(16);
    std::copy_n(it, 16, key.begin());

    // or use in algorithms
    int sum = 0;
    for (int i = 0; i < 1000; ++i, ++it) {
        sum += *it;
    }
    // sum ≈ 127500 (mean of uniform [0,255] × 1000)
}

Each ++ consumes a fresh entropy byte from the kernel. You literally cannot iterate twice over the same sequence. This is why the input iterator category exists: some sources are inherently single-pass. Claiming forward iterator capabilities would be a lie.

The same pattern applies to network streams, sensor readings, and any source where data is consumed by reading it.

The Payoff

Now binary_search does not need to know about vectors, deques, or sorted arrays. It only needs random-access iterators. The algorithm expresses its requirements; the container provides capabilities. They compose through the iterator abstraction.

Is It Prime?

September 10, 2019

The Miller-Rabin primality test and the mathematics of certainty

The Problem

Given a large number n, is it prime? Trial division up to sqrt(n) is too slow for cryptographic-sized numbers. We need something faster, and we are willing to accept “probably prime” with quantifiable certainty.

Fermat’s Little Theorem

For prime $p$ and any $a$ not divisible by $p$:

$$a^{p-1} \equiv 1 \pmod{p}$$

This suggests a test: pick random a, compute a^(n-1) mod n. If the result is not 1, n is definitely composite. But if it is 1, n might be prime, or might be a Carmichael number that fools this test.

The Miller-Rabin Improvement

Miller and Rabin observed something stronger. For odd prime $p$, write $p-1 = 2^r \cdot d$ (factor out all 2s). Then the sequence:

$$a^d, a^{2d}, a^{4d}, \ldots, a^{2^r \cdot d} = a^{p-1}$$

must either:

Start with 1, or
Contain -1 (i.e., p-1) somewhere before reaching 1

Why? Because the only square roots of 1 mod p are plus or minus 1. If we ever see 1 without first seeing -1, we have found a non-trivial square root of 1, proving n is composite.

The Witness Test

bool witness_test(int64_t n, int64_t a) {
    // Write n-1 = 2^r x d
    int64_t d = n - 1;
    int r = 0;
    while ((d & 1) == 0) { d >>= 1; r++; }

    // Compute x = a^d mod n
    int64_t x = mod_pow(a, d, n);

    if (x == 1 || x == n - 1) return true;  // Probably prime

    // Square r-1 times, looking for n-1
    for (int i = 1; i < r; i++) {
        x = (x * x) % n;
        if (x == n - 1) return true;
    }

    return false;  // Definitely composite
}

If witness_test(n, a) returns false, n is definitely composite. The value a is a “witness” to compositeness.

Error Bounds

Here is the part I find most satisfying. For any composite n, at least 3/4 of all possible witnesses a in [2, n-2] will detect it. Each random witness has at most 1/4 chance of failing to detect a composite.

With $k$ independent witnesses:

$$P(\text{false positive}) \leq \left(\frac{1}{4}\right)^k$$

Witnesses	Error bound
10	$< 10^{-6}$
20	$< 10^{-12}$
40	$< 10^{-24}$

The error drops exponentially. 40 witnesses gives you a false positive probability smaller than the chance of a cosmic ray flipping a bit in your RAM during the computation.

Parameterizing by Error

Rather than asking “how many iterations?”, ask “what error rate is acceptable?”:

Modular Arithmetic as Rings

June 22, 2019

Finite algebraic structures and what they teach us about algorithms

The Stepanov Perspective

Stepanov’s central insight: algorithms arise from algebraic structure. The same algorithm that works on integers works on matrices, polynomials, and modular integers, not by accident, but because they share algebraic properties.

Integers modulo N form a ring: a set with addition and multiplication satisfying familiar laws. When N is prime, it is a field, meaning every non-zero element has a multiplicative inverse. Understanding these structures tells you which algorithms apply where.

The Ring Z/NZ

Integers modulo N, written Z/NZ, are equivalence classes:

0 = {…, -N, 0, N, 2N, …}
1 = {…, -N+1, 1, N+1, 2N+1, …}
…

Operations are inherited from integers:

[a] + [b] = [a + b]
[a] x [b] = [a x b]

The implementation keeps one representative per class, in [0, N):

template<int64_t N>
struct mod_int {
    int64_t v;  // Always in [0, N)

    static constexpr int64_t normalize(int64_t x) {
        x %= N;
        return x < 0 ? x + N : x;
    }

    constexpr mod_int(int64_t x) : v(normalize(x)) {}
};

Ring Axioms

A ring (R, +, x) satisfies:

Addition forms an abelian group:

Associative: (a + b) + c = a + (b + c)
Identity: a + 0 = a
Inverses: a + (-a) = 0
Commutative: a + b = b + a

Multiplication is a monoid:

Associative: (a x b) x c = a x (b x c)
Identity: a x 1 = 1 x a = a

Distributive:

a x (b + c) = a x b + a x c
(b + c) x a = b x a + c x a

These axioms enable algorithms. Power-by-squaring works because multiplication is associative. The extended GCD works because of the ring structure.

Fermat’s Little Theorem

When N is prime, something special happens: every non-zero element has a multiplicative inverse. The set of non-zero elements forms a multiplicative group of order N-1.

Fermat’s Little Theorem: for prime $p$ and $a$ not congruent to $0 \pmod{p}$:

$$a^{p-1} \equiv 1 \pmod{p}$$

This gives us the inverse:

$$a \cdot a^{p-2} = a^{p-1} \equiv 1 \pmod{p}$$

So $a^{p-2}$ is the multiplicative inverse of $a$:

constexpr mod_int inverse() const {
    return pow(N - 2);  // Using repeated squaring
}

The Connection to Peasant

The power function uses the same peasant algorithm from the previous post:

One Algorithm, Infinite Powers

March 15, 2019

How the Russian peasant algorithm reveals the universal structure of exponentiation

The Algorithm

Russian peasants had a clever method for multiplication that does not require memorizing times tables. To compute 23 x 17:

23    17
11    34     (halve, double)
 5    68
 2   136
 1   272

Add the right column wherever the left is odd: 17 + 34 + 68 + 272 = 391. That is 23 x 17.

Why does this work? Because we are really computing:

23 x 17 = (16 + 4 + 2 + 1) x 17 = 16x17 + 4x17 + 2x17 + 17

The algorithm only needs three operations on the multiplier:

half(n), integer division by 2
even(n), test if divisible by 2
Addition on the result

From Multiplication to Exponentiation

Here is the insight that makes this interesting: the same algorithm computes powers.

Replace “add to accumulator” with “multiply into accumulator” and “double the multiplicand” with “square the base”:

T power(T base, int exp) {
    T result = 1;
    while (exp > 0) {
        if (!even(exp)) result = result * base;
        base = base * base;
        exp = half(exp);
    }
    return result;
}

This is O(log n) multiplications instead of O(n). Computing 2^1000 takes about 10 multiplications, not 1000.

The Monoid Connection

The peasant algorithm works whenever you have:

An associative binary operation *
An identity element 1 where 1 * x = x * 1 = x

This structure is called a monoid. The algorithm computes x * x * ... * x (n times) using O(log n) operations.

What makes this powerful is that many things form monoids:

Type	Operation	Identity	Computing x^n gives you…
Integers	x	1	Powers
Matrices	x	I	Matrix powers
Strings	concat	""	String repetition
Functions	compose	id	Function iteration
Permutations	compose	id	Permutation powers
Quaternions	x	1	Rotation composition

Why Associativity Unlocks Efficiency

Why does the peasant algorithm achieve O(log n) instead of O(n)? The answer lies in a single algebraic law: associativity.

Associativity says $(a \cdot b) \cdot c = a \cdot (b \cdot c)$. This looks innocuous, but it means we can restructure computation without changing results. Consider computing $a^8$:

Naive:     a x a x a x a x a x a x a x a     (7 multiplications)
Peasant:   ((a^2)^2)^2                        (3 multiplications)

Both produce the same answer because we can freely regroup. The peasant algorithm exploits this freedom systematically: instead of accumulating one factor at a time, it squares intermediate results and combines them.

Method	Accuracy	Cost for \(f: \mathbb{R}^n \to \mathbb{R}\)	Cost for \(f: \mathbb{R} \to \mathbb{R}^m\)	Memory
Forward AD	Exact	\(O(n)\) passes	\(O(1)\) pass	\(O(1)\)
Reverse AD	Exact	\(O(1)\) pass	\(O(m)\) passes	\(O(\text{ops})\)
Finite Diff	\(O(h^p)\)	\(O(n)\) evaluations	\(O(n)\) evaluations	\(O(1)\)

Rule	Formula	Error	Exact for
Midpoint	\((b-a)f(m)\)	\(O(h^3)\)	Linear
Trapezoidal	\(\frac{b-a}{2}(f(a)+f(b))\)	\(O(h^3)\)	Linear
Simpson’s	\(\frac{b-a}{6}(f(a)+4f(m)+f(b))\)	\(O(h^5)\)	Cubic

Witnesses	Error bound
10	\(< 10^{-6}\)
20	\(< 10^{-12}\)
40	\(< 10^{-24}\)

Codecs as Functors

Bit I/O: The Foundation

Prefix-Free Codes and the Free Monoid

The Free Monoid, Recalled

Why Lists?

The Free Monoid

The Universal Property

Fold Is a Theorem

Structures and Maps

The Concept

Examples Everywhere

Two Operations, Different Rules

Four Examples

The Algorithm: Tarski’s Fixed-Point Theorem

From Monoids to Semirings

Five Semirings

Matrices Over a Semiring

Accumulators as Monoids

KBN: Compensated Summation

Welford: Online Mean and Variance

The Motivating Example

Push vs Pull

Encode vs Decode

The Question Behind the Code

The Principle

The Flip Side

The Pattern

Three Examples

Log Space

Tropical Semirings

The Landscape

Forward-Mode AD: Dual Numbers

Reverse-Mode AD: Computational Graphs

Numerical Differentiation: Finite Differences

Overview

Quick Start

The Quadrature Zoo

Basic Rules

Composite Rules

Gauss-Legendre Quadrature

Adaptive Integration

Deriving Simpson’s Rule

Why Gauss-Legendre is Optimal

Overview

Quick Start

The Mathematics

API Reference

dual

Mathematical Functions

Higher-Order Derivatives

Forward vs Reverse Mode

The Observation

Quick Start

API Reference

Creating Polynomials

Arithmetic

GCD and Related

Evaluation and Calculus

The Euclidean Domain Structure

The Representation

Arithmetic

Exact Comparison

The Mediant

The Stern-Brocot Tree

GCD Ties Everything Together

The Iterator Abstraction

A True Input Iterator

The Payoff

The Problem

Fermat’s Little Theorem

The Miller-Rabin Improvement

The Witness Test

Error Bounds

Parameterizing by Error

The Stepanov Perspective

The Ring Z/NZ

Ring Axioms

Fermat’s Little Theorem

The Connection to Peasant

The Algorithm