Regex Under the Hood

September 12, 2022 3 min read Updated: March 16, 2026

You type grep 'a*b+' file.txt and results appear. But what actually happened in that fraction of a second? Your pattern compiled into a state machine — a tiny computer with no memory except “where am I right now?”

This post builds that machine from scratch and lets you watch it think.

The Pipeline

Every regex engine follows the same path:

Regex → NFA → DFA → Match

Parse the regex and build a Nondeterministic Finite Automaton (NFA) using Thompson’s construction. The NFA can be in multiple states at once and can take free epsilon (ε) transitions without consuming input.
Convert the NFA to a Deterministic Finite Automaton (DFA) using subset construction. Each DFA state represents a set of NFA states. One state at a time, one transition per symbol.
Simulate the DFA on input — one character, one lookup, O(n). No backtracking.

Thompson’s Construction

Thompson’s construction builds NFAs by snapping together small templates. Each template has exactly one start and one accepting state:

Single character a:

──▶(q0)──a──▶((q1))

Concatenation ab — chain the fragments, connecting the first’s accept to the second’s start via ε:

──▶(q0)─a─▶(q1)─ε─▶(q2)─b─▶((q3))

Union a|b — new start fans out via ε, both accepts converge via ε:

       ┌─ε─▶(q2)─a─▶(q3)─ε─┐
──▶(q0)┤                     ├▶((q5))
       └─ε─▶(q4)─b─▶(q5)─ε─┘

Kleene star a* — ε-bypass for zero occurrences, ε-loop for repetition:

    ┌────────ε────────┐
    │   ┌──ε──┐       │
──▶(q0)─ε─▶(q1)─a─▶(q2)─ε─▶((q3))
    │                       ▲
    └───────────ε───────────┘

These compose mechanically. (a|b)*abb nests a union inside a star, then concatenates three symbols. The result is a graph of states and transitions that exactly encodes the regex’s meaning.

Try It

Type a regex and an input string. Build constructs the NFA, then Step through the input one character at a time. Watch which states light up — the NFA explores all possibilities simultaneously.

Pattern

Input

Switch to the DFA tab to see the determinized version — each DFA state is a set of NFA states, computed by subset construction. The legend below the graph shows the mapping.

Subset Construction

The NFA simulates nondeterminism by tracking all possible states simultaneously. Subset construction makes this explicit: each set of NFA states becomes a single DFA state.

The algorithm is a BFS:

Start with the ε-closure of the NFA’s initial state. That’s your first DFA state.
For each DFA state, for each symbol, compute: move (follow symbol transitions), then ε-closure. The resulting set is the next DFA state.
Repeat until no new DFA states are discovered.
A DFA state is accepting if it contains any NFA accepting state.

The result: a machine that makes exactly one transition per input character. No branching. No backtracking. O(n) matching, guaranteed.

What Regex Can’t Do

Finite states = finite memory = no unbounded counting.

A DFA can’t match “n as followed by n bs” (like aaabbb). It would need to remember how many as it saw, but it has no counter — just a fixed set of states. By the Pumping Lemma, any sufficiently long string accepted by a finite automaton contains a pumpable substring, which breaks the equal-count requirement.

This is why you can’t match balanced parentheses, valid HTML nesting, or palindromes with regex. The machine would need infinitely many states. For that, you need a pushdown automaton (context-free) or a Turing machine.

The limitation is exactly why regex is fast. Finite states means finite work per character. No stack, no tape, no backtracking — just a lookup table.

For the formal treatment: Sipser, “Introduction to the Theory of Computation,” Chapter 1.

Every expression is a tree. (2 + 3) * x looks like this:

    *            In Python: ('*', ('+', 2, 3), …

The Pipeline

Thompson’s Construction

Try It

Subset Construction

What Regex Can’t Do

Related Posts

Terminal 101: Learn Linux in Your Browser

Fuzzy Inference: Teaching Machines to Think in Shades of Grey

Facts and Degrees

How to Teach a Computer Calculus with Pattern Matching

Watch an LLM Think

Dive into Deep Learning

Notes

Discussion