Skip to main content

Regex Under the Hood

You type grep 'a*b+' file.txt and results appear. But what actually happened in that fraction of a second? Your pattern compiled into a state machine — a tiny computer with no memory except “where am I right now?”

This post builds that machine from scratch and lets you watch it think.

The Pipeline

Every regex engine follows the same path:

Regex → NFA → DFA → Match

  1. Parse the regex and build a Nondeterministic Finite Automaton (NFA) using Thompson’s construction. The NFA can be in multiple states at once and can take free epsilon (ε) transitions without consuming input.

  2. Convert the NFA to a Deterministic Finite Automaton (DFA) using subset construction. Each DFA state represents a set of NFA states. One state at a time, one transition per symbol.

  3. Simulate the DFA on input — one character, one lookup, O(n). No backtracking.

Thompson’s Construction

Thompson’s construction builds NFAs by snapping together small templates. Each template has exactly one start and one accepting state:

Single character a:

──▶(q0)──a──▶((q1))

Concatenation ab — chain the fragments, connecting the first’s accept to the second’s start via ε:

──▶(q0)─a─▶(q1)─ε─▶(q2)─b─▶((q3))

Union a|b — new start fans out via ε, both accepts converge via ε:

       ┌─ε─▶(q2)─a─▶(q3)─ε─┐
──▶(q0)┤                     ├▶((q5))
       └─ε─▶(q4)─b─▶(q5)─ε─┘

Kleene star a* — ε-bypass for zero occurrences, ε-loop for repetition:

    ┌────────ε────────┐
    │   ┌──ε──┐       │
──▶(q0)─ε─▶(q1)─a─▶(q2)─ε─▶((q3))
    │                       ▲
    └───────────ε───────────┘

These compose mechanically. (a|b)*abb nests a union inside a star, then concatenates three symbols. The result is a graph of states and transitions that exactly encodes the regex’s meaning.

Try It

Type a regex and an input string. Build constructs the NFA, then Step through the input one character at a time. Watch which states light up — the NFA explores all possibilities simultaneously.

abεεεεεεεεaεbεbεq6q7q4q5q0q1q2q3q8q9q10q11q12q13
a
a
b
b
Start: ε-closure({q6}) = {q0, q2, q4, q6, q7, q8}

Switch to the DFA tab to see the determinized version — each DFA state is a set of NFA states, computed by subset construction. The legend below the graph shows the mapping.

Subset Construction

The NFA simulates nondeterminism by tracking all possible states simultaneously. Subset construction makes this explicit: each set of NFA states becomes a single DFA state.

The algorithm is a BFS:

  1. Start with the ε-closure of the NFA’s initial state. That’s your first DFA state.
  2. For each DFA state, for each symbol, compute: move (follow symbol transitions), then ε-closure. The resulting set is the next DFA state.
  3. Repeat until no new DFA states are discovered.
  4. A DFA state is accepting if it contains any NFA accepting state.

The result: a machine that makes exactly one transition per input character. No branching. No backtracking. O(n) matching, guaranteed.

What Regex Can’t Do

Finite states = finite memory = no unbounded counting.

A DFA can’t match “n as followed by n bs” (like aaabbb). It would need to remember how many as it saw, but it has no counter — just a fixed set of states. By the Pumping Lemma, any sufficiently long string accepted by a finite automaton contains a pumpable substring, which breaks the equal-count requirement.

This is why you can’t match balanced parentheses, valid HTML nesting, or palindromes with regex. The machine would need infinitely many states. For that, you need a pushdown automaton (context-free) or a Turing machine.

The limitation is exactly why regex is fast. Finite states means finite work per character. No stack, no tape, no backtracking — just a lookup table.


Implementation adapted from nfa-tools. For the formal treatment: Sipser, “Introduction to the Theory of Computation,” Chapter 1.

Discussion