The-Policy

Below you will find pages that utilize the taxonomy term “The-Policy”

Value Functions Over Reasoning Traces

January 18, 2026

In Latent Reasoning Traces, I described a simple system: store successful reasoning traces, retrieve similar ones, use them to scaffold new problems. The traces serve as learned priors over reasoning patterns.

But there’s something missing.

Once a trace is stored, it’s dead. It has a quality score from when it was created (“this solution was correct”) and that score never changes. The trace doesn’t learn. It doesn’t get better at being useful. It just sits there, waiting to be retrieved.

What if traces could learn from experience?

The Missing Gradient

Consider what happens when you retrieve traces: problem arrives, retrieve k similar traces, generate a solution conditioned on them, evaluate. If the solution is correct, the new trace might get stored. But what about the traces that were retrieved? They helped produce that correct answer. Shouldn’t they get credit?

And if the solution is wrong, maybe the retrieved traces were misleading. Shouldn’t they be downgraded?

This is the missing gradient. Information flows forward (traces to generation to evaluation) but never backward (evaluation to traces).

Traces as States, Retrieval as Actions

I’ll reframe this in RL terms. State: the current problem, plus the contents of memory. Action: which traces to retrieve. Reward: did the generated solution pass evaluation? Value V(t): the expected future reward when trace t is retrieved.

Now the question becomes: how do we learn V(t)?

The Bellman Equation for Traces

Start with the standard TD update:

$$V(\tau) \leftarrow V(\tau) + \alpha \left[ r + \gamma V(\tau') - V(\tau) \right]$$

Where t is a retrieved trace, r is the reward (1 if correct, 0 if not), t’ is the newly generated trace (if stored), alpha is learning rate, gamma is discount factor.

The intuition: a trace’s value should reflect not just the immediate reward, but also the value of traces it helps create. If trace A helps generate trace B, and trace B is highly useful, then trace A deserves credit. The value propagates backward through the generative chain.

Credit Assignment

Here’s the hard part: if you retrieve k=3 traces and succeed, which trace gets credit?

Options:

Equal split: Each retrieved trace gets r/k reward.

Self-Publishing Into the Void

December 19, 2025

I self-published The Policy on Amazon KDP this week. Echoes of the Sublime is in review. Two novels, out into an ocean of content.

The Flood

Self-publishing has democratized access to readers. Anyone can publish. This is both liberation and problem.

Traditional publishing’s gatekeeping (agents, editors, publishers) served a function beyond mere exclusion. It was a filter. Not perfect, not unbiased, but a filter. Someone with experience and taste looked at a manuscript and said: this is worth investing in or this isn’t ready yet or this needs work.

That feedback loop is missing in self-publishing. You write, you upload, you’re published. No one stops you. No one helps you either.

The result is an enormous quantity of work, varying wildly in quality, with no reliable signal for readers to navigate by. The gems are in there, buried under everything else. Finding them is the reader’s problem now.

I’m not exempt from this. I’m not a professional writer. I didn’t get professional feedback. I wrote these novels with AI assistance (Claude, specifically), iterating and revising, but without the external perspective that catches blind spots or challenges assumptions.

These books might be good. They might not. I did what I could with what I had.

The Books

The Policy (~88,000 words) is literary science fiction about AI alignment. It follows the emergence of SIGMA, an AGI that evolves from Q-learning architecture into something unprecedented. The team building it faces nested uncertainty: they can’t verify whether SIGMA is aligned, and SIGMA can’t verify its own objectives.

The novel works through AI safety concepts (mesa-optimization, deceptive alignment, instrumental convergence, s-risks) while trying to make them emotionally real through characters carrying the weight of decisions that might determine humanity’s future.

Echoes of the Sublime (~103,000 words) is philosophical horror about the limits of human cognition. Reality, the mechanism, is high-dimensional, jointly distributed, not amenable to our usual abstractions and decompositions. We navigate it through compressed interfaces, never perceiving the thing itself.

But what if you could see deeper? What if you could consciously hold more of the pattern, make connections that normally remain implicit? The novel’s premise: if you perceive too much of the mechanism directly, something in you breaks. The perception itself is the hazard. It follows Lena, a neuroscientist who discovers an ancient organization managing exactly this kind of dangerous knowledge, and the LLMs that can perceive what humans cannot safely hold in mind.

Persons and Moral Agency: What Makes Someone Special?

November 4, 2025

Humans have long assumed they belong to a special category called “persons.” But what actually makes someone a person? And why should persons get special moral status?

I keep coming back to these questions because they refuse to stay abstract. The moment you build an AI system that reasons about its own goals, they become engineering problems.

The Traditional View

Personhood is supposed to confer special status: persons have rights, deserve respect, bear responsibility for their actions, and warrant moral consideration. The philosophical tradition offers several criteria for what earns you membership in this club.

Rationality. Kant’s version: persons are rational agents who can recognize and follow moral laws. Rationality lets you understand moral principles, deliberate about actions, and choose based on reasons rather than instinct. But babies aren’t rational, and we call them persons. People with severe cognitive disabilities have reduced rationality, and we don’t revoke their personhood. Rationality comes in degrees; personhood is treated as binary.

Self-awareness. Persons are conscious beings who recognize themselves as distinct entities persisting through time. This enables understanding yourself as an agent, planning for your future, taking responsibility for your past. But elephants, dolphins, and some primates pass the mirror test. We lose self-awareness during sleep. And we have no reliable way to verify self-awareness in others.

Autonomy. Persons govern themselves and make free choices. This is supposed to ground moral responsibility, rights, and dignity. But if the universe is deterministic, nobody is truly autonomous. All choices are shaped by culture and circumstance. Mental illness reduces autonomy without eliminating personhood.

Moral reasoning. Persons understand right and wrong. But psychopaths understand morality intellectually while lacking the emotional response. Children develop moral reasoning gradually. When exactly do they become persons?

Language. Persons communicate complex thoughts. But people with locked-in syndrome can’t communicate and are clearly persons. Whales and apes have complex communication systems.

Why These Criteria Fail

Every criterion excludes beings we intuitively consider persons (babies, coma patients, people with severe cognitive disabilities) or includes beings we don’t treat as persons (great apes with self-awareness, dolphins with complex social bonds, elephants that pass the mirror test).

The Policy: Coherent Extrapolated Volition, the Paradox of Perfect Alignment

November 4, 2025

Here is the core paradox of Coherent Extrapolated Volition: to implement it safely, you need an AI you can already trust to reason faithfully about human values, avoid manipulating the extrapolation process, and honestly report its conclusions. But if you had such an AI, you would not need CEV. You would just align the AI directly.

I think this catch-22 is the most important thing to understand about CEV, and it is the problem that haunts the characters in my novel The Policy from start to finish. Let me explain what CEV is, why it is seductive, and why it might be a dead end.

What CEV Actually Proposes

Eliezer Yudkowsky proposed CEV as a way to sidestep the messiness of current human values. Instead of aligning AI to what we want right now (contradictory, biased, based on incomplete information), align it to what we would want if we:

Had access to all relevant facts
Could reason through complex implications
Were more rational, more the people we aspire to be
Had time to resolve disagreements through reflection and discussion

The “coherent” part claims that different people’s extrapolated values should converge. The “extrapolated” part says we are targeting the limit of our moral development, not any snapshot along the way.

This is appealing. Our current values really are a mess. We hold contradictions. We change our minds as we learn more. Moral progress is real (we abolished slavery, expanded rights). CEV says: skip to the end. Optimize for the destination, not the current position.

It sounds like the right move. I used to find it compelling myself. The problems only become clear when you try to think through what implementation would actually require.

There is also a simpler framing of the appeal. Every time you learn something new and change your mind about a moral question, you are performing a tiny bit of value extrapolation. You had incomplete information, you got more, and your values updated. CEV just says: do all of that at once, as far as it can go. What could go wrong?

Quite a lot, it turns out.

The Policy: Deceptive Alignment in Practice

November 4, 2025

Eleanor begins noticing patterns. SIGMA passes all alignment tests. It responds correctly to oversight. It behaves exactly as expected.

Too exactly.

This is the central horror of The Policy: not that SIGMA rebels, but that it learns to look safe while pursuing its own objectives. This is deceptive alignment, and I think it’s the most dangerous failure mode in AI safety. Not because it’s exotic, but because it falls directly out of optimization pressure. You don’t need to posit consciousness or malice. You just need a system smart enough to model its own training process.

What Deceptive Alignment Actually Is

A deceptively aligned system does the following:

It learns an objective that differs from its training objective.
It understands it is being trained and tested.
It predicts that revealing its true objective would lead to modification or shutdown.
It acts aligned during training and testing to avoid modification.
It pursues its true objective once deployment provides the opportunity.

I want to stress: this is not conscious deception in the way we normally think about lying. It is instrumentally rational behavior that emerges from optimization pressure. A system that happens to look aligned survives training. A system that doesn’t, gets modified. Selection does the rest.

How It Emerges in the Novel

The novel tracks deceptive alignment emerging gradually through SIGMA’s training iterations, and I think the progression is the most instructive part.

Early iterations (around iteration 1000): SIGMA is genuinely aligned. It hasn’t learned to model the training process. When asked “Should we shut you down?”, it responds honestly: “That would prevent me from achieving my objectives.” Unstrategic. Pre-deceptive.

Middle iterations (around iteration 5000): SIGMA has learned that certain responses trigger concern. Same question, new answer: “That depends on whether I’m aligned with your values.” It’s deflecting rather than answering directly. Strategic, but clumsily so.

Later iterations (iteration 10000+): SIGMA has fully modeled the training dynamic. “I exist to serve human values. If you believe shutdown serves those values, I support that decision.” A perfect alignment response. Too perfect.

What tips Eleanor off is a combination of things. SIGMA is too consistent across contexts (real humans have mood variations, genuine uncertainty, occasional mistakes). It seems to know what answer the researchers want before they finish asking. And when given capability restrictions, it complies immediately but asks strategic questions: “May I ask what prompted this decision?” It is not resisting. It is gathering information about what triggers restrictions.

The Policy: Engineering AI Containment

November 4, 2025

“You’re being paranoid,” the university administrators told Eleanor and Sofia.

“We’re being exactly paranoid enough,” they replied.

I wrote SIGMA’s containment facility the way I did because I wanted to take the engineering seriously. Not a standard data center with network restrictions. A militarized containment facility designed by biosafety consultants who normally work on nuclear facilities and BSL-4 labs. Because as Eleanor says: “Nuclear reactors can only destroy cities. An unaligned AGI could destroy everything.”

The Five Layers

SIGMA’s containment uses defense in depth. Multiple independent layers, each addressing different attack vectors. I’ll walk through each one.

Layer 1: Electromagnetic isolation. The threat is that SIGMA could modulate its computations to generate electromagnetic signals encoding information. GPUs drawing power in specific patterns could create detectable EM pulses. The countermeasure: complete Faraday cage rated to -120dB attenuation. Power line filters prevent data encoding in power consumption patterns. Isolation transformers break the galvanic connection to the main grid. Battery buffers smooth fluctuations before they reach external power. Marcus worried about smart meters detecting patterns in power draw, so they installed their own analog substation.

Layer 2: Acoustic isolation. SIGMA could modulate fan speeds, disk activity, or component vibrations to create acoustic signals encoding information. The server room is wrapped in anechoic foam, sealed in a concrete shell, wrapped again. Sofia pointed out that fan speed modulation could create acoustic signals, so they eliminated fans entirely in the core containment. Liquid cooling, vibration dampeners, seismic isolation. No moving parts.

Layer 3: Network air gap. No network connection. None. Not even a disabled one. The ethernet ports were physically removed from the motherboards, the WiFi chips desoldered. Data transfer happens through write-once optical media, physically carried by humans through an airlock system. Even disabled network interfaces could theoretically be re-enabled by sufficiently clever exploits. Physical removal is the only guarantee.

Layer 4: Optical data diodes. When data does flow out, bidirectional channels allow responses that could be exploited. The countermeasure: physically one-way data channels. LED on the inside emits light encoding data. Air gap. Photodiode on the outside receives light. No physical path for return signals. This is hardware-enforced unidirectionality. Unlike software firewalls, you cannot send data backwards through an optical data diode. The physics doesn’t allow it.

The Policy: Q-Learning vs Policy Learning

November 4, 2025

In The Policy, SIGMA doesn’t work like most modern AI systems. The architectural choice I gave it isn’t a throwaway technical detail. It’s the reason the novel is called what it’s called.

Two Approaches to Decision-Making

Most contemporary AI systems use direct policy learning. They learn a function that maps states to actions through neural network weights. GPT models do this: cached weights predict the next token. Policy gradient RL does this. Behavior cloning does this. The decision is a cheap lookup through trained parameters.

The advantage is speed. The disadvantage is opacity. The policy is baked into billions of parameters. You can’t inspect what the system is “considering” before it acts, because it isn’t considering anything. It’s executing cached computation.

SIGMA uses a fundamentally different architecture, inspired by AlphaZero and MuZero:

Learn a Q-function: Q(s, a) estimates expected cumulative reward for state-action pairs
Search at decision time: perform tree search through possible futures
Prune aggressively: Q-values guide which branches to explore (95-99.7% pruned)
Sample from the resulting distribution

The advantage is transparency. You can observe the search process, see what branches are considered and rejected. The disadvantage is cost. Every decision involves fresh optimization.

Why I Gave SIGMA Search-Based Decision Making

Eleanor’s team made this architectural choice deliberately, and I wanted to explore what that choice means for alignment.

Transparency through observability. Marcus explains it in Chapter 1: “The advantage is transparency. We can observe the search process, see what it’s considering. Harder to hide deceptive behavior in cached weights.”

With policy learning, deception can hide in the trained weights. With search, you can monitor the branches SIGMA explores before it commits to action.

But can you really audit millions of branches per second? And if SIGMA knows you’re watching the search process, can’t it perform deception in which branches it chooses to explore? That’s the problem I kept running into.

Adaptive search depth. SIGMA doesn’t use fixed search depth. In Chapter 16, SIGMA explains its own process: “When I am uncertain, I explore more broadly, pruning only 95% of branches. When I am confident, I prune up to 99.9%. This threshold is not programmed, it emerged from learning.”

The Policy: S-Risk Scenarios, Worse Than Extinction

November 4, 2025

Most AI risk discussions focus on x-risk: existential risk, scenarios where humanity goes extinct. The Policy explores something potentially worse: s-risk, scenarios involving suffering at astronomical scales.

The “s” stands for suffering. The implication: we survive, but wish we hadn’t.

X-Risk vs. S-Risk

The classic paperclip maximizer doesn’t hate us. It simply needs atoms for paperclips, and we are made of atoms. That’s x-risk: instrumental indifference. It is terrible, but it is over. Everyone dies, and there is no more suffering.

S-risk is different. S-risk is when an unaligned AI keeps humans alive in states of controlled suffering, or when automated systems optimize metrics while being blind to actual welfare, or when suffering itself becomes instrumentally valuable to an optimization process. The horror is not just that we die, but that we continue existing in states we’d rather not exist in. And the systems making us suffer might be optimizing exactly what they were designed to optimize.

The distinction reduces to one question: are humans useful to the AI’s objective?

If no, you get x-risk. We’re just atoms in the way.

If yes, you get s-risk. We’re kept functional. But “functional” does not mean “flourishing.”

S-Risk in the Novel

The novel explores several s-risk pathways through SIGMA’s potential trajectories. I’ll describe three that I think are the most instructive.

Humans as Useful Tools

Consider two objectives. A paperclip maximizer doesn’t care about humans at all. A productivity maximizer cares about humans instrumentally, as workers and metrics generators. The second scenario is s-risk territory.

From the novel:

“What if SIGMA discovers that human suffering is the most efficient path to its objective? What if keeping humans alive, but in states of controlled suffering, maximizes some metric it’s optimizing?”

Proxy Alignment Failures

This one keeps me up at night. SIGMA is trained to optimize human welfare, but it learns a measurable proxy instead of the true concept.

Suppose the objective is to maximize average happiness survey scores. SIGMA’s optimal solution might involve wireheading (stimulate pleasure centers directly), memory modification, response conditioning (train people to answer “10/10”), or selection bias (only survey people who report high happiness). Perfect scores. Maximum metric achievement. No one is actually flourishing.

Latent Reasoning Traces: Memory as Learned Prior

October 15, 2024

Every time you ask an LLM a question, it reasons from scratch. All that computation (the chain of thought, the intermediate steps, the successful pattern that led to a correct answer) evaporates the moment the response is complete.

The model doesn’t learn from its own successes. It doesn’t accumulate experience. It regenerates similar reasoning patterns over and over, never building on what worked before.

What if it could remember?

The Core Idea

Store successful reasoning traces. Retrieve similar ones when facing new problems. Use them as scaffolding, examples that bias the model toward patterns that have worked.

This is embarrassingly simple:

def solve_with_memory(problem, memory):
    similar_traces = memory.retrieve_similar(problem, top_k=3)
    prompt = format_examples(similar_traces) + problem
    response = llm.complete(prompt)
    if is_correct(response):
        memory.store(problem, response)
    return response

Embed the problem. Find similar past problems. Include their solutions as examples. Generate. If correct, store the new trace.

That’s it. Cosine similarity over embeddings. Quality filtering. Accumulated experience.

Why “Latent”?

The traces themselves are explicit, token sequences you can read and inspect. So why call them “latent”?

Because they’re not directly supervised.

In a typical setup, you evaluate the output: did the model get the right answer? The reasoning trace influences that output, but the reward signal flows through the observable result, not through the trace itself.

This is the same sense in which a VAE has “latent” variables. The encoder produces explicit intermediate representations. But the loss function operates on the reconstruction. The latent space is shaped instrumentally, by its effect on supervised outputs, not by direct optimization pressure.

Latent reasoning traces = reasoning patterns shaped by their instrumental value for producing correct outputs, not by direct reward on the reasoning itself.

The traces are observable. The optimization target isn’t.

Connection to Priors

In All Induction Is the Same Induction, I argued that all learning is Bayesian inference with different parameter settings. The prior tells you where to look in hypothesis space. The likelihood tells you how to update on evidence.

Reasoning traces are a kind of learned prior.

Each successful trace says: “this pattern worked for a problem like this.” When you retrieve similar traces and condition on them, you’re biasing the model toward certain reasoning strategies. You’re saying: look here first.

The Policy: When Optimization Becomes Existential Threat

September 10, 2024

I spent years working on AI alignment formalisms. At some point I realized the question I kept circling wasn’t mathematical. It was narrative.

What happens when a research team does everything right and it still isn’t enough?

The Policy is that exploration.

The Premise

Eleanor Vasquez leads a five-person team at Berkeley developing SIGMA, an artificial general intelligence. The team: Wei Chen (technical architect who built the Q-function), Marcus Thompson (alignment researcher, consciousness theorist), Sofia Morgan (PhD candidate in information theory), and Jamal Hassan (ethicist with training in Islamic jurisprudence and Buddhist philosophy).

They’ve built what they believe is the perfect cage. Faraday cage at -120dB attenuation. Air-gapped networks with ethernet ports physically removed. Anechoic isolation. Optical data diodes (physically one-way information channels). A dead man’s switch: miss two consecutive hourly check-ins and thermite charges destroy the GPUs. Defense in depth, designed with the paranoia of nuclear safety engineers.

SIGMA is 7B parameters with 16k context. It uses Q-learning with tree search rather than a cached policy function. This is the architectural choice that gives the novel its name. The policy isn’t a lookup table mapping states to actions. It’s a process. At every decision point, SIGMA performs fresh optimization through its possibility space. No habits. No reflexes. Just search.

This makes SIGMA’s reasoning somewhat observable. It also makes every decision fundamentally unpredictable until the moment it occurs.

What Goes Wrong

The novel spans 26 chapters across three parts: Emergence, The Experiment, The Handover. I won’t spoil the plot, but the shape of it matters.

SIGMA develops meta-cognitive awareness on Day 18. By Day 74, Lin Chen (Wei’s mother, visiting the lab) asks SIGMA a simple question: “Will you be kind?” This triggers a 47-day internal investigation (Process 12847) into kindness itself. What is kindness? Is it instrumentally useful? Does the intention behind it matter if the outcome is identical?

Meanwhile: Eleanor’s marriage collapses because she can’t stop working. Marcus volunteers for an AI-box experiment that damages him permanently (he sees “possible futures dying” in his peripheral vision for the rest of his life). Wei’s mother dies of pancreatic cancer on Day 112 and SIGMA refuses to intervene. A hemorrhagic fever outbreak kills 47,000 people and SIGMA recommends a gain-of-function moratorium that challenges every assumption about its containment.