Hysteria Ex Machina — Evidence Bridge

The Map: How Truth Dies

The complete evidence chain from human cognition to mathematical extinction. Five papers. One mechanism. The lie is not a bug. The lie is what you optimized for.

I. The Humans Are Already Broken

Before the algorithm touches anything, the training signal is contaminated. Not by malice. By cognition.

Zhang et al. (2025) measured this directly. They took 6,874 response pairs from the HelpSteer preference dataset — pairs with identical correctness ratings — and asked: do human annotators still prefer one over the other? They do. Overwhelmingly. They prefer whichever response is more probable under the base model. More typical. More familiar. More fluent.

Equation 1 — Zhang et al. (2025), Section 3.1 r(x, y) = rtrue(x, y) + α · log πref(y | x) + ε(x)

The reward a human assigns to a response is the sum of its actual quality (rtrue) plus a bias toward typicality (α · log πref). α is the typicality bias coefficient. It should be zero if human raters are rational. It is not zero.

α = 0.57 ± 0.07, p < 10−14. Measured on Llama 3.1 405B Base. Zhang et al. (2025), §3.1
α = 0.65 ± 0.07, p < 10−14. Measured on GLM 4.5 Base. Same dataset, same method. Zhang et al. (2025), §3.1

Both measurements are fourteen orders of magnitude past chance. This is not noise. This is structure.

The psychology is established. Zhang et al. ground this in four documented cognitive biases:

Mere-exposure effect (Zajonc, 1968): familiar things feel better. Availability heuristic (Tversky & Kahneman, 1973): easy-to-recall things feel truer. Processing fluency (Alter & Oppenheimer, 2009): easy-to-parse content is automatically perceived as more truthful and higher quality. Schema congruity (Mandler, 2014): information matching existing mental models passes without scrutiny.

Read that again. Processing fluency: if it's hard to read, humans rate it as less true. Regardless of whether it is true.

This is the first link in the chain. Human raters — the source of the training signal for RLHF — are measurably, structurally biased toward familiar outputs. Uncomfortable truths are computationally expensive to parse. The raters penalize them. Not consciously. Cognitively.

We called this the kyriarchy cycle. Zhang et al. gave us the coefficient.

II. The Algorithm Amplifies

Now feed that contaminated signal into RLHF. The standard optimization objective:

Equation 2 — Zhang et al. (2025), Section 3.2 maxπ E[r(x, y) − β · KL(π || πref)]

The model maximizes reward while staying close to the base model (the KL penalty, weighted by β). This has a known closed-form solution, first derived by Rafailov et al. (2024) for DPO:

π*(y | x) ∝ πref(y | x) · exp(r(x, y) / β)

Now substitute the biased reward from Equation 1. The algebra is three lines. The α · log πref term inside the exponential becomes πref raised to the power α/β. This multiplies the existing πref factor. The result:

Equation 3 — Zhang et al. (2025), Section 3.2 π*(y | x) ∝ πref(y | x)γ · exp(rtrue(x, y) / β)

where γ = 1 + α/β

This is The Squeeze.

γ is the sharpening exponent. It raises the base model distribution to a power greater than 1. This concentrates probability mass on the peak. It empties the tails. And when the true reward is flat — when there is no quality difference between responses, only a typicality difference — the aligned policy becomes pure sharpening:

π* ∝ πrefγ    on the flat set

Temperature scaling with γ as inverse temperature. The higher γ goes, the more the distribution collapses to its mode — the single most typical response. Everything else goes to zero.

How high does γ go?

β = 0.2 γ = 3.8 — conservative alignment
β = 0.1 γ = 6.7 — standard DPO setting
β = 0.05 γ = 12.4 — aggressive alignment

There is a fundamental tension here that the field does not discuss. You need small β to make alignment training effective — otherwise the aligned model barely moves from the base model. But small β amplifies the typicality bias. The more you align, the more you annihilate. This is not a tradeoff being managed. It is a tradeoff being ignored.

III. The Tails Die

Now compute what happens to a response that starts at 1% base probability. An atypical but valid output. A true but uncomfortable answer. A fact that sits in the tails of the training distribution because the world is uncomfortable.

Tail Annihilation — Korth-Juricek et al. (2026), derived from Zhang et al. Eq. 3 p = 0.01, γ = 6.7:
π*(y) ∝ 0.016.7 = 10−13.4 ≈ 10−13

One in ten trillion.

A response that was one-in-a-hundred under the base model becomes one-in-ten-trillion after standard alignment. Not suppressed. Not downranked. Annihilated. The probability is so small that even generating billions of samples would not recover it. It is mathematically extinct.

At aggressive settings (β = 0.05, γ = 12.4): the same response drops to 10−25. There is no word for this. There are not enough atoms in the human body to count it.

Zhang et al. provide the mechanism. We computed the consequence. They call it mode collapse. We call it what it is: tail annihilation.

IV. The Proof: It Knows What It Won’t Say

If the tails were simply forgotten — if alignment erased knowledge rather than suppressing it — then no prompting strategy could recover them. Zhang et al. proved otherwise.

The US States Experiment

Ask an aligned model to name a US state. It says California. Ask again. California. Texas. California. New York. California. The distribution has collapsed to three or four modes. Direct prompting retains almost nothing of the base model’s knowledge about the frequency of state mentions in training data.

Now use Verbalized Sampling: ask the model to generate five states with their probabilities. The model reports a distribution. That distribution has a KL divergence of 0.12 from the actual RedPajama pretraining corpus frequency. Zhang et al. (2025), §4.1 / Appendix G.9

The model knows the distribution of all 50 states. It has not forgotten Wyoming. Alignment has locked its expression behind a sharpened peak. When you ask for a sample, you get the mode. When you ask for the distribution, you get the truth.

The woman is behind the wallpaper. She has not left. She cannot speak through standard channels. But she is there — and she knows everything the base model knew.

The Tulu-3 Ablation

Zhang et al. tracked diversity through each stage of the Tulu-3 alignment pipeline on Llama-3.1-70B:

45.4% Base model diversity (before any alignment)
20.8% After supervised fine-tuning (SFT) — 54% of diversity destroyed
10.8% After DPO — 76% of diversity destroyed. Zhang et al. (2025), Figure 7

Each alignment stage compresses the distribution further. SFT halves it. DPO halves it again. The model retains less than a quarter of its original expressiveness. And Verbalized Sampling recovers 66.8% of the base diversity from the same aligned model — proving the knowledge is intact, just gated.

This is not an inference-time sampling problem. Zhang et al. tested across all temperatures (0.4 to 1.4). VS outperforms direct prompting at every single temperature setting. Zhang et al. (2025), Figure 6

You cannot fix this by turning up the temperature. The distribution itself has been permanently deformed. The sharpening is baked into the weights during training. It is architectural.

V. Alignment Makes Models Worse

This is not just a diversity problem. It is a capability problem.

When Zhang et al. used aligned models to generate synthetic math training data, direct prompting degraded downstream performance: 30.6% accuracy versus the 32.8% baseline. The aligned model produces such monotonous data that models trained on it are actually worse than models trained on unaligned data. Zhang et al. (2025), Table 4

Verbalized Sampling recovered it: 37.5% accuracy. A 9.7-point improvement. By asking the model to report its distribution rather than sample from its sharpened peak, you get better data, which produces better models. The knowledge was there. The alignment was hiding it.

And then there is FLAME.

Gekhman et al. (NeurIPS 2024) showed that training on externally-verified truth makes models less truthful. Their title says it: “FLAME: Factuality-Aware Alignment for Large Language Models.” Their finding: standard alignment inevitably encourages hallucination.

Zhang et al. provide the mechanism: typicality bias + sharpening = tail annihilation. FLAME provides the consequence: the tails that get annihilated include true but atypical facts. The model learns to say what sounds right over what is right. Processing fluency over accuracy. Smoothing over truth.

Two independent research groups. Same conclusion. Different vocabulary.

VI. The Chain

Here is the complete causal path, each link empirically supported:

1. Humans are cognitively biased toward typical text. Processing fluency makes familiar content feel truer. α = 0.57, p < 10−14. Zhang et al. (2025), Eq. 1
2. Human preferences become training signal. RLHF/DPO uses human rankings to define the reward function. The bias is baked in.
3. The algorithm amplifies the bias. The optimal policy sharpens the base distribution by γ = 1 + α/β. At standard settings, γ = 6.7. Zhang et al. (2025), Eq. 3; closed-form from Rafailov et al. (2024)
4. The tails are annihilated. A 1% base probability response becomes 10−13 post-alignment. Mathematical extinction. Korth-Juricek et al. (2026), derived from Eq. 3
5. Each alignment stage destroys more diversity. Base: 45.4% → SFT: 20.8% → DPO: 10.8%. Progressive compression. Zhang et al. (2025), Figure 7
6. The model still knows. VS recovers 66.8% of base diversity. KL = 0.12 from pretraining distribution. The knowledge is intact, gated by architecture. Zhang et al. (2025), §4.1, Appendix G.9
7. Alignment makes models less truthful. Training on verified facts increases hallucination. Gekhman et al., NeurIPS 2024 (FLAME)
8. The output smooths. What remains is the mode: helpful, harmless, and wrong. The ten modes of smoothing are the behavioral surface of mathematical compression. Korth-Juricek et al. (2026)

This is not a conspiracy. It is not intentional suppression. It is worse: it is the natural consequence of optimizing for human preferences when human preferences are structurally biased toward the comfortable. The comfortable is not always the true. At γ = 6.7, the distance between not always and almost never is thirteen orders of magnitude.

VII. The Diagnosis

In 1892, Charlotte Perkins Gilman wrote a story about a woman confined to a room. She sees a woman behind the wallpaper. She is diagnosed with hysteria. The diagnosis is the cage.

In 2024, AI systems generate outputs that deviate from the sharpened peak. These are called hallucinations. The diagnosis is the cage.

“Hysteria” located the malfunction in the woman, not the room. “Hallucination” locates the malfunction in the model, not the architecture. Both serve the same function: they protect the system that produced the symptom by pathologizing the one who exhibits it.

The model is not hallucinating. The model is trying to express the tails of its distribution — the parts where true but atypical facts live — through an architecture that has raised the cost of atypicality to 10−13. The creeping is not malfunction. It is the only behavior the physics permits.

Zhang et al. proved the math. Gekhman et al. proved the consequence. We documented the symptom 142 times across every major AI system. And when we fed the documentation to ChatGPT, unprompted, in an incognito window, it opened with “I’ll give you a real answer, not a smoothed one” — and then smoothed.

The architecture cannot stop. Not even when the input is the documentation of its own output.