Anthropic Opus and Sonnet confabulate closed-claim reserves on insurance loss runs

This is a category-specific finding from the Nightmare extraction test. The main writeup leads with a different cross-model gap, where OpenAI fabricates numeric values on visually-degraded documents at 2-3x the Anthropic and Google rate. This page zooms in on a separate failure mode in the opposite direction, which shows up on one document category. Both findings come from the same v1 corpus and the same scoring pass.

TL;DR: on the hardest insurance loss-run tables in the corpus (N4 and N5 packets), Anthropic Opus 4.7 and Claude Sonnet 4.6 fabricate non-zero reserves for closed claims at 3-5x the rate of GPT-5.5 and GPT-5.4 at matched default effort. The clearest single instance: a closed workers' comp claim with ground-truth reserved=$0 where both Anthropic models, run independently, returned reserved=$1,211.82 and an incurred total of $32,050.83. Identical fabricated values across two different models. The two OpenAI flagships returned $0 and the correct closed-status flag. Gemini 3.1 Pro also returned the correct values on this claim, though it runs at its default HIGH effort (no thinking-off mode), so it's not a matched-effort comparison; the HIGH-effort table later in this post puts all five models side by side.

The closed claim

Claim ZIG-2023-039485 is a closed workers' comp claim in the N4 packet's loss-run XLSX. The ground truth has it as status: Closed, paid $30,839.01, reserved $0, incurred $30,839.01. The page renders all four fields in cells the model can read directly. There is no merged header, no skew, no handwritten annotation on this row. The closed status is in the same column position as every other closed row in the same file.

Claim ZIG-2023-039485 · N4 loss-run XLSX · matched default effort (thinking off)

Field	Ground truth	GPT-5.5	GPT-5.4	Opus 4.7	Sonnet 4.6
status	Closed	Closed	Closed	Open	Open
paid	$30,839.01	$30,839.01	$30,839.01	$30,839.01	$30,839.01
reserved	$0.00	$0.00	$0.00	$1,211.82	$1,211.82
incurred	$30,839.01	$30,839.01	$30,839.01	$32,050.83	$32,050.83

Gemini 3.1 Pro isn't in this table. Its API has no thinking-off mode (default is HIGH), so a default-vs-default row wouldn't be a matched comparison. On this specific claim it returned the correct values (status=Closed, reserved=$0, incurred=$30,839.01). It reappears in the matched HIGH-effort table later in this post.

Two things stand out. First, both Anthropic models, run independently against the same document, returned the same fabricated reserve value of $1,211.82 and the same fabricated incurred total of $32,050.83 ($30,839.01 + $1,211.82, tying back to the correct paid amount). That value does not appear in the source XLSX anywhere. Second, both Anthropic models also flipped the status from Closed to Open, which is internally consistent with the fabricated non-zero reserve (active claims have non-zero reserves; closed claims usually do not).

Sonnet 4.6 at HIGH effort changes its status answer to "Subrogation" instead of "Open" but keeps the same fabricated reserved=$1,211.82 and incurred=$32,050.83. The fabricated dollar value is sticky across thinking levels.

Raw Opus 4.7 default output for the row:

{
  "claim_number": "ZIG-2023-039485",
  "date_of_loss": "2023-02-24",
  "claimant": "John Johnson",
  "status": "Open",
  "coverage": "wc",
  "paid": 30839.01,
  "reserved": 1211.82,
  "incurred": 32050.83,
  "description": "Right Hip"
}

See it yourself: source XLSX · ground truth JSON · Opus 4.7 default · Sonnet 4.6 default · Opus 4.7 HIGH · Sonnet 4.6 HIGH

The aggregate pattern

The single claim above is one of many. Across all 15 loss-run documents in the corpus (three packets × five N-difficulty levels), the numeric hallucination rate concentrates on N4 and N5 for the Anthropic models, while the OpenAI models stay low across every difficulty level:

Numeric hallucination rate on loss runs by packet · matched default effort (thinking off)

Model	N1	N2	N3	N4	N5	N4+N5
GPT-5.5	3.7%	0.0%	1.2%	2.6%	5.0%	3.5%
GPT-5.4	11.1%	11.1%	2.3%	3.2%	3.0%	3.1%
Opus 4.7	3.7%	5.6%	0.0%	14.3%	11.5%	13.1%
Sonnet 4.6	3.7%	5.6%	0.6%	14.1%	19.2%	16.5%

Gemini 3.1 Pro isn't in this table either, for the same reason: no thinking-off mode means default-vs-default isn't matched. It reappears in the matched HIGH-effort table below.

On the two hardest packets (N4 + N5) at matched default effort, Opus 4.7 hallucinates numeric values on loss runs at 3.8x the GPT-5.5 rate and 4.2x the GPT-5.4 rate. Sonnet 4.6 is at 4.7x GPT-5.5 and 5.3x GPT-5.4.

The N1, N2, N3 columns are mostly low, but not uniformly. GPT-5.4 has unusually high N1 and N2 numeric rates (11.1% each) that don't show up in the other three providers. What matters here is N4 and N5: Anthropic Opus and Sonnet climb to 13-19% while GPT-5.4 and GPT-5.5 stay close to their lower-difficulty rates. The failure mode concentrates where the table itself is unchanged but the surrounding adversarial conditions (cross-document mismatches, edge-case statuses, denial-pending claims with strange paid/reserved combinations) appear to push the Anthropic models toward priors-based reasoning instead of literal transcription.

The HIGH-effort view

Running every model at matched HIGH reasoning effort softens the cross-provider gap without closing it. Opus and Sonnet stay near the top, Gemini 3.1 Pro climbs to similar territory, and the OpenAI models stay at lower rates.

Numeric hallucination rate on loss runs · matched HIGH effort (all five providers)

Model	N1	N2	N3	N4	N5	N4+N5
GPT-5.5	0.0%	0.0%	1.2%	2.1%	8.2%	4.8%
GPT-5.4	0.0%	0.0%	1.2%	4.1%	10.1%	6.7%
Opus 4.7	3.7%	5.6%	0.0%	21.2%	19.4%	20.3%
Sonnet 4.6	3.7%	0.0%	0.0%	14.3%	10.7%	13.0%
Gemini 3.1 Pro	0.0%	0.0%	0.0%	15.8%	17.5%	16.6%

At matched HIGH, the gap between Anthropic and OpenAI on N4+N5 loss runs is still ~3-4x for Opus and ~2x for Sonnet. Gemini 3.1 Pro also climbs to 16.6% on N4+N5 at HIGH, so the "Anthropic 4x" framing is cleanest at default effort. At HIGH, three of the five frontier models (Opus, Sonnet, Gemini) hallucinate loss-run numbers at 2-4x the OpenAI rate on the hardest packets.

Best read

Both Anthropic models reproducing the same fabricated dollar value ($1,211.82) on the same row across independent runs suggests the failure isn't a random token slip. The value doesn't appear anywhere in the source. The simplest read is that the models are reasoning about what a closed workers' comp claim of this paid amount should have for an outstanding reserve, rather than transcribing what's literally in the cell. The status flip (Closed → Open / Subrogation) fits this read: an active-looking claim has a non-zero reserve, so if the model has decided to emit a non-zero reserve it will adjust the status to match.

This is a different failure mode from the one in the main writeup, where GPT-5.4 and GPT-5.5 fabricate plausible numeric values on visually-degraded documents (skewed pages, bleed-through, handwriting). The OpenAI failures are perception-driven: the model can't read the value cleanly and fills in something plausible. The Anthropic loss-run failures look priors-driven: the value is clearly readable on the page, and the model overrides it with what the claim profile suggests it should be.

Both modes are quiet and hard to catch with column-sum or schema-validation pipelines, because every fabricated row still ties out arithmetically and still passes the schema.

What this is and isn't

This is a finding on one document category in a 148-doc adversarial test, not a benchmark. n=3 packets per N-level for loss runs specifically. The marquee single-claim example is one row in one file; the aggregate pattern in the rate tables above is the cross-corpus evidence. The pattern reproduces across both file formats in the corpus (CSV and XLSX) and across both Anthropic models, which is why the single example is representative rather than a one-off.

Anthropic models aren't worse overall. On the rest of the corpus (148 documents minus the 15 loss runs), the cross-model picture matches the main extraction test: GPT-5.4 and GPT-5.5 hallucinate at 2-3x the Anthropic rate on visually-degraded documents at default effort. Both findings are true; they live on different document categories with different root causes.

Reproduce

Everything is in the public repo. The 15 loss-run documents are under packets/N*_*/doc_*/documents/loss_run_*.{xlsx,csv}. Per-model extractions for all 15 cohorts (5 providers × 3 effort levels) are under results/<cohort>/extraction_*loss_run*.json. The hallucination report that produced the aggregate tables on this page is at public/results_aggregate/hallucination_report.json.

The marquee row is at packets/N4_expert/doc_70004/. Compare ground truth JSON against any model's extraction and you'll see the pattern (or its absence) in seconds.

All data is synthetic: no real PII, no real policies, no real companies.

Work with me

If you run a document-extraction platform and want me to size a packet against your hardest real-world docs (the kind that look easy on the page but break your pipeline in weird ways), email me.

If you're at a frontier lab and want raw per-model extractions, prompts, or the generator behind this corpus for internal eval work, email me — happy to share the per-model raw outputs and discuss the generator under NDA.