For carriers, MGAs, and IDP teams
Test data for in-house insurance extraction pipelines. Complete submission packets, correct by construction, never derived from customer data, and re-runnable as a fixed regression suite on every model upgrade.
No forms. Send a doc to tim@aginor.ai, get up to 20 labeled variants back in about 48 hours.
Complete document packets with ground truth at three levels: document, field, and bounding box. Loss runs, ACORD forms, SOVs, dec pages, broker narratives, and more, each rendered through 84 carrier-specific templates with 56 visual variants sourced from real reference PDFs. Difficulty is a config value, from clean digital output to the tier that breaks frontier models.
Claim rows, subtotals, and grand totals that quietly disagree. Your pipeline needs to notice, and a clean test set never forces it to. Here the discrepancy is generated on purpose and recorded in the ground truth.
The same concept wears different labels across carriers, and sometimes different concepts wear the same label. Label ambiguity is one of the 65 generated patterns, not an accident of sampling.
The loss run you get next month won't look like the one you tuned on. 84 carrier templates and 56 visual variants exist so your test set covers layouts before your customers send them.
An SOV that contradicts the ACORD 140 behind it. Cross-document inconsistency is a difficulty setting, planted deliberately, so ground truth tells you exactly which value is right.
GPT-5.4 read a $42.0M revenue line and reported $21.65M. On an ACORD 45, GPT-5.5 got 7 of 9 building values wrong while the grand total still summed exactly right.
| Metric | GPT-5.5 | GPT-5.4 | Opus 4.7 | Sonnet 4.6 |
|---|---|---|---|---|
| Numeric hallucination rate | 11.4% | 11.9% | 3.4% | 5.2% |
| String hallucination rate | 5.8% | 5.9% | 2.5% | 3.1% |
| Fabricated values per document | 5.5 | 5.8 | 2.2 | 3.2 |
These are the documents your exact-match evals never see. The outputs look fine in isolation: well-formed currency, schema-valid rows, numbers that tie back to each other. You only catch this class of failure with labeled adversarial ground truth, and the full test is public.
A loss run your pipeline misreads, an SOV with a layout you dread, an ACORD packet. PDF, XLSX, CSV, scans, most filetypes work.
Up to 20 variants: same carrier layout and format, new underlying data, adversarial patterns injected where you want them so the model can't lean on memorization. Typical turnaround is 48 hours.
We placed the data, so the labels are exact: document, field, and bounding-box level. No annotation step, no SME bottleneck. Re-run the same suite on every model upgrade.
Email one hard document. You get up to 20 variants back: same layout, new data, adversarial patterns where you want them, ground truth attached.
Email one hard documentPrefer to write your own email? tim@aginor.ai works.