Test data for AI pipelines

Synthetic test data,
correct by construction.

Documents for extraction pipelines. Security logs for detection agents. The generator places every value, so the labels are exact, nothing comes from customer data, and every re-run is identical. At max difficulty the documents break every frontier model we tested, and the test set is public.

Email one hard document See the extraction test

No forms. Send a doc to tim@aginor.ai and get up to 20 labeled variants back in about 48 hours. Security logs start from the scenario catalog.

Loss run · generated seed 4127 · variant 12 of 20

Claim	Status	Paid	Incurred
WC-2019-0244	Closed	4,120.00	4,120.00
WC-2020-1187	Open	18,275.50	20,575.50
GL-2021-0086	Closed	1,942.16	1,942.16
Total incurred			31,206.98

Planted error The rows sum to 26,637.66. The generator placed the wrong total and wrote the right one into the ground truth.

The problem

The test set you need is the one you can't build from production data.

Customer data is off-limits

HIPAA, SOC2, DPAs. The hard cases you'd most want to regression-test against are the ones you can't legally keep, reuse, or share with a new model vendor.

Edge cases multiply with every customer

Every new customer brings template variants, payer formats, scan quality, and tenant behavior your prompts didn't anticipate. The long tail outpaces any internal labeling effort.

Model upgrades shift pipeline behavior

A new frontier model drops and your numbers change. Without a fixed corpus you can re-run, you can't tell whether the upgrade helped, hurt, or broke specific cases.

Ambiguity is the hard part

"Premium" means three things across forms. A closed claim wears an open status. A benign login looks like an attack. Test data has to include the ambiguous cases on purpose, or you never learn whether your system survives them.

What failure looks like

Exhibit · from the public extraction test

GPT-5.4 read a $42.0M revenue line and reported $21.65M. GPT-5.5 fabricated revenue and COGS that subtract exactly to the real gross profit.

A loss run where the claim rows don't reconcile to the totalsinsurance

An EOB where line items don't sum to the paid amounthealthcare

Fabricated values constructed to pass your column-sum checksdocuments

An attack that's 0.2% of 50,000 events while a quarter of the stream mimics onesecurity

Every artifact ships with exact labels, so misses like these are measurablesee the data

How it works

Send one hard case

Email a document your pipeline misreads, or pick a scenario from the security log catalog. PDF, XLSX, CSV, scans, or Okta System Log streams.

The engine generates

Config-driven, deterministic generation. Up to 20 document variants or a 50,000-event log stream, with ground truth written in the same pass as the data. Typical document turnaround is 48 hours.

You test, then re-run forever

The labels are exact because the generator placed every value. No annotation step, no SME bottleneck. Same seed, same output, so every model upgrade is measured against a fixed corpus.

Where it's deep

Insurance carriers and MGAs

SOVs, loss runs, ACORD forms, and full submission packets. 84 carrier templates, 19 ACORD forms, 65 adversarial patterns, with ground truth at the document, field, and bounding-box level.

SOVs Loss runs 19 ACORD forms Dec pages Full submissions

Insurance test data

Healthcare RCM teams

EOBs, CMS-1500s, UB-04s, and remittances through the clone-and-variants workflow. HIPAA means real claims can't become your test corpus. These documents are never derived from customer data, so nothing is off-limits.

EOBs CMS-1500s UB-04s Remittances

Healthcare test data

Detection and AI SOC teams

Okta System Log scenarios with attacks, false-positive noise, and a realistic benign baseline. Deterministically generated, with investigation questions and answers as the ground truth.

Okta System Log Attack scenarios FP noise Kill chains

Security log test data

Proof

The difficulty dial goes further than you need.

Difficulty is a config value. It starts at clean digital output and ends at the Nightmare tier: 148 adversarial insurance documents across 5 levels. On the clean tier, every frontier model we tested fabricates under 1% of numeric values. On the hardest tiers, every model climbs past 6% and the OpenAI flagships exceed 17%. The documents, ground truth, and scoring code are public.

It also answers a fair question: why not generate synthetic docs with an LLM? Because hard cases have to stay internally consistent. Totals that reconcile, values that agree across a 30-document packet, labels that are actually correct. LLM generation can't hold that together. A fuzzing engine with correct-by-construction ground truth can.

The security log engine runs on the same discipline: deterministic seeds, answer keys written with the stream, identical on every re-run.

148 adversarial documents · 5 difficulty tiers · 5 frontier models tested · 9 document categories

Read the extraction test Dataset on HuggingFace Code on GitHub

Tim Michaud, Founder

YC Alum · Previously Staff Security Eng @ Moveworks (acq. ServiceNow)

At Moveworks I spent years breaking 250+ AI agents serving Fortune 500 companies as the security eng on those rollouts. Before that I spent a decade in security research finding bugs in Apple, Chrome, and Qualcomm. Aginor came from putting those two things together: I know what production data does to agents, and I know how to generate the inputs that break them.

Start with one hard case.

Email a document your pipeline struggles with and get up to 20 labeled variants back in about 48 hours, or ask for the 42-scenario security log catalog. Ground truth attached either way.

Email one hard document

Testing detections? Ask for the log scenario catalog. Or just write tim@aginor.ai.

Synthetic test data,correct by construction.