For RCM engineering and product teams

Synthetic EOBs, CMS-1500s, and UB-04s with ground truth built in.

Test data for claims extraction pipelines. Real claims can't become test corpora under HIPAA. These documents are generated, never derived from customer data, with deterministic labels for every field.

Email one hard document See the extraction test

No forms. Send a doc to tim@aginor.ai, get up to 20 labeled variants back in about 48 hours.

The problem

HIPAA means your best test cases are locked away.

Real claims can't be test corpora

Every hard EOB in production is PHI. You can't keep it, can't share it with a model vendor, can't check it into an eval harness. The documents you most need to test against are the ones compliance takes off the table.

Clean test sets hide the failures

An EOB where line items don't sum to the paid amount is the failure your clean test set never catches. Payers produce documents like that every day. Test data has to include them on purpose.

Payer format sprawl

Every payer formats EOBs and remits differently, and new variants keep coming. The long tail outpaces any internal labeling effort, and clinician labeling time is the most expensive kind.

Model upgrades shift accuracy

A new model drops and your extraction numbers move. Without a fixed labeled corpus you can re-run, you can't tell what got better and what silently broke on specific payer formats.

What the difficulty dial generates

Exhibit · failure cases, generated on purpose

An EOB where line items don't sum to the paid amount. The failure your clean test set never catches.

Fax-quality scans, bleed-through, handwritten annotationsscan effects

The same field wearing different labels across payer formatsambiguity

An EOB and its remit that disagree, with ground truth recording which is rightcross-doc

Totals engineered to pass your column-sum sanity checksarithmetic

How it works

Email one document

A hard EOB, CMS-1500, UB-04, or remit. A de-identified sample or a representative template, whatever your compliance team allows. PDF, XLSX, CSV, scans.

The engine generates variants

Up to 20 variants: same layout and format, entirely new synthetic data, difficulty dialed where you need it. No LLM anywhere in the generation path. Typical turnaround is 48 hours.

You test, then re-run forever

Deterministic ground truth ships with every variant. No annotation queue, no clinician labeling bottleneck. The suite re-runs unchanged on every model upgrade.

Proof

The same engine broke five frontier models on insurance documents.

Healthcare runs on the generation engine we built for insurance: config-driven fuzzing with deterministic, correct-by-construction ground truth. On the insurance corpus, every frontier model we tested fabricated under 1% of numeric values on clean documents and more than 6% on the hardest tiers. The OpenAI flagships passed 17%. GPT-5.4 read a $42.0M revenue line and reported $21.65M. Insurance is where the prebuilt template library is deepest today; healthcare runs clone-and-variants, building around the documents you send.

Why not have an LLM draft synthetic EOBs? Because hard cases have to stay internally consistent: line items that sum, values that agree across the claim, labels that are right every time. LLM generation can't guarantee any of that. Correct-by-construction generation can.

148 adversarial documents · 5 difficulty tiers · 5 frontier models tested · public dataset and scoring code

Read the extraction test Dataset on HuggingFace Code on GitHub

Send the EOB that breaks your pipeline.

Email one hard document, de-identified or representative. You get up to 20 labeled variants back, same layout, new synthetic data, ground truth attached, in about 48 hours.

Email one hard document

Prefer to write your own email? tim@aginor.ai works.