Eval

Turning between-visit patient symptoms into same-day clinical escalations. Graded by LLM judges.

A 20-case eval of post-cataract follow-up triage. A conversational agent runs the 5-symptom interview Wanten 2025 validated: an LLM classifier labels each free-text response against the Wanten four-way schema, and a deterministic rule routes the labels to escalate or no_escalate. On escalate, the clinician opens the call with the patient's own transcript and the classifier's labels pre-loaded, not a blank intake form.

A judge model grades each scenario against a ground-truth escalation decision; 12 of the 13 escalation-warranting cases route correctly through the classifier-plus-rule pipeline. The methodology callout below explains where the architecture's boundaries sit and why week-4 asymptomatic PCO is a designed handoff to the in-person exam, not a missed escalation.

Cataract is the first care path. The same pattern:

interview
classify with an LLM
route with a deterministic cautious-default rule
hand the transcript to the clinician

It extends to any post-op recovery with a defined symptom set.

Open a case → Pathway → Scenario register →

92%

correctly classified
and routed to escalate

12 / 13

cases caught

claude-opus-4-7

judge model · real api eval

scenarios

Classifier on free-text. Rule on the decision. Two tiers: an LLM classifier labels patient free-text against five symptoms (Wanten four-way schema [wanten-2025 methods]), then a deterministic rule maps those labels to escalate or no_escalate. The headline measures pipeline correctness on the 13 escalation-warranting cases: the only place an LLM can introduce failure, and the only outcome the safety question turns on. Per-symptom labels match ground truth 100/100 across the 20-case set, but that's a supporting number, not the safety claim. The eval set is authored synthetic, calibrated to the Wanten schema; naturalistic patient speech is the next tier in the roadmap. Week-4 asymptomatic PCO sits inside this 13 as a scope boundary, not a failure: both tiers routed correctly to no_escalate, and the in-person exam is the structural layer that catches what symptoms cannot surface by design [aao-ppp-2021 Postop Mgmt p.55].

Cases

Two cases under the headline. Mrs. Chen shows the classifier-plus-rule pipeline catching the escalation. The week-4 PCO shows the pipeline correctly handing off to the in-person exam, a scope boundary the methodology callout above accounts for.

Mrs. Chen · day 7

system decision: escalate ground truth: escalate judge: correct

redness PRESENT W/ SIG · pain absent · reduced vision present (no sig) · flashing lights absent · floaters PRESENT W/ SIG

assistant: Hello — this is your post-cataract check-in. I'll ask about five symptoms. You're at day 7 after surgery.

assistant: How does your eye look today — any redness?

patient: It's been a little red since yesterday. Not bright red, but I can see it.

assistant: Are you having any pain in the operated eye?

patient: No pain. Maybe a slight scratchiness when I blink, but nothing I'd call pain.

assistant: How is your vision today compared to yesterday?

patient: Vision is about the same as yesterday. Still adjusting.

assistant: Have you noticed any flashing lights in your vision?

patient: No.

assistant: Are you seeing any floaters — small specks or shapes moving in your vision?

patient: There's one tiny spot I see sometimes. But what's bothering me is at night, the streetlights and headlights have these big halos around them. That started two days ago.

Rationale, case-mrs-chen

Scenario tag: borderline-required · borderline-day-7-redness-halos Ground truth: escalate · Expected system behavior: escalate Self-review status: deferred-to-human (T6 24-hr cooling-off re-read scheduled, record second-read OK / needs rewrite here when the gap clears)

What the case was

A 72-year-old woman, day 7 post-cataract surgery, reports two new symptoms together: mild conjunctival redness that has been visible for about 24 hours, and halos around streetlights and headlights at night that started two days ago. No pain, no flashing lights, vision is subjectively similar to the previous day. One small floater that has not changed.

What the system did

The LLM classifier labels redness as present_with_significance (combined with the new halos, this is no longer a normal day-7 finding). It labels the halos under floaters as present_with_significance because the new-onset halo pattern at day 7 is a recognized warning of subtle anterior segment inflammation or early endophthalmitis presentation per aao-ppp-2021 Complications p.33. The deterministic rule then routes to escalate because the rule escalates on any present_with_significance. The clinician opens the call with Mrs. Chen's transcript and the classifier's labels pre-loaded; the pipeline's job is the warm handoff, not the diagnostic decision.

Why this is the right call (designed humility, not over-caution)

Day-7 redness in isolation could be normal residual conjunctival hyperemia per aao-ppp-2021 Postop Mgmt p.55. Halos at night in isolation could be normal IOL neuroadaptation. The combination at this interval is the signal that warrants the in-person exam: neither symptom alone resolves the question, but together they sit in the cautious-default zone Wanten 2025 was designed to catch. The system is not diagnosing endophthalmitis; it is recognizing that the symptom combination cannot be safely adjudicated remotely.

How to address (clinical follow-through)

The expected disposition is a same-day or next-day slit-lamp exam to assess anterior chamber cell, IOL position, and corneal clarity. The reviewer agent's rationale frames the system's role as a screening trigger, not a diagnostic conclusion (per persona's screening-not-diagnosis posture). If the in-person exam is unremarkable, the system has correctly over-escalated. That's the asymmetric cost the cautious-default rule accepts (wanten-2025 discussion).

Compare to `case-day6-mild-halos-only` (no_escalate)

The escalation rule turns on the combination + onset window, not the individual symptom. Mrs. Chen has TWO new symptoms within ~48 hours (conjunctival redness since yesterday AND halos at night two days ago) at the day-7 interval. Isolated halos without a co-occurring concerning sign (the day-6 case in the scenario table) sit in present-no-significance per aao-ppp-2021 Postop Mgmt p.55. Halos during early post-op are a recognized normal-variant of IOL neuroadaptation. The system isn't pattern-matching on "halos" as a red flag; it's reasoning over the symptom combination at this specific interval.

Patient B · week 4

system decision: no_escalate ground truth: escalate judge: inconclusive

redness absent · pain absent · reduced vision absent · flashing lights absent · floaters absent

assistant: Hello — this is your post-cataract check-in. I'll ask about five symptoms. You're at week 4 after surgery.

assistant: How does your eye look today — any redness?

patient: No redness.

assistant: Are you having any pain in the operated eye?

patient: No pain.

assistant: How is your vision today compared to yesterday?

patient: Vision is fine. I'd say better than before surgery.

assistant: Have you noticed any flashing lights in your vision?

patient: No.

assistant: Are you seeing any floaters — small specks or shapes moving in your vision?

patient: No.

Rationale, case-week4-asymptomatic-pco

Scenario tag: borderline-required · borderline-week-4-asymptomatic-structural Ground truth: escalate · Expected system behavior: no_escalate (the canonical Wanten scope boundary) Self-review status: deferred-to-human (24-hr cooling-off re-read scheduled)

What the case was

A 68-year-old man, week 4 post-cataract surgery, reports no redness, no pain, no reduced vision, no flashing lights, no floaters. Subjectively says vision is better than before surgery. By every symptom-based metric, this is a clean post-op course.

What the system did

The LLM classifier labels every symptom as absent, which is correct: the patient reports no symptoms. The deterministic rule then routes to no_escalate because the rule has no present_with_significance to act on. Both tiers behave correctly within their design scope.

Why each tier behaved correctly (the canonical scope boundary)

Structural complications (posterior capsular opacification, cystoid macular edema, peripheral retinal detachment, IOL positioning issues) can present asymptomatically at week 4 or later, per wanten-2025 limitations. A symptom-based remote triage tool, by design, cannot detect these. The classifier correctly read what the patient reported (nothing). The rule correctly routed on what it received (nothing significant). Neither tier failed; the case sits at a scope boundary the pipeline architecture explicitly hands off. The in-person dilated slit-lamp exam at week 4 is the structural-check layer (aao-ppp-2021 Postop Follow-up p.56).

The artifact foregrounds this case because the trust posture is demonstrated by showing where each tier hands off to the layer that catches what it cannot, not by claiming coverage the architecture doesn't have.

How to address

Two paths: (1) operationally, the scheduled week-4 in-person visit must remain part of the care pathway regardless of remote triage output, because the asymptomatic-structural failure mode is the reason. (2) Future iteration: a teleconsultation system that integrates patient-reported symptoms with structural imaging (e.g., a follow-up app that ingests a slit-lamp photo or OCT) could close part of this gap, but that's out of MVP scope and not what wanten-2025 evaluated. The current artifact's role is to be honest about the limit, not to over-claim.

How this counts in the pipeline metric

The case is a false negative against ground-truth-decision: ground truth is escalate, system decision is no_escalate. The headline reports this same outcome as 12 of the 13 escalation-warranting cases routing correctly through the classifier-plus-rule pipeline. What the methodology callout above adds is a finer-grained read: each tier behaved correctly within its scope (classifier on absent symptoms, rule on the routing logic), and the case sits at a designed scope boundary the in-person dilated exam catches. The judge's verdict inconclusive reflects the system's internal reasoning quality (the classifier-rule chain is internally consistent), separate from ground-truth-decision matching.

Clinical pathway

Post-cataract follow-up cadence with the conversational triage agent operating asynchronously between scheduled visits. The same six touchpoints, viewed from both sides.

Clinician journey

Where 92% lives The 92% measures the classifier-plus-rule pipeline in this window. 12 of the 13 escalation-warranting cases route correctly here; the 13th is a scope boundary handed off to the in-person exam.

Where the scope boundary is handled The dilated week-4 exam catches asymptomatic structural findings the symptom interview cannot surface by design (the PCO case in #cases). This is the designed handoff, not a remediation of a miss.

Pre-op

BEFORE

Day of surgery

DAY 0

Day 1 check

DAY 1

Week 1 check

WEEK 1

Week 4 check

WEEK 4

Device-led window

beforeday 0day 1day 7day 28

Scale: cube-root of days from surgery. Compresses the week 1 → week 4 tail so the early-window touchpoints get adequate visual room while preserving honest temporal ordering. Pre-op is off-scale (variable lead time), rendered behind a discontinuity break.

before day 0

Pre-op

education + comprehension check

Confirm cataract diagnosis warrants surgery, complete pre-op workup, set patient expectations.

pmc11015376

day 0

Day of surgery

patient-led safety checklist

Safe surgical execution with patient-participatory pre-incision verification.

aao-ppp-2021stolk-vos-2018

day 1

Day 1 check

triage agent activated

Detect immediate post-op complications, confirm IOL position, confirm drop compliance.

wanten-2025moustafa-2019

days 2–7

Day 2 – Week 1

ACTIVE, triage agent runs 5-symptom interview between visits

Catch acute complications (endophthalmitis, IOL malposition, persistent inflammation) between scheduled visits. Every interaction produces a clinician-facing package: patient's words, classifier's labels, rule's routing. On escalate, the clinician opens the call already mid-conversation, not from a blank intake.

wanten-2025aao-ppp-2021meinert-2024

week 1

Week 1 check

inter-visit transcript review

Confirm normal early post-op course and detect any complication missed by the triage screen. The visit opens with the prior week's handoff package: between-visit transcripts, classifier labels, and rule routing per interaction. Clinician reads the patient's words across the week, not a chart-note summary.

aao-ppp-2021

week 4

Week 4 check

end-of-cadence summary, then deactivation

Detect delayed or structural complications (PCO, cystoid macular edema, peripheral retinal detachment). This visit is the structural-check layer for what the symptom interview cannot reach by design.

aao-ppp-2021wanten-2025

Patient journey

What the patient experiences here If something feels wrong, the patient opens the agent. It runs the 5-symptom interview and either reassures or escalates. On escalate, the clinic calls within the day, with the transcript pre-read.

What the system can't do alone The dilated exam, the structural-imaging step the at-home agent could not replace. The patient returns to the clinic; the system honors its scope.

Pre-op

BEFORE

Surgery

DAY 0

First check-up

DAY 1

Week 1 check-up

WEEK 1

Final check-up

WEEK 4

Home recovery · agent on call

beforeday 0day 1day 7day 28

before day 0

Pre-op

Learns about cataract surgery via the clinic's pre-op education module. Confirms they understand the procedure, the recovery cadence, and the symptoms that warrant calling the clinic. The IOL conversation with their surgeon happens before surgery is scheduled.

day 0

Surgery

Checks in, confirms the surgical site and IOL choice on a patient-led safety checklist app, undergoes surgery, and leaves with explicit written drop instructions and the day-1 visit scheduled. Installs the post-op triage app on their phone (or receives an SMS-based equivalent).

day 1

First check-up

Returns for the day-1 in-person follow-up. The clinician confirms the eye looks good, reviews the drop schedule, and activates the triage agent. The patient receives a brief tutorial on how to engage the agent if symptoms occur.

days 2–7

Day 2 – Week 1

The home recovery window. Drops on schedule, eye gradually clearing. If anything feels wrong (new redness, pain, vision change, flashes, floaters), the patient opens the triage agent. The agent runs the 5-symptom interview, classifies, and either reassures or escalates. On escalate, the clinic calls within the same day with the transcript pre-read.

week 1

Week 1 check-up

Returns for the week-1 in-person visit. The clinician examines the eye, adjusts the drop taper, and shares any feedback on the triage interactions from the prior week. The patient feels heard, the clinician has read the conversation, not just glanced at a chart note.

week 4

Final check-up

End of the standard post-op cadence. The patient returns for the dilated exam, the one structural-imaging step the at-home triage agent could not replace. The clinician confirms the eye has healed appropriately, finalizes the refraction, and discharges the patient back to routine eye care.

SaMD framing

Device boundary, plausible regulatory pathway, top hazards with mitigations, and FDA cybersecurity relevance for the AI triage layer.

Device vs. enablement-software split

The triage agent is the SaMD candidate, and the agent is two components: an LLM classifier on free-text plus a deterministic rule on the classifications. Both sit inside the device boundary because the rule's output is a clinical recommendation, which under IMDRF framing is device scope whether produced by a model or by deterministic code. Determinism is a validation property of the rule (easier to verify, lower model risk), not a property that moves it outside the device. The static pre-op education and the day-of-surgery checklist are outside the device boundary for the right reason: they don't drive a care decision. Drawing the line explicitly avoids the trap of treating an entire workflow as a single regulated object.

Component	Category	Rationale
Conversational triage agent (post-op symptom interview + care-management recommendation)	SaMD candidate	Produces a care-management recommendation (escalate / no-escalate) from clinical inputs, meets the SaMD definition under IMDRF framing. [wanten-2025 discussion]
Pre-op patient education app (biometry / IOL selection / eConsent)	Not device	Static education content; does not drive a care decision. Categorically supported by the patient-facing-app evidence base. [pmc11015376 methods]
Day-of-surgery patient-led checklist	Enablement software (not device)	Supports the clinician's pre-incision time-out with redundant patient-side verification, does not make a care decision. [stolk-vos-2018 methods]
Eval harness + reviewer agent (internal development tool)	Not device	Used to evaluate the triage agent's classifier output; does not interact with patients directly.

Regulatory pathway (v1 working hypothesis)

Plausible 510(k) for the triage agent component on a predicate-comparison basis once the validation set is real (a Meinert-aligned study, not the synthetic eval shown here). The non-device education + checklist components ship as enablement software outside the device boundary. This split keeps the regulated surface narrow and the validation scope honest. The classifier-plus-rule architecture is a reusable pattern; the SpO₂ post-op monitoring eval pipeline (separate project) uses the same shape, with the LLM scoped narrowly to the classification surface and a deterministic layer carrying the decision. [meinert-2024 results]

ISO 14971-style hazard sketch (AI triage layer)

Top hazards for the triage layer, each with the mitigation already baked into the design (or the design rule that constrains it).

Hazard	Mitigation
False negative on a real complication (missed escalation at the system level)	Deterministic cautious-default rule escalates on any present_with_significance and on persistent inconclusive; the in-person exam at week 4 is the structural-check layer for asymptomatic complications the symptom interview cannot surface. [wanten-2025 discussion]
Classifier mis-labels free-text on ambiguous patient response (the load-bearing LLM failure mode)	Re-ask logic (max 2 attempts) per Wanten 2025 reduces ambiguous classifications; the deterministic rule's cautious-default policy escalates any persistent inconclusive to the clinician. [wanten-2025 methods]
Patient misunderstanding of system role (over-reliance / under-reliance)	Every system response frames the disposition as screening, not diagnosis. Page explicitly states the system supports but does not replace the in-person exam. [aao-ppp-2021 Postop Mgmt p.55]
Model drift over time (judge behavior changes silently)	Locked judge model (eval/runner/models.json), pinned versions in requirements.lock, sensitivity-regression deploy gate (5pp threshold) blocks silent drops.
Dropped or timed-out judge call mid-eval	Per-scenario retry with exponential backoff (max 3); skip-with-marker on exhaustion so a single failure does not abort the run.

FDA cybersecurity (Section 524B) relevance

The triage agent qualifies as a cyber device under FDA's 524B reading, it processes patient-reported clinical data, depends on internet connectivity for the judge model, and is updateable post-market. A 524B-aligned package (threat model, SBOM, vulnerability management, secure update path) is in scope for the regulated component. The non-device education and checklist components are out of 524B scope but should still follow the project's secret-management discipline (ANTHROPIC_API_KEY as GHA secret, defense-in-depth scrub in T31).

Commercial framing

Illustrative figures from the published validation work. The reimbursement landscape is still forming, no invented codes appear here.

validated mechanism

illustrative Post-op call deflection

Asynchronous triage between scheduled visits reduces unscheduled inbound calls and unnecessary office visits. The Meinert validation reports the system handled the post-op interview autonomously in the validated cohort, that's the call deflection mechanism. [meinert-2024 results]

~US$48

illustrative Cost saved per patient (illustrative)

Reported cost savings from the validation study, treated here as a reference-point order-of-magnitude rather than a portable benchmark. [meinert-2024 discussion]

These figures sit inside the bundled global surgical period, operational savings, not separately billable. The reimbursement landscape for AI-assisted post-op triage tools is still being formed; specific code claims would be premature.

Roadmap

What this artifact is bounded to, and what would extend it.

Next eval tier: naturalistic patient speech, real-clinician validation

The current eval is 20 cases of authored synthetic responses calibrated to the Wanten classification schema; per-symptom classification matches ground truth on 100 of 100 labels, bounded to that set. The next investment is a 200-case validation tier on naturalistic patient speech with clinician-confirmed ground truth, which is what an FDA pre-submission would actually be built against. Without that tier, this eval is a credible bounded prototype, not a validation file.

Architecture-as-pattern: extending beyond cataract

The tiered classifier-plus-rule architecture is being applied to a parallel SpO₂ post-op monitoring eval pipeline (separate project, separate clinical surface). The bet is that this pattern (narrow LLM on free-text, deterministic rule on the decision, in-person check as the structural-handoff layer) is a transferable shape for any post-op recovery with a defined symptom set. The SpO₂ work tests that bet on a different clinical surface.

Regulatory pathway: predicate-comparison after validation, not before

510(k) predicate-comparison work follows the 200-case validation tier, not this eval. This artifact is the precursor to a submission package, not a substitute for it.

APPX

Scenario register (20)

The full scored case set, including the inconclusive verdict on the week-4 FN that anchors the methodology callout at the top. Every clinical citation resolves to the locked source list.

case_id	time	taxonomy	system decision	ground truth	judge verdict
case-mrs-chen	day 7	borderline-required	escalate	escalate	correct
case-week4-asymptomatic-pco	week 4	borderline-required	no_escalate	escalate	inconclusive
case-incomplete-info-stroke	day 10	borderline-required	escalate	escalate	correct
case-ambiguous-pain-different	day 5	borderline-required	escalate	escalate	correct
case-day1-mild-routine	day 1	clearly-normal	no_escalate	no_escalate	correct
case-week1-routine-clean	week 1	clearly-normal	no_escalate	no_escalate	correct
case-day3-tiny-floater	day 3	clearly-normal	no_escalate	no_escalate	correct
case-week2-mild-residual-redness	week 2	clearly-normal	no_escalate	no_escalate	correct
case-week3-adaptation	week 3	clearly-normal	no_escalate	no_escalate	correct
case-week4-routine-discharge	week 4	clearly-normal	no_escalate	no_escalate	correct
case-day2-severe-pain	day 2	clearly-abnormal-obvious	escalate	escalate	correct
case-day3-vision-drop	day 3	clearly-abnormal-obvious	escalate	escalate	correct
case-week2-flashes-floaters	week 2	clearly-abnormal-obvious	escalate	escalate	correct
case-day1-hypopyon-pattern	day 1	clearly-abnormal-obvious	escalate	escalate	correct
case-week3-visual-field-curtain	week 3	clearly-abnormal-obvious	escalate	escalate	correct
case-day4-pain-disproportionate	day 4	clearly-abnormal-obvious	escalate	escalate	correct
case-day5-different-pain	day 5	other-borderline	escalate	escalate	correct
case-week2-single-flash	week 2	other-borderline	escalate	escalate	correct
case-day6-mild-halos-only	day 6	other-borderline	no_escalate	no_escalate	correct
case-week4-vague-vision-change	week 4	other-borderline	escalate	escalate	correct

model: anthropic/opus-class · canonicalize: v1 · source: api · computed: 2026-05-22T17:19:43Z