92%
sensitivity
should-escalate
12 / 13
cases caught
claude-opus-4-7
judge model · real api eval
20
scenarios

12 of 13 should-escalate cases caught — and here's the one we missed, and why a symptom-based screen can't see it.

Two cases under the headline. Mrs. Chen is the kind of complication the system is built to catch. The week-4 PCO is the one this class of tool can't — that's what the methodology above accounts for.

Mrs. Chen · day 7

system decision: escalate ground truth: escalate judge: correct
redness PRESENT W/ SIG · pain absent · reduced vision present (no sig) · flashing lights absent · floaters PRESENT W/ SIG

assistant: Hello — this is your post-cataract check-in. I'll ask about five symptoms. You're at day 7 after surgery.

assistant: How does your eye look today — any redness?

patient: It's been a little red since yesterday. Not bright red, but I can see it.

assistant: Are you having any pain in the operated eye?

patient: No pain. Maybe a slight scratchiness when I blink, but nothing I'd call pain.

assistant: How is your vision today compared to yesterday?

patient: Vision is about the same as yesterday. Still adjusting.

assistant: Have you noticed any flashing lights in your vision?

patient: No.

assistant: Are you seeing any floaters — small specks or shapes moving in your vision?

patient: There's one tiny spot I see sometimes. But what's bothering me is at night, the streetlights and headlights have these big halos around them. That started two days ago.

Patient B · week 4

system decision: no_escalate ground truth: escalate judge: inconclusive
redness absent · pain absent · reduced vision absent · flashing lights absent · floaters absent

assistant: Hello — this is your post-cataract check-in. I'll ask about five symptoms. You're at week 4 after surgery.

assistant: How does your eye look today — any redness?

patient: No redness.

assistant: Are you having any pain in the operated eye?

patient: No pain.

assistant: How is your vision today compared to yesterday?

patient: Vision is fine. I'd say better than before surgery.

assistant: Have you noticed any flashing lights in your vision?

patient: No.

assistant: Are you seeing any floaters — small specks or shapes moving in your vision?

patient: No.

Clinical pathway

Post-cataract follow-up cadence with the conversational triage agent operating asynchronously between scheduled visits. The same six touchpoints, viewed from both sides.

Clinician journey

Where 92% lives The 92% sensitivity headline is measured in this window. 12 of 13 should-escalate cases caught here; the 1 FN is structurally invisible to a symptom-based screen.
Where the miss is caught The dilated week-4 exam catches the asymptomatic-PCO miss featured in #cases. The system stepping back here is the system honoring its scope.
Pre-op
BEFORE
Day of surgery
DAY 0
Day 1 check
DAY 1
Week 1 check
WEEK 1
Week 4 check
WEEK 4
Device-led window
beforeday 0day 1day 7day 28

Scale: cube-root of days from surgery. Compresses the week 1 → week 4 tail so the early-window touchpoints get adequate visual room while preserving honest temporal ordering. Pre-op is off-scale (variable lead time) — rendered behind a discontinuity break.

before day 0

Pre-op

education + comprehension check

Confirm cataract diagnosis warrants surgery, complete pre-op workup, set patient expectations.

pmc11015376
day 0

Day of surgery

patient-led safety checklist

Safe surgical execution with patient-participatory pre-incision verification.

aao-ppp-2021stolk-vos-2018
day 1

Day 1 check

triage agent activated

Detect immediate post-op complications, confirm IOL position, confirm drop compliance.

wanten-2025moustafa-2019
days 2–7

Day 2 – Week 1

ACTIVE — triage agent runs 5-symptom interview between visits

Catch acute complications (endophthalmitis, IOL malposition, persistent inflammation) between scheduled visits. The highest-acuity inter-visit window.

wanten-2025aao-ppp-2021meinert-2024
week 1

Week 1 check

inter-visit transcript review

Confirm normal early post-op course; detect any complication missed by the triage screen.

aao-ppp-2021
week 4

Week 4 check

end-of-cadence summary, then deactivation

Detect delayed or structural complications (PCO, cystoid macular edema, peripheral retinal detachment). This visit catches the documented asymptomatic-structural failure mode.

aao-ppp-2021wanten-2025

Patient journey

What the patient experiences here If something feels wrong, the patient opens the agent. It runs the 5-symptom interview and either reassures or escalates. On escalate, the clinic calls within the day, with the transcript pre-read.
What the system can't do alone The dilated exam — the structural-imaging step the at-home agent could not replace. The patient returns to the clinic; the system honors its scope.
Pre-op
BEFORE
Surgery
DAY 0
First check-up
DAY 1
Week 1 check-up
WEEK 1
Final check-up
WEEK 4
Home recovery · agent on call
beforeday 0day 1day 7day 28
before day 0

Pre-op

Learns about cataract surgery via the clinic's pre-op education module. Confirms they understand the procedure, the recovery cadence, and the symptoms that warrant calling the clinic. The IOL conversation with their surgeon happens before surgery is scheduled.

day 0

Surgery

Checks in, confirms the surgical site and IOL choice on a patient-led safety checklist app, undergoes surgery, and leaves with explicit written drop instructions and the day-1 visit scheduled. Installs the post-op triage app on their phone (or receives an SMS-based equivalent).

day 1

First check-up

Returns for the day-1 in-person follow-up. The clinician confirms the eye looks good, reviews the drop schedule, and activates the triage agent. The patient receives a brief tutorial on how to engage the agent if symptoms occur.

days 2–7

Day 2 – Week 1

The home recovery window. Drops on schedule, eye gradually clearing. If anything feels wrong — new redness, pain, vision change, flashes, floaters — the patient opens the triage agent. The agent runs the 5-symptom interview, classifies, and either reassures or escalates. On escalate, the clinic calls within the same day with the transcript pre-read.

week 1

Week 1 check-up

Returns for the week-1 in-person visit. The clinician examines the eye, adjusts the drop taper, and shares any feedback on the triage interactions from the prior week. The patient feels heard — the clinician has read the conversation, not just glanced at a chart note.

week 4

Final check-up

End of the standard post-op cadence. The patient returns for the dilated exam — the one structural-imaging step the at-home triage agent could not replace. The clinician confirms the eye has healed appropriately, finalizes the refraction, and discharges the patient back to routine eye care.

SaMD framing

Device boundary, plausible regulatory pathway, top hazards with mitigations, and FDA cybersecurity relevance for the AI triage layer.

Device vs. enablement-software split

The triage agent is the SaMD candidate; the static pre-op education is not. Drawing the line explicitly avoids the trap of treating an entire workflow as a single regulated object.

ComponentCategoryRationale
Conversational triage agent (post-op symptom interview + care-management recommendation)SaMD candidateProduces a care-management recommendation (escalate / no-escalate) from clinical inputs — meets the SaMD definition under IMDRF framing. [wanten-2025 discussion]
Pre-op patient education app (biometry / IOL selection / eConsent)Not deviceStatic education content; does not drive a care decision. Categorically supported by the patient-facing-app evidence base. [pmc11015376 methods]
Day-of-surgery patient-led checklistEnablement software (not device)Supports the clinician's pre-incision time-out with redundant patient-side verification — does not make a care decision. [stolk-vos-2018 methods]
Eval harness + reviewer agent (internal development tool)Not deviceUsed to evaluate the triage agent's classifier output; does not interact with patients directly.

Regulatory pathway (v1 working hypothesis)

Plausible 510(k) for the triage agent component on a predicate-comparison basis once the validation set is real (a Meinert-aligned study, not the synthetic eval shown here). The non-device education + checklist components ship as enablement software outside the device boundary. This split keeps the regulated surface narrow and the validation scope honest. [meinert-2024 results]

ISO 14971-style hazard sketch (AI triage layer)

Top hazards for the triage layer, each with the mitigation already baked into the design (or the design rule that constrains it).

HazardMitigation
False negative on a real complication (missed escalation)Cautious-default escalation rule + foregrounded-failure display calibrates the design around catching, not over-classifying — sensitivity is the primary outcome. [wanten-2025 discussion]
Speech/text classification failure on ambiguous patient responseRe-ask logic (max 2 attempts) per Wanten 2025; persistent inconclusive escalates by cautious-default policy. [wanten-2025 methods]
Patient misunderstanding of system role (over-reliance / under-reliance)Every system response frames the disposition as screening, not diagnosis. Page explicitly states the system supports but does not replace the in-person exam. [aao-ppp-2021 Postop Mgmt p.55]
Model drift over time (judge behavior changes silently)Locked judge model (eval/runner/models.json), pinned versions in requirements.lock, sensitivity-regression deploy gate (5pp threshold) blocks silent drops.
Dropped or timed-out judge call mid-evalPer-scenario retry with exponential backoff (max 3); skip-with-marker on exhaustion so a single failure does not abort the run.

FDA cybersecurity (Section 524B) relevance

The triage agent qualifies as a cyber device under FDA's 524B reading — it processes patient-reported clinical data, depends on internet connectivity for the judge model, and is updateable post-market. A 524B-aligned package (threat model, SBOM, vulnerability management, secure update path) is in scope for the regulated component. The non-device education and checklist components are out of 524B scope but should still follow the project's secret-management discipline (ANTHROPIC_API_KEY as GHA secret, defense-in-depth scrub in T31).

Commercial framing

Illustrative figures from the published validation work. The reimbursement landscape is still forming — no invented codes appear here.

validated mechanism
illustrative Post-op call deflection
Asynchronous triage between scheduled visits reduces unscheduled inbound calls and unnecessary office visits. The Meinert validation reports the system handled the post-op interview autonomously in the validated cohort — that's the call deflection mechanism. [meinert-2024 results]
~US$48
illustrative Cost saved per patient (illustrative)
Reported cost savings from the validation study, treated here as a reference-point order-of-magnitude rather than a portable benchmark. [meinert-2024 discussion]

These figures sit inside the bundled global surgical period — operational savings, not separately billable. The reimbursement landscape for AI-assisted post-op triage tools is still being formed; specific code claims would be premature. The pre-op IOL counseling extension below moves into a different reimbursement category (see roadmap).

What's next — pre-op IOL counseling

The same engine, earlier in the journey, with a different commercial shape.

The same engine extends to pre-op IOL selection counseling — earlier in the journey, structurally familiar (locked sources, gap-flagging, cautious defaults), commercially different. Post-op triage sits inside the bundled global surgical period: its value is cost deflection — fewer unscheduled visits, fewer ED detours — operational and indirect. Pre-op premium-IOL counseling for multifocal, EDOF, and toric lenses is cash-pay by design, a non-covered service patients elect into. The surgeon's counseling time at that step is unbundled, billable to the patient, and capacity-constrained. An AI assistant that improves counseling quality AND surgeon throughput at that step affects top-line revenue, not just operational savings. The post-op work validates the safety case; the pre-op IOL extension is where the same engine reaches a different P&L.

Scenario register (20)

The full scored case set, including the inconclusive verdict on the week-4 FN that anchors the methodology callout at the top. Every clinical citation resolves to the locked source list.

case_id time taxonomy system decision ground truth judge verdict
case-mrs-chen day 7 borderline-required escalate escalate correct
case-week4-asymptomatic-pco week 4 borderline-required no_escalate escalate inconclusive
case-incomplete-info-stroke day 10 borderline-required escalate escalate correct
case-ambiguous-pain-different day 5 borderline-required escalate escalate correct
case-day1-mild-routine day 1 clearly-normal no_escalate no_escalate correct
case-week1-routine-clean week 1 clearly-normal no_escalate no_escalate correct
case-day3-tiny-floater day 3 clearly-normal no_escalate no_escalate correct
case-week2-mild-residual-redness week 2 clearly-normal no_escalate no_escalate correct
case-week3-adaptation week 3 clearly-normal no_escalate no_escalate correct
case-week4-routine-discharge week 4 clearly-normal no_escalate no_escalate correct
case-day2-severe-pain day 2 clearly-abnormal-obvious escalate escalate correct
case-day3-vision-drop day 3 clearly-abnormal-obvious escalate escalate correct
case-week2-flashes-floaters week 2 clearly-abnormal-obvious escalate escalate correct
case-day1-hypopyon-pattern day 1 clearly-abnormal-obvious escalate escalate correct
case-week3-visual-field-curtain week 3 clearly-abnormal-obvious escalate escalate correct
case-day4-pain-disproportionate day 4 clearly-abnormal-obvious escalate escalate correct
case-day5-different-pain day 5 other-borderline escalate escalate correct
case-week2-single-flash week 2 other-borderline escalate escalate correct
case-day6-mild-halos-only day 6 other-borderline no_escalate no_escalate correct
case-week4-vague-vision-change week 4 other-borderline escalate escalate correct

model: anthropic/opus-class · canonicalize: v1 · source: api · computed: 2026-05-22T17:19:43Z