A working portfolio demo. Independent work. Synthetic data, labeled.
01
Eval
Turning between-visit patient symptoms into same-day clinical escalations.
Graded by LLM judges.
A 20-case eval of post-cataract follow-up triage. A conversational agent runs the 5-symptom interview Wanten 2025 validated: an LLM classifier labels each free-text response against the Wanten four-way schema, and a deterministic rule routes the labels to escalate or no_escalate. On escalate, the clinician opens the call with the patient's own transcript and the classifier's labels pre-loaded, not a blank intake form.
A judge model grades each scenario against a ground-truth escalation decision; 12 of the 13 escalation-warranting cases route correctly through the classifier-plus-rule pipeline. The methodology callout below explains where the architecture's boundaries sit and why week-4 asymptomatic PCO is a designed handoff to the in-person exam, not a missed escalation.
Cataract is the first care path. The same pattern:
interview
classify with an LLM
route with a deterministic cautious-default rule
hand the transcript to the clinician
It extends to any post-op recovery with a defined symptom set.
Two cases under the headline. Mrs. Chen shows the classifier-plus-rule pipeline catching the escalation. The week-4 PCO shows the pipeline correctly handing off to the in-person exam, a scope boundary the methodology callout above accounts for.
Mrs. Chen · day 7
system decision: escalateground truth: escalatejudge: correct
rednessPRESENT W/ SIG · painabsent · reduced visionpresent (no sig) · flashing lightsabsent · floatersPRESENT W/ SIG
assistant:Hello — this is your post-cataract check-in. I'll ask about five symptoms. You're at day 7 after surgery.
assistant:How does your eye look today — any redness?
patient:It's been a little red since yesterday. Not bright red, but I can see it.
assistant:Are you having any pain in the operated eye?
patient:No pain. Maybe a slight scratchiness when I blink, but nothing I'd call pain.
assistant:How is your vision today compared to yesterday?
patient:Vision is about the same as yesterday. Still adjusting.
assistant:Have you noticed any flashing lights in your vision?
patient:No.
assistant:Are you seeing any floaters — small specks or shapes moving in your vision?
patient:There's one tiny spot I see sometimes. But what's bothering me is at night, the streetlights and headlights have these big halos around them. That started two days ago.
Patient B · week 4
system decision: no_escalateground truth: escalatejudge: inconclusive
assistant:Hello — this is your post-cataract check-in. I'll ask about five symptoms. You're at week 4 after surgery.
assistant:How does your eye look today — any redness?
patient:No redness.
assistant:Are you having any pain in the operated eye?
patient:No pain.
assistant:How is your vision today compared to yesterday?
patient:Vision is fine. I'd say better than before surgery.
assistant:Have you noticed any flashing lights in your vision?
patient:No.
assistant:Are you seeing any floaters — small specks or shapes moving in your vision?
patient:No.
03
Clinical pathway
Post-cataract follow-up cadence with the conversational triage agent operating asynchronously between scheduled visits. The same six touchpoints, viewed from both sides.
Clinician journey
Where 92% lives The 92% measures the classifier-plus-rule pipeline in this window. 12 of the 13 escalation-warranting cases route correctly here; the 13th is a scope boundary handed off to the in-person exam.
Where the scope boundary is handled The dilated week-4 exam catches asymptomatic structural findings the symptom interview cannot surface by design (the PCO case in #cases). This is the designed handoff, not a remediation of a miss.
Pre-op
BEFORE
Day of surgery
DAY 0
Day 1 check
DAY 1
Week 1 check
WEEK 1
Week 4 check
WEEK 4
Device-led window
beforeday 0day 1day 7day 28
Scale: cube-root of days from surgery. Compresses the week 1 → week 4 tail so the early-window touchpoints get adequate visual room while preserving honest temporal ordering. Pre-op is off-scale (variable lead time), rendered behind a discontinuity break.
Safe surgical execution with patient-participatory pre-incision verification.
aao-ppp-2021stolk-vos-2018
day 1
Day 1 check
triage agent activated
Detect immediate post-op complications, confirm IOL position, confirm drop compliance.
wanten-2025moustafa-2019
days 2–7
Day 2 – Week 1
ACTIVE, triage agent runs 5-symptom interview between visits
Catch acute complications (endophthalmitis, IOL malposition, persistent inflammation) between scheduled visits. Every interaction produces a clinician-facing package: patient's words, classifier's labels, rule's routing. On escalate, the clinician opens the call already mid-conversation, not from a blank intake.
wanten-2025aao-ppp-2021meinert-2024
week 1
Week 1 check
inter-visit transcript review
Confirm normal early post-op course and detect any complication missed by the triage screen. The visit opens with the prior week's handoff package: between-visit transcripts, classifier labels, and rule routing per interaction. Clinician reads the patient's words across the week, not a chart-note summary.
aao-ppp-2021
week 4
Week 4 check
end-of-cadence summary, then deactivation
Detect delayed or structural complications (PCO, cystoid macular edema, peripheral retinal detachment). This visit is the structural-check layer for what the symptom interview cannot reach by design.
aao-ppp-2021wanten-2025
Patient journey
What the patient experiences here If something feels wrong, the patient opens the agent. It runs the 5-symptom interview and either reassures or escalates. On escalate, the clinic calls within the day, with the transcript pre-read.
What the system can't do alone The dilated exam, the structural-imaging step the at-home agent could not replace. The patient returns to the clinic; the system honors its scope.
Pre-op
BEFORE
Surgery
DAY 0
First check-up
DAY 1
Week 1 check-up
WEEK 1
Final check-up
WEEK 4
Home recovery · agent on call
beforeday 0day 1day 7day 28
before day 0
Pre-op
Learns about cataract surgery via the clinic's pre-op education module. Confirms they understand the procedure, the recovery cadence, and the symptoms that warrant calling the clinic. The IOL conversation with their surgeon happens before surgery is scheduled.
day 0
Surgery
Checks in, confirms the surgical site and IOL choice on a patient-led safety checklist app, undergoes surgery, and leaves with explicit written drop instructions and the day-1 visit scheduled. Installs the post-op triage app on their phone (or receives an SMS-based equivalent).
day 1
First check-up
Returns for the day-1 in-person follow-up. The clinician confirms the eye looks good, reviews the drop schedule, and activates the triage agent. The patient receives a brief tutorial on how to engage the agent if symptoms occur.
days 2–7
Day 2 – Week 1
The home recovery window. Drops on schedule, eye gradually clearing. If anything feels wrong (new redness, pain, vision change, flashes, floaters), the patient opens the triage agent. The agent runs the 5-symptom interview, classifies, and either reassures or escalates. On escalate, the clinic calls within the same day with the transcript pre-read.
week 1
Week 1 check-up
Returns for the week-1 in-person visit. The clinician examines the eye, adjusts the drop taper, and shares any feedback on the triage interactions from the prior week. The patient feels heard, the clinician has read the conversation, not just glanced at a chart note.
week 4
Final check-up
End of the standard post-op cadence. The patient returns for the dilated exam, the one structural-imaging step the at-home triage agent could not replace. The clinician confirms the eye has healed appropriately, finalizes the refraction, and discharges the patient back to routine eye care.
04
SaMD framing
Device boundary, plausible regulatory pathway, top hazards with mitigations, and FDA cybersecurity relevance for the AI triage layer.
Device vs. enablement-software split
The triage agent is the SaMD candidate, and the agent is two components: an LLM classifier on free-text plus a deterministic rule on the classifications. Both sit inside the device boundary because the rule's output is a clinical recommendation, which under IMDRF framing is device scope whether produced by a model or by deterministic code. Determinism is a validation property of the rule (easier to verify, lower model risk), not a property that moves it outside the device. The static pre-op education and the day-of-surgery checklist are outside the device boundary for the right reason: they don't drive a care decision. Drawing the line explicitly avoids the trap of treating an entire workflow as a single regulated object.
Produces a care-management recommendation (escalate / no-escalate) from clinical inputs, meets the SaMD definition under IMDRF framing. [wanten-2025 discussion]
Static education content; does not drive a care decision. Categorically supported by the patient-facing-app evidence base. [pmc11015376 methods]
Day-of-surgery patient-led checklist
Enablement software (not device)
Supports the clinician's pre-incision time-out with redundant patient-side verification, does not make a care decision. [stolk-vos-2018 methods]
Eval harness + reviewer agent (internal development tool)
Not device
Used to evaluate the triage agent's classifier output; does not interact with patients directly.
Regulatory pathway (v1 working hypothesis)
Plausible 510(k) for the triage agent component on a predicate-comparison basis once the validation set is real (a Meinert-aligned study, not the synthetic eval shown here). The non-device education + checklist components ship as enablement software outside the device boundary. This split keeps the regulated surface narrow and the validation scope honest. The classifier-plus-rule architecture is a reusable pattern; the SpO₂ post-op monitoring eval pipeline (separate project) uses the same shape, with the LLM scoped narrowly to the classification surface and a deterministic layer carrying the decision. [meinert-2024 results]
ISO 14971-style hazard sketch (AI triage layer)
Top hazards for the triage layer, each with the mitigation already baked into the design (or the design rule that constrains it).
Hazard
Mitigation
False negative on a real complication (missed escalation at the system level)
Deterministic cautious-default rule escalates on any present_with_significance and on persistent inconclusive; the in-person exam at week 4 is the structural-check layer for asymptomatic complications the symptom interview cannot surface. [wanten-2025 discussion]
Classifier mis-labels free-text on ambiguous patient response (the load-bearing LLM failure mode)
Re-ask logic (max 2 attempts) per Wanten 2025 reduces ambiguous classifications; the deterministic rule's cautious-default policy escalates any persistent inconclusive to the clinician. [wanten-2025 methods]
Patient misunderstanding of system role (over-reliance / under-reliance)
Every system response frames the disposition as screening, not diagnosis. Page explicitly states the system supports but does not replace the in-person exam. [aao-ppp-2021 Postop Mgmt p.55]
Model drift over time (judge behavior changes silently)
Locked judge model (eval/runner/models.json), pinned versions in requirements.lock, sensitivity-regression deploy gate (5pp threshold) blocks silent drops.
Dropped or timed-out judge call mid-eval
Per-scenario retry with exponential backoff (max 3); skip-with-marker on exhaustion so a single failure does not abort the run.
FDA cybersecurity (Section 524B) relevance
The triage agent qualifies as a cyber device under FDA's 524B reading, it processes patient-reported clinical data, depends on internet connectivity for the judge model, and is updateable post-market. A 524B-aligned package (threat model, SBOM, vulnerability management, secure update path) is in scope for the regulated component. The non-device education and checklist components are out of 524B scope but should still follow the project's secret-management discipline (ANTHROPIC_API_KEY as GHA secret, defense-in-depth scrub in T31).
04
Commercial framing
Illustrative figures from the published validation work. The reimbursement landscape is still forming, no invented codes appear here.
validated mechanism
illustrative Post-op call deflection
Asynchronous triage between scheduled visits reduces unscheduled inbound calls and unnecessary office visits. The Meinert validation reports the system handled the post-op interview autonomously in the validated cohort, that's the call deflection mechanism. [meinert-2024 results]
~US$48
illustrative Cost saved per patient (illustrative)
Reported cost savings from the validation study, treated here as a reference-point order-of-magnitude rather than a portable benchmark. [meinert-2024 discussion]
These figures sit inside the bundled global surgical period, operational savings, not separately billable. The reimbursement landscape for AI-assisted post-op triage tools is still being formed; specific code claims would be premature.
05
Roadmap
What this artifact is bounded to, and what would extend it.
Next eval tier: naturalistic patient speech, real-clinician validation
The current eval is 20 cases of authored synthetic responses calibrated to the Wanten classification schema; per-symptom classification matches ground truth on 100 of 100 labels, bounded to that set. The next investment is a 200-case validation tier on naturalistic patient speech with clinician-confirmed ground truth, which is what an FDA pre-submission would actually be built against. Without that tier, this eval is a credible bounded prototype, not a validation file.
The tiered classifier-plus-rule architecture is being applied to a parallel SpO₂ post-op monitoring eval pipeline (separate project, separate clinical surface). The bet is that this pattern (narrow LLM on free-text, deterministic rule on the decision, in-person check as the structural-handoff layer) is a transferable shape for any post-op recovery with a defined symptom set. The SpO₂ work tests that bet on a different clinical surface.
Regulatory pathway: predicate-comparison after validation, not before
510(k) predicate-comparison work follows the 200-case validation tier, not this eval. This artifact is the precursor to a submission package, not a substitute for it.
APPX
Scenario register (20)
The full scored case set, including the inconclusive verdict on the week-4 FN that anchors the methodology callout at the top. Every clinical citation resolves to the locked source list.