// Case study / AI workflows

Designing AI workflows that don't outsource the thinking.

A production alert-triage system with explicit decisions about where AI assists and where human judgement is preserved — and why those lines are where they are.

230raw alerts reviewed

39posts after triage

83%alert noise reduction

1critical miss surfaced

What happened

A production triage workflow compressed noise and surfaced a miss.

A production AI alert-triage system compressed 230 raw alerts to 39 posts across a two-week run on real on-call volume. That is an 83% reduction in alert noise. More importantly, it surfaced one critical feature failure that the human review had missed.

The architecture is deliberately conservative: AI surfaces and ranks; humans still diagnose and act. It makes the human judgement step clearer, not absent.

Situation

The signal was present, but not readable.

A noisy alert channel at a major QSR analytics stack was producing roughly 230 alerts per fortnight from production and staging monitors. The known failure mode was legibility rather than absence of data: a prior critical checkout regression had gone unactioned because technical percentages obscured the signal.

Task

Compress the noise without removing the human gate.

The brief was to surface what matters, suppress what does not, escalate when a suppressed signal worsens, and never delegate the diagnose-and-act decision to the model. The workflow needed to reduce review burden while preserving accountability.

1Suppress known issues

Reads the alert channel, records monitors already posted today, and suppresses repeats unless a signal escalates significantly.

2Pull, dedupe, pre-screen

Pulls the last hour of production and staging alerts, groups firings by monitor name, and drops known low-signal monitor errors before expensive review.

3Trend + error breakdown

Queries Amplitude for each surviving monitor's seven-day chart and surfaces error-type breakdown so the summary has context.

4Gate, group, post

Posts Critical or Warning findings only. Suppressed monitors hitting 2x prior value override suppression. Co-firing monitors group under one root-cause hypothesis.

fig. 01 / responsibility boundarymodel assists / human owns action

AI handlesSurface

Pulls, groups, summarises, ranks, and points attention toward likely signal.

Gate handlesEscalate

Overrides suppression when the same issue worsens instead of letting prior posts silence the problem.

Human handlesDecide

Diagnosis, ownership, and action remain with the on-call team.

Outcome

One missed critical issue justified the architecture.

The signal that justifies the design is not only the alert count. It is the critical feature failure the on-call team's human review had missed. The system surfaced it inside the V1 run; without that surface, it could have continued silently.

230 raw alerts to 39 posts across the two-week production run.
83% reduction in alert noise while preserving escalation paths.
Hourly alert review cycle replacing multi-hour manual review.
V2 build in progress based on the observed V1 surface and suppression patterns.

Design lesson

Suppression is the dangerous part.

Most alert-triage systems focus on suppressing duplicates. This one also asks when suppression should fail. The override rule exists because a system can become unsafe if it silences a worsening issue with its own earlier post.

Claude API Amplitude AI workflow design Human-AI collaboration Alert triage

More casesFull case library.Sixteen active cases across seven categories: question discovery, instrumentation, experimentation, AI workflows, adoption, marketing analytics, and data engineering. AboutBio, skills, credentials.The longer version: methodology, credentials, and what I am available for. Get in touchContact and context.Email, LinkedIn, GitHub, the CV. Happy to walk through this case or any other in detail.