// Case study / AI workflows
Designing AI workflows that don't outsource the thinking.
A production alert-triage system with explicit decisions about where AI assists and where human judgement is preserved — and why those lines are where they are.
What happened
A production triage workflow compressed noise and surfaced a miss.
A production AI alert-triage system compressed 230 raw alerts to 39 posts across a two-week run on real on-call volume. That is an 83% reduction in alert noise. More importantly, it surfaced one critical feature failure that the human review had missed.
The architecture is deliberately conservative: AI surfaces and ranks; humans still diagnose and act. It makes the human judgement step clearer, not absent.
Situation
The signal was present, but not readable.
A noisy alert channel at a major QSR analytics stack was producing roughly 230 alerts per fortnight from production and staging monitors. The known failure mode was not absence of data. It was legibility: a prior critical checkout regression had gone unactioned because technical percentages obscured the signal.
Task
Compress the noise without removing the human gate.
The brief was to surface what matters, suppress what does not, escalate when a suppressed signal worsens, and never delegate the diagnose-and-act decision to the model. The workflow needed to reduce review burden while preserving accountability.
Reads the alert channel, records monitors already posted today, and suppresses repeats unless a signal escalates significantly.
Pulls the last hour of production and staging alerts, groups firings by monitor name, and drops known low-signal monitor errors before expensive review.
Queries Amplitude for each surviving monitor's seven-day chart and surfaces error-type breakdown so the summary has context.
Posts Critical or Warning findings only. Suppressed monitors hitting 2x prior value override suppression. Co-firing monitors group under one root-cause hypothesis.
Pulls, groups, summarises, ranks, and points attention toward likely signal.
Overrides suppression when the same issue worsens instead of letting prior posts silence the problem.
Diagnosis, ownership, and action remain with the on-call team.
Outcome
One missed critical issue justified the architecture.
The signal that justifies the design is not only the alert count. It is the critical feature failure the on-call team's human review had missed. The system surfaced it inside the V1 run; without that surface, it could have continued silently.
- 230 raw alerts to 39 posts across the two-week production run.
- 83% reduction in alert noise while preserving escalation paths.
- Hourly alert review cycle replacing multi-hour manual review.
- V2 build in progress based on the observed V1 surface and suppression patterns.
Design lesson
Suppression is the dangerous part.
Most alert-triage systems focus on suppressing duplicates. This one also asks when suppression should fail. The override rule exists because a system can become unsafe if it silences a worsening issue with its own earlier post.