// Case study / Alert engineering
Clean taxonomy, trustworthy alerts, enriched context.
Three layers of remediation on an alert channel where genuine checkout failures were buried in noise from low-traffic Active% triggers.
What happened
A noisy alert channel became a more trustworthy operating surface.
A major QSR's Amplitude alert channel was firing on too many low-signal
events. Genuine checkout failures were buried in low-traffic Active%
noise. I shipped redesigned thresholds with troubleshooting guidance,
then authored two implementation-ready improvements: a hierarchical
error-category ID scheme and a
cart_id enrichment specification for checkout error events.
The taxonomy work was designed to reduce false positives in the AI alert-triage system by giving the model cleaner categories and richer context to reason from.
Context
The alert channel was correct but exhausting.
Low-traffic monitors produced noisy Active% triggers that drowned out genuinely important checkout failures. On-call responders had to do detective work before they could know whether an alert represented a real checkout problem, which basket was affected, or how the error should be categorised.
Task
Make the alert itself carry the start of the diagnosis.
The work needed to reduce false positives while improving diagnostic context. That required changes at the monitor threshold layer, the error taxonomy layer, and the event-schema layer.
Monitors redesigned around absolute rate plus rate-of-change, reducing overreaction to low-traffic Active% volatility.
Engineering-facing troubleshooting guidance added directly on each chart — the alert carried the start of the diagnosis.
Hierarchical error-category IDs such as
SYS-INF-001 with deterministic mapping rules and
governance against schema drift.
Specification to add cart_id to
Checkout Started Error Received so error and success
events shared the same context level.
Redesigned monitors around absolute rate plus rate-of-change, reducing overreaction to low-traffic Active% volatility.
Added engineering-facing troubleshooting guidance directly on each chart so the alert carried the start of the diagnosis.
Designed hierarchical error-category IDs with deterministic mapping rules and governance guidelines against schema drift.
Authored the specification to add cart_id to
Checkout Started Error Received as a ready-for-dev
ticket, so error and success events would share the same context
level.
Action
The alert channel was redesigned across layers, not just muted.
I redesigned alert thresholds, added troubleshooting guidance to the
relevant Amplitude charts, designed a hierarchical error-category ID
scheme with governance rules, and authored the specification to enrich
checkout error events with
cart_id. The goal was not fewer alerts for its own sake. It
was alerts that responders could interpret and act on without recreating
the context from scratch.
Outcome
Alerts became more actionable before AI entered the loop.
- Thresholds redesigned around absolute rate plus rate-of-change.
- Engineering-facing troubleshooting guidance added directly to charts.
- Hierarchical error-category ID scheme designed and proposed, with deterministic mapping rules and governance guidelines.
-
cart_idenrichment specified for checkout error events as a ready-for-dev ticket, so failures could be correlated to basket instances. - False-positive reduction support created for the AI alert-triage workflow.
Design lesson
Alerting problems are often taxonomy problems wearing an operations costume.
If errors are inconsistently categorised or stripped of the context success events carry, the alert channel forces humans to reconstruct basic facts. Fixing that starts upstream in instrumentation and schema governance.