How Anagha uses ML models trained on golden signals to predict production failures 8–22 minutes before they impact users — and auto-remediate 94% of incidents without paging a human.
The standard alerting model is fundamentally backward: a threshold is crossed, a human is paged, a human diagnoses, a human remediates. At each handoff, time passes — and users are experiencing degraded service during every second of that chain. The average enterprise MTTR for a P1 incident is 4.2 hours. The average cost per hour of downtime is $300K. That's $1.26M per P1 incident in direct cost, before customer trust damage is calculated.
Worse: threshold-based alerting is either too sensitive (alert fatigue from hundreds of false positives per week) or too coarse (catching failures only after they're severe enough to breach absolute thresholds). Neither approach catches the subtle degradation patterns that precede major failures — the slow memory leak building over four hours, the database connection pool creeping toward saturation, the P99 latency ratcheting up 2ms per minute.
The reality: In Anagha's pre-engagement assessments, 73% of P1 incidents were detectable 8–22 minutes before customer impact — but no alert fired because no threshold was configured for the specific metric trajectory that caused the failure.
Anagha instruments every service with OpenTelemetry — capturing the four golden signals (latency, traffic, errors, saturation) plus extended signals: connection pool utilization, GC pause duration, Kafka consumer lag, database slow query rate, external dependency latency, and business-level signals (order success rate, login failure rate, payment processing time). Metrics are scraped by Prometheus every 15 seconds. Long-term storage in Thanos (S3-backed, 13-month retention) enables seasonal model training. All metrics are labeled with service, version, region, and canary cohort — enabling the ML model to attribute anomalies to specific versions.
Raw metrics are noisy. The feature engineering layer transforms raw time series into ML-ready features: rolling averages (5m, 15m, 1h), rate-of-change derivatives (first and second order — "how fast is latency accelerating?"), seasonal decomposition (day-of-week, hour-of-day baselines), inter-service correlation features (when service A latency rises, does service B's error rate follow?), and cross-metric anomaly signals. These ~140 features per service feed the prediction models.
Two ML layers run in parallel. An anomaly detection layer (Isolation Forest + Prophet seasonality models) runs continuously, flagging individual metrics that deviate from historical norms — using adaptive thresholds that account for time-of-day, day-of-week, and recent drift. This catches point anomalies that static thresholds miss.
A failure prediction layer (LSTM for sequence modeling + XGBoost for structured tabular features) is trained on historical incident data: what did the metric constellation look like 5, 10, 20 minutes before each past incident? The model learns the multivariate signatures that precede failures — not individual threshold breaches, but the interaction patterns that indicate systemic stress. Precision: 94.2%. False positive rate: 3.1% (well below the alert-fatigue threshold).
For the ~80% of incident patterns that have a known remediation, the system acts autonomously before a human is involved. Anagha maintains a library of executable runbooks: horizontal scale-out (k8s HPA scale-up when pod CPU + memory both in 80th percentile); rolling restart of pods showing memory leak signatures; circuit breaker trip for degraded downstream services; traffic reroute to healthy regional replicas; cache warming for cold-start latency spikes. Each automated action is logged with the anomaly signal that triggered it, creating a full audit trail.
Result: 94% of incidents resolved by automated runbooks in under 60 seconds. Human on-call engineers are paged only for novel failure patterns, data integrity concerns, or cascading failures that require architectural judgment.
Anagha's SRE intelligence platform takes 6 weeks to deploy on your existing Prometheus/Grafana stack. No rip-and-replace required.