The Challenge
ShopStream's rapid growth to 300+ microservices left the SRE team drowning in alert noise. They had 2,400 active alerts (most stale), no distributed tracing, and an MTTD (mean time to detect) exceeding 25 minutes during peak traffic incidents that cost $50K+ per hour in lost revenue.
Our Approach
- Metrics Pipeline — Deployed Prometheus with Thanos for long-term storage and federated queries across 8 Kubernetes clusters.
- Logging — Replaced the ELK stack with Grafana Loki for 10× cost reduction on log storage with label-based querying.
- Distributed Tracing — Instrumented all services with OpenTelemetry and Grafana Tempo for end-to-end request tracing.
- SLO Framework — Defined SLIs/SLOs for every critical user journey with burn-rate alerting and error budget policies.
Results
MTTD dropped from 25 minutes to under 90 seconds. Alert count was reduced by 92% while meaningful coverage increased. The platform maintained 99.995% uptime through Black Friday and Cyber Monday — zero revenue-impacting incidents.
"For the first time, our on-call engineers can actually sleep through the night. The signal-to-noise ratio is incredible." — Head of SRE, ShopStream