[Verse 1] When your service crashes at three AM Users screaming, patience wearing thin You need metrics that predict the pain Before your reputation goes down the drain Three letters spell your saving grace SLI, SLO, SLA - know your place [Chorus] Indicators measure what you see Objectives set where you should be Agreements promise what you'll do SLI, SLO, SLA - the reliability crew Measure, target, promise true That's the trio pulling you through [Verse 2] SLI's your telescope on system health Response time, errors, uptime wealth Raw numbers flowing from your code Percentiles paint the truth untold Ninety-nine percent requests succeed These indicators plant the seed [Chorus] Indicators measure what you see Objectives set where you should be Agreements promise what you'll do SLI, SLO, SLA - the reliability crew Measure, target, promise true That's the trio pulling you through [Verse 3] SLO's your compass pointing north Ninety-nine-point-nine and so forth Internal goals that shape your plan Error budget in your hand When you breach that sacred line Time to debug and realign [Bridge] SLA's the contract signed in ink Legal promise, missing link External face of what you vow Service credits when you bow Error budgets keep you sane Calculate acceptable pain [Chorus] Indicators measure what you see Objectives set where you should be Agreements promise what you'll do SLI, SLO, SLA - the reliability crew Measure, target, promise true That's the trio pulling you through [Outro] From metrics raw to promises made That's how trust in tech is played SLI feeds SLO feeds SLA Reliability's the CTO way
# The Case of the Vanishing Customers ## 1. THE MYSTERY The morning coffee at TechFlow Solutions had gone cold hours ago. Sarah Martinez, the startup's CEO, stared at her laptop screen in bewilderment, surrounded by crumpled printouts and empty energy drink cans. For the third month running, customers were abandoning their online shopping platform at an alarming rate, but the pattern made no sense. "Look at this," she muttered to her development team gathered around the conference table. "Our server logs show everything's running perfectly. No crashes, no major outages. But customer complaints are pouring in about slow loading times and failed purchases. Yesterday, we lost our biggest client, MegaRetail Corp, and they're threatening to sue us for breach of contract." She held up a thick legal document. "They're claiming we violated our service agreement, but I don't understand how. Our monitoring dashboard shows green lights across the board!" The numbers painted a confusing picture: the system reported 98.5% uptime, yet customers complained about frequent unavailability. Response times averaged 2.3 seconds according to their internal tools, but users reported waiting 8-10 seconds for pages to load. The disconnect between what their systems claimed and what customers experienced was driving everyone crazy. ## 2. THE EXPERT ARRIVES Just as Sarah was about to call another emergency meeting, her phone buzzed. "Sarah, I heard you might need some help," came the calm voice of Dr. Elena Rodriguez, a reliability engineering consultant who specialized in helping companies understand their service quality. "I'm in the building next door consulting for another client. Mind if I stop by?" Twenty minutes later, Dr. Rodriguez walked into the chaotic conference room, her tablet already displaying charts and graphs. She had the kind of focused energy that comes from solving impossible puzzles for a living. After a quick survey of the printed reports scattered across the table and a few pointed questions about their monitoring setup, her eyebrows raised with recognition. "Ah, I see the problem. You're measuring the wrong things in the wrong way." ## 3. THE CONNECTION "Think of your situation like a pizza delivery service," Dr. Rodriguez began, pulling up a chair. "Imagine you promise customers that pizza will arrive in 30 minutes or less. But instead of timing from when the customer calls to when the pizza reaches their door, you're only measuring how long it takes to bake the pizza in your oven. You might have the fastest oven in town, but if your delivery drivers get lost, your customers are still getting cold pizza an hour later." Sarah leaned forward, intrigued. "So you're saying our internal monitoring isn't showing us what customers actually experience?" "Exactly! This is where three crucial concepts come into play: SLIs, SLOs, and SLAs. Think of them as a building with three floors," Dr. Rodriguez drew three boxes stacked on top of each other on the whiteboard. "The foundation level is SLI - Service Level Indicators. These are the actual measurements of what's happening. The middle level is SLO - Service Level Objectives, which are the targets you set for yourself. The top level is SLA - Service Level Agreements, which are the promises you make to customers with consequences if you break them." ## 4. THE EXPLANATION Dr. Rodriguez moved to the whiteboard with enthusiasm. "Let's start with SLIs - Service Level Indicators. These are like the speedometer, fuel gauge, and temperature readings in your car. They tell you what's actually happening right now. But here's the critical part: you need to measure what matters to your customers, not what's convenient for you to measure." "Your current monitoring is like checking if your server's CPU is happy, but your customers care about whether they can complete a purchase quickly. Real SLIs for an e-commerce platform should measure things like: How long does it take for a page to fully load from a customer's perspective? What percentage of purchase transactions complete successfully? How often can customers access your site when they try?" She drew arrows connecting internal metrics to customer-facing outcomes. "SLOs - Service Level Objectives - are the targets you set based on those measurements," she continued. "Using our pizza analogy, if your SLI measures actual delivery time from order to doorstep, your SLO might be '95% of pizzas delivered within 30 minutes.' For your platform, you might set an SLO of '99% of page loads complete within 3 seconds' or '99.9% of purchase attempts succeed.'" "The key insight," Dr. Rodriguez emphasized, tapping the whiteboard, "is that SLOs should be based on what keeps customers happy, not what makes your engineering team comfortable. And here's the secret: you set SLOs slightly better than what you need for customers, giving yourself a buffer before problems become customer-facing issues." "Finally, SLAs - Service Level Agreements - are the formal promises you make to customers, usually with penalties if you fail. These should be less ambitious than your internal SLOs. If your internal objective is 99.9% uptime, your customer SLA might promise 99.5% uptime. This gives you room to have internal issues without breaking customer promises." ## 5. THE SOLUTION "Let's solve your mystery step by step," Dr. Rodriguez said, pulling out her tablet. "First, we need to identify proper SLIs that match customer experience. Instead of measuring server uptime from your data center, let's set up synthetic monitoring that attempts to load pages and complete purchases from different locations, just like real customers do." Within an hour, they had configured new monitoring tools. The results were eye-opening: while their servers stayed up 98.5% of the time, customers could only successfully complete the full shopping experience 91% of the time. Pages that loaded in 2.3 seconds on their internal network took 8+ seconds for customers on slower connections. "Now for SLOs," Dr. Rodriguez continued. "Based on industry research, customers abandon shopping carts if pages take longer than 4 seconds to load. So let's set an internal SLO of '95% of page loads complete within 3 seconds from customer locations' and '99% of purchase transactions complete successfully end-to-end.' These targets give you early warning when you're heading toward customer frustration." "For your SLA with MegaRetail Corp, instead of promising vague 'uptime,' promise specific customer outcomes: '98% of page loads within 4 seconds' and '99% purchase success rate.' This aligns your legal commitments with what actually matters to their business." ## 6. THE RESOLUTION Three weeks later, Sarah called Dr. Rodriguez with excitement in her voice. "Elena, you solved it! Once we started measuring what customers actually experience and set realistic targets, everything clicked into place. We discovered our CDN wasn't working properly for international customers, and our payment processing had timeout issues we never saw from inside our network." "The best part? MegaRetail Corp not only dropped their lawsuit but signed an expanded contract. When we showed them our new SLI dashboards proving we could measure and guarantee their actual user experience, they gained confidence in our reliability." Sarah paused, then chuckled. "I finally understand why that legal document seemed impossible to fulfill - we were promising outcomes we weren't even measuring! Now we measure first, set internal goals higher than customer needs, then promise only what our data proves we can deliver." As Dr. Rodriguez packed up her consulting materials, she smiled. Another mystery solved through the simple power of measuring what matters, setting realistic internal targets, and making promises based on real data rather than wishful thinking.
← 4 Reliability & Resilience | Error Budgets: Your Reliability Currency →