How Service Level Objectives work.

Site Reliability Engineer (SRE) - Interview Questions

Cloud-native software and its supporting tools and infrastructure generate various metrics and data points every second that indicate a system’s state and performance. Service-level objectives define or support a set of higher-level business goals, which you can measure by leveraging the data and insights from observability tools.

The goal of SLOs is to deliver more reliable, resilient, and responsive services that meet or exceed user expectations. Reliability and responsiveness are often measured in nines on the way to 100%. For example, an objective for system availability can be:

* 90% – one 9
* 99% – two 9s
* 99.9% – three 9s
* 99.99% – four 9s
* 99.999% – five 9s

Each decimal point closer to 100 usually involves greater cost and complexity to achieve. Users may require a certain level of responsiveness, after which they can no longer detect a difference. Setting SLOs is part science and part art, striking a balance between statistical perfection and realistic goals.

You can set service level objectives based on individual indicators, such as batch throughput, request latency, and failures-per-second. You can also create SLOs based on aggregate indicators, for example, the application performance index (Apdex), an industry standard that measures user satisfaction based on a variety of metrics.

Gathering and analyzing metrics over time will help you determine the overall effectiveness of your SLOs so you can tune them as your processes mature and improve. These trends also help you adjust business objectives and SLAs.