Why SRE should be part of your IT organization

SRE brings Engineering Approach To IT Operations

IT Operations excellence rarely isn’t a priority nowadays. Distributed and hybrid systems and service models are increasing reliability achievement challenges.

Failures happen, in your on-premise or cloud data center, but what’s important is what we learn from them and how we can establish solid processes to react quickly and efficiently and prevent specific issues once we understood why they occurred.

SRE is what you get when you treat operations as if it’s a software problem

Google

SRE (Site Reliability Engineering) is a set of practices, which aims to increase IT Operational quality, with a measurable, engineered, and continuous improvement approach.

SRE - IT Processes in scope

SRE deals with several processes. Let’s look at them.

Availability

Time-Based Availability

Usually, availability measurement is time-based, calculated upon uptime vs total service time.

Here are some examples of time-based availability target values for a given IT service/system:

  • a target of 99.5% means 1,83 days a year of downtime is tolerated;
  • a target of 99.9% availability means 8,76 hours a year of downtime is tolerated;
  • a target of 99.95% availability means 4,38 hours a year of downtime is tolerated;
  • a target of 99.99% availability means 52,6 minutes a year of downtime is tolerated.

Time-based availability can be less meaningful if considering highly distributed systems (e.g. global services), as it means service is probably partially up somewhere in the world. This can be addressed by “regionalizing” targets of course, as done by the following cloud providers at the time of writing this post:

  • AWS commits to use commercially reasonable efforts to make Amazon EC2 available for each AWS region with a Monthly Uptime Percentage of at least 99.99%;

  • Microsoft Azure commits for all Virtual Machines that have two or more instances deployed across two or more Availability Zones in the same Azure region, we guarantee you will have Virtual Machine Connectivity to at least one instance at least 99.99% of the time.

Aggregate Availability

Another way to measure availability is based on technical behavior, which means, for instance, measuring the success rate of a system’s served requests, over a defined rolling time (i.e. a day).

For example, if a given system should serve 1M requests a day, a target of 99.99% means that 100 errors a day are tolerated. End-to-end measurement of systems that a user needs to consume a given IT service is obtained balancing the different weights in terms of user-perceived availability that each sub-system has.

Performance

Availability is not the only SRE (and quality of service) relevant indicator: performance indicators assume a key role too, such as error rates, requests latency, and system throughput.

Monitoring capabilities are key success factors to properly instrument and measure such indicators.

Service Levels

Service levels are defined by:

  • SLIs (Service Level Indicators), as availability, system throughput, error rate, request latency;
  • SLOs (Service Level Objectives), which define a target value (or range of values) for SLIs, for example, setting the availability target to 99,9% for a given service or setting a specific request latency target to be less than 100ms;
  • SLAs (Service Level Agreements), contracts with users stating consequences of meeting or missing specific SLOs. Their definition depends on the level of formalization required with users and the confidence to achieve SLOs.
  • Error budget, which defines the rate at which the SLOs can be missed: the difference between the defined SLO and the actual measurement (typically from your monitoring platforms) gives the budget of potentially missed reliability which can be tolerated for the considered timeframe.

This framework enables more risk-aware decision-making, prioritization between naturally conflicting items such as velocity and reliability: as SLOs and error budget need to be met, your IT Organization can pursue velocity, innovation, and whatever else matters within these caps, such as lowering technical debt.

Good examples of error budget usage are investing on automating key processes steps to shift-left reliability of IT systems (such as software code quality checks, automated testing, code reviews, deploy automation, security checks), and experimenting or testing response procedures, as they can let gain further confidence to reach targets which can, in turn, be used to address backlogged issues / evaluate targets increase over time.

The level of sharing with your organization’s business depends on BRM (Business Relationship Management) maturity in my opinion: there’s no need to do it at the start, so to let your IT Organization practice indicators and targets definition.

Capacity Planning & Provision

To ensure that performance (and availability) indicators are met, future demand and adjustments in IT systems capacity to meet day-by-day growth must be managed too.

The provision of capacity is, of course, a related process.

Emergency Response

Service targets can be met not only by designing resilience to failures, but also through an effective and efficient response to production issues.

Here are some of the key areas of excellence that SRE can help to achieve within this process:

  • Crisis (or “war”) room management: this sub-process ensures procedures to define roles, engagement of subject matter experts, events logging, proper communication to the IT organization, the Company, and Customers to effectively manage complex production incidents;
  • Postmortems, i.e. blameless and analytical in-depth retrospectives on what happened during production incidents, focusing on what can be learned in terms of mitigation and prevention, which can lead to actions definition and prioritization;
  • Improvement actions review process, ensuring regular progress monitoring, escalation, and prioritization, with due management level involvement.

A postmortem template can be found here and a postmortem real example can be found here.

Relevant KPIs are:

  • Mean Time to Detect (MTTD): this indicator can bring to improvements to monitoring / alerting capabilities or operational procedures;
  • Mean Time to Repair (MTTR): this indicator tells a lot in terms of solving, also in terms of mitigation, capabilities and can lead to improving knowledge sharing, documentation, troubleshooting practices
  • Mean time to Failure (MTTF), which helps to evaluate the reliability of a given hardware/system component or service.

Testing emergency response is a key success factor to ensure continuous improvement: staging environments can be used for safe procedures testing.

Site Reliability Engineers can also be required to take part in on-call rotations, to ensure proper service time windows coverage.

Change Management and Production Readiness

Preparing for production launch of a new feature/service is a complex task. SRE can ensure proper actions have been implemented and considered, interact with release managers in charge, and help to evaluate potential non-mitigated risks to ensure the safest and most successful launch to production.

A good template to create a launch plan can be found here.

Key success factors to implement or improve SRE practices

(Photo by Steven Lelham on Unsplash)

In my experience, key factors which can leverage greater value from SRE introduction, among others, are certainly the following:

  • DevOps & Engineering culture and practices
  • Monitoring practices
  • Infrastructure as Code practices
  • Service excellence culture
  • Blameless culture
  • Production services technical knowledge
  • Leadership to manage people change and their empowerment

Let’s give a closer look at some of them.

DevOps & Engineering culture and practices

When an IT organization has already established DevOps & Sw Engineering practices, it can leverage the engineering approach, the automation targets, the measurement culture, all leveraging the creation of a perfect environment to build SRE culture.

Monitoring Practices

Measurement of availability, especially aggregate availability, and related SLOs, require adequate monitoring capabilities.

Infrastructure as Code practices

The implementation of IaC (Infrastructure as Code) alone brings great value to enforce observability and availability of IT services, enabling the application to infrastructure assets of all the capabilities already leveraged (or that can be leveraged) for software development (i.e. CI/CD, automated tests, quality of software controls and gates, release management, and engineering).

IaC brings more stable systems and services, thanks to the shift-left of practices that can anticipate potential issues in production, by leveraging one of the greatest areas of co-operation between DevOps and SRE in my opinion.

Service Excellence Culture

I consider service excellence culture as the common understanding within the whole IT Organization that ensuring the targeted service quality to consumers must be a top priority.

This means leaders and team members all share the importance of ensuring practices to build resilient services, to learn from mistakes or issues in production, to identify proper incident mitigation, prevention, and services improvement actions to be tracked with a proper governance process.

Blameless culture

Hope is not a strategy

Traditional SRE saying (Site Reliability Engineering Book- Google)

Failures happen. Full stop. It doesn’t mean IT professionals should stand passive waiting from them to happen; on the contrary, all possible mitigation and prevention actions should be put in place, but focusing on the continuous improvement target and not on who did something wrong.

Which priorities?

(Photo by Denise Jans on Unsplash)

As usual, there’s no easy answer.

In my experience, the maturity level in processes and practices described above drives prioritization.

For example, if incidents are frequent and take considerable effort and time to be repaired, then investing in monitoring, emergency response and production readiness practices should be prioritized. SLI/SLO/SLAs framework can be implemented later.

Also, provisioning and capacity planning can be addressed by SREs to ensure efficiency and an end-to-end look, but in my opinion, they’re not priority number one: sharing indicators and let IT asset owners be in charge of capacity and provisioning can be a working model too.

I initially brought too many processes in scope when setting up SRE practices, realizing soon I needed to prioritize better.

Champions

Motivated, talented, and engineering-oriented team members can be a success factor in introducing SRE in your IT Organization.

That’s what I did and it worked great.

Useful Resources from Google

(Photo on top by Bill Oxford on Unsplash)