One day, Mark was having a meeting with his SRE team, when their pager announced a major service outage.
They immediately set up a crisis room, appointed the incident commander, and brought on-call engineers in.
Monitoring systems were deep red, and a few minutes later Mark got a call from the CEO informing him that marketing had to launch a new product in few hours. “We need to be back online Mark, now!”.
The crisis room wasn’t able to troubleshoot the root cause of the incident: too many alerts, from practically all systems. Mark suggested getting a higher picture of the events: “if we’re having such a mess, something ‘central’ could be the root cause: I suggest to start looking at our network and firewalling”, he said.
It turned out he was right: one hour later, they found out a core network device was hit by a bug in his operative system version, causing weird packet management, failing without showing the expected behavior, and blocking high-availability. Luckily, the team had on-call a very good network engineer, John.
Mark escalated the case with the network vendor, but the support case was immediately closed due to the end-of-life status of the device. Mark was frustrated, it was unexpected.
They were alone. John then started to look at public documentation or similar cases, finding out that a similar incident had already occurred to several Companies on that OS version and the evidence of a documented workaround that had worked.
Incident commander quickly authorized the application of the workaround, and the outage was fixed within the next half an hour. It took overall more than three hours to get all services back.
The postmortem showed that the device was not alone: more than 30% of key network devices were in end-of-life and an urgent upgrade initiative was approved, disrupting planned activities though.
This short and fictional story outlines a few elements:
- obsolescence risks can easily turn into severe issues
- network obsolescence risk looked poorly managed: asset owners (network engineering) and SRE teams were both unaware of the obsolescence status, meaning also that SRE was probably playing a weak “check and balance” role
- due to all the above, there could probably be several more risks linked to obsolescence and service availability which were unwatched.
I think this happens in many Companies, even when IT risks management practices are in place, mainly due to:
- vendors shrinking support windows
- dependencies among assets (software, architecture, and infrastructure components), which slow down upgrade initiatives
- poor risk management culture and team accountability enforcing.
CENTRAL VERSUS DISTRIBUTED RISKS MANAGEMENT APPROACH
In my experience, formal processes for managing risks help, but can hardly capture the essence of IT risks. For instance:
- as they’re brought to top management attention, they’re often simplified and maybe e a little optimistic, tending to show a “we’ve got things under control” picture
- Risk definition and progress tracking can be felt like an external and annoying task, thus failing to properly enforce team accountability.
I am in favor of a distributed risks approach, because it scales, keeps accountability on top of teams’ shoulders, and can better consider specific assets and service nuances. It can feed in addition to centralized processes.
I find inspiring: both easy and effective. Top risks are declared and mitigation actions listed, with owners when necessary. That’s it. And all publicly.
I started to adopt this technique with my teams (but not publicly) a few times ago: the major obstacle was the culture switch, but eventually, it started working after a while.
This more risks-aware approach also got us a deeper awareness, forcing us to find new ways to overcome risks, with velocity in mind. Automation, engineering techniques, right-sized testing approach, strong prioritization, they all become effective knives to cut complexity into manageable slices.
It also brought us into a more aware and constructive condition when discussing priorities and budget allocation with our top management.
HOW TO IMPLEMENT AN EFFECTIVE BIGGEST RISKS MANAGEMENT PRACTICE
Define Your Biggest Risks
I suggest starting simple, by asking: “Guys, which are our biggest risks?” and giving some hints about a few risks to consider.
Biggest Service Quality Risks
- Insufficient service resilience
- Poor service quality, at performance level for instance
- Insufficient protection against a security breach.
Biggest People Risks
- Poor team motivation, a key engine to get progress
- Loss of key team members, meaning potential weakening of key processes or services management
- Unclear accountability and poor team empowerment
- Insufficient skills to face IT complexity and evolution.
Biggest Technology Risks
- Increase of Technical Debt
- Loss of key technology modernization opportunities
- Poor complexity and dependencies reduction
- Insufficient velocity (which can be mapped to our capabilities roadmap, support to business initiatives, SW lifecycle management, for instance).
Biggest Financial Risks
- Loss of financial targets, meaning spending optimization, budget allocation for initiatives.
Define Mitigation Initiatives
The following step is asking your team to define their mitigation initiatives, according to your biggest risks.
They can be for instance:
- technologies evolutions
- processes adjustments
- teams scope, focus, functional model review
- training programs definitions.
Define Drivers To Set Your Priority Map
Drivers can come into your play to support prioritization and approaches.
You can’t do everything immediately. But you must act on the biggest priorities first.
For instance, risks and related mitigation initiatives can be prioritized to favor critical services (such as those driven by , security exposure, teams, and members relevance.
With this simple approach, you can define your priority map, which brings clarity and offers a common understanding of risks and the effectiveness of the mitigation approach.
Wrapping up, I strongly recommend to:
- Act at team level on risks management, focusing at culture switch; you ‘feed’ central process more easily, though
- Start simple with the biggest risks, to ease culture switch
- Force team to work on strong prioritization when defining mitigation initiatives, with shared drivers
- Define and regularly review your priority map and start using it as a basis for planning and budget allocation discussions with your top management.
“A major lesson in risk management is that a ‘receding sea’ is not a lucky offer of an extra piece of free beach, but the warning sign of an upcoming tsunami.”