4.16 AM - New York, New York
Patricia dreamed about New York, her favorite place in the world, when her mobile rang, awakening her. It was 4.16 am. No good news at this time, she thought while answering the call.
It was Mark, his VP of IT Operations. “Patricia, sorry for awakening you, but we’re handling a pretty bad situation here, and we’re running out of options. Our main applications became unavailable to our customers more than four hours ago, and we can’t understand what’s happening yet. That’s why I decided to escalate this incident to you”.
I want to go on with my dream, thought Patricia: “Well done, Mark,” she heard herself saying instead, “can you please summarize the sequence of the events and which technical information you’ve gathered so far?”.
“Sure: at 1.11 am, all our internet exposed applications became unavailable. We’ve gone through a lot of checks, but no service looks wrong. Logs are still under examination by my system engineers, but we found nothing so far.”
“Did we have any planned change tonight?”. “No, nothing,” replied Mark.
This is terrible, thought Patricia.
08.30 AM - CRISIS
The troubleshooting was still going on at 8.30 am, and the war room was getting more and more chaotic when Patricia got a call from his boss John, the COO. “Hi Patricia, what’s happening? I tried to open our mobile app and got a weird message, and I just got a call from our CEO, who’s getting a lot of angry calls from our customers”.
“Hi John, I’m sorry I did not inform you, but we’re still trying to troubleshoot this issue. Without any clue, actually.” “Look, Patricia, we’ve got to sort it out and quickly. Our crisis management procedure must be applied, and we need to inform the board in no more than 30 minutes”, John said. “Sure, I’ll let you know,” replied Patricia. But she knew she wouldn’t have good news by then.
At 9.00, Patricia got the invitation to join the Crisis Committee. She explained what happened; John added the team was doing everything possible and had escalated the incident to all providers and suppliers at the top management level. The Committee asked to be briefed every 30 minutes and approved a communication to the stakeholders.
At 10.30 am, all social media heavily attacked the Company, and the press started to ask for explanations. Still, Patricia’s team had no clue.
At 11.15 am, Jane, the CEO’s assistant, set up an urgent call with Patricia and her involved managers where he asked for an immediate resolution.
At 12.43 am, Julie, VP of Applications Management, told Mark she discovered an unplanned change at 1.09 that night. “Really? What was it about?” asked Mark. “It was a minor change on one of our libraries.” “Let’s roll it back!” said Mark.
And after that, all services went back online. Mark shared the excellent news with Patricia, who then called his boss and informed the Crisis Committee.
At 1.00 pm sharp, Jane arranged another call with Patricia, Mark, and Julie: “first of all,” the CEO said, “thank you for your efforts to bring our services back online. But I want to be clear that I won’t tolerate another event like this one. I want Patricia and all of you to immediately work on a plan which will kick us out on a far higher maturity level! We can’t, I repeat it, we can’t afford anymore to lose our services for almost 12 hours. And this is the third relevant incident in a row this month!”.
And today was only the 12th, though Patricia.
1.30 PM - LEADERSHIP, FINALLY
Patricia took a walk in the nearby park, as she was not only exhausted but also furious. Mostly with herself, she realized.
She was the CTO of her Company for almost six months now. She came from a similar role in a far lesser complex company, and she had to admit she was struggling.
There were so many conflicting and ever-changing priorities, each requiring a mountain of work to be done. Also, Patricia was tired. She worked not less than 14 hours a day and often during the weekends, following projects rollouts or handling her backlog.
Her boss was supportive and very demanding too.
Mark was a good but inexperienced manager; she appointed him as VP of IT Operations two months before, and he brought a lot of improvements, but she realized she had to coach him on a broader approach.
She went through the events of the previous hours. How many weak points, she thought. She found a bench, took out her notebook, and started to write them down:
- We weren’t able to gather helpful information from the logs and monitoring systems
- It wasn’t clear who was leading the troubleshooting, an evident lack of coordination impaired the analysis efforts
- Suppliers in charge of infrastructure and application services were involved, but their contributions were poor
- We missed communicating regularly and clearly, as we were too busy trying to solve the incident
- We didn’t know there was an unplanned change, which was the root cause of the incident
Patricia felt a bit better and went back to work. She immediately arranged a meeting with her managers Mark and Julie, to share her list of weaknesses, asking for acknowledgment and main reasons. Patricia noted them down, which uncovered aspects she hadn’t wholly focused yet:
- Inadequacy of monitoring systems, both in terms of functionality and scope of implementation
- Poor internal technical skills and general demotivation of staff
- High turnover in both application and infrastructure suppliers teams
- Unclear incident handling organization model
- Change management procedures applied mostly to significant changes
- Lacking or obsolete operational procedures
- No retrospectives on past incidents
- Lack of measures, which impairs our priorities setting
Why didn’t I do it before? She asked herself. Some of the items were known and already tackled, but not deeply enough, she admitted to herself. Now she had a clearer picture of where to head their efforts.
She asked her team to build a plan on the above topics and have it ready by the following day at noon. She reviewed it, and after some minor changes and adding some context, she sent an email to her boss, getting great feedback encouraging her to share the plan with the CEO.
4.45 PM - SUCCESS
At 4.45 pm she wrote to the CEO:
First of all, I wish to state that I take full responsibility for what happened yesterday. An untraced changed to one of our libraries caused the incident.
The whole team is committed to achieving a far better quality of service, we know how to do it, and we already started some of the required steps.
We need to adopt a more industrial approach, and practices to our IT Operations and SW Lifecycle Management, and the following steps are our priorities:
- Monitoring systems improvement to achieve broader and deeper observation capabilities on our systems, particularly during outages. We started a software selection process last month, and after spending approval, we estimate to have the new system in place in five months. Meanwhile, we’ll leverage our current solutions to improve alerting and lighten at least some of our dark areas.
- Engineers increase and skill improvement, as we’re too much relying on suppliers: I’ve already talked to our HR Director to support our training and hiring plans, we’ll finalize by the end of the current week.
- Supplier governance enforcement: they need to act as Partners, investing more, documenting operational activities, and ensuring stability and proper skills of their teams. If not, there won’t be any room for them. I’ve already brought the message during last weeks, but I’ll reinforce it to their top managers in the following days.
- Site Reliability Engineering (SRE) practice adoption, introduced by Google, brings a retrospective and engineering culture to managing Operations; our training proposal already includes it.
- Change management process improvement to ensure a complete tracing of changes.
- Proper KPIs definition and measurement, to track our progress and regularly report to you and John, beginning the end of the current month.
The above steps will be completed within the next six months, ensuring a constantly added value to our Company.
We’ll finalize the plan and the economics by the end of next week.
Susanne replied in minutes, expressing satisfaction and total commitment to support the plan.
Patricia shared the good news with the team: they all felt energized and confident.
The detailed plan was approved, and six months later, the team completed most of the program and enormously improved.
Also, Patricia and her team had a clearer picture of moving forward, mainly thanks to the SRE practice adoption path they started and the new skills the team acquired.
In this fictional story, the IT Operations shows a low level of maturity; the CTO’s approach, though, is robust, the act of a .
In my experience, when facing issues with an analytic and , leaders can get great results and turn problems into successes.
is a fundamental part of the plan.
Which can Patricia use to measure their improvements regularly?
Here are some examples:
- service availability,
- MTTD (Mean Time To Detect, the time it takes to detect an incident)
- MTTR (Mean Time To Repai, the time it takes to repair an incident)
- number of traced RFCs (Request for Change)
- team satisfaction index derived from regular surveys
- the number of training hours consumed by the team.