“Mark”, said his CEO, what is happening to our new app? I get weird behaviors, and I’m receiving phone calls from disappointed customers.”
“I don’t know”, honestly replied Mark. “I’ll gather my to find out”.
Mark got his team on board to check on the new app status. “We released it last night, didn’t we?”. “Yes, we did”, said Henrich, a mobile senior SW engineering. “I personally covered the , and it went perfectly. I don’t understand”.
Henrich then looked at the dashboard and quickly found out there was an API call failing randomly. “Why didn’t we see it before?”
The project team quickly fixed the issue; Mark informed the CEO and the CTO about what happened”. The CEO escalated the incident to the CTO, expressing his disappointment and asking for explanations and improvement actions.”
The clarified that the API call generating the issue was’t covered by the project , both at level. Also, did not find it, as this functionality couldn’t be tested in due to missing . Once in production, the absence of user traffic did not immediately reveal it as a specific alert wasn’t instrumented. Finally, the lack of observability in staging prevented to identify the errors before launch.
After getting the postmortem outcomes, the CTO set a meeting with all IT leaders to discuss their quality assurance practices. It turned out, there were lots of gaps, both at process and capability level, well known by engineers but poorly prioritized at leadership level”.
The CTO quickly decided to set-up a SW quality improvement program’, starting with an assessment of current practices and gaps, to identify an improvement plan, which he then presented to the CEO, getting immediate approval”. A few months later, that plan was in the execution phase, concrete value was already delivered and the IT team awareness towards SW quality assurance practices had already moved a step further.”
This fictional story tells a lot in my opinion, from lack of sw engineering practices, to leadership unsufficient on their relevance and priority. In my experience, it’s not unsual, mainly due to resistance to , application and architcture complexity, plurality of technologies and speed of modern technology options.
The world of SW testing is huge, and this post doesn’t pretend to summarize it all. I’ll use the weaknesses the story above outlines to focus on some key areas of this practice, though. And I’ll add links to useful material to extend the vision from these practical examples.
Integration testing phase should have identified this issue, as the problem was due to a malfunctioning API call; also covering also trhough could have worked.
This phase, together with the preceding , is of paramount importance, as it can anticipate most of the issues which SW can suffer in later stages, till production.
That is why they represent the base of the .
Environments let testers verify software in isolation / semi-isolation / integration / merged scenarios, up to production.
Staging, specifically, should let testers simulate production as much as possible: lacks here bring risks for production readiness. Adequate data sets, environment availability, functional equivalence to production are key elements to mitigate them.
In our fictional story, we observed two main weaknesses about the staging environment: lack of observability and test data. The former, is probably a synthom of poor attention to the environment health.
Test Data Management
Test data management is another key ingredient to ensure proper sw quality: data can come from production, but with adequate protection (e.g. to comply to regulations such as , but they can also be .
Ensuring the availability of necessary test data sets is a key practice, which makes testing both effective and fast.
The APM solution detected the issue in our story. In production, though, and only when user traffic arose.
could have put the issue in evidence at least, but it wasn’t part of the test strategy for the new app.
Also, no alert was instrumented in production for the API call: the same approach would have been applied in staging, even with an APM there. And there wasn’t.
Observability is another key ingredient to ensure proper SW quality: lack of tools and processes (i.e. why the alert was not instrumented?) can undermine it heavily, exposing the whole IT Organization to risks and issues in SW services availability and resilience.
Modern capabilities and patterns
As I said, SW quality assurance is a huge and complex topic: SW engineering and DevOps practices are continuously offering evolving approaches, such as:
- shift-left, meaning fast, repeatable and automated testing in ealier stages, leveraging
- design for testing, enhancing modularity, and considering testability as a requirement
- produce software with “lights on”, committing fast and reducing which avoid source code “going dark”
- observability integration, ensuring that information from observability tools can trigger quality gates
- security testing integration.
Skills and SW Engineering Practices
None of the above approaches can be implemented without proper skills and sw engineering practices. Investing in hiring and training is a priority to ensure maturity progression in this area.
Also scaling and distributing testing activities towards developers is a key to ensure testing is properly addressed since the earlier SW development stages, while ensuring a central governance of sw testing practices.
Understanding how the “IT machine” works is the basis to ensure good software quality.
Moving towards shifting left testing and leveraging on SW engineering practices is a great IT transformation opportunity, as it ensures improvements both in SW resilience and velocity.
Adopting the (always moving) latest patterns shouldn’t be the target, though: continuous improvement is. Understanding modern approaches must lead to a pragmatic adoption, keeping in mind the Organization maturity level and overall priorities, with business in mind.
“Quality is never an accident; it is always the result of intelligent effort.”