<img alt="" src="https://secure.leadforensics.com/149263.png" style="display:none;">

After the Software Outage – Postmortems and Process Improvement

Posted by Brenda Barrioz on Sep 26, 2019, 2:20:19 PM

 

There is no such thing as a good software outage!

Anytime an IT service is unavailable to your employees, suppliers or customers, the costs begin mounting rapidly, taking shape as lost productivity, lost revenue, lost reputation or most likely a combination of all three.

frustrated young business man working on laptop computer at office

 

Crosscode Panoptics’ capabilities for automated application mapping, audit trail capture, and governance give our users the tools to deep dive software outages, understand precisely what went wrong and take steps to make sure it never happens again.

 

Ideally, any organization that is responsible for managing software infrastructure will have a process in place to systematically review software outages and other major IT incidents. These activities go by many names such as post-incident review, or after-action report. Here we will use the slightly morbid but widely understood term postmortem.

 

 

Postmortem

In a postmortem, the team studies the event to be sure there is a solid understanding of what led to the outage, assesses what went right and what went wrong during the response, and develops a recommendation to reduce the likelihood of the event recurring.

The output of the postmortem is the postmortem report. Typical elements of the postmortem report include:

  • Incident Description/High-Level Summary
  • Root Cause Analysis
  • Review of Actions and Effectiveness
  • Lessons Learned
  • Recommended Actions and Remediations

The root cause analysis (RCA) is, in many ways the key to a genuinely useful postmortem. Critically, an RCA takes the team beyond what happened, to begin to understand why the incident occurred. Was the outage caused by a faulty patch? A memory leak? Human error?

While there are countless potential outage root causes, at a high level, we can distinguish two broad categories: physical events such as power outages and hardware failures, and software events. Across the board, poorly managed and poorly understood software change is a leading cause of unscheduled downtime. Study after study from organizations such a Gartner and the Uptime Institute consistently place software change issues at the center of 30% or more of all outages.

Among the many issues, IT professionals face as systems become more distributed and complex, is that software change-related outages are harder to diagnose. It is extremely difficult (if not utterly impossible) to analyze a software outage if you don’t already have a solid understanding of your application infrastructure along with a solution that monitors system changes and tracks those changes in a meaningful way.

Crosscode Panoptics’ audit trail capabilities can bring your RCA activities to the next level, saving significant time in reviewing system changes and improving the completeness and accuracy of the results.

Built on top of the industry’s most comprehensive automated dependency maps, Panoptics’ audit trail captures changes directly from the runtime environment. The visual tool makes it easy to identify system changes and when they occurred, significantly lowering the bar on the effort and expertise needed to collect the necessary information for an RCA. Panoptics also reliably captures database, code, and configuration changes that often go undetected by other tools.

With Panoptics, your postmortem team can quickly and confidently extract a shortlist of suspect changes. With the shortlist in hand, other Panoptics features support a more in-depth analysis, letting you see the difference in the context of application dependencies and trace the effects of both real and hypothetical change across the software environment.

Once the team has identified the root cause of the outage, a review of the response effectiveness and a compilation of lessons learned help to round out the picture. The two fundamental areas of inquiry here concern whether the organization had the understanding and wherewithal to bring the system back online as quickly as possible, and perhaps more importantly, was the outage foreseeable and preventable. Not surprisingly, in hindsight, most organizations determine change-related software failures to have been avoidable.

The final step of the postmortem process is to take all the information gathered and develop recommendations to prevent a recurrence. Crosscode’s automated Governance Operating System (GoES) enables you to use lessons learned and make clear, proactive steps to harden your system against future similar failures.

Using GOeS, you create custom rules to govern changes at the most detailed level and send an alert whenever a change doesn’t comply with one of your rules. Rules can be configured to execute anytime an element such as a method, class or database is added, deleted or updated. Rules can also execute when various conditions are met, such as when a change creates a new dependency or creates a new security issue.

Crosscode Panoptics was designed with a single purpose - to help organizations manage today’s complex application environments. Our tools give you unique insight into your application infrastructure across the entire software development lifecycle. If your responsibilities include managing software infrastructure and understanding how changes impact that infrastructure, contact us today for a personalized demonstration and let us show you the difference Panoptics can make.

Topics: Agile, devops, Best Practices, Crosscode, Digital Transformation, Panoptics, IT, Cloud Migration, softwareoutage, failure