Summary of Live Site Incidents on CHT deployments (Q1 2022)

In the first quarter of 2022, we embarked on an exercise to track the live site incidents from various CHT deployments. Live site incidents are defined as issues in production that have a demonstrable impact on users such as inability to perform their regular work. We respond to these issues with highest priority and ensure appropriate resolution. From the sample instances we tracked, we observed that:

  • 20% of incidents were fixed after the CHT upgrade
  • 40% were because of dashboard errors
  • 27% were because of couch2pg errors
  • 20% were due to configuration errors

Beyond that, we’ve also observed that 50% of incidents were reported by users, and 60% have been checked for system regressions.

Some recommendations to avoid future disruptions:

  • Upgrade the instances regularly
  • Monitor dashboards and couch2pg for errors
  • Test the configuration

How have the live site incidents affected you? Do you have any suggestions? Please share.

8 Likes

@binod - this is a great list - thanks for putting it together! I was wondering for this item:

Do you know how much of the 20% was just needing to upgrade the latest available CHT version vs how of it much exposed a novel bug that needed to be fixed, thus the incident had to wait longer to be resolved?

Another way of asking might be: Could all 20% of these incidents be avoided by staying up to date?

1 Like

Looking back at the incident reports, around two-third of those 20% incidents would have been avoided by staying up to date. For the remaining one-third, we found and fixed the bug in the cht-core.

2 Likes

Awesome - thanks for the info! I really appreciate communicating these stats publicly.