Mitigating “Database has a global failure” (“Unexpected error” Dialog)

co-author @diana

Database has a global failure” is an intermittent error which can affect any CHT user. All users are at some level of risk - but some users experience this error 60x more frequently than others. This post explains known factors contributing to this variation and provides steps which can be taken to reduce error rates - both at the user-level and at the project-level.

Causes

Analysis shows that 92% of these failures occur when devices restore from sleep or when the app is minimised in Android. We believe this is most likely to occur if the app sleeps while navigating between tabs. The underlying cause is an open issue in Chrome.

Other known causes of this failure are running out of disk storage space (QuotaExceeded) and I/O errors like an actual failure to write to disk or other OS/hardware failure. For the technically inclined, see this.

What Do Users See?

When this failure occurs, the user will see an “Unexpected error” dialog which prompts them to reload the page. If the user closes the dialog (or if they are not running cht-core 3.3 or above) the user may experience a range of broken experiences: infinite spinners, failures to load, persistent errors on all tabs, inability to complete workflows, etc.

Prevalence

According to data across four larger CHT projects, this failure occurs for about 2% of users per month. However there is significant variation across users and projects: some projects have failures occurring for 4% of users per month and some have only 0.02% of users per month.

This data is based only on confirmed measurable cases and a significant portion of this issue’s prevalence may not currently be observable.

Risk Factors

Old cht-android versions

Users running cht-android 0.x + Android <10 have a ~3.9x higher rate of Global Database Failures compared with users running cht-android 1.x or comparable experiences.

Many Documents

Failure rates increase roughly exponentially with the number of documents on the device.

Devices

Certain devices appear to suffer from this issue more often than others. For example, users with Tecno B-Class devices experience this failure at a 3.3x higher rate than other users in the same project with other devices including: Tecno K-Class, Tecno W-Class, or Samsung devices.

We do not have a list of devices which suffer from this beyond Tecno B-Class Devices. You can run this query to identify problematic devices based on your own project’s telemetry data. If you do this - please share your findings!

Cht-Core Version Doesn’t Appear to Matter

This is an open issue in the latest version of cht-core. There is no data to suggest that this failure affects users differently based on their cht-core version.

For all CHT versions, an affected user becomes incapable of reading or writing persistent data. This report indicates that the symptoms are more severe on the Tasks Tab after cht-core@v3.8.

User-Level Mitigation

Reload or Restart the Application

This failure can be resolved by reloading the CHT Android Application. The easiest way to do this is to have users click the “Reload” button whenever an “Unexpected Error” dialog appears.

If the dialog is closed or isn’t available, here are details on how to force quit and restart an application in Android.

The surest way to direct a non-technical user to reload an Android application may be to tell them to restart their device.

Ensure sufficient disk space

This failure can be caused in some conditions when a device runs out of disk space. If this is the case - users must free up storage space prior to reloading the application.

What is sufficient disk space? Maintaining ~1GB of disk space is a good target for users. For Medic-lead projects, we alert administrators when users fall below 250MB.

Plug-in to Electricity and Adjust Power Settings

If a device is experiencing this issue it may be helpful to plug the device into electricity, turn off battery optimisation for the CHT application, delay the onset of Android sleep, and keep the app in the foreground.

NOT RECOMMENDED: Clear Device Data and Re-Sync

Users across multiple projects have independently mitigated these failures by clearing all data from the device and logging in fresh. This mitigation can cause data-loss, server congestion, and significant project-level disruption. This is not recommended. Our evidence at this time suggests it is unnecessary.

If you encounter instances of this failure which cannot be resolved by reloading the application please share!

Project-Level Mitigation

Upgrade Cht-Android

Upgrade your users to the latest version of cht-android and latest version of Chrome through the Google Play Store.

Keep Document Counts on Devices Within CHT Limits

You can keep failure rates low by keeping document counts low.

Keeping the number of contacts and reports on user’s devices within CHT’s replication limit (10,000 docs) is important for both client-side and server-side stability. Many problems manifest when this limit is not maintained. Purging and Replication Depth are the features available for maintaining an appropriate volume of these document types on user’s devices.

Projects should also monitor the number of task documents created per user. This is separate from the replication limit and best achieved using Postgrse. One target would be to keep less than 5,000 task documents per user within the previous 60 days.

Device Procurement

You should avoid Tecno B-Class devices when procuring hardware for CHT projects.

When procuring new hardware models for a CHT project - you might consider asking here on the forum if others have experience with particular hardware varieties.

Links

Steps to reproduce this issue can be found in: cht-core#8149 and cht-core#8155.

9 Likes