Background:
One of the instances running multi-cluster CHT instances had a partial database crash recently that caused the production outage and CHT was down.
Discovery of the bug:
The issue started with couch2pg failing to sync data from medic-users-meta
database to postgres. Then we checked and realized that the monitoring endpoint api was also timing out. The first thing to look at any CHT debugging is to check the status of api, sentinel, and couchdb pods. While api and sentinel pods were running fine, there were some errors on couchdb pods targeted at specific couchdb shards.
' at async Promise.all (index 0)' |
-- | --
| | ' at async Promise.all (index 6)\n' + |
| | ' at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n' + |
| | ' at /service/api/node_modules/pouchdb-abstract-mapreduce/lib/index.js:391:31\n' + |
| | ' at Object.generateErrorFromResponse (/service/api/node_modules/pouchdb-errors/lib/index.js:104:18)\n' + |
| | 2024-12-02T05:49:42.664 ERROR: Error fetching feedback count: {
...
[error] 2024-12-02T05:49:42.661864Z couchdb@couchdb-2.our-host-prod.svc.cluster.local <0.17993.1569> 4c36c89be1 fabric_worker_timeout reduce_view,'couchdb@couchdb-1.our-host-prod.svc.cluster.local',<<"shards/55555554-6aaaaaa8/medic-users-meta.1698722729">> |
| |
Restoring the failed shard
If you look at the end of the error, the error is pointing at the shard shards/55555554-6aaaaaa8/medic-users-meta.1698722729
.
In a single cluster, it’d be on on your couchdb pod. In a multi-cluster installation, you need to find which couchdb installation has the shard mentioned in the failure. Take a look at each couchdb pod in /opt/couchdb/data/shards/ if that cluster contains the failed shard(s). In our case, the corrupted shard was on cluster-1.
We keep a regular automated backup of our couchdb, so it was useful to go back at a specific date and pick the backup to work with. We mounted the backup to disks and looked for the failed shard as mentioned in the error. In a partial failure like this, it’s important to restore only the failed database and failed shard only. If we were to restore the entire backup, we’d be losing the data collected afterwards.
Since we have figured out the location of the backed up shard and cluster and location of the failed shard, the next step is to copy the backup to the production cluster directly. From the location of backup, execute following command:
kubectl cp medic-users-meta.1698722729.couch kube-namespace/cht-couchdb-1-56ddd646d6-c4mf9:/opt/couchdb/data/shards/55555554-6aaaaaa8/
On this command, we are copying a shard to running production instance with the help of kubectl cp
command. kube-namespace
is the kubernetes namespace of the instance and cht-couchdb-1-56ddd646d6-c4mf9
is the name of the couchdb1 pod. You don’t need to change the path /opt/couchdb/data/shards/55555554-6aaaaaa8/
if you did not change where couchdb data is stored during initial installation.
This should complete without any error and replace a failed shard with a working shard from backup. Now, restart all couchdb deployments.
kubectl -n kube-namespace rollout restart deployment cht-couchdb-1
kubectl -n kube-namespace rollout restart deployment cht-couchdb-2
kubectl -n kube-namespace rollout restart deployment cht-couchdb-3
This completes the restoring of failed database shard on a production instance.
At the end,
We were able to restore, only because we had a regular backup. It is strongly recommended that regular automatic backup of CHT and couchdb clusters is configured.
CHT also has a monitoring endpoint that can be used to monitor the health of the instance or can be used with CHT Watchdog to monitor the status of instances.