Database shard corruption

sablearjun-ola · February 11, 2026, 7:31pm

Critical Production Issue: CouchDB Shard Corruption (read_beyond_eof) – CHT 5.0.0

Hello Community,

We are facing a critical production outage due to CouchDB shard corruption and would appreciate guidance from the Medic team and community members.

System Details

CHT Version: 5.0.0-custom-20260119
CouchDB Version: 3.5.0
Deployment: Docker (standard CHT stack)
Database: medic
Total Documents: ~98,845
Environment: Production (Linux 5.15)

Problem Summary

The CHT API entered a crash loop and the application became inaccessible (502 Bad Gateway via nginx).

Initially, we suspected view index corruption. However, further investigation revealed that even _all_docs fails.

Running inside the CouchDB container:

curl -u medic:password http://localhost:5984/medic/_all_docs?limit=1

Returns:

{
  "error": "badmatch",
  "reason": "{error,{read_beyond_eof,
           \"./data/shards/6aaaaaa9-7ffffffd/medic.1758278474.couch\",
           ...}}"
}

This indicates corruption in the shard file:

shards/6aaaaaa9-7ffffffd/medic.1758278474.couch

What We Have Tried

Ran _view_cleanup
Performed database compaction
Deleted and rebuilt view indexes
Replaced the corrupted shard file with a backup copy
Restarted services carefully
Verified multiple shard ranges exist (12 total)

Despite these efforts, _all_docs continues to fail with read_beyond_eof.

This confirms it is not only view index corruption, but actual shard-level data file corruption.

Backup Status

No full filesystem snapshot from before corruption
We DO have complete document data in the analytics database (cht_sync)
Table v1.couchdb contains full document JSON (doc column)

Example:

SELECT COUNT(*) FROM v1.couchdb;
SELECT doc->>'type', COUNT(*) FROM v1.couchdb GROUP BY 1;

So business data appears intact in analytics.

Suspected Root Cause

Improper shutdown during write
Possible memory/disk pressure
Corruption of B-tree offsets inside .couch shard file
Leading to read_beyond_eof errors

Proposed Recovery Plan

Since shard repair does not appear possible:

Reset CouchDB with a clean data directory
Recreate the medic database
Export all documents from v1.couchdb
Reinsert documents using _bulk_docs
Reinstall CHT system documents
Allow views to rebuild

Questions

Is there any supported way to repair a corrupted shard in CouchDB 3.5.0?
Is rebuilding from analytics (v1.couchdb) considered safe in CHT disaster recovery?
Are there recommended hardening steps to prevent shard corruption in Docker deployments?
Should we upgrade CouchDB before re-importing data?

Our objective is to restore system functionality safely, avoid silent data loss, and follow recommended CHT best practices.

Any guidance would be greatly appreciated.

Thank you.
Arjun

mrjones · February 11, 2026, 8:25pm

Hi @sablearjun-ola - welcome to the forum!

I’m sorry to hear you’re having issues with your CHT instance. Can you clarify what version 5.0.0-custom-20260119 is? Is it a custom built version of the CHT and if yes, can you try running the official 5.0.1 release of CHT?

However, I’m not sure the version really matters here - from what you’ve described, I think you’re looking at data loss : (

To answer your questions:

Is there any supported way to repair a corrupted shard in CouchDB 3.5.0?

If a shard is corrupt and reporting read_beyond_eof errors, the supported way is to restore from backup.

Is rebuilding from analytics (v1.couchdb) considered safe in CHT disaster recovery?

Data in PostgreSQL populated from either couch2pg or CHT Sync can continue to be used when the upstream CHT instance is offline due to data corruption. However, you can not rebuild the data in PostgreSQL without restoring the CHT instance.

Additionally, the data in PostgreSQL is involves aggregating and consolidating. As such, there is intentionally data loss and CouchDB data can not be recreated from couch2pg or CHT Sync data in PostgreSQL.

Are there recommended hardening steps to prevent shard corruption in Docker deployments?

Being sure to run production CHT instances in a data center or cloud provider is likely the best way. These have high availability hardware with multiple levels of redundant power supply to avoid hard shutdowns of VMs or bare-metal.

For more budget constrained instances, running on a dedicated server, only used for production services with a UPS can help.

Should we upgrade CouchDB before re-importing data?

If you have a valid backup of the corrupt shard that you’re trying to restore, do not change the version of CouchDB.

cc @diana @binod - we had a private chat about this topic. Adding you in case I missed anything or made any mistakes!

diana · February 12, 2026, 2:47am

Hi @sablearjun-ola

I’m sorry you’re dealing with CouchDb corruption, and especially at main database shard level.

@mrjones is correct in every aspect, but I can chime in with some targeted answers.

In terms of this question:

The answer is you should never do that, you should always run the version of CouchDb that comes bundled with the CHT version you are running. Changing this version, or even settings of CouchDb, can create all sorts of elusive and often surprising side-effects that are hard to track down.