Critical Production Issue: CouchDB Shard Corruption (read_beyond_eof) – CHT 5.0.0
Hello Community,
We are facing a critical production outage due to CouchDB shard corruption and would appreciate guidance from the Medic team and community members.
System Details
-
CHT Version: 5.0.0-custom-20260119
-
CouchDB Version: 3.5.0
-
Deployment: Docker (standard CHT stack)
-
Database:
medic -
Total Documents: ~98,845
-
Environment: Production (Linux 5.15)
Problem Summary
The CHT API entered a crash loop and the application became inaccessible (502 Bad Gateway via nginx).
Initially, we suspected view index corruption. However, further investigation revealed that even _all_docs fails.
Running inside the CouchDB container:
curl -u medic:password http://localhost:5984/medic/_all_docs?limit=1
Returns:
{
"error": "badmatch",
"reason": "{error,{read_beyond_eof,
\"./data/shards/6aaaaaa9-7ffffffd/medic.1758278474.couch\",
...}}"
}
This indicates corruption in the shard file:
shards/6aaaaaa9-7ffffffd/medic.1758278474.couch
What We Have Tried
-
Ran
_view_cleanup -
Performed database compaction
-
Deleted and rebuilt view indexes
-
Replaced the corrupted shard file with a backup copy
-
Restarted services carefully
-
Verified multiple shard ranges exist (12 total)
Despite these efforts, _all_docs continues to fail with read_beyond_eof.
This confirms it is not only view index corruption, but actual shard-level data file corruption.
Backup Status
-
No full filesystem snapshot from before corruption
-
We DO have complete document data in the analytics database (
cht_sync) -
Table
v1.couchdbcontains full document JSON (doccolumn)
Example:
SELECT COUNT(*) FROM v1.couchdb;
SELECT doc->>'type', COUNT(*) FROM v1.couchdb GROUP BY 1;
So business data appears intact in analytics.
Suspected Root Cause
-
Improper shutdown during write
-
Possible memory/disk pressure
-
Corruption of B-tree offsets inside
.couchshard file -
Leading to
read_beyond_eoferrors
Proposed Recovery Plan
Since shard repair does not appear possible:
-
Reset CouchDB with a clean data directory
-
Recreate the
medicdatabase -
Export all documents from
v1.couchdb -
Reinsert documents using
_bulk_docs -
Reinstall CHT system documents
-
Allow views to rebuild
Questions
-
Is there any supported way to repair a corrupted shard in CouchDB 3.5.0?
-
Is rebuilding from analytics (
v1.couchdb) considered safe in CHT disaster recovery? -
Are there recommended hardening steps to prevent shard corruption in Docker deployments?
-
Should we upgrade CouchDB before re-importing data?
Our objective is to restore system functionality safely, avoid silent data loss, and follow recommended CHT best practices.
Any guidance would be greatly appreciated.
Thank you.
Arjun