Database shard corruption

Critical Production Issue: CouchDB Shard Corruption (read_beyond_eof) – CHT 5.0.0

Hello Community,

We are facing a critical production outage due to CouchDB shard corruption and would appreciate guidance from the Medic team and community members.


:desktop_computer: System Details

  • CHT Version: 5.0.0-custom-20260119

  • CouchDB Version: 3.5.0

  • Deployment: Docker (standard CHT stack)

  • Database: medic

  • Total Documents: ~98,845

  • Environment: Production (Linux 5.15)


:red_exclamation_mark: Problem Summary

The CHT API entered a crash loop and the application became inaccessible (502 Bad Gateway via nginx).

Initially, we suspected view index corruption. However, further investigation revealed that even _all_docs fails.

Running inside the CouchDB container:

curl -u medic:password http://localhost:5984/medic/_all_docs?limit=1

Returns:

{
  "error": "badmatch",
  "reason": "{error,{read_beyond_eof,
           \"./data/shards/6aaaaaa9-7ffffffd/medic.1758278474.couch\",
           ...}}"
}

This indicates corruption in the shard file:

shards/6aaaaaa9-7ffffffd/medic.1758278474.couch


:magnifying_glass_tilted_left: What We Have Tried

  • Ran _view_cleanup

  • Performed database compaction

  • Deleted and rebuilt view indexes

  • Replaced the corrupted shard file with a backup copy

  • Restarted services carefully

  • Verified multiple shard ranges exist (12 total)

Despite these efforts, _all_docs continues to fail with read_beyond_eof.

This confirms it is not only view index corruption, but actual shard-level data file corruption.


:package: Backup Status

  • No full filesystem snapshot from before corruption

  • We DO have complete document data in the analytics database (cht_sync)

  • Table v1.couchdb contains full document JSON (doc column)

Example:

SELECT COUNT(*) FROM v1.couchdb;
SELECT doc->>'type', COUNT(*) FROM v1.couchdb GROUP BY 1;

So business data appears intact in analytics.


:brain: Suspected Root Cause

  • Improper shutdown during write

  • Possible memory/disk pressure

  • Corruption of B-tree offsets inside .couch shard file

  • Leading to read_beyond_eof errors


:counterclockwise_arrows_button: Proposed Recovery Plan

Since shard repair does not appear possible:

  1. Reset CouchDB with a clean data directory

  2. Recreate the medic database

  3. Export all documents from v1.couchdb

  4. Reinsert documents using _bulk_docs

  5. Reinstall CHT system documents

  6. Allow views to rebuild


:red_question_mark: Questions

  1. Is there any supported way to repair a corrupted shard in CouchDB 3.5.0?

  2. Is rebuilding from analytics (v1.couchdb) considered safe in CHT disaster recovery?

  3. Are there recommended hardening steps to prevent shard corruption in Docker deployments?

  4. Should we upgrade CouchDB before re-importing data?


Our objective is to restore system functionality safely, avoid silent data loss, and follow recommended CHT best practices.

Any guidance would be greatly appreciated.

Thank you.
Arjun

1 Like

Hi @sablearjun-ola - welcome to the forum!

I’m sorry to hear you’re having issues with your CHT instance. Can you clarify what version 5.0.0-custom-20260119 is? Is it a custom built version of the CHT and if yes, can you try running the official 5.0.1 release of CHT?

However, I’m not sure the version really matters here - from what you’ve described, I think you’re looking at data loss : (

To answer your questions:

Is there any supported way to repair a corrupted shard in CouchDB 3.5.0?

If a shard is corrupt and reporting read_beyond_eof errors, the supported way is to restore from backup.

Is rebuilding from analytics (v1.couchdb) considered safe in CHT disaster recovery?

Data in PostgreSQL populated from either couch2pg or CHT Sync can continue to be used when the upstream CHT instance is offline due to data corruption. However, you can not rebuild the data in PostgreSQL without restoring the CHT instance.

Additionally, the data in PostgreSQL is involves aggregating and consolidating. As such, there is intentionally data loss and CouchDB data can not be recreated from couch2pg or CHT Sync data in PostgreSQL.

Are there recommended hardening steps to prevent shard corruption in Docker deployments?

Being sure to run production CHT instances in a data center or cloud provider is likely the best way. These have high availability hardware with multiple levels of redundant power supply to avoid hard shutdowns of VMs or bare-metal.

For more budget constrained instances, running on a dedicated server, only used for production services with a UPS can help.

Should we upgrade CouchDB before re-importing data?

If you have a valid backup of the corrupt shard that you’re trying to restore, do not change the version of CouchDB.

cc @diana @binod - we had a private chat about this topic. Adding you in case I missed anything or made any mistakes!

Hi @sablearjun-ola

I’m sorry you’re dealing with CouchDb corruption, and especially at main database shard level.

@mrjones is correct in every aspect, but I can chime in with some targeted answers.

In terms of this question:

The answer is you should never do that, you should always run the version of CouchDb that comes bundled with the CHT version you are running. Changing this version, or even settings of CouchDb, can create all sorts of elusive and often surprising side-effects that are hard to track down.