Failed CouchDB cluster init

Felix_Otieno · June 17, 2025, 4:24am

I am deploying CHT on kubernetes and getting errors (described below). I have so far ensured that

erlang cookies are present and identical on each node
DNS names resolve to the correct IPs and are reachable (inter-pod communication is working)

Anyone with an idea on how to resolve the rexi_DOWN,noproc error? How can I trigger creation of default databases i.e. _users on initialisation?

Error logs:
[error] 2025-06-17T02:58:26.340266Z couchdb@cht-couchdb-0.cht-couchdb.cht-k8s.svc.cluster.local emulator -------- Error in process <0.503.0> on node ‘couchdb@cht-couchdb-0.cht-couchdb-couchdb.cht-k8s.svc.cluster.local’ with exit value:
{{rexi_DOWN,{‘couchdb@cht-couchdb-1.cht-couchdb.cht-k8s.svc.cluster.local’,noproc}},[{mem3_rpc,rexi_call,3,[{file,“src/mem3_rpc.erl”},{line,384}]},{mem3_seeds,‘-start_replication/1-fun-0-’,1,[{file,“src/mem3_seeds.erl”},{line,107}]}]}

…

[notice] 2025-06-17T02:58:26.497064Z couchdb@cht-couchdb-0.cht-couchdb.cht-k8s.svc.cluster.local <0.609.0> -------- Missing system database _users
[notice] 2025-06-17T02:58:31.417394Z couchdb@cht-couchdb-0.cht-couchdb.cht-k8s.svc.cluster.local <0.561.0> -------- chttpd_auth_cache changes listener died because the _users database does not exist. Create the database to silence this notice.
[error] 2025-06-17T02:58:31.417576Z couchdb@cht-couchdb-0.cht-couchdb.cht-k8s.svc.cluster.local emulator -------- Error in process <0.765.0> on node ‘couchdb@cht-couchdb-0.cht-couchdb.cht-k8s.svc.cluster.local’ with exit value:
{database_does_not_exist,[{mem3_shards,load_shards_from_db,[<<“_users”>>],[{file,“src/mem3_shards.erl”},{line,430}]},{mem3_shards,load_shards_from_disk,1,[{file,“src/mem3_shards.erl”},{line,405}]},{mem3_shards,load_shards_from_disk,2,[{file,“src/mem3_shards.erl”},{line,434}]},{mem3_shards,for_docid,3,[{file,“src/mem3_shards.erl”},{line,100}]},{fabric_doc_open,go,3,[{file,“src/fabric_doc_open.erl”},{line,39}]},{chttpd_auth_cache,ensure_auth_ddoc_exists,2,[{file,“src/chttpd_auth_cache.erl”},{line,214}]},{chttpd_auth_cache,listen_for_changes,1,[{file,“src/chttpd_auth_cache.erl”},{line,160}]}]}

diana · June 17, 2025, 4:49am

hi @Felix_Otieno

Can you please share your node name setup? It’s hard to determine from the logs on which machine this error is triggered, because there are two mentioned in the error:

[error] 2025-06-17T02:58:26.340266Z couchdb@cht-couchdb-0.cht-couchdb.cht-k8s.svc.cluster.local emulator -------- Error in process <0.503.0> on node ‘couchdb@cht-couchdb-0.cht-couchdb-couchdb.cht-k8s.svc.cluster.local’ with exit value:
{{rexi_DOWN,{‘couchdb@cht-couchdb-1.cht-couchdb.cht-k8s.svc.cluster.local’,noproc}},[{mem3_rpc,rexi_call,3,[{file,“src/mem3_rpc.erl”},{line,384}]},{mem3_seeds,‘-start_replication/1-fun-0-’,1,[{file,“src/mem3_seeds.erl”},{line,107}]}]}

You have cht-couchdb-0.cht-couchdb.cht-k8s.svc.cluster.local and cht-couchdb-1.cht-couchdb.cht-k8s.svc.cluster.local ?
I think we need more logs, and from all your couchdb nodes, to even begin troubleshooting this.

Felix_Otieno · June 26, 2025, 4:11am

hi @diana

Many thanks for your response.

I have 4 nodes in total. 1 control plane and 3 workers with hostnames worker1, worker2 and worker3.
I am setting this up on cross-cloud bare metal servers and adjusted the values.yaml file as below;

...
couchdb:
  password: "password"
  secret: "uuidgen-secret" 
  user: "medic"
  uuid: "uuidgen-couchID"
  clusteredCouch_enabled: true
  couchdb_node_storage_size: 350Gi
clusteredCouch:
  noOfCouchDBNodes: 3
...

environment: "remote" 
cluster_type: "k3s-k3d"
...

nodes:
 node-1: "worker1" 
 node-2: "worker2" 
 node-3: "worker3"

 ...

 couchdb_data:
  preExistingDataAvailable: "true" 
  dataPathOnDiskForCouchDB:
    "couchdb-%d"
...

local_storage: 
  preExistingDiskPath-1: "/home/user/couchdb" 
  preExistingDiskPath-2: "/home/user/couchdb" 
  preExistingDiskPath-3: "/home/user/couchdb"

 ...

To setup the worker nodes I have rsync’d pre-existing data to the 3 worker nodes.

I have noticed that

-When I check cluster setup, the state is single_node_disabled despite the clusteredCouch_enabled setting in values.yaml

# curl -X GET http://admin:password@couchdb-1.cht-k8s.svc.cluster.local:5984/_cluster_setup
{"state":"single_node_disabled"}

When I get membership, there is a node within the cluster nodes that references localhost.

 curl -X GET http://admin:password@couchdb-1.cht-k8s.svc.cluster.local:5984/_membership
{"all_nodes":["couchdb@couchdb-1.cht-k8s.svc.cluster.local","couchdb@couchdb-2.cht-k8s.svc.cluster.local","couchdb@couchdb-3.cht-k8s.svc.cluster.local"],"cluster_nodes":["couchdb@127.0.0.1","couchdb@couchdb-1.cht-k8s.svc.cluster.local","couchdb@couchdb-2.cht-k8s.svc.cluster.local","couchdb@couchdb-3.cht-k8s.svc.cluster.local"]}

When I remove the localhost node and try to finish setup

curl -X POST http://admin:password@couchdb-1.cht-k8s.svc.cluster.local:5984/_cluster_setup -H "Content-Type: application/json" -d '{"action": "finish_cluster"}'
{"error":"setup_error","reason":"Cluster setup unable to sync admin passwords"}

When I check admins, I see that I have same admin on all the nodes

# curl -X GET http://admin:password@couchdb-1.cht-k8s.svc.cluster.local:5984/_node/_local/_config/admins/
{"admin":"-pbkdf2-279886f05fa3cc2a9,b04b2bc4210,10"}
# curl -X GET http://admin:password@couchdb-2.cht-k8s.svc.cluster.local:5984/_node/_local/_config/admins/
{"admin":"-pbkdf2-279886f05fa3cc2a9,b04b2bc4210,10"}
# curl -X GET http://admin:password@couchdb-3.cht-k8s.svc.cluster.local:5984/_node/_local/_config/admins/
{"admin":"-pbkdf2-279886f05fa3cc2a9,b04b2bc4210,10"}

Please see logs from the pods on this link

diana · June 28, 2025, 4:11pm

This is likely because of the data you’re preloading, it is referencing a non-existing node.
This is pretty advanced stuff to fix. We have a script that would help you resolve the data-node relation, but you’d have to still understand what is required to do in order to get this to work.

Where did you get the data from? Was it a single node couchdb?

Felix_Otieno · June 30, 2025, 7:57pm

Yes. The data is from a single node couchdb that we have been using for a while.

I am curious to have a look at the script you’ve mentioned. How can I get a hold of it? I am hoping that by looking its structure and the purpose of its components, I can be able to understand what I need to do.

diana · June 30, 2025, 8:33pm

This is the documentation for the script: Migration from Docker Compose CHT 3.x to 3-Node Clustered CHT 4.x on K3s – Community Health Toolkit

I know it says about migration from 3.x to a 4.x k8s cluster, but what it does is basically split the data from one single-node to three nodes and resolve the node names in the data mapping. I think it’s pretty safe you follow this guide as-is at first: Migration from Docker Compose CHT 3.x to 3-Node Clustered CHT 4.x on K3s – Community Health Toolkit

Felix_Otieno · July 10, 2025, 2:03pm

I followed the guide and was able to resolve the issues we were facing. Today is our 4th day running CHT on a cluster and everything is working as expected. Thank you so much @diana for your help. I sincerely appreciate you for helping.