Slow Syncing+Replication

We have a node running CHT 4.6.0 and for some time now since December we have been experiencing slow syncing and replication. Find our specs below.

Server Specs:
Cores: 8
RAM: 16GB
OS: Ubuntu 22.04 LTS

Watchdog Stats (last 90 days):
Monthly Active Users: 769
Replication 50th Percentile: 21.9 s 43.4 mins
Replication 90th Percentile: 41.1 s 56.7 mins
Replication Max: 3 mins 1 hour

CouchDB Stats:
Mode: Single node
Doc count: 8800965

What We Have Tried So Far:

  • Upgraded our infrastructure to its current state
  • Upgrade CHT but due to bugs with the upgrade service we could only upgrade up to v4.6.0, from v4.3.1
  • Adjust replication depth for some of our user groups to reduce load on the server
  • Attempted to force an upgrade manually through Docker to v4.11.0 so that we get some of the updates to how replication is done but it would just hang when running migrations
1 Like

Hi @danielmwakanema!

How many docs does each user have to replicate?

You can check it with this API:
api/v1/users-doc-count

Hi @danielmwakanema

I would consider that your server specs are quite low for an instance that serves almost 800 users. I would first recommend trying to further upgrade your instance.

I think @binod 's question is very relevant, the number of docs (including purged docs) is quite impactful of replication times.

Thirdly, can you please share which bugs in the upgrade service you are referring to? have u tried upgrading the upgrade service first?

Thanks.

@binod I will get back to you with this shortly.

@diana, what server specs are recommended?

About the upgrade service, we encountered the view indexing and CouchDB crashing bugs. We tried restarting the concerned services each time but it never really would finish. We referred to the docs here, Troubleshooting 4.x upgrades | Community Health Toolkit.

@danielmwakanema

I see. To progress the upgrade, do follow the guidelines in the Troubleshooting doc you linked.

For recommended server specs, I’m deferring to @hareet who has most experience.

@danielmwakanema - thanks for sharing your issue! As @diana mentioned, before following any of the troubleshooting steps - be sure you upgrade your upgrade service. Assuming you’re using docker to host, do this by pulling the image with:

docker pull public.ecr.aws/s5s3h4s7/cht-upgrade-service

And then be sure then to stop and start your upgrade service so that the new image is loaded:

docker stop $(docker ps -aq --filter "name=cht-upgrade-service")
docker start $(docker ps -aq --filter "name=cht-upgrade-service")

You can check before and after to see if your image is updated with a docker images call - it should have been created ~4 mo ago:

docker images public.ecr.aws/s5s3h4s7/cht-upgrade-service
REPOSITORY                                    TAG       IMAGE ID       CREATED        SIZE
public.ecr.aws/s5s3h4s7/cht-upgrade-service   latest    bf1133f540ed   4 months ago   396MB
1 Like

@elijah Would you have what is used for KE eCHIS for comparable county size?

2 Likes

I think this is maybe the worst replication performance I’ve seen for any cht 4.X instance.

0 users at replication_limit
Says you’re on 4.3.1 though… Didn’t you say you upgraded? Do you want help with that?

@kenn we reverted it later yesterday because v4.6.0 kept crashing during indexing.

Can you please be a little more verbose? We need a bit more information to help you

we could only upgrade up to v4.6.0, from v4.3.1

So is this right?

  1. You upgraded to 4.6.0
  2. The upgrade to 4.6.0 completed (?)
  3. You had some crashes
  4. After some time of running 4.6.0, you downgraded back to 4.3.1?

Can you provide some details about the crashes? Logs? At least what service is crashing? Why exactly do you think it was “during indexing” (usually this completes before the upgrade)? Did you check disk space which is the normal cause of crash during index?

@kenn, the upgrade to v4.6.0 completed. However CouchDB would crash during indexing, the indexing that happens after a restart, with the following error:
Screenshot 2025-02-14 at 10.55.19.

We know it was during indexing because there were some indexing tasks running at those times. We verified through Fauxton. We also verified that we have enough disk space.

For KE deployment, the most comparable instance is Isiolo with ~800 users. Server specs are 16 cores, 8GB RAM & 500 GB disk. It’s been operating smoothly so far. I recommend monitoring resource consumption periodically to identify where the bottleneck is.

Thanks @elijah. This is good feedback.