I/O performance optimisation

derick · July 23, 2025, 6:42am

We are working with a hosting provider with slow I/O are are also looking to explore potential optimisation strategies to help mitigate the situation. Below are some options we are considering:

Cold storage. Do we know if it has yields significant improvements?
RAID setup

Are there others we can consider?

mrjones · July 23, 2025, 3:25pm

Thanks for posting your question @derick!

The very best solution would be to use a hosting solution with solid I/O. Medic has been hosting the CHT for many years across hundreds of instances and have not had this issue. The rest of my response assumes you can not improve I/O and are forced to use the current hosting provider.

@diana , @jkuester and @elijah and I have been working on this issue for some time in a private ticket, which I’ll summarize:

While the private ticket is about a clone of a VM, the underlying slow storage issue applies to your production VMs as well
The issue being faced is that upgrades fail
The main issue is that iowait is very high, the exact symptom you would expect with slow I/O (slow reads and writes to disk)
Current focus of the ticket has been to serially stage views during upgrades in hopes that this is less I/O intensive so that upgrades succeed
We’re perusing using the pre-stage branch out the CHToolbox to test serial upgrades

To your question:

Cold storage. Do we know if it has yields significant improvements?

Any activity you can do to reduce I/O will help, including cold storage. However cold storage is very hard to implement and may cause CHWs to sync heavy amounts of data. We do not provide documentation around this. An easier solutions would be purging.

However, we’re seeing errors like this in our tests:

could not load validation funs + corrupted_data

This suggests that when the server is under load, any transaction could fail. That is, with your current number of users writing the current number of new documents, any transaction may fail and need to be retried. I’m doubtful that deleting documents via cold storage or implementing purging will help.

RAID setup

We’re seeing a iowait (from iostat) on your VMs as high as 40. A single SSD should be able to support any given CHT instance. So, yes, RAID might help, but so would having a single, local to the VM SSD. RAID is one solution to “Slow I/O”, but maybe not the best given how prescriptive it is. Hypothetically speaking, if your I/O bound VM is on a heavily congested network connection to a SAN that somehow doesn’t have RAID, demanding RAID on the SAN won’t solve the heavily congested network connection.

Understanding why the VM has such bad iowait would be the first step. Maybe you can ask your hosting provider for a bare-metal box instead of a VM? I assume the SAN gives you features like disk snapshots, rollbacks and easily increasing the size of the VM disk. However if your CHT instance can server the basic function of supporting CHWs, these SAN features aren’t worth it.

diana · July 24, 2025, 12:01pm

Hi @derick

So sorry to hear about your disk performance issues!

Unfortunately, cold storage via CouchDb purging turned out to be unfeasible, due to the need to rebuild view indexes after a significant amount of docs were purged. The purging operation is also quite slow (even on performant hardware!) .

Another option for cold storage would be filtered replication - replicate only the needed documents to a new database and start the CHT project using that database.
This would most likely involve some downtime and it will take additional resources to replicate to the new DB.

While it’s true that having a smaller database will reduce some number of disk reads, the writes will still be slow. So it’s an unknown how much of a impact a smaller database will have with low disk performance.

rmayore · July 24, 2025, 7:40pm

@diana We do that when setting up staging/UAT instances – basically a slim instance where a user can log in with their prod credentials, view their hierarchy and complete workflows, but they don’t have any of their old data (data_records). Essentially we do a filtered replication of only contacts, _users and user-settings. This method is very quick and you can come up with a ‘clone’ in only a few hrs.

Applying it to this problem, I think it would work and the users will log in quickly, but then they all will start uploading documents en-masse (which are now not present in the slim instance), leading to the same performance issues. Another challenge is the fastest couch filtered replication method (using views) cannot be applied here since views not building is a big part of the problem. The replications using a selector or a filter were too slow in our tests. So ultimately an external filtered replication might be considered, which is where we ended up.

diana · July 28, 2025, 10:14am

Hi @rmayore

I don’t think this will be a problem, because likely the amount of docs on user’s devices is much less than what exists on the server due to purging, You do have purging enabled, right
?

rmayore · August 6, 2025, 7:23am

I checked out the purging configuration… I think it’s well set up (most docs are set to be purged after a 1-3 month window).