Thanks for posting your question @derick!
The very best solution would be to use a hosting solution with solid I/O. Medic has been hosting the CHT for many years across hundreds of instances and have not had this issue. The rest of my response assumes you can not improve I/O and are forced to use the current hosting provider.
@diana , @jkuester and @elijah and I have been working on this issue for some time in a private ticket, which Iāll summarize:
- While the private ticket is about a clone of a VM, the underlying slow storage issue applies to your production VMs as well
- The issue being faced is that upgrades fail
- The main issue is that
iowait
is very high, the exact symptom you would expect with slow I/O (slow reads and writes to disk)
- Current focus of the ticket has been to serially stage views during upgrades in hopes that this is less I/O intensive so that upgrades succeed
- Weāre perusing using the
pre-stage
branch out the CHToolbox to test serial upgrades
To your question:
Cold storage. Do we know if it has yields significant improvements?
Any activity you can do to reduce I/O will help, including cold storage. However cold storage is very hard to implement and may cause CHWs to sync heavy amounts of data. We do not provide documentation around this. An easier solutions would be purging.
However, weāre seeing errors like this in our tests:
could not load validation funs + corrupted_data
This suggests that when the server is under load, any transaction could fail. That is, with your current number of users writing the current number of new documents, any transaction may fail and need to be retried. Iām doubtful that deleting documents via cold storage or implementing purging will help.
RAID setup
Weāre seeing a iowait
(from iostat
) on your VMs as high as 40
. A single SSD should be able to support any given CHT instance. So, yes, RAID might help, but so would having a single, local to the VM SSD. RAID is one solution to āSlow I/Oā, but maybe not the best given how prescriptive it is. Hypothetically speaking, if your I/O bound VM is on a heavily congested network connection to a SAN that somehow doesnāt have RAID, demanding RAID on the SAN wonāt solve the heavily congested network connection.
Understanding why the VM has such bad iowait
would be the first step. Maybe you can ask your hosting provider for a bare-metal box instead of a VM? I assume the SAN gives you features like disk snapshots, rollbacks and easily increasing the size of the VM disk. However if your CHT instance can server the basic function of supporting CHWs, these SAN features arenāt worth it.