Investigate adding more shards as a potential avenue for improved performance

The bottleneck

One of the biggest bottlenecks we’ve seen in pre-clustering CHT was view indexing while the instance was under high load. This usually occurred with end-of-month sync-ing patterns, where large numbers of users are pushing many docs in a short time frame, and attempting to download docs at the same time. In order for these actions to succeed, the CouchDB server needs to timely index all the new pushed docs, and then respond to the view query requests which are required for replication.
When CouchDB can’t keep up with view indexing, we see replication requests fail with timeouts, and upon inspecting logs, we see entries like this:

2022-09-01 07:54:58 ERROR: Server error: {
  error: 'timeout',
  reason: 'The request could not be processed in a reasonable amount of time.',
  status: 500,
  name: 'timeout',
  message: 'The request could not be processed in a reasonable amount of time.'
}

Increasing CPU resources (CPU being the main bottleneck) worked to an extent, but would reach a point where returns diminished entirely. In these cases, the instances were severely slowed, but resource usage would show that there are plenty of resources available.

CouchDB Shards

CouchDB data is split into chunks, called shards. Distribution of data between these shards is handled entirely by CouchDB, and is not something that the CHT manages in any way.
You can think of every shard as a tiny server that is queried and written to individually, like a small database, with its own indexes. When you make a view query request to CouchDB, it will request data from all shards and consolidate a response, which means that all shards need to have up-to-date indexes in order for the server to fulfill this request. Each shard gets its own single indexing process per design document (ddoc), which is the main reason for this investigation.
In CHT 4.x, we introduced clustering and increased the default amount of shards (from 8 to 12) in CouchDB config, so CouchDB can have access to more resources and the shards can be evenly distributed across a 3-node cluster, which is the default setup for clustering.
The experiment

When many documents are pushed to the server and significant view indexing is happening, CouchDB will start a distinct (ddocs x shards) number of indexing processes. For example: if a database has 2 ddocs and 8 shards, there will be 2 * 8 indexing processes, so the most optimal setup would be if CouchDB has access to at least 16 CPU cores, to insure least resource contention, while having more than 16 cores will not improve view indexing time (applies for an otherwise idle server, these cores are useful to fulfill other requests in a non-idle server). For a 2 ddoc database with 12 shards, the ideal number of cores is 24, and so on.

Tests have confirmed this. Below are the findings when running a view indexing test, involving various types of standard AWS instances, with varying numbers of cores and deploying CHT Core 4.16.0 ddocs with varying shard counts, and tracking how long it takes to rebuild indexes. This experiment uses a generated data set of 800.000 documents, with each user having 2200 documents.

Instance type Nbr. Cores Nbr. Shards View rebuild (min)
c5.2xlarge 8 12 44.24
c5.4xlarge 16 12 21.47
c5.4xlarge 16 16 18.41
c5.4xlarge 16 16 20.18
c5.4xlarge 16 24 16.99
c5.4xlarge 16 12 23.45
c5.4xlarge 16 16 20.69
c5.4xlarge 16 24 19.81
c5.9xlarge 36 12 18.46
c5.9xlarge 36 18 15.27
c5.9xlarge 36 36 9.09
c5.9xlarge 24 (out of 36) 12 20.04
c5.9xlarge 24 (out of 36) 24 14.91
c5.9xlarge 24 (out of 36) 36 11.42
c5.2xlarge x3 24 (8 * 3) 12 19.75
c5.2xlarge x3 24 (8 * 3) 24 13.74
c5.9xlarge 36 12 18.3
c5.9xlarge 36 16 15.23
c5.9xlarge 36 36 9.44
c5.9xlarge 36 24 12.48
c5.9xlarge 36 36 10.27
c5.12xlarge 48 12 16.21
c5.12xlarge 48 16 13.17
c5.12xlarge 48 24 9.98
c5.12xlarge 48 48 6.5

For a machine with 36 cores, increasing shard count from 12 to 36 reduced indexing time by 50%!
Yes, having more shards potentially eliminates one of the biggest bottlenecks in CHT replication.

The caveat

Because view queries need to consolidate responses from every shard, the performance of these requests reduces significantly when increasing the number of shards.
Along with measuring view rebuild time, an initial replication scalability suite was run - where a number of users concurrently run initial replication, and averaging the completion time.

These are the results:

Instance type Nbr. cores Nbr. shards View indexing (min) Replicate 50 users (sec) Replicate 100 users (sec)
c5.9xlarge 36 12 18.46 47 92
c5.9xlarge 36 18 15.27 58 110
c5.9xlarge 36 36 9.09 83 177
c5.9xlarge 24 (out of 36) 12 20.04 57 104
c5.9xlarge 24 (out of 36) 24 14.91 70 138
c5.9xlarge 24 (out of 36) 36 11.42 93 180
c5.2xlarge x3 24 (8 * 3) 12 19.75 55 105
c5.2xlarge x3 24 (8 * 3) 24 13.74 77 136

Going from 12 to 36 shards in a 36 core machine reduced view indexing time by 50%, but increased replication time by 76% for 50 users and by 92% for 100 users.
Going from 12 to 36 shards in a 24 core machine reduced indexing time by 57%, but increased replication time by 73%.
Going from 12 to 24 shards in a 24 core machine reduced indexing time by 25%, but increased replication time by 32%.
For a 3-node CouchDB cluster, running on machines with 8 CPU cores each, going from 12 to 24 shards reduced view indexing time by 30%, but increased replication time by 30%.

Conclusion

This experiment shows that there is no one-size-fits-all solution for all deployments, and every solution comes with its advantages and disadvantages.

Modifying the number of shards is not a trivial endeavor. There are two ways in which this can be achieved:
By using the built-in _reshard CouchDB endpoint. This has limitations, as it currently only supports splitting a shard into two. So this means that you can always only double the amount of shards you have (8 → 16, 12 → 24, etc).
By changing the CouchDB setting for number of shards (the q value), creating a new database and replicating data from the medic database - this requires a script to also copy local documents, which hold important client sync-ing information, and then setting this as the main database and restarting API and Sentinel.
Both these options require significant downtime due to view rebuilding and require additional storage.

The CHT currently has two ddocs that are queried by CHT API for replication and should always be up to date: medic and medic-client.
The best approach is to tweak instances before initial deployment by adjusting the shard count relative to the number of CPU cores available for the deployment.
For existing deployments, it depends whether they are hitting or are close to hitting the view indexing bottleneck: If replication requests timeout and CouchDB reports view indexing timeouts, and the project is already at a number of cores more than double the amount of shards. This could affect deployments that were migrated from CHT 3.x to CHT 4.x more often / under less load, because CHT 3.x used only 8 shards and the migration process from 3.x to 4.x does not update shards.

Is clustering still worth it?

Results from this experiment show no clear advantage for clustering, as we are seeing the same performance for view indexing and replication when using a cluster vs using a single machine with the same number of cores.
When we ran performance tests pre 4.x and chose clustering as the main avenue for scaling the CHT, the CHT replication algorithm was different and the version of CouchDB was different (2.3.1 vs 3.4.2) - and improvements that were made over time could have triggered this change.

5 Likes

Wow! This is amazing research. Thanks for taking the time to dive into this and publish a detailed and helpful write up here.

There’s a really important line in the conclusion:

Results from this experiment show no clear advantage for clustering

I want to confirm the implications of this:

  • We should stop recommending deployments run multi-node couch
  • Instead we should publish guides when and how to increase shards when load demands it and extra cores are available
  • We should update our recommendations for running Kubernetes denoting that only large, multi-tenant deployments should consider running Kubernetes

I bring this up to foster some public discussion about how we proceed here. I know that some national deployments have been struggling to implement Kubernetes. While it offers amazing high availability features and lends it self well for multi-tenancy, these deployments have also been mired by the complexity and required training to keep such a system up and running.

Thanks @mrjones

I think I can confirm that these are the implications. In today’s world of virtualization, the cost of having 32 cores is roughly the same if you have them in a single machine vs 4 machines, each with 8 cores (at least, that’s how it seems for AWS), so the only consideration becomes performance.
I would be more confident about confirming if we knew of one large deployment running on a single node Couch and that it handles load acceptably.

There might still be advantages having services run in resource-isolation, and at the moment, the default docker way is of dumping all containers onto one machine (I know that docker-compose allows for resource allocation, but this is not something we currently instruct).

I believe that some of our documentation already sends the correct message:

When:

  • you can no longer vertically scale your CHT instance because of hardware limitations
  • vertically scaling stops yielding better performance (currently estimated to be 32 cores and 200GB of RAM)
  • you’re starting a new deployment and you predict a large number of users (in excess of 1k)

This is where I was thinking we can further emphasize or add some extra information about sharding: Vertical vs Horizontal scaling | Community Health Toolkit

1 Like