The bottleneck
One of the biggest bottlenecks we’ve seen in pre-clustering CHT was view indexing while the instance was under high load. This usually occurred with end-of-month sync-ing patterns, where large numbers of users are pushing many docs in a short time frame, and attempting to download docs at the same time. In order for these actions to succeed, the CouchDB server needs to timely index all the new pushed docs, and then respond to the view query requests which are required for replication.
When CouchDB can’t keep up with view indexing, we see replication requests fail with timeouts, and upon inspecting logs, we see entries like this:
2022-09-01 07:54:58 ERROR: Server error: {
error: 'timeout',
reason: 'The request could not be processed in a reasonable amount of time.',
status: 500,
name: 'timeout',
message: 'The request could not be processed in a reasonable amount of time.'
}
Increasing CPU resources (CPU being the main bottleneck) worked to an extent, but would reach a point where returns diminished entirely. In these cases, the instances were severely slowed, but resource usage would show that there are plenty of resources available.
CouchDB Shards
CouchDB data is split into chunks, called shards. Distribution of data between these shards is handled entirely by CouchDB, and is not something that the CHT manages in any way.
You can think of every shard as a tiny server that is queried and written to individually, like a small database, with its own indexes. When you make a view query request to CouchDB, it will request data from all shards and consolidate a response, which means that all shards need to have up-to-date indexes in order for the server to fulfill this request. Each shard gets its own single indexing process per design document (ddoc), which is the main reason for this investigation.
In CHT 4.x, we introduced clustering and increased the default amount of shards (from 8 to 12) in CouchDB config, so CouchDB can have access to more resources and the shards can be evenly distributed across a 3-node cluster, which is the default setup for clustering.
The experiment
When many documents are pushed to the server and significant view indexing is happening, CouchDB will start a distinct (ddocs x shards) number of indexing processes. For example: if a database has 2 ddocs and 8 shards, there will be 2 * 8 indexing processes, so the most optimal setup would be if CouchDB has access to at least 16 CPU cores, to insure least resource contention, while having more than 16 cores will not improve view indexing time (applies for an otherwise idle server, these cores are useful to fulfill other requests in a non-idle server). For a 2 ddoc database with 12 shards, the ideal number of cores is 24, and so on.
Tests have confirmed this. Below are the findings when running a view indexing test, involving various types of standard AWS instances, with varying numbers of cores and deploying CHT Core 4.16.0 ddocs with varying shard counts, and tracking how long it takes to rebuild indexes. This experiment uses a generated data set of 800.000 documents, with each user having 2200 documents.
Instance type | Nbr. Cores | Nbr. Shards | View rebuild (min) |
---|---|---|---|
c5.2xlarge | 8 | 12 | 44.24 |
c5.4xlarge | 16 | 12 | 21.47 |
c5.4xlarge | 16 | 16 | 18.41 |
c5.4xlarge | 16 | 16 | 20.18 |
c5.4xlarge | 16 | 24 | 16.99 |
c5.4xlarge | 16 | 12 | 23.45 |
c5.4xlarge | 16 | 16 | 20.69 |
c5.4xlarge | 16 | 24 | 19.81 |
c5.9xlarge | 36 | 12 | 18.46 |
c5.9xlarge | 36 | 18 | 15.27 |
c5.9xlarge | 36 | 36 | 9.09 |
c5.9xlarge | 24 (out of 36) | 12 | 20.04 |
c5.9xlarge | 24 (out of 36) | 24 | 14.91 |
c5.9xlarge | 24 (out of 36) | 36 | 11.42 |
c5.2xlarge x3 | 24 (8 * 3) | 12 | 19.75 |
c5.2xlarge x3 | 24 (8 * 3) | 24 | 13.74 |
c5.9xlarge | 36 | 12 | 18.3 |
c5.9xlarge | 36 | 16 | 15.23 |
c5.9xlarge | 36 | 36 | 9.44 |
c5.9xlarge | 36 | 24 | 12.48 |
c5.9xlarge | 36 | 36 | 10.27 |
c5.12xlarge | 48 | 12 | 16.21 |
c5.12xlarge | 48 | 16 | 13.17 |
c5.12xlarge | 48 | 24 | 9.98 |
c5.12xlarge | 48 | 48 | 6.5 |
For a machine with 36 cores, increasing shard count from 12 to 36 reduced indexing time by 50%!
Yes, having more shards potentially eliminates one of the biggest bottlenecks in CHT replication.
The caveat
Because view queries need to consolidate responses from every shard, the performance of these requests reduces significantly when increasing the number of shards.
Along with measuring view rebuild time, an initial replication scalability suite was run - where a number of users concurrently run initial replication, and averaging the completion time.
These are the results:
Instance type | Nbr. cores | Nbr. shards | View indexing (min) | Replicate 50 users (sec) | Replicate 100 users (sec) |
---|---|---|---|---|---|
c5.9xlarge | 36 | 12 | 18.46 | 47 | 92 |
c5.9xlarge | 36 | 18 | 15.27 | 58 | 110 |
c5.9xlarge | 36 | 36 | 9.09 | 83 | 177 |
c5.9xlarge | 24 (out of 36) | 12 | 20.04 | 57 | 104 |
c5.9xlarge | 24 (out of 36) | 24 | 14.91 | 70 | 138 |
c5.9xlarge | 24 (out of 36) | 36 | 11.42 | 93 | 180 |
c5.2xlarge x3 | 24 (8 * 3) | 12 | 19.75 | 55 | 105 |
c5.2xlarge x3 | 24 (8 * 3) | 24 | 13.74 | 77 | 136 |
Going from 12 to 36 shards in a 36 core machine reduced view indexing time by 50%, but increased replication time by 76% for 50 users and by 92% for 100 users.
Going from 12 to 36 shards in a 24 core machine reduced indexing time by 57%, but increased replication time by 73%.
Going from 12 to 24 shards in a 24 core machine reduced indexing time by 25%, but increased replication time by 32%.
For a 3-node CouchDB cluster, running on machines with 8 CPU cores each, going from 12 to 24 shards reduced view indexing time by 30%, but increased replication time by 30%.
Conclusion
This experiment shows that there is no one-size-fits-all solution for all deployments, and every solution comes with its advantages and disadvantages.
Modifying the number of shards is not a trivial endeavor. There are two ways in which this can be achieved:
By using the built-in _reshard CouchDB endpoint. This has limitations, as it currently only supports splitting a shard into two. So this means that you can always only double the amount of shards you have (8 → 16, 12 → 24, etc).
By changing the CouchDB setting for number of shards (the q
value), creating a new database and replicating data from the medic
database - this requires a script to also copy local documents, which hold important client sync-ing information, and then setting this as the main database and restarting API and Sentinel.
Both these options require significant downtime due to view rebuilding and require additional storage.
The CHT currently has two ddocs that are queried by CHT API for replication and should always be up to date: medic and medic-client.
The best approach is to tweak instances before initial deployment by adjusting the shard count relative to the number of CPU cores available for the deployment.
For existing deployments, it depends whether they are hitting or are close to hitting the view indexing bottleneck: If replication requests timeout and CouchDB reports view indexing timeouts, and the project is already at a number of cores more than double the amount of shards. This could affect deployments that were migrated from CHT 3.x to CHT 4.x more often / under less load, because CHT 3.x used only 8 shards and the migration process from 3.x to 4.x does not update shards.
Is clustering still worth it?
Results from this experiment show no clear advantage for clustering, as we are seeing the same performance for view indexing and replication when using a cluster vs using a single machine with the same number of cores.
When we ran performance tests pre 4.x and chose clustering as the main avenue for scaling the CHT, the CHT replication algorithm was different and the version of CouchDB was different (2.3.1 vs 3.4.2) - and improvements that were made over time could have triggered this change.