Updated replication for CHT v5.0 ~ Replication Cinque

diana · October 7, 2025, 2:57pm

Hello Community!

The Medic team has probably been quite loud about this, but to recap: in version 5.x, which will be published soon, we include a new version of CouchDb along with many more improvements. More details about 5.x are available here.

The new version of CouchDb includes a new search engine, called Nouveau, which we are using in 5.x for free-text searches. It turns out the Nouveau engine performs very well at large key index queries, which makes it a good candidate for the queries used for replication.
The performance improvement for the implementation of replication using this new engine is ~40% for large users.

Given that correct replication is at the base of offline users workflows, it’s extremely important to test changes to the algorithm very thoroughly. For this effort, I would like to ask for community assistance.

The algorithm can be found on the test-replication-nouveau branch and can be installed through the Admin Upgrade page. A test would involve:

using either your current test CHT deployment OR creating a new test deployment on CHT v4.21 or master
create or make sure you have some offline users
fully replicate (login through the app) with the offline users or call /api/v1/initial-replication/get-ids with their accounts and cache the documents that the users download
upgrade your test instance to test-replication-nouveau
fully replicate (login through the app) the offline users or call /api/v1/initial-replication/get-ids with their accounts and cache the documents that the users download
compare the list of docs that the same users downloaded on one version versus the other

The test passes if the users download the same documents on both versions.

More elaborate testing would include:

making sure replication_depth for contacts and/or reports is respected
making sure reports with needs_signoff get downloaded by supervisors
making sure sensitive documents are not downloaded by users

We would really appreciate the community assisting us with testing this improvement!
Please let us know if there’s any additional setup or details that would be helpful.
Thank you in advance!

mrjones · October 10, 2025, 1:15am

I’m working with a clone of a production instance that has 4,805,515 docs in medic and has ~800 users hosted on an 8cpu/16GB RAM EC2 instance in docker currently on CHT 4.21.1. We have users with anything from 1k documents all the way up to one with over 350k documents.

Our plan is to test a bunch on 4.21.1 and then upgrade to cinque and re-run the same tests.

We’ll report back our findings! Please also send suggestions of what to test our way.

mrjones · October 14, 2025, 10:44pm

@diana - I have 5 test users on my cloned production instance with anywhere from 5k to 200k documents each. I’ve written a test script to query the /api/v1/initial-replication/get-ids API 5 times for each user. We can then average together the times. The users on the cloned instance have had their passwords reset to be all the same using curl.

I’ll record a CSV of the results for 4.21 and then re-run script after we upgrade to cinque version. Here’s an example of what data will look like:

user    docs      seconds    cht-version
foo     232021    3434.23    4.21
smang   12002     423.01     4.21

And here’s the python script I’m using:

import time
import subprocess
import requests
import os


def run_command(command):
    proc = subprocess.Popen(
        command,
        shell=True,
        executable="/bin/bash",
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE
    )
    return proc.communicate()


def check_docs(user, type):
    start_time = time.time()
    password = 'SHARED-PASSWORD-HERE'
    api = '/api/v1/initial-replication/get-ids'
    full_url = f"https://{user}:{password}@lumbini-cinque-test.dev.medicmobile.org{api}"
    command = f"curl -qs '{full_url}' | jq -r '.doc_ids_revs | length'"
    try:
        doc_count, err = run_command(command)
    except Exception as e:
        print(f"FAIL: {user} doc count couldn't be fetched.  Error: ({e})")
        return False

    if not isinstance(int(doc_count.decode().strip()), int):
        print(f"FAIL: {user} doc count isn't an integer, got: {err}")
        return False

    try:
        end_time = time.time()
        total_time = end_time - start_time
        print(f"{user},{int(doc_count.decode().strip())},{round(total_time,4)},{type}")
    except Exception as e:
        print(f"FAIL: {user} couldn't output results.  Error: ({e})")
        return False


num = 5
for _ in range(num):
    check_docs('foo',"4.21")
    check_docs('bar',"4.21")
    check_docs('bash',"4.21")
    check_docs('quuux',"4.21")
    check_docs('smang',"4.21")

cc @binod

mrjones · October 20, 2025, 3:56pm

After doing tests with 5 users on a clone of a production instance using the above script, here’s our results:

User	4.21 avg (seconds)	Cinque avg (seconds)	Doc Count	% Faster
user 1	6.23	2.95	2,307	52.55%
user 2	187.84	68.63	83,049	63.46%
user 3	284.63	104.81	126,236	63.18%
user 4	56.61	39.85	110,959	29.60%
user 5	289.25	89.47	121,833	69.07%

Full disclosure: we discarded the fastest and slowest times to try and get more realistic results. Using google sheets and the 5 data points, this formula was used =(sum(J24:J28)-max(J24:J28)-min(J24:J28))/(count(J24:J28)-2) when calculating average times.

Not everyone will see this type of improvement, but we saw an almost 70% improvement - this is astounding!!

diana · October 20, 2025, 6:26pm

I appreciate the thorough testing! I know the process was cumbersome and there were many hurdles. The numbers look great, I’m really happy with the results.

twier · October 22, 2025, 11:28am

Should we move contacts_by_depth to Nouveau also?
So that all the indexes that replication depends on use the same indexer.

This would less for performance directly since changes_by_depth isn’t as much of an issue as docs_by_replication_key, but more for simplicity and consistency.

Building and updating indexes has different performance characteristics in Lucene compared to CouchDB map reduce views.
With one index in Nouveau and one as a CouchDB map/reduce view, the performance of replication overall depends on both CouchDB map reduce and Lucene.
With both in Nouveau, replication depends only on Lucene.

Practically it might not be that much different; in particular, it appears that the maximum concurrency of Nouveau indexing processes is still determined by the number of shards, despite Lucene not having the same concept of shards as CouchDB (it’s one Lucene index per couch shard apparently).

But it could reduce complexity by having only one indexing engine for the whole replication process.

diana · October 22, 2025, 12:58pm

I think this can be something to test for a next iteration.
I think with contacts_by_depth we’re quite safe with its performance, so there is no strong argument for doing it quickly.

I personally don’t really like the Lucene search API, and we don’t have a good way of calling it, compared to PouchDb APIs which are clean and documented.