Why the CHT uses CouchDB

gareth · August 23, 2022, 1:00am

A lot has changed in the CHT since it launched over 10 years ago. In the beginning it was designed for use by small SMS based projects and has now matured being able to support nationwide CHW programs providing rich smartphone applications. However one thing that has remained the same is the use of CouchDB as the primary database. While there are many database options available each with different strengths and weaknesses, CouchDB meets many of the requirements for the CHT.

Benefits of CouchDB

Offline first applications

One of the main advantages of CouchDB is it provides a robust master-to-master replication protocol out of the box. What this means is two separate databases can be kept in sync and the database handles document updates, deletions, and conflicts. This works seamlessly with PouchDB which is a database that runs in the browser so web applications can create documents locally and replicate back to the server when an internet connection is available. This is a fundamental component of offline first applications which is essential for Community Health Workers as many of the places they work have no internet availability at all.

Open and free

CouchDB is built to be open source and free from license fees which makes it a natural fit for the CHT. Partners can be assured that not only will their programs be free to run forever, but they have complete ownership of their data. Importantly this includes where the data is stored which allows for compliance with laws for safeguarding Protected Health Information.

Schemaless

The CHT has always been highly configurable, particularly with the structure of the data being collected. Because of this the Core Framework utilizes the schemaless nature of CouchDB to store documents in whatever format is needed for a given workflow.

Community support

CouchDB also has a thriving community and provides responsive support over a variety of channels free of fees. This has been invaluable for debugging issues and determining best practice for development.

Clustering

For larger deployments, CouchDB natively supports clustering to balance the load over multiple machines. While the CHT doesn’t yet support database clustering this feature is coming in 4.0.0 and testing shows a significant improvement in scalability. Keep watching the forum for more updates.

APIs

Virtually anything is possible with their extensive list of APIs, and using REST means there are a wide range of tools to make scripting and integration easy.

Downsides of CouchDB

While it’s done a great job getting the CHT this far there are some limitations to CouchDB.

Filtered replication

CHWs should only see a portion of the data on the server database. This is for two reasons; the phone can’t physically store as much data as the server, and for privacy reasons the user should only be able to access the information required for their job. To accomplish this the CHT filters the replication to just a specific list of document IDs. Unfortunately this doesn’t perform well at scale. The reason is quite technical but essentially every user has to skip over all the documents uploaded by other users to get to just the data they should sync. As filtered replication makes up the majority of requests to the server this is a significant limitation. Incremental improvements have been made to improve how filtering is done but more improvements are now needed to reach the increasing scale of CHT deployments.

Data queries

The other main limitation is it’s very difficult to do certain queries on CouchDB, for example for bespoke investigations or data dashboards. To mitigate this it is recommended to sync the data to a data warehouse which better supports these sorts of queries.

Views are slow to build

CouchDB uses map/reduce views which are indexed by executing a function on every document in the database and caching the result, which makes them quick and easy to query. However if a new view is created, or an existing view is updated with a new index function, then the cache needs to be updated for every document in the database, which can take a very long time. You can read more about this in the CouchDB documentation.

This limitation is mitigated in the CHT by querying by ID when possible, by reusing views in preference to creating new ones, and by implementing view warming on upgrade so the reindexing can happen without any downtime.

The future

In the near term the CHT will a) support clustering, b) be upgrade to use the latest version of CouchDB, and c) undergo investigation to further improve the filtered replication algorithm that is currently limiting scale. Additionally work is ongoing to develop a data pipeline product which will sync data to PostgreSQL for easy to develop and efficient to execute queries. These changes will help mitigate the two downsides listed above.

Further in the future it’s quite possible the CHT will be modified to use a different database altogether. This would be a large project requiring a significant amount of code changes and have inherent risks, but this will be weighed against the benefits it would bring.

Ultimately the CHT will continue to evolve to support partners hit their goals and improve health outcomes in the hardest to reach places.

simon · August 23, 2022, 4:41am

Very insightful and am sure will be for our partners as well. Thanks for putting this together @gareth .

bamatic · August 24, 2022, 12:36am

Hi there,
Having the ability to have other NoSql compatible databases other than couchdb sounds great, having the CHT’s database on a serverless cloud service and the cht running in a separate VM might be really cool but the marriage between pouchDB and couchDB is so good that breaking it will take a lot of work to find a good alternative

Which would be a good way to sync data between couchDB and a data warehouse like big query without the postgresql instance ?

Actually we are ingesting data to bigquery from postgresql views and matviews in a daily basis, but we get often our ingestions tasks refused by the postgresql server ‘too many connexion for user’ even if we dont use any paralelisme , I mean, we ingest data from one view, once finished we ingest from another one, one by one
could you advise us a way to ingest the json docs directly from the couchDB database?
Thank you

gareth · August 24, 2022, 10:25pm

Getting data out of couchdb is reasonably straightforward using the changes API. This then needs to be directed to your data warehouse. The next generation version of this for the CHT → PG is cht-sync which should be reasonably easy to modify for whatever data store you want.