CouchDB _purge and upgrading to version 4

Hello
We are planning two actions in our largest project, the one in Mali, and we would like your advice on the best order to perform these actions, or perhaps the order of action is not relevant.

We want to delete some documents from the CouchDB database, a significant number of documents.
We want to update to the latest version of CHT 4.
So our goal is to have a CHT 4 version with a reduced number of documents.

Is there any advantage to performing the CouchDB purge of these documents once we have the CHT 4.x release?
What action, if applicable, would be best to take?

Hi @bamatic.

The order is not important, but you will have an advantage if you _purge before you upgrade because:

  • you will need to index views, and you will save on time because indexing time will be less
  • if you need to move data files around for the migration, you will have smaller files which could be easier to manage.
  • you can assess the success of your purge before taking on the upgrade

I’m curious what your plan is: how are you filtering the docs, are you going to write a script to purge, how many docs do you plan to delete?
I think this is the first time any partner attempted something like this. I’d love to assist or contribute, if that’s welcome.

1 Like

Hi Diana, thank you very much for this, the proposal to help us with this is really welcome.

Really we need your expertise because with the huge amount of docs that we want to _purge (if possible) we need to estimate approximately how long the purge and the installation will take (approx).

In our concret case we are already ingesting postgresql data to our data warehouse, so we will select the docs’ _id to purge from our data warehouse.

Maybe we could give to your team the list of _id that we want to _purge , and you could maybe copy these docs from medic database to a new medic-cold database and docs from medic-sentinel to a new medic-sentinel-cold, _purge them from our id list, so once finished we can launch the cht4 installation.

Usually when upgrading CHT we ask our users to sync all data and to shutdown their devices. So we can start purging with no users connected.

We currently have:

type count
info 26,821,420.00
data_record 18,902,253.00
task 6,815,813.00
person 621,996.00
tombstone 578,007.00
contact 130,543.00
feedback 54,480.00
target 17,996.00
task:outbound 5,535.00
user-settings 576.00
null 364.00

And we want to purge all data_records older than 2022-01-01 (reported_date <1640908800000), all tasks older than authoredOn < 2022-01-01, all data_record form:home_visit older than 2024-01-01 and all self_assessment and epi_daily_report.

In addition to this data_records and tasks all the related info-docs and tombstones

Only home_visits are already more than 13 000 000, with the others data records and tasks we will have about 18-20 millions of docs, and a similar number of info docs

Thanks for all the detail @bamatic .

I am not sure how long purging would take. I can run a few tests and come back with some info, but it would be guesses and no guarantees. The only way to have a good estimate is to test purging on a cloned version of your database - which I would encourage for other reasons like: validating your workflows are ok after this, you are ok with the docs that get purged and the database behaves as expected.

Ideally, you would not need users to be offline when this is happening - maybe just for performance reasons, avoid running the script during high traffic, and you could do this as a separate task from upgrading to 4.x. You could even run with cold storage for a while.

I’m very happy to assist with this.

Thanks @diana
definetively cloning the database and performing a _purge there is the best option.

We could provide the list of docs to purge, and we will need to adapt the “purge script”, we already have this work from you, GitHub - dianabarsan/purging, but how could we get your team"s support to have the Mali project’s database cloned ?

Hi @bamatic

I’ve created a ticket requesting a clone for muso-mali prod.

1 Like

Hi @bamatic

The clone for muso is available. Can we collaborate to generate a list of document ids to get purged?
Thanks!

Hi @diana great
We have already a first list, I will share the list with you as well as the filtering criteria very soon. Just the time to find and verify what have been done

Thank you

1 Like

Hi @diana
I’ve share the files in this github issue

1 Like

Hi @diana
Thanks a lot for supporting us.
I’m wondering if we can plan a date to run the purging script on the cloned database
Let us know if we can help, maybe we could modify the purging script to adapt it to the final filtering rules retained in Muso.