Decreasing the number of reports modified by a big move-contact operation

bamatic · March 3, 2023, 12:55pm

Hi community,
I’ve been thinking about how could we decrease the number of documents affected by a move-contact operation and I’ve in mind two ideas:

1. if the longuer workflow in a given project is of X days, I think that we do not need to modify reports that are older than X day

if the documents are purged the modifications of this documents could be skiped

The problem is if the things changes later, and some skiped documents, because purged or because they are older of this max workflow duration, get unpurged, or are needed now because the max workflow duration has been extended, the skipped docs won’t have the “contact lineage” modified.

if I’m purging patient_assessments data_records older than 45 days, and now I have another workflow that has a duranton of 120 days which tasks depend on patient_assessment I will modify the purge.js and set patient_assessment to be purged if older than 120 days, so I will have some data_records that will be unpurged, but as a CHT app workflow developer could I expect that this data_records in the past (near past) will trigger tasks that are “important” for my workflow ? it’s really possible but usualy when I’m developing a workflow and I deploy the workflow at a given date, I expect to have this workflow working with data submitted after the deploying date.

I’m thinking about a possibility to add some configuration something like, modify_purged_reports_when_move_contacts and if false move_contacts command should skip this modifications, I mean, if a form in its context is intendeed for a role, (maybe we should add an intended_for_role field to forms context ??) and the document is purged for this role skip the modification of the doc.

If this is difficult or makes no sense we could think about a parameter passed to cht-conf with the move-contact operation something like cht --url=xxxx --move-contacts --skip-older-than=300
and cht-conf would not download reports older than 300 days
So the responsability of doing this would be by project .

Another idea is to take out of couchdb all purged docs or docs that are older than a given duration, so unpurged it later will be impossible.

jkuester · March 3, 2023, 3:36pm

I definitely recognize the challenge of moving contacts with large amounts of historical records. Just wanted to add some of my initial thoughts here!

I think you are right that implementing something like modify_purged_reports_when_move_contacts would require some kind of association between particular types of records and roles (since there is no universal concept of the doc being “purged” that is not associated to particular roles). This kind of thing might be more of a paradigm shift than is worth doing for just this use-case… (unless there was other value we could obtain from that kind of mapping )

A --skip-older-than parameter for move-contacts would, I think, be more simple to implement. I think cht-conf would still have to retrieve all the reports (so it can check the date), but I would expect that avoiding updates to a significant percentage of the reports would still have a positive effect on performance (and reduce the overall load on the system caused by the move). The downside, of course, would be the inconsistent lineage data for older reports.

Another idea is to take out of couchdb all purged docs or docs that are older than a given duration, so unpurged it later will be impossible.

This proposal is something that has come up several times recently and I appreciate it getting raised again here in the context of moving contacts! Permanently “archiving” historical data (so it is no longer in the main CouchDB) could help alleviate a number of performance-related issues on long-running instances.

I know that @mrjones, @iesmail, and I were recently discussing the challenges with moving contacts within the CHT. It is clear that the current tooling/support for moving contacts in the CHT is inadequate. In my opinion, we may need to rethink the approach of de-normalizing the contact lineage to the various reports. (But, of course, that data is there to make the contact-specific replication work properly. So changing how that data is stored would require updating the fundamentals of the replication algorithm…)

bamatic · March 4, 2023, 1:56pm

Thanks @jkuester for your answer

The downside, of course, would be the inconsistent lineage data for older reports.

A consideration about this to keep in mind, is that if we have the chw_area_A with the chw_A associated to it into the supervisor_area_A with the supervisor_A in it, and chw_A has been sending several reports during 2020, and if this chw_area_A is moved in 2022-12-12 to the supervisor_area_B with the supervisor_B in it.

looking to a data_record of a patient in one of the households of chw_area_A, I will have before to the move-contact something like:

{
"_id":"abc",
"_rev":"1-qsd",
"form":"patient_assessment",
"type":"date_record",
"reported_date": "2020-01-14"
"contact" : {
      "_id":"chw_A_uuid",
      parent: {
         "_id" : "chw_area_A_uuid"
     },
      parent: {
         "_id" : "supervisor_area_A_uuid"
     }
}

and after the move-contacts I’ll have:

{
"_id":"abc",
"_rev":"2-qqssd",
"form":"patient_assessment",
"type":"date_record",
"reported_date": "2020-12-12"
"contact" : {
      "_id":"chw_A_uuid",
      parent: {
         "_id" : "chw_area_A_uuid"
     },
      parent: {
         "_id" : "supervisor_area_B_uuid"
     }
}

so the data analyst will find that this data_record was reported in 2020 by chw_A belonging to the chw_area_A in the supervisor_area_B and that the supervisor_area_B was created in 2022, this can even makes sense because the household has not been created in 2022 and the household was there in 2020 but the difficulty is for example with the supervisor performance évaluations, the supervisor_A’s performance KPI of past years wil change after the move contact opération and this makes no sense, if the supervisor_A has a lot of supervision_visits done in 2020 in households belonging to the chw_area_A, this realization will be computed to supervisor_area_B

Other dificult that we have is to explain to a supervisor that has begun in 2022 in a new area that she has some data in his new area from 2020

so we need to externaly, I mean out of couchdb, maintain a table or something to track this move-contact events to allow data analysts know where was the household, in the supervisor_area_A or supervisor_area_B at a given time.

At Muso we are adding to all our app forms a lineage group to get the ids of all the lineage of the patient (otcontact about the form is) to save the complete lineage at the moment of form submission
something like

begin group	lineage
calculate	c50_patient_uuid	…/…/inputs/contact/_id
calculate	c50_family_uuid	…/…/inputs/contact/parent/_id
calculate	c40_chw_area_uuid	…/…/inputs/contact/parent/parent/_id
calculate	c30_supervisor_area_uuid	…/…/inputs/contact/parent/parent/parent/_id
calculate	c20_health_area_uuid	…/…/inputs/contact/parent/parent/parent/parent/_id
calculate	c10_site_uuid	…/…/inputs/contact/parent/parent/parent/parent/parent/_id
end group

I want to say that sometimes to do not update the data_record’s contact lineage can be better if they are old since in our use case these old data_records wont never be used in the supervisor devices
but it is also clear that if old data of 2018 can be modified in 2023 having a copy of original data out of couchdb or even erasing old data from couchdb and load it in a data warehouse can be an option