Replication Depth Tool

Due to a huge problem in KE of the number of docs users are downloading, there is need to understand the composition of the number of docs a user downloads during login. Therefore I’ve been tasked with creating a script that for a given replication depth, returns a detailed breakdown of the various document types, all adding up to the count from /api/v1/users-doc-count.

Example:
Input: mary

Output:

[
    {
        "username": "mary",
        "facilities": 10,
        "replication_depth": 0,
        "report_depth": 0,
        "contacts": 10,
        "data_records": 2,
        "tasks": 15,
        "targets": 0,
        "total": 27
    },
    {
        "username": "mary",
        "facilities": 10,
        "replication_depth": 1,
        "report_depth": 0,
        "contacts": 30,
        "data_records": 2,
        "tasks": 75,
        "targets": 10,
        "total": 117
    },
    {
        "username": "mary",
        "facilities": 10,
        "replication_depth": 1,
        "report_depth": 1,
        "contacts": 30,
        "data_records": 22,
        "tasks": 75,
        "targets": 10,
        "total": 137
    },
    ...
]

I’ve tried to look at the code at authorization.js, but I just can’t understand the full logic of calculating the docs. Would appreciate some explanation on what happens at step 3 below! This is what I have so far.

  1. From the usernames, get all the facilities from user.facility_id

  2. For each facility, get the contacts at every depth starting from 0, by querying _view/contacts_by_depth

  3. Get all data records belonging too the contacts (stuck here!! Specifically I’m missing the logic on how the DRs are counted, and how _view/docs_by_replication_key is used)

  4. Get all tasks belonging too the contacts

  5. Get all targets belonging too the contacts

Considerations for this solution

  • Ignore needs_signoff, replicate_primary_contact, private docs, and any other type of doc outside contacts, tasks, targets, DRs. The counts don’t have to add up exactly to what is in /api/v1/users-doc-count
  • Tried building from postgres but numbers still way off

Hello @rmayore

This is really great, it would be very helpful to understand in depth which documents the KE users are downloading.

Get all data records belonging too the contacts (stuck here!! Specifically I’m missing the logic on how the DRs are counted, and how _view/docs_by_replication_key is used)

Here is some background about `docs_by_replication_key`

Replication uses two views; contacts_by_depth and docs_by_replication_key.

each facility_id for a user is the id of a contact in the hierarchy.

contacts_by_depth gets all the descendants of a facility_id, up to the configured replication_depth.

That gives it all the contact ids, but its needs the ids of all the documents, including contacts but also reports, tasks and targets. This is where docs_by_replication_key comes in; it is a link between all documents and the contact hierarchy.

It may be helpful to look at the source code; the summary is

  • reports are linked to their ‘subject’, which is the patient id, place id, or contact id. which can have different formats, and there are many exceptions, but is always a contact id representing who or what the report is about. The exact field used will depend on the form configuration.
  • reports are also linked to contacts directly using the contact field; and to all that contact’s parents.
  • tasks are linked to user
  • targets are to owner
  • contacts are included in this view, but the point is to link documents to contacts, so contacts just link to themselves.

Tried building from postgres but numbers still way off

Can you share the query you are using? I also was not able to get a query working, but if we could use postgres, it could be helpful to get more flexible answers more quickly.

@twier thanks a lot for the quick reply. Indeed that was my initial understanding, and with that understanding I tried to ‘join’ docs_by_replication_key with contacts_by_depth

  • If a row by docs_by_replication_key has doc.value.type == “contact”, Ignore since it’s already accounted for by contacts_by_depth
  • If a row by docs_by_replication_key has doc.value.type == “data_record”, join on doc.value.subject
  • If a row by docs_by_replication_key has doc.value.type == “task”, join on doc.key
  • If a row by docs_by_replication_key has doc.value.type == “target”, join on doc.key

The results were still further from what I expected. Perhaps I should call _view/docs_by_replication_key with the contact ID, for every contact returned by contacts_by_depth? Or is it enough to call _view/docs_by_replication_key with just the facility ID?

Below is the PG query I wrote… I bet it can still be optimized or simplified

WITH users AS (
    SELECT * FROM couchdb
    WHERE doc ->> '_id' = ANY (ARRAY [
                    'org.couchdb.user:user1',
                    'org.couchdb.user:user2',
                    'org.couchdb.user:user3',
                    'org.couchdb.user:user4',
                    'org.couchdb.user:user5',
                    'org.couchdb.user:user6'
                    ])
),
contacts AS (
    SELECT users.doc ->> '_id' AS username, jsonb_array_elements_text(users.doc -> 'facility_id') AS facility_id, cm.*
    FROM users
    INNER JOIN contactview_metadata cm ON cm.uuid = users.doc ->> 'contact_id'
),
users_distinct AS (
    select username, parent_uuid, name
    FROM contacts
    GROUP BY username, parent_uuid, name

),
facilities_count AS (
    SELECT username, count(facility_id) AS facilities
    FROM contacts
    GROUP BY username
),
cu_docs AS (
    SELECT contacts.username, count(*) AS docs
    FROM contacts
    --- chw_area is just the doc.place_id, could be a CU, a CHW Area, a County
    INNER JOIN form_metadata fm ON fm.chw_area = contacts.facility_id
    GROUP BY contacts.username
),
areas AS (
    SELECT contacts.username, contacts.facility_id, cm.uuid AS area_uuid
    FROM contacts
    INNER JOIN contactview_metadata cm ON cm.parent_uuid = contacts.facility_id
    WHERE cm.contact_type= 'd_community_health_volunteer_area'
),
area_docs AS (
    SELECT areas.username, count(*) AS docs
    FROM areas
    INNER JOIN form_metadata fm ON fm.chw_area = areas.area_uuid
    GROUP BY areas.username
),
areas_count AS (
    SELECT username, count(area_uuid) AS chv_areas
    FROM areas
    GROUP BY username
),
chvs AS (
    SELECT areas.username, areas.area_uuid, cm.uuid AS chv_uuid
    FROM areas
    INNER JOIN contactview_metadata cm ON cm.parent_uuid = areas.area_uuid
    WHERE cm.contact_type= 'person'
),
chv_count AS (
    SELECT username, count(chv_uuid) AS chvs
    FROM chvs
    GROUP BY username
),

households AS (
    SELECT areas.username, areas.area_uuid, cm.uuid AS household_uuid
    FROM areas
    INNER JOIN contactview_metadata cm ON cm.parent_uuid = areas.area_uuid
    WHERE cm.contact_type= 'e_household'
),
households_count AS (
    SELECT username, count(household_uuid) AS households
    FROM households
    GROUP BY username
),

clients AS (
    SELECT households.username, households.household_uuid, cm.uuid AS client_uuid
    FROM households
    INNER JOIN contactview_metadata cm ON cm.parent_uuid = households.household_uuid
    WHERE cm.contact_type= 'f_client'
),
clients_count AS (
    SELECT username, count(client_uuid) AS clients
    FROM clients
    GROUP BY username
)

SELECT ud.username,
        ud.name,
       facilities_count.facilities AS cu_facilities,
       cu_docs.docs AS cu_docs,
       areas_count.chv_areas AS chv_reas,
       area_docs.docs AS area_docs,
       chv_count.chvs AS chvs,
       households_count.households AS households,
       clients_count.clients AS clients
FROM users_distinct ud
INNER JOIN facilities_count ON facilities_count.username = ud.username
INNER JOIN areas_count ON areas_count.username = ud.username
INNER JOIN chv_count ON chv_count.username = ud.username
INNER JOIN households_count ON households_count.username = ud.username
INNER JOIN clients_count ON clients_count.username = ud.username
INNER JOIN cu_docs ON cu_docs.username = ud.username
INNER JOIN area_docs ON area_docs.username = ud.username

docs_by_replication_key should be called with every contact returned by contacts_by_depth; just the facility_id will return many many fewer results because it will only return docs linked directly to the facility_id, and not any of its children. In a realistic hierarchy, the subject for most reports will be patients or households and facility id will be a higher level contact like a CHW area or a supervisor containing many CHWs as children, so just using the facility_id directly will miss most reports.

@twier Thanks I’m going to try this. In the beginning I was unsure because looking at the view’s code, it seemed like a DR doc is emitted for the contact, and also the parent and their parents and so on up the hierarchy (in the while loop). So I thought ultimately a data record (even one as low as a patients) is emitted both for the contact and all the parents up the hierarchy, including the facility Id.

Yes, the key thing is that there are multiple contacts associated with the report. the contact field itself (and its parents) is the contact of the user that submitted the report. The patient or household will be the subject (which does not emit all its parents).

So actually usually the facility id will be an ancestor of the user that submitted the report. But there are other cases like users with multiple facility ids, or maybe other cases that I’m not thinking about. In any case it would be more accurate to simulate what the api server actually does and use all the results of contacts_by_depth

1 Like

Thanks to the great help from @twier , I managed to get the first draft of the solution working. For a given facility and depth, I am able to get the number of contacts, tasks, targets and data records that would be replicated. Even more than that I’m able to get the number of each DR form that would be replicated.

However, I have 2 more questions remaining before I further refine and submit the solution:

  1. For a given depth, does the view contacts_by_depth only fetch the contacts at exactly that depth, or all the contacts from depth 0 to that depth?
  2. After fetching the data records using docs_by_replication_key for every contact returned by contacts_by_depth and adding them all up, is there a possibility of some DR docs being duplicated?

For question 1, I think the answer is contacts_by_depth only fetches the contacts at exactly that depth since making the request with an impossible large depth like 10 returns 0 contacts.

it fetches contacts at exactly the specified depth, but you can specify a range instead of an exact depth, e.g.

GET /medic/_design/medic/_view/contacts_by_depth?startkey=["id",0]&endkey=["id",3]

yes, there could be duplicate results from docs_by_replication_key

@twier thanks for the great suggestion on using the key range!!

Finally managed to get a per-depth breakdown of all replication docs.

Example:

[
    {
        "username": "username",
        "facilities": 10,
        "depth": 0,
        "contacts": 6,
        "data_records": 44114,
        "tasks": 0,
        "targets": 0,
        "moh-515-summary-community_health_assistant-verified": 29,
        "hh_flood_status": 100,
        "cebs_signal_verification": 722,
        "mute_person": 649,
        "mute_household": 2270,
        "death_confirmation": 5,
        "moh-515-summary-offline_sub_county_supervisor-verified": 11,
        "sha_insurance": 8328,
        "death_review": 6,
        "ntd_service": 40,
        "moh-515-summary": 72,
        "death_report": 125,
        "cebs_signal_reporting": 3551,
        "commodity_discrepancy_amendment": 6,
        "cha_verbal_autopsy": 5,
        "u5_assessment": 28007,
        "moh_515_verify": 40,
        "sgbv": 107,
        "commodities_order": 16,
        "commodity_received": 4,
        "cebs_signal_reporting_verification": 2,
        "community_event": 19
    },
    {
        "username": "username",
        "facilities": 10,
        "depth": 1,
        "contacts": 73,
        "data_records": 129955,
        "tasks": 0,
        "targets": 103,
        "moh-515-summary-community_health_assistant-verified": 23,
        "hh_flood_status": 200,
        "cebs_signal_verification": 1430,
        "mute_person": 1050,
        "commodity_count": 236,
        "mute_household": 3282,
        "cadre_signal_resolution": 31,
        "death_confirmation": 8,
        "moh-515-summary-offline_sub_county_supervisor-verified": 9,
        "chv_supervision": 176,
        "sha_insurance": 12824,
        "death_review": 10,
        "ntd_service": 24,
        "moh-515-summary": 60,
        "death_report": 174,
        "cebs_signal_reporting": 4758,
        "follow_up_chv_not_visited": 118,
        "chv_consumption_log": 57892,
        "commodity_discrepancy_amendment": 12,
        "cha_verbal_autopsy": 8,
        "u5_assessment": 47437,
        "moh_515_verify": 32,
        "sgbv": 58,
        "commodity_stockout": 21,
        "cha_supervision_calendar": 1,
        "commodity_supply": 65,
        "commodities_order": 8,
        "commodity_received": 4,
        "commodity_received_confirmation": 2,
        "cebs_signal_reporting_verification": 2
    },
    {
        "username": "username",
        "facilities": 10,
        "depth": 2,
        "contacts": 5820,
        "data_records": 183774,
        "tasks": 0,
        "targets": 875,
        "moh-515-summary-community_health_assistant-verified": 23,
        "chv_signal_reporting": 96,
        "commodity_count": 236,
        "mute_household": 5147,
        "cadre_signal_resolution": 32,
        "sha_insurance": 20065,
        "ntd_service": 36,
        "moh-515-summary": 60,
        "socio_economic_survey": 2327,
        "chv_consumption_log": 57892,
        "hap_survey": 5950,
        "u5_assessment": 71158,
        "sgbv": 87,
        "commodity_stockout": 22,
        "hh_flood_status": 376,
        "cebs_signal_verification": 1904,
        "mute_person": 1575,
        "wash": 6472,
        "death_confirmation": 9,
        "moh-515-summary-offline_sub_county_supervisor-verified": 9,
        "chv_supervision": 176,
        "death_review": 11,
        "delivery_note": 154,
        "household_member_registration_reminder": 73,
        "unmute_household": 4,
        "death_report": 261,
        "cebs_signal_reporting": 9347,
        "follow_up_chv_not_visited": 118,
        "commodity_discrepancy_amendment": 21,
        "cha_verbal_autopsy": 8,
        "moh_515_verify": 32,
        "cha_supervision_calendar": 1,
        "commodity_supply": 65,
        "commodities_order": 8,
        "training:immunization_service": 1,
        "commodity_received_confirmation": 2,
        "mute_household_confirmed": 1,
        "approve_mute_household": 1,
        "training:new_person_registration": 1,
        "commodity_received": 6,
        "training:hh_registration": 1,
        "cha_signal_verification": 3,
        "training:death_muting": 1,
        "cebs_signal_reporting_verification": 2
    },
    {
        "username": "username",
        "facilities": 10,
        "depth": 3,
        "contacts": 34268,
        "data_records": 264551,
        "tasks": 0,
        "targets": 875,
        "moh-515-summary-community_health_assistant-verified": 23,
        "chv_signal_reporting": 96,
        "defaulter_follow_up": 88,
        "commodity_count": 236,
        "family_planning": 9252,
        "mute_household": 5147,
        "cadre_signal_resolution": 32,
        "pregnancy_home_visit": 1315,
        "sha_insurance": 26486,
        "patient_details_reminder": 198,
        "ntd_service": 48,
        "moh-515-summary": 60,
        "socio_economic_survey": 2327,
        "unmute_person": 26,
        "chv_consumption_log": 57892,
        "hap_survey": 5950,
        "u5_assessment": 96210,
        "sgbv": 201,
        "over_five_assessment": 14137,
        "commodity_stockout": 22,
        "first_aid": 101,
        "hh_flood_status": 376,
        "postnatal_care_service_newborn": 134,
        "cebs_signal_verification": 1904,
        "mute_person": 2088,
        "wash": 6472,
        "death_confirmation": 11,
        "moh-515-summary-offline_sub_county_supervisor-verified": 9,
        "chv_supervision": 176,
        "postnatal_care_service": 1217,
        "death_review": 13,
        "delivery_note": 154,
        "household_member_registration_reminder": 73,
        "unmute_household": 4,
        "pregnancy_screening": 462,
        "death_report": 328,
        "CLIENT_DETAILS_MISMATCH": 50,
        "cebs_signal_reporting": 9347,
        "follow_up_chv_not_visited": 118,
        "socio_economic_survey_person": 607,
        "immunization_service": 7301,
        "commodity_discrepancy_amendment": 22,
        "cha_verbal_autopsy": 10,
        "moh_515_verify": 32,
        "referral_follow_up": 9243,
        "treatment_follow_up": 3387,
        "hypertension": 695,
        "diabetes": 376,
        "cha_supervision_calendar": 1,
        "commodity_supply": 65,
        "commodities_order": 8,
        "commodity_received_confirmation": 2,
        "training:new_person_registration": 1,
        "training:death_muting": 1,
        "training:immunization_service": 1,
        "muac_follow_up": 2,
        "mute_household_confirmed": 1,
        "approve_mute_household": 1,
        "commodity_received": 6,
        "training:hh_registration": 1,
        "cha_signal_verification": 3,
        "cebs_signal_reporting_verification": 2
    },
    {
        "username": "username",
        "facilities": 10,
        "depth": 4,
        "contacts": 34268,
        "data_records": 264551,
        "tasks": 0,
        "targets": 875,
        "moh-515-summary-community_health_assistant-verified": 23,
        "chv_signal_reporting": 96,
        "defaulter_follow_up": 88,
        "commodity_count": 236,
        "family_planning": 9252,
        "mute_household": 5147,
        "cadre_signal_resolution": 32,
        "pregnancy_home_visit": 1315,
        "sha_insurance": 26486,
        "patient_details_reminder": 198,
        "ntd_service": 48,
        "moh-515-summary": 60,
        "socio_economic_survey": 2327,
        "unmute_person": 26,
        "chv_consumption_log": 57892,
        "hap_survey": 5950,
        "u5_assessment": 96210,
        "sgbv": 201,
        "over_five_assessment": 14137,
        "commodity_stockout": 22,
        "first_aid": 101,
        "hh_flood_status": 376,
        "postnatal_care_service_newborn": 134,
        "cebs_signal_verification": 1904,
        "mute_person": 2088,
        "wash": 6472,
        "death_confirmation": 11,
        "moh-515-summary-offline_sub_county_supervisor-verified": 9,
        "chv_supervision": 176,
        "postnatal_care_service": 1217,
        "death_review": 13,
        "delivery_note": 154,
        "household_member_registration_reminder": 73,
        "unmute_household": 4,
        "pregnancy_screening": 462,
        "death_report": 328,
        "CLIENT_DETAILS_MISMATCH": 50,
        "cebs_signal_reporting": 9347,
        "follow_up_chv_not_visited": 118,
        "socio_economic_survey_person": 607,
        "immunization_service": 7301,
        "commodity_discrepancy_amendment": 22,
        "cha_verbal_autopsy": 10,
        "moh_515_verify": 32,
        "referral_follow_up": 9243,
        "treatment_follow_up": 3387,
        "hypertension": 695,
        "diabetes": 376,
        "cha_supervision_calendar": 1,
        "commodity_supply": 65,
        "commodities_order": 8,
        "commodity_received_confirmation": 2,
        "training:new_person_registration": 1,
        "training:death_muting": 1,
        "training:immunization_service": 1,
        "muac_follow_up": 2,
        "mute_household_confirmed": 1,
        "approve_mute_household": 1,
        "commodity_received": 6,
        "training:hh_registration": 1,
        "cha_signal_verification": 3,
        "cebs_signal_reporting_verification": 2
    },
    {
        "username": "username",
        "facilities": 10,
        "depth": 5,
        "contacts": 34268,
        "data_records": 264564,
        "tasks": 0,
        "targets": 875,
        "moh-515-summary-community_health_assistant-verified": 23,
        "chv_signal_reporting": 96,
        "defaulter_follow_up": 88,
        "commodity_count": 236,
        "family_planning": 9264,
        "mute_household": 5147,
        "cadre_signal_resolution": 32,
        "pregnancy_home_visit": 1315,
        "sha_insurance": 26486,
        "patient_details_reminder": 198,
        "ntd_service": 48,
        "moh-515-summary": 60,
        "socio_economic_survey": 2327,
        "unmute_person": 26,
        "chv_consumption_log": 57892,
        "hap_survey": 5950,
        "u5_assessment": 96211,
        "sgbv": 201,
        "over_five_assessment": 14137,
        "commodity_stockout": 22,
        "first_aid": 101,
        "hh_flood_status": 376,
        "postnatal_care_service_newborn": 134,
        "cebs_signal_verification": 1904,
        "mute_person": 2088,
        "wash": 6472,
        "death_confirmation": 11,
        "moh-515-summary-offline_sub_county_supervisor-verified": 9,
        "chv_supervision": 176,
        "postnatal_care_service": 1217,
        "death_review": 13,
        "delivery_note": 154,
        "household_member_registration_reminder": 73,
        "unmute_household": 4,
        "pregnancy_screening": 462,
        "death_report": 328,
        "CLIENT_DETAILS_MISMATCH": 50,
        "cebs_signal_reporting": 9347,
        "follow_up_chv_not_visited": 118,
        "socio_economic_survey_person": 607,
        "immunization_service": 7301,
        "commodity_discrepancy_amendment": 22,
        "cha_verbal_autopsy": 10,
        "moh_515_verify": 32,
        "referral_follow_up": 9243,
        "treatment_follow_up": 3387,
        "hypertension": 695,
        "diabetes": 376,
        "cha_supervision_calendar": 1,
        "commodity_supply": 65,
        "commodities_order": 8,
        "commodity_received_confirmation": 2,
        "training:new_person_registration": 1,
        "training:death_muting": 1,
        "training:immunization_service": 1,
        "muac_follow_up": 2,
        "mute_household_confirmed": 1,
        "approve_mute_household": 1,
        "commodity_received": 6,
        "training:hh_registration": 1,
        "cha_signal_verification": 3,
        "cebs_signal_reporting_verification": 2
    }
]

Some quick thoughts on the results:

1. Still some considerable difference between what we get here and what we get from the login page.

Example:

depth login script
0 70,160 44,120
1 182,002 130,171
2 227,947 190,469

2.With the understanding of the KE echis v2 hierarchy (a_countyb_sub_countyc_community_health_unitd_community_health_volunteer_areae_householdf_client), it is surprising that even at depth 0, a facility at c_community_health_unit level is still downloading docs (e.g u5_assessment) for a subject at the f_client level.

This to me is the biggest cause for supervisors having too many documents. Our supervisors have very limited workflows so it’s impossible for them at depth 0 to have only 6 direct contacts but 70k documents.

3. With purging, the expectation is these numbers should be considerably reduced. I’m waiting for the next purging cycle to run before I can update these findings with more data on the login document count.

cc @kenn per the work you’re doing on the other user doc count ticket

1 Like

This sounds like needs_signoff. The u5_assessment sets needs_signoff to true for every assessment. This would mean that every assessment will replicate to every user with an ancestor facility_id - regardless of the depth setting.

I see 24 workflows in eCHIS-KE which set needs_signoff to true.

Ignore needs_signoff

Can your script include this bit of logic also? I think it will explain a lot of the gap.

I did a profile of a user in produciton with lots of documents. 65k of 67k reports had needs_signoff: true.

Should we turn on purging for this role? Most important to purge are cha_performance_verification, chp_performance_self_assessment, hh_flood_status, sgbv, sha_insurance, u5_assessment.

Unsure if this is a reasonable suggestion. But I’m unclear if it would be significantly more performant for heavy users like this to have a new feature which omits docs with needs_signoff: true. I see the needs_signoff value is already tracked in the docs_by_replication_depth view so it seems possibly easy to omit them when a flag it turned on. Unclear if that would be significantly work and whether it would be significantly more performant vs purging. Tossing it out there.