Deduplication prototype

We previously helped build a just-in-time (JIT) solution to help prevent new duplicates from being created. Our focus has now shifted to fixing existing duplicates. Specifically, letting users decide what action to take (merge/delete/archive) on records flagged as potential duplicates, and adding an approval/audit step before those actions are finalized. Maybe these “actions”, that we’ll get into in a bit, could eventually be handled by the cht-conf tool [1][2]

While the JIT approach focussed on prevention, this next step is about correction. Currently, most of the automatic correction (merging, deleting, and archiving of duplicates) happens in our Databricks space. However, items with low confidence scores still require manual intervention. These records need to be flagged and sent back to the instance for review by users who have the right context to make informed decisions.

Building the prototype

To handle this, we’ve been experimenting with a prototype screen that lets users manage flagged duplicates directly. It borrows a few pieces from our WIP plugin PR [1][2][3]. The prototype is still rough, some parts are hardcoded to our context, but it’s functional enough to prove the concept.

It integrates with NgRx state and reacts to database document changes like other CHT screens. We’re also reusing the existing duplicate component (with a few small tweaks) to display the same configurable summary information seen in the JIT flow.

Updating the ddocs

To make the prototype work, we needed to adjust a couple of views that drive replication and querying. This is where we’d really appreciate some guidance.

medic-client/views/doc_by_type:
By default, this view picked up our new document type (duplicates) automatically. However, to query the right docs for each user, we had to expose and additional state property (explained in more detail below). We also started emitting the parent as a value to ensure that offline users only replicate documents relevant to their specific scope.

medic/views/docs_by_replication_key:
Since CHT only allows for replication of certain document types, we added duplicates to the list.



We’ll get into the concerns a little later.

Extending the Authorization Service

To support these changes, we made a small change In the API space. In our app version (4.3.x) it was within the authorization service’s getAuthorizationContext method. We introduced an extra doc_by_type query that returns our duplicates documents that is then added to the subjects array for replication (since the original contacts_by_depth alone was insufficient for our needs).

While this works, it does introduce some extra sync delay, since we’re now making an additional query. This is method is hit quite often by offline users, so we’d like to optimize that - possibly by supporting replication by state or “depth”, so that users only pull down what’s actually relevant to them.

A snippet of the change:

const getDuplicateSubjects = (row, facilityId) => {
  const subjects = [];

  if(row.value){
    current = row.value.parent;
    while (current){
      if(current._id === facilityId){
        subjects.push(row.id);
        break;
      }
      current = current.parent;
    }
  }

  return subjects;
};

const getAuthorizationContext = (userCtx) => {
  const authorizationCtx = getContextObject(userCtx);

  return Promise.all([
    db.medic.query('medic/contacts_by_depth', { keys: authorizationCtx.contactsByDepthKeys }),
    db.medic.query('medic-client/doc_by_type', { startkey: ['duplicates'],  endkey: ['duplicates', {}] })
  ]).then(([contactsByDepthResults, duplicatesResults]) => {
    contactsByDepthResults.rows.forEach(row => {
      const subjects = getContactSubjects(row);
      authorizationCtx.subjectIds.push(...subjects);

      if (usesReportDepth(authorizationCtx)) {
        const subjectDepth = row.key[1];
        subjects.forEach(subject => authorizationCtx.subjectsDepth[subject] = subjectDepth);
      }
    });

    duplicatesResults.rows.forEach(row => {
      const subjects = getDuplicateSubjects(row, authorizationCtx.userCtx.facility_id);
      console.log('Duplicate subjects: ', subjects);
      authorizationCtx.subjectIds.push(...subjects);
      
      // TODO: replication depth
    });
...

Document schema and structure

Each duplicates document defines a set of potentially duplicate contacts, along with the actions to take on each one. Here’s an example:

{
  "_id": "7bc83fa55fced88ad97a7c30e703f379",
  "_rev": "12-2107b24705b3ae899ff8de8124d1e784",
  "type": "duplicates",
  "name": "Test 10",
  "contact_type": "dwelling",
  "issue_date": 1759304952000,
  "due_date": 1760428152000,
  "priority": "high",
  "state": "approved",
  "comment": "Some comment here",
  "items": {
    "46b8a610-a011-487d-85f4-e460fd0de7f0": "canonical",
    "0e931c6c-a0e7-4b9e-a2ce-ddb8c4033b64": "merge",
    "fb84b045-c8bf-404e-b46f-4e100cee0ae6": "delete"
  },
  "parent": {
    "_id": "0e931c6c-a0e7-4b9e-a2ce-ddb8c4033b64",
    "parent": {
      "_id": "f4657403-f5df-4eff-a5dc-befd3bfb732d",
      "parent": {
        "_id": "4d2f7817-ad7e-4f5e-8262-a41e860dbd15"
      }
    }
  },
  "meta": {
    "created_by": "system",
    "created_by_person_uuid": "",
    "created_by_place_uuid": "",
    "last_edited_by": "anro-chw-4125",
    "last_edited_by_person_uuid": "3558df86-3844-4c7f-afe4-0703d8097198",
    "last_edited_by_place_uuid": "0e931c6c-a0e7-4b9e-a2ce-ddb8c4033b64",
    "last_edited_on": 1760717235855,
    "audited_by": "admin",
    "audited_by_person_uuid": "",
    "audited_by_place_uuid": "",
    "audited_on": 1760717284678
  }
}
Property Value(s) Purpose
type duplicates Differentiates this doc from others
name < string > Title shown in the list
contact_type < string > Based on the contact types in the app config. Here, the value outlines the type of the contact IDs in this doc
issue_date/due_date < ms timestamp > Used for ordering and deadlines
priority low, medium, high Controls list ordering (high = top)
state issued, pending, approved Defines visibility and workflow stage
comment < string > Optional feedback or audit notes
items {id: action} Maps each duplicate contact to its action (canonical, merge, delete, archive)
parent < object > Mirrors CHT’s parent structure, used for replication
meta < object > Standard metadata fields for audit trail

Considering custom views

One concern, as mentioned in the ddoc section, is the maintainabillity of modifying the ddocs. Since CHT warns that core design docs may be overwritten during upgrades, it might be safer to use custom indexes/views. Functionality we helped introduce a while back via the cht-conf tool.

That said, I’m a little hesitant to rely on a custom view in such a core part of the app. Off the top of my head, to facilitate something like that we’d need to:

  1. query whether additional view(s) are available on the medic db
  2. maintain a dynamic list of views and their params, whose results should be included as replication subjects. Depending on the complexity, the contents might need to be parsed as js. For example, in our context, CHWs should only replicate docs with a state of “issued”.

Since offline users hit this often, it could be an issue if not done optimally. Still, for CHT to support diverse plugins this might be the right direction to explore. Maybe I’m overthinking it, but I’d be interested to hear if anyone has approached this differently.

Current output

The prototype represents a first step toward giving users direct control over managing duplicate data. It’s still very rough and un-styled but seems to work.

Introduced an additional nav “Duplicate” button:
image

To maintain consistency with rest of CHT, items for selection are displayed on the left-hand-side and “content” is displayed on the right. The content is also displayed in cards to mirror contact “profile” screens. The big difference being that the editing of the doc takes place right in the content pane.

As a CHW:

A list of “issued” docs will be displayed to the user. As explained above, each item contains a collection of record IDs, of the same type, that have been flagged as duplicates. After pressing the “edit” button, additional controls appear next to the “open” button. Selecting an item as “canonical” displays an “action” dropdown on the remaining items. After the user submits the changes, the state moves to “pending” , and the record will be removed from view.


As an auditor (admin):

A list of “pending” docs will be displayed to the auditor. After pressing the “edit” button, the user is able to view the selections made by the CHW. Based on the state of the doc, the auditor can decide to Reject or Approve it. A comment should be entered when rejecting a doc to guide the CHW making the correct capture.Once the state changes the record is removed from view.



We’d really appreciate any thoughts or advice.

  • Is there a better long-term approach?
  • Any suggestions for optimizing replication to avoid added sync delays? While still allowing for extending functionality.
  • Thoughts on the schema or flow, anything we might be missing?
2 Likes

This is awesome @Anro! Thank you for the clear and detailed documentation. :+1:

For your approach here, what happens once the auditor approves to dedupe decision made by the CHW? You mentioned “Maybe these “actions”, that we’ll get into in a bit, could eventually be handled by the cht-conf tool”. But, does that mean that currently you do not have any automated “merge” process? I do agree that there is overlap here with the merge-contacts action in cht-conf which has a bunch of logic for dealing with the necessary updates to the contact’s reports, etc.

Another question I had was how you are triggering the initial dupe checking for the CHW on the client side? I see you mention listening for doc changes. Are you just re-running the dupe checking every time a contact changes? Or are you identifying the potential dupes on the server-side (e.g. Databricks) and then pushing the duplicates doc down to the CHW so they can make a decision?

Regarding your ddoc changes and the duplicates document type, my first thought is that perhaps we could just capture this information directly on the contact docs, themselves, instead of having a separate duplicates doc type (similar to how muting contacts works currently). That would eliminate several of the view changes you needed related to replication (but would still need a new view to get the contacts with the duplicates property).

Are your auditor users also offline? I am trying to figure out why you needed the extra subjects from the doc_by_type query during replication. Might need some kind of needs_signoff override for contacts (like we currently have for reports)… (Actually, looking at this again, I think perhaps the doc_by_type query is to make sure the CHW has access to both contacts when they make the initial duplication decision, right? Are you checking for dupes on the server-side in such a way that you could identify dupes that do not share the same parent contact?)


Naturally, my first thought here was this might make a good plugin. :sweat_smile: (Which, just FYI, thanks for your patience on that end. We are shooting to get a formal squad kicked off next week! :crossed_fingers:) However, I think the various replication considerations might mean this needs to be a more formal part of the CHT data model. :thinking:

I like the UI in your screenshots for reviewing the duplicates and making the decision! From a workflow perspective, I wonder if it would be better for the users to launch into that view from a task instead of it being a separate top-level tab in the UI? I am thinking that whenever a user needs to make a decision about duplicates, they would get a task that shows up on the normal Tasks tab. Selecting that task would take them into that dupe-review view. (So, “Duplicate review” just becomes a new type of task action. ) A task-based workflow here should allow for high levels of customization regarding assigning responsibility for duplicate review to specific users (and I think should also make it pretty simple to trigger the “audit” step once the CHW has made a decision about the duplicates.

Regarding the final operation to update the contacts once the duplicates have been approved, IMHO we need to add proper endpoints to the CHT api server to support deleting and merging contacts. We have staggered on this long with having second-class support in cht-conf, but these are fundamental data operations that belong in the api server… Once we have that support in the server, though, it becomes trivial to update Sentinel to actually trigger the operations on the duplicates and the updates can happen automatically. :+1:


Anyway, sorry for my reply being all over the place here, but once again I am really happy that you all have put so much good thought and effort into this!

Thank you for taking the time to read through our walls of text and provide such valuable feedback @jkuester!

At this time, Databricks is going to pick up the approved items and perform the necessary actions. Databricks is already set to handle identifying potential duplicates, performing automatic merging, deletion, or archiving of high-confidence items - as well as the flagging of low-confidence items for manual intervention by the UI. Pushing them back into the app. It’s specific to our environment (for now). It seems fitting to reuse that functionality.

That said, for this to be useful to the larger community, we would need to build it in a way that allows different contexts to push and consume these documents in their own ways- whether that’s Databricks, other cloud functions, the cht-conf tool, or sentinel + API.

We’re fortunate, our backend team (using Databricks) is set to handle the initial identification and push the duplicate document down, both retroactively and eventually “real time” as new records are consumed by our data processing job. This means devices stay performant and we can compare docs over a larger scope (when compared to device comparison on change). Of course, this isn’t directly useful to the CHT yet. Initially, I was hoping that Sentinel could bridge that gap - since it already manages work like aggregating feedback docs, creating login users based on contact_for_user prop, and handling purging (and probably more). However, please correct me if I’m wrong, it seems Sentinel only has a single contact in context at any given time, which makes duplicate detection tricky.

If we could extend Sentinel to handle duplicates with an opt-in approach similar to the purge script, that could be quite beneficial. That said, if memory serves, its contents are meant to be “self contained”, which means we can’t relay on our existing duplicate flagging logic in .props files. As a side-note, because of our complex dedupe requirements, we’ve had to move some of our .props logic into xml-forms-context-utils.service. Which also allowed us to create tests. I’m curious if others have encountered similar issues.
Manual flagging of duplicates can also be facilitated on the UI using the same floating “+” button pattern.

Duplicates don’t necessarily have to be limited to contacts - similar to the JIT functionality, this could apply to other document types like reports. This could become even more relevant with the introduction of plugins. Having all related duplicate documents visible in one place, instead of scattered throughout, also makes it easier to track what is going on. That said, I haven’t worked with the muting functionality yet, so I may be missing your point there. As you’ve noted, we can’t really get away from some level of view change. If we’re to support diverse plugins, we may also need to facilitate the replication of a variety of doc types - not just the core CHT ones. Maybe that’s a discussion for the upcoming plugin squad :sweat_smile:.

At the moment, the auditor is hardcoded as admin. An online user. That’s just specific to our context (and also may change). Other deployments might have offline auditors and online initial editors. Unlikely, but I’m hesitant to make assumptions that could prematurely forclose.

The doc_by_type query in our replication logic is written to only consider:

  • duplicates specific to the facility of the offline users, and
  • duplicates with a state relevant to that user’s role (issued = CHW, pending = auditor)
    Everything else is excluded from replication. For example, if both the CHW and auditor (like a team leader) are offline users, each will only replicate documents relevant to their own scope and permissions. The 2nd filter still needs to be implemented.

Our server will search for duplicates across a broad scope, but the flagging rollout will be gradual. Level by level. Similar to how we approach the JIT impl on our side. The idea is to resolve duplicates from the topmost place first, then move down through the children, forcing the grouping of items where it makes sense. However, yes, it is possible that some duplicates may not share the same parent. We could allow for these items.


We’re on the same page there :sweat_smile:. At least this conversation is helping with highlighting potential plugin considerations. Thank you for getting the squad moving.

You raise a good point about task-based workflows. While task seem to fit this case nicely, we don’t necessarily want to restrict this functionality to offline users.

For context, we have an OTL (team lead) that needs to capture death reports for patients flagged by the CHWs in their team. The challenge is the OTL must replicate a LARGE number of records, and tolerate slow performance, to access the tasks they need to catch up on. Gets worse the higher up the hierarchy you move. A little bit of a pain point for us :sweat_smile:.

By allowing any user (online or offline) to pull the duplicate data directly, they can act on it without the replicating the entire hierarchy just to see a notification. Of course, this data is stored “globally” instead of as a single record created by a task - but at least it is shared by a few users.

Your closing remarks make a lot of sense. Definitely aligns with our thinking. I should have read your comment before drafting my response :sweat_smile:.

Very interesting! :thinking:

The great/challenging part of the de-dupe workflow is that there are quite a few different aspects and moving parts. I have tried to capture the high-level steps in your workflow in this diagram (let me know what I missed :sweat_smile:):

The more I look at this, the more I think it is unrealistic to try and define a single duplicate-resolution-process that will work for all CHT deployments. Instead maybe we should focus on what reusable features/components do we need in the CHT platform to support the workflow we have outlined above. Drilling in on this:

  • For #2, I think we definitely should leave this in an external system (e.g. databricks); at least for now.
    • The logic is likely to be highly org-specific.
    • Also, as you pointed out, Sentinel (and really CouchDB in general) is not a really good place for doing duplicate detection because it is not optimized for longitudinal data views. To perform these operations at scale, it is almost certainly better to stream the data (#1 and #9) into a different datastore. E.g. cht-sync can be used to send data to Postgres.
  • For #3, data can currently be sent directly to Couch, but we are also adding more endpoints to the API server around basic CRUD for contacts/reports. Long-term it would be best to do all data manipulation though the CHT api server instead of directly modifying the database.
  • For #4 and #6, we need to make sure the data model used to track duplicates is compatible with the replication algorithm/views. (Or update the algorithm to have the support we need.)
    • Honestly, I still don’t see the benefit of a duplicates doc type over having either just a duplicate_contacts object property on the contact itself OR we could have a report doc with type: 'data_record' and form: 'duplicates'.
    • I think the report doc would offer most of the benefits you get from the duplicates type (can query the medic-client/reports_by_form) with the added bonus that it should work “out of the box” with our replication algorithm.
    • Also, if you had an actual duplicates form, you could use that to manually mark contacts as duplicates…
  • For #5, we should be able to use a task to notify the user that they need to take action.
    • With the CHT Plugins on the brain, I am inclined to think that the UI for conflict resolution would be a good plugin. It would be useful to be able to “launch” plugins from a task.
  • Also, it would be great to trigger #7 from a task, but as you noted this will be a problem if the auditor is an online user. Ultimately, I am determined to get support for filtering data access to online users which would allow for task support (or maybe we need a new task paradigm for users with access to all the instance data…).
    • Regardless, the auditor’s UI can also be a CHT Plugin.
  • Finally, as I mentioned before, I think it would be great if #10 was just simply a bit of logic to trigger calls to a CHT api endpoint to perform the merge-contacts operation async.

So, maybe with plugins and a few new API endpoints, we could make all this work? :thinking: :sweat_smile:

1 Like

You said it! This is part of our broader drive to improve overall data quality in our solution. Thank you so much for taking the time to respond in such detail. Referencing the points outlined in the diagram made it A LOT easier to follow.

I think you’re right. It’s unfortunate, as we really want to contribute back the things we benefit from, but given what’s currently possible, it seems unlikely to provide the value we hoped for. That said, perhaps as changes are implemented over time, a clearer path or solution will present itself.

Right, let’s get into making the CHT and the required info gel well.

  • #2 Agreed.

  • #3 Agreed. It would be great to benefit from the CHT’s built-in APIs and the safety checks that come along with them when pushing doc changes. A quick skim through the PR shows great progress - well done! I realise that the push for plugins has ramped up a bit quite recently, but was wondering - were the APIs improvements developed with that in mind by any chance?

  • #4/#6 We agree that some changes will be needed to make the replication algorithm/views compatible. That said, perhaps due to our own limited understanding, we’re not entirely sure that reusing existing doc types would meet our needs.
    The duplicates type in our case represents a group of all items of a specific type (contact, report, or others in the future) that are considered potential duplicates of another.

    • The idea is to have one record that contains all the IDs of the items that need to be reconciled - small and simple, just one write. That way, we don’t have to manage multiple contact references or worry about keeping them in sync.
      When downstream processes the action, it needs to load one doc. It can then simply remove or archive items without loading each contact individually. Only when an item is marked as “canonical” and “merge” do we load the contact’s content and handle the merging logic.
    • If we were to use medic-client/reports_by_form, the only way I can maybe see this working, since we need the "state" to filter what gets replicated, would be to suffix report types (e.g. “deduplicate-issued”, “duplicates-pending”). As the solution matures, we may need more filters, and multiple suffixes could become tricky to manage.
    • Forms can only be created by offline users (if memory serves), but the person marking duplicates isn’t necessarily offline. In our case, there will likely be a mix of both - so this might need its own dedicated screen.

    Even if we could make the existing approach work, what happens when another type is introduced by a different plugin? We might not be able to extend existing doc types to fit those needs. We’re probably not the only ones wanting to make CHT a one-stop shop.

    Please do correct us if we’re misunderstanding anything.

    At the moment, our thinking to support various doc types in the long term looks something like this:

    • Leave the specifics of the view and filter keys to the implementation context by allowing dynamic/custom views to be queried through the replication API. This could leverage the recent cht-conf changes that allow uploading custom indexes to the instance.
    • Since the app settings are already available to the API via config.get, and settings.json allows non-strict property uploads, we could use this to define an entry similar to replication depth. Each object in the array could include a role, applicable view(s), and keys for those views.
    • For the user in ctx we grab their role, look for a matching role entry in this list, grab the relevant custom view(s), and pass in the accompanying keys.
    • Use the user’s facility_id to further filter the records.
  • #5 Agreed. If the task filtering work we’ve been meaning to finish (sorry!) helps make CHWs’ workloads easier to manage, I don’t see why a “notification” to complete the dedupe work couldn’t also be surfaced as a task.

  • #7 The online filter idea would be awesome!

  • #10 Agreed.

Plugins make the world go ’round :sweat_smile:. The APIs are going to be key to making all of this come together.