Mitigate duplicate data capture

Anro · February 6, 2024, 12:35pm

We’ve run into an issue where our CHWs have started capturing duplicate entries.
After some investigation it seems to be a multi faceted problem.

We used a version of CHT that had the 50 contact limit, limiting work visibility.
Users were on-boarded & trained with minimal data, possibly being unaware just how much the list items will grow and how to work through that in the UI.
Users weren’t initially trained to use the “search everything bar” at the top.
Users sometimes create an entry as a location/plot instead of a household (one layer down).
There seems to be no mechanism to “look up” previously created records once in the creation flow.

From a training & design perspective we’re/we’ve:

Upgraded to a newer commit where the 50 character limit issue had been addressed.
Working with designers to create more “descriptive” icons, to clearly differentiate between the types of hierarchy people and places.
Training users how to structure search terms when using the search bar.
Emphasized users need to first look through the left hand side list before simply creating a new contact.
Training users to provide more descriptive place (plot/location) names, and updating existing entries.

That being said, as devs we’re trying to make the usage of the app as seamless as possible.
There are two areas that need to be addressed:

Since suggesting the usage of the search bar, we’ve encountered some unexpected behavior as noted in this thread.
While some inconsistencies between the server & client have been addressed, the way the search functions fundamentally remains the same. We’re hoping that training our CHWs to use more descriptive names will largely mitigate the short & ambiguous location names, however, if the need remains to tweak the “ignored minimum search term length” - how would one go about doing so? I’m reluctant to do so, as I’m sure this will have a performance impact.
Our CHWs we’re use to the previous app halting duplicate contact creations, so it’s understandable that they would assume the same of this platform. Since recently learning of the descendant-of-current-contact completion we thought of two ways to possibly address this:

Back porting the descendant-of-current-contact functionality to our instance version (4.2.2). I’m unsure how possible this would be, however I’m hoping it would be a quick implementation due to being CHT’s own code. We’d then, as soon as a person tries to register a household, use the parent ID (Indawo) to do a lookup to check for existing households that match that string.
Write an extension library that can, using fuzzy logic, query the database for records that match a given string. Notify the user if the entity already exists, halt form creation, and perhaps also redirect the user to the entity they are interested in. Again, I’m not sure how possible it is to expose the db to such an extension-lib, so would greatly appreciate any guidance/thoughts.

jkuester · February 6, 2024, 8:11pm

tweak the “ignored minimum search term length” - how would one go about doing so?

By “minimum search term length”, are you talking about the number of characters required to actually search with (which is 3 characters)? If so, I do not think there is any way of adjusting this via configuration (definitely would have negative performance impacts). That being said, @gareth, I was testing the latest code in master and did find the behavior a bit surprising. If I have 2 persons: Test 01 and Test 02, and I search for “Test 01”, then I see results for both Test 01 and Test 02. I expected to only see Test 01 because the total number of characters in my search string was >=3. Is this intentional behavior? (And for the record, if I change the names to Test 001 and Test 002 and then search for “Test 001”, then I only see a result for Test 001.)

Back porting the descendant-of-current-contact functionality to our instance version (4.2.2)

Functionally, the descendant-of-current-contact functionality seems like a pretty good approach to helping avoid creating duplicate contacts! Unfortunately, new features like this do not get patched back. There are a number of reasons for that but one the primary considerations for why we only release bug fixes in a patch release is to provide all partners as simple of an upgrade path as possible to be able to take advantage of the bug fixes. This is why new features are only shipped on major/minor releases. To be able to take advantage of new functionality like this, we highly recommend planning to upgrade to the upcoming 4.6.0 release.

Write an extension library that can, using fuzzy logic, query the database for records that match a given string.

Sadly, I can confirm that currently the extension-lib cannot support this kind of functionality yet. There are two main features missing to be able to do something like this:

Currently extension-libs cannot do async calls. They are great for normal sync computations, but doing things like DB lookups will require being able to return values asynchronously.
Also, at this point, we do not have any kind of interface defined that would allow an extension-lib function to be able to get data from the CHT. So, even if you could make your extension function async, it would still need some way to go do a lookup.

@gareth, I would be curious about your current thoughts on these limitations and if you think we perhaps have a clear enough idea of future possibilities for extenstion-libs that it would be helpful to log an issue for these?

gareth · February 7, 2024, 6:46am

The way our search is set up is that the overall search string is split up into key words and then short words are dropped. This means the “01” in your search phrase is completely ignored. This is to improve performance because if we indexed on, for example, “a” the size of the index would be enormous, and the results wouldn’t be useful because it would match on many docs in the database.

There is a notable change in the next CHT Core release - in previous versions your search term would have returned no results, because while the short key was being ignored during indexing, it wasn’t being ignored in the client side. As of 4.6.0 as you noticed you get both results as it’s better to get too many results than not enough. I hope this will go some way to addressing the duplicate creation issue in this thread as it won’t appear as though “test 01” doesn’t exist.

I think this could work but there would be quite a lot of effort involved in exposing the right APIs to complete this.

On the other hand the ability to detect during contact creation if this contact looks like one that already exists seems like a really useful addition to CHT Core that would benefit all projects. I’m not sure how this should be implemented but let’s raise an issue and start the discussion.

Anro · February 7, 2024, 12:03pm

Let me preface my response with we’re planning on upgrading our instance as quickly as possible, but unfortunately this will take some time. We’ve written quite a bit of code that ties into grunt (which will now need to be facilitated by node from v4.4 onwards), made changes to how db requests are made, and must investigate the impact of switching over to the new couch db version. For these reasons, and due diligence, each upgrade will be quite a process.

Yes exactly, tweaking the 3-character requirement to trigger a search. Thank you for confirming that there’s no way to do this via configuration. Would you perhaps know where to tweak that in code?
Again, while I’m reluctant to make such a change due to the performance impact (and having to unnecessarily maintain additional custom code), if I can showcase deterioration in performance with some figures – I might be able to dissuade this avenue/implementation.
Since, as @gareth has said, this affects both the indexing and the actual search functionality, I’m assuming it’s also not a quick change?

Thank you for explaining the approach to feature/fix releases, and for noting that the descendant-of-current-contact would work for our purpose.
To be clear, I did not expect CHT would back-port this feature to v4.3, I was more so asking if it would be a possible endeavor for us to undertake – as we need to actively try reduce duplicate captures while we wait for the upgrade path.
I’ve had a look at this PR, and was wondering if it’s as simple as implementing some of these changes, or if it’s building on top of other features implemented after our version (v4.3) and therefore is a lost cause.

Thank you so much for providing more detail on this. I’ve been wondering for some time if the extension-libs can be used for async operations, while I’m saddened that it can’t, I am glad to have a decisive answer.

Initially we hoped to follow a similar approach like we have with our date_diff extension:

Provide an entry to our namespace in the in the main.ts in the webapp/src/ts folder:
Create the type in the polyfill.ts file as well:
Then reference that namespace in our extension lib, in our case hooking into the moment lib:

image691×105 5.97 KB

I’ve not yet had time to play around with it, but I saw PouchDB is available in that main.js space, would it be possible to run queries against that?

I think it will help with addressing duplicate record creation in some cases. In other cases, the search result might return quite a few records if they’re similarly named, such as Indawo 1… 99. From a CHW’s perspective it might be a bit confusing as they have much less screen real-estate and they’re including a number in their search term that’s doing nothing. We have started conversations/training to provide more descriptive naming, but it will take time.

Thank you @jkuester and @gareth for taking the time to look into this, it is greatly appreciated!

gareth · February 7, 2024, 1:21pm

Yes it would be possible. The downside is because it’s not a support API it’s not documented and any upgrade to CHT Core may break it without warning because the developers won’t know that you’re accessing it directly. These sorts of changes make it harder and harder to upgrade in future (as you’ve found with the grunt tasks - I had no idea you were relying on those!).

My preference is to get customisations you want to make supported in CHT Core and help you get on to the latest version and keep up to date from there.

But ultimately the CHT is open source and you’re more than welcome to fork it to do whatever you need!

jkuester · February 7, 2024, 3:22pm

I think (have not tested) it is just a matter of changing this check in the CouchDB view and then this value in the app code. Like Gareth said, this will have a significant impact on the size of this Couch view and the subsequent query performance. (And, this view will have to be re-indexed. This will take a long time for any non-trivial data set.)

It is hard to be sure (since it seems like the challenges for stuff like this are always the little things you don’t anticipate), but I briefly had a look through that code and nothing jumped out at me as being particularly difficult to back-port to 4.2. One concern I had was how much it was affected by the Angular upgrade in 4.5, but even that looks tangential to the search changes needed for descendant-of-current-contact…

I am curious why you decided to add moment globally instead of just using something like webpack to package it with your extension code? (I would expect that after tree-shaking, the size difference would be pretty minimal in terms of kb of code…) I had always figured the webpack approach would be the cleanest way to do dependencies in extension-libs, so I would like to know if you ran into some kind of problems with that approach! (Also, as a side note, a date_diff function seems like a generic feature that could be useful to many partners. If you are finding it useful, consider perhaps up-streaming it as a built-in CHT function!)

I definitely agree that this really should be a build-in feature in the CHT! Looks like we already have an issue raised, but I have added a link to this conversation:

github.com/medic/cht-core

Prevent and/or merge duplicate contacts

opened 04:38PM - 18 Apr 20 UTC

ecsalomon

Type: Feature

**Is your feature request related to a problem? Please describe.** [This excell…ent report](https://docs.google.com/document/d/1TibEbmNiZPAr9PKHiBw0DTZn-KO2XHtOb6Q702caEBM/) by @marialma shows that duplicate contacts are an issue for a non-trivial amount of CHWs in some deployments. Maria has enumerated a large number of issues that arise for care delivery, privacy, and analytics from this. For example: - For client-initiated health assessment, duplicate contacts create issues of linking phone numbers to patients or households - Recreating a patient who moves to a new household means that the CHT and analysts lose the history of that patient, which may impact care delivery (e.g., lose the context of a registered high risk pregnancy) **Describe the solution you'd like** There are several potential solutions: 1. When entering a new contact, search existing contacts for exact or extremely close (say, Levenshtein distance = 2) names and ask the CHW to confirm that they want to create a new contact. 2. Allow CHWs to "move" contacts between parent contacts (e.g., when someone moves to a new household within a catchment area) 3. For older deployments, let's say a data scientist sets up an entity matching service on the backend to identify likely duplicates, allow a CHW/supervisor to review and confirm duplicates and merge them (so that all documents become associated with the merged/new contact) and deconflict contact info (e.g., date of birth). **Describe alternatives you've considered** There are likely some training solutions (e.g., advising CHWs of the problems created by duplicate contacts). But training will never address everything. **Additional context**

Anro · February 14, 2024, 10:19am

Thank you @jkuester, I’ll be sure to fiddle around with these values and test the results.
@gareth, in this thread, mentioned it might be worthwhile re-evaluating how search is, and should, function.

Too true! So far the functionality seems to be working, at least from a form usage perspective.
Running the accompanying tests might tell a different story, but we’ll see.
Thank you for taking the time to give it a once over.

We’ve also tried implementing a way to redirect a user to the doc they’re interested in.
We tried to dynamically construct a href value, then set that anchor tag as the content for another field - with the hopes of eventually styling a button.
Unfortunately, the only way we’ve been able to achieve redirection, due to the dynamic values, is by using hyperlink markdown - which unfortunately opens a new tab.
It would be awesome if there’s a way to do this more elegantly, do you perhaps know of a mechanism?

The test xlsx can be found here for your reference.

Has Levenshtein Distance (LD) or something similar ever been considered as a way to fuzzy search db records? Would something like that be possible?

I’m reluctant to admit my knowledge regarding webpack bundling is a bit sparse. Simply importing the moment library in the extension file did not work. I was unsure how to go about it , and received help from another dev in order to get this working solution.
How would one go about utilizing webpack to package it with an extension? For future reference.

We’d be more than happy to contribute the code upstream! Thank you for highlighting the space where this needs to be implemented.

gareth · February 14, 2024, 10:57am

The way the query is implemented is using a couchdb view which indexes every word in the doc. This index is cached for performance. I can’t think of a way to use fuzzy matching with this because LD (for example) compares two strings, so the cached index couldn’t be used.

A long time ago before the CHT supported offline users lucene was used to do the search, but there was no obvious client side implementation for offline users so we dropped it in favour of the solution in use today.

jkuester · February 15, 2024, 7:53pm

I added a response in this thread to some of the design proposals included in both threads.

Regarding manually including a redirect link in a form, I think you are on the right path with the anchor. Setting target="_self" as an attribute on the anchor (as you probably know) should make it open the link in the same tab. But, of course, this is not useful for markdown links. And, as you have noticed, our dynamic-url widget only works for markdown links… or does it?

It took some tinkering, but I found the proper format to use for your html anchor in a form label to actually be able to trigger the dynamic-url functionality for your custom link!

<a target="_self" href="#" rel="noopener" class="dynamic-url">
    ${my_name}
    <div class=”url hidden">${link}</div>
</a>

With that as your “label text”, it should dynamically load both the link text (from my_name) and the actual link href value (from link). (The dynamic-url/url classes here trigger the dynamic-url widget code to automatically read the link value and set it as the href on the anchor.)

For bonus points, you can re-use the bootstrap classes we already have loaded and update the anchor to have: class="dynamic-url btn btn-primary" style=”display:inline-block;" and you get a very nice looking button! (Though it should be noted that these bootstrap classes are not formally guaranteed to exist in future versions of the CHT. Really all of this is pretty much “use at your own risk”…)

Anro · February 16, 2024, 9:51am

Thank you for your response, to both of these threads.

That’s so cool, never would have thought to reuse the markdown widget to augment the anchor!
Thank you so much for taking the time to tinker around with this, it works brilliantly:

Should the display classes no longer exist in the future, the item just ends up looking like the markdown link we intended to use in the beginning:

I acknowledge the “use at your own risk” caveat. Since this is a piece of convenience functionality, I think it worthwhile to include (and it looks pretty). The backup of manually going back to the list already exists.

Anro · March 19, 2024, 3:07pm

We’ve opened a PR with the date_diff functionality as outlined above. Thank you for pointing us to the “built-in function” direction .
The original difference-in-months method now uses the same function.

We also up-streamed our drawing widget integration, if that would be useful to the larger community.

Anro · November 6, 2024, 12:06pm

We have proposed a duplicate prevention prototype here and would greatly appreciate feedback from the community