Upper limit on number of docs server will sync

jkuester · July 28, 2022, 7:02pm

I have been discussing an issue with @melema and @Patrick_K where occasionally users on large instances get misconfigured such that the user has an excessive amount of docs associated with them (e.g. the users are added at the wrong level in the hierarchy). In those cases, these users can have a very negative impact on server performance when they try to sync.

I know that when a user is associated with 10000 or more docs that a warning will be presented on the client side asking if they really want to sync that many docs. However, the user can simply continue the sync, negatively affecting the performance for everyone else on the server.

Is there a way to actually prevent a user with too many docs from syncing altogether?

Users are trying to replicate more docs sometimes on purpose and sometimes due t…o misconfiguration. This leads to poor user experience and can impact other users by monopolising server resources. We need to do a better job of communicating the maximum number of recommended docs and also warning when users have been configured incorrectly. 1. @SCdF Has made a start on an API to return the number of docs a user will replicate - take [his PR](https://github.com/medic/medic/pull/5356) and get it production ready and and merged. Consider caching to reduce load. 2. When creating a user through medic-conf or the admin app query this API and warn (but don't block) the configurer if the doc count is too high. 3. When a user replication request is received log a warning in the API log if the doc count is too high. Consider how to not spam the logs, possibly batching the writes together so it's written once per hour. 4. When a user does an initial replication show a confirm dialog so they are aware that their doc count is unusually high. The definition of "too high" is up for debate but I think we should warn at 10,000 docs for now. This can be raised when we improve scalability.

gareth · July 28, 2022, 11:19pm

There is a warning shown to administrators when creating users that would replicate too many docs either via the admin app or cht-conf. However I suspect this warning doesn’t show when moving contacts. That would be a useful addition to cht-conf.

After the fact, these misconfigured users can be detected using the APIs created in this issue.

It would be possible to block replication completely when the user is over the limit, but I worry that be jarring for existing users that create one more doc and suddenly are blocked from replication. One compromise would be to have a warning limit as well as a higher limit that blocks the user completely, so there’s time for the admin to fix the replication/purging settings before the user is cut off.

Another issue is the user is only shown the warning the first time they log in, so a user who has been misconfigured will not see the warning at all.

Do you know how far over the limit the users were?

mrjones · July 29, 2022, 4:30pm

I didn’t know about the API resulting from the issue @gareth cited, so I enjoyed exploring it just now. Here’s my findings!

The API was released in CHT 3.11.0, is documented here, and needs a users with _admin role and is called at at /api/v1/users-doc-count.

I have only three users in my local dev instance, all with less than 100 docs. I can call it, pass the results into jq and see the output:

curl -s https://medic:password@192-168-68-108.my.local-ip.co/api/v1/users-doc-count | jq
{
  "limit": 10000,
  "users": [
    {
      "_id": "replication-count-abdul",
      "_rev": "1-0a35a0e096a985510662dc2fd4417eca",
      "user": "abdul",
      "date": 1659111882929,
      "count": 46
    },
    {
      "_id": "replication-count-mrjones",
      "_rev": "1-4de640ce4ef29b812b30b5b8a6c040c7",
      "user": "mrjones",
      "date": 1659101850205,
      "count": 34
    },
    {
      "_id": "replication-count-mrjones-replacement",
      "_rev": "1-7a5593328e882f851ddc80e87763bcf8",
      "user": "mrjones-replacement",
      "date": 1659110993531,
      "count": 40
    }
  ]
}

However, what if I have 10s, 100s or even 1000s of users? How can I easily know which users are over the limit of 10000? Thanks to the power of jq, we can easily filter out any users above 10,000 and show just their count and username. For my case, I’ll filter at or above 40 to show the filter working:

curl -s https://medic:password@192-168-68-108.my.local-ip.co/api/v1/users-doc-count | \
   jq '.users[] | select(.count >= 40) | .count, .user'
46
"abdul"
40
"mrjones-replacement"

Hopefully this helps anyone else reading this thread - cheers!

oyierphil · January 13, 2023, 6:26pm

@mrjones
Today I have seen a strange warning while logging using one of the CHV accounts as below:

Not sure how to go about it, please advise

jkuester · January 13, 2023, 6:42pm

That warning is existing functionality (described above by Gareth). It indicates the client has noticed the current user is associated with an unusually large number of documents.

Selecting “Continue” will result in the client trying to replicate all of the 10907 documents to the device and continue as normal. This may be successful, but is highly discouraged. That number of documents can cause serious performance issues on a device and result in a significantly degraded user experience. As noted above, this number of documents related to a user can also have a negative affect on the performance of the whole server.

It is worth investigating why this particular user has access to so many docs. Is it a supervisor user that has access to many contacts below them in the hierarchy (and their associated reports)? Are there other users with similar doc counts (@mrjones curl query above can help make this determination)? We have even seen cases where a misconfiguration in a form resulted in all the reports created for that form being associated with a single contact.

oyierphil · January 13, 2023, 7:08pm

@jkuester, this a CHV account, I was informed that the user had challenges logging in. Was trying to login from my desktop to understand the challenge and received the error above. I am curious to find out why the user has access to so many docs, we want our server to perform optimally

Running curl as advised by @mrjones, I see a number of users with documents above 10,000 as below:
“count”: 11268, “count”: 18359, “count”: 13739, “count”: 10856, “count”: 13681, “count”: 10736, “count”: 18323, “count”: 12496, “count”: 10927, “count”: 18000

Not sure what this means to our server and the users devices

jkuester · January 13, 2023, 7:33pm

For a user with an unexpectedly high number of documents, the first thing I would do is double-check the person contact that the user is associated with and double-check the place that person belongs to. Make sure the user is associated to the expected contact and the contact is located at the proper place in the hierarchy (a contact at to high of a level, could have access to way more docs than intended/necessary).

Beyond that, you will probably need to manually sample the docs in the DB for that user and look for anything abnormal/unexpected.

oyierphil · January 13, 2023, 7:37pm

@jkuester, thank you, how do you determine this: [quote=“jkuester, post:5, topic:2055”]
misconfiguration in a form resulted in all the reports created for that form being associated with a single contact.
[/quote]

Running GET /api/v1/users-doc-count for the above CHW:
s-doc-count?user=“XXXX”
{“limit”:10000,“users”:{“_id”:“replication-count-XXXX”,“_rev”:“54-1ece3103e07d9642da6879d6fe671e43”,“user”:“XXXX”,“date”:1673630884149,“count”:10907}}

Trying to review the implication of this since I have users with almost 20,000 documents as shared above, thank you

jkuester · January 13, 2023, 10:17pm

There is no easy way I know of to determine if a misconfiguration is the cause of excessive docs. A manual review of the documents available to the user is required, looking for reports not generated by that user’s activity. Sadly, I don’t think there is any way to check on the backend, which docs are available for a given user. The answer to that question requires some non-trivial calculations based on the replication_depth configuration and the results of the medic/contacts_by_depth and medic/docs_by_replication_key view queries.

Probably the easiest way to review the docs for a user is to try logging in as that user on a device with plenty of storage and power and a good internet connection (e.g. use the browser on a computer). Then you can see which contacts/reports are available for the user.

oyierphil · January 16, 2023, 2:32pm

@jkuester, was logging in as the user above from my laptop, a Dell Coi7 with enough capacity I think, and I still got the warning, “You are about to download 10,90 docs…”. Do you wish to continue?
I didn’t continue since wanted to know the implication of saying yes

oyierphil · January 17, 2023, 1:29pm

@jkuester, was trying to reset password of a different user on the admin app using my laptop and got the same error, thus couldn’t complete the task. We still have a long way to go with data collection, will have to find a way of handling such errors

From the config, we had set the replication depth of the data collection agents to three (3) as follows:
“replication_depth”:[
{
“role”:“chw_supervisor”,
“depth”:3
},
{
“role”:“chw”,
“depth”:3
},
{
“role”:“health_worker”,
“depth”:3
}
],

jkuester · January 17, 2023, 1:52pm

I still got the warning, “You are about to download 10,90 docs…”. Do you wish to continue?
I didn’t continue since wanted to know the implication of saying yes

This is expected. The warning is just based off of the number of docs (and not on the specifics of the client device). For testing/evaluation purposes, it should be fine to replicate this many docs. The main considerations would be:

Does the device have enough power to handle it? (sounds like you laptop should be good)
Can you replicate during an off-peak time for the server? (so that the database performance is not negatively impacted when many other users are trying to replicate)
Is this just for testing and not regular usage? (The main challenges we have seen when users have this many documents is that continued usage over time can gradually degrade server performance (especially if there are many users with similar numbers of documents). I would not expect any problems from simply replicating users like this a couple times to do testing/investigation work.)

If you are noticing a growing number of users with similar doc counts, it is definilty time to review your replication depth and purging configurations. These are the two main tools available to help manage the number of docs being replicated to clients. (Note that neither of these will affect the data on the server.) All collected data will be stored as expected on the server. These configurations simply allow for limiting the amount of docs synced to clients to just the sub-set necessary for the users.

jkuester · January 17, 2023, 3:01pm

I would start by adjusting replication_depth since that is often the most clean way to limit the number of docs associated with a user. I do not know the specifics of your project’s hierarchy, but the general approach here is to start with your highest-level offline users and determine the lowest level of data that they need to have access to in order to do their jobs. E.g. a chw_supervisor probably needs to have access to data about chws (and reports about the CHWs), but the supervisor might not need to directly see patients or reports about patients. Setting the replication_depth configuration accordingly can dramatically reduce the number of documents that supervisor users need to replicate.

Often it is the case that a user like a supervisor does not need access to the majority of the reports at a particular level, but certain workflows required those supervisor users to see particular types of reports. In that case, the needs_signoff flag can be set in those forms to allow for replication to the supervisor while the rest of the reports can still be filtered out by the replication_depth config.

oyierphil · January 17, 2023, 3:25pm

@jkuester, our hierarchy is as follows:
Place User
National Office - National Officer
County Office - County Officer
Sub-County Office - Sub-County Officer
Link Facility - (CHW Supervisors (CHAs), Health workers (Lab Technologists))
CHW Area - (CHWs/CHVs)
Households - (Household members)

So I wish to review the replication depth as follows:
CHWs - Depth 2 (HH and HH Members)
CHW Supervisors - Depth 1 (CHWs)
Health workers - Depth 1 (They register cases under the link facility, thus need the client registration data), not sure if they have to pull any documents

Please review and advise

antony · January 17, 2023, 6:04pm

@oyierphil based on the hierarchy you have shared, you can set the replication depth for supervisors at 1 and as @jkuester mentioned you can use needs_sign off flag to allow supervisors to access certain reports. The health providers and CHWs can use the default replication depth; I don’t think you need to adjust their replication depth.

jkuester · January 17, 2023, 8:35pm

Expanding on this a bit, if you set your config with depth=1 like this (as @antony mentioned, the CHW users can just use the default replication behavior since they need access to everything below them in the hierarchy):

"replication_depth":[
  {"role":"chw_supervisor","depth":1},
  {"role":"health_worker","depth":1}
],

then you end up with a replication matrix that looks like this:

Documents	chw_supervisor	health_worker	chw
`Link Facility`
All reports about `Link Facility`
Person children of `Link Facility`
Reports about person children of `Link Facility`
`CHW Area`
All reports about `CHW Area`
Person children of `CHW Area`
Reports about person children of `CHW Area`
`Household`
All reports about `Household`
Person children of `Household`
Reports about person children of `Household`

Depending on how your config is set up, this may work great for you (and would greatly reduce the number of docs replicated by chw_supervisor/health_worker users).

One thing to watch out for with this config (as noted in a recent feature request) is that with depth=1, the chw_supervisor/health_worker users will not be able to see the CHW contacts or reports about CHW contacts (since the CHW contacts would be “Person children of CHW Area”). Increasing to depth=2 would allow chw_supervisor/health_worker users to see CHW contacts, but would also sync the Household place contacts and all the reports associated with these contacts. (This might be acceptable if the total number of households per Link Facility is not too high and there are not too many reports about these contacts.) Additionally, assuming your CHT instance is on 3.10+, you can avoid replicating the reports about these CHWs and Households by setting the report_depth option:

"replication_depth":[
  {"role":"chw_supervisor","depth":2,"report_depth":1 },
  {"role":"health_worker","depth":2,"report_depth":1 }
],

This would result in a replication matrix that looks like this:

Documents	chw_supervisor	health_worker	chw
`Link Facility`
All reports about `Link Facility`
Person children of `Link Facility`
Reports about person children of `Link Facility`
`CHW Area`
All reports about `CHW Area`
Person children of `CHW Area`
Reports about person children of `CHW Area` submitted by current user
Reports about person children of `CHW Area` submitted by other users
`Household`
Reports about `Household` submitted by current user
Reports about `Household` submitted by other users
Person children of `Household`
Reports about person children of `Household`

oyierphil · January 18, 2023, 9:34am

@jkuester, this is great feedback, would the default replication depth for CHWs be 3, since we have CHV Areas, then household, then household members?

Blockquote
Increasing to depth=2 would allow chw_supervisor/health_worker users to see CHW contacts, but would also sync the Household place contacts and all the reports associated with these contacts

We only register Households through the CHW Areas and by CHWs, the link facility team ( chw_supervisor, health_worker) don’t register households, they register clients and immediately proceed to CIF forms, thus setting the depth at 1 should be fine, I have made the changes to the config as above, monitoring the app

oyierphil · January 18, 2023, 1:48pm

@jkuester, thank you, the health workers are now working.
I have seen a similar challenge with a CHW today while resetting the password

The replication setting for ChWs is as follows:
{ “role”:“chw”, “depth”:3 },

She logged in, can register clients but gets an error when trying to fill CIF form the registered clients below:

Trying to login as the user in the link facility takes time, most of the issues today have been from the one health facility, any idea where to look?

And the console logs as below, please review and advise:

jkuester · January 18, 2023, 8:00pm

would the default replication depth for CHWs be 3, since we have CHV Areas, then household, then household members?

Effectively yes. Technically the “default replication depth” is unlimited. So, users with no configured replication depth will have access to all the data at the place they belong to and below (so everything in their CHV Area, all the housholds, and the household members).

She logged in, can register clients but gets an error when trying to fill CIF form the registered clients below:

Is this for a chw user or a chw_supervisor/heath_worker user? You mentioned a CHW, but then above you were saying that the chw_supervisor/heath_worker “immediately proceed to CIF forms”. I just want to be sure I understand which user encountered this error! If it was a CHW user, then it might just be a mis-configuration of the user. You can check the user in the App Management console and be sure that it is associated with a contact.

If it is coming from a chw_supervisor/heath_worker user, then it is possible the error was caused by changing the replication_depth settings somehow. It would still be worth double-checking the user in the App Management console first to see if they are associated with the expected contact. If the contact is set correctly for the user, and none of the chw_supervisor/heath_worker users can submit a CIF form, then we will need to evaluate the details of those CIF forms to understand what went wrong.

Trying to login as the user in the link facility takes time

How much time does it take for a user with a good internet connection to login? In your screenshots I see that it is replicating 5307 docs. It would not surprise me to hear that it takes several hours to do the initial replication of that much data on a slower internet connection (or if there is a lot of traffic on the server). Subsequent syncs for that user should be much faster than the initial replication. However, if the initial replication is taking many hours, that would be concerning.

I have seen a similar challenge with a CHW today

If you are starting to see chw users with 10,000+ docs, then it is time to reach for the next tool in the box: server-side purging. We cannot reduce the data available to CHWs with the replication_depth (since they need access to all the households that belong to their area), so the next best option is to reduce the time that reports of certain types are stored on the chw’s device (by default, reports are never removed from the device). Do you have any purge config specified currently?

oyierphil · January 24, 2023, 12:56pm

@jkuester, thank you, the chw_supervisor/heath_worker have now settled, had to reconfigure the associated contacts
The current challenge is with CHWs passing the document limit, another case just now as below:

Blockquote
If you are starting to see chw users with 10,000+ docs, then it is time to reach for the next tool in the box: server-side purging.

We don’t have any server-side or client-side purging rule, we started the project around September last year and hope to complete data collection by March this year. I have read about purging on the online documentation, which has many warnings on thoroughly test before taking to production server , will follow the guide on server-side purging and test the rule on test VM, which doesn’t have a lot of data tomorrow.

What would you advise, client side or server side purging, given our current challenges with document limits? I have read about a possibility with our config, which I have reviewed severally and havent picked a culprit. Would appreciate guidance on writing purge rules so that we don’t loose any data, our greatest worry

Upper limit on number of docs server will sync

Related: