Error loading tasks

There is an error loading tasks in the CHW App and the logs display a database global failure.
Some of the findings for the possible causes have been documented here and the document entails the ADB logs for the different mobile devices too.

Kindly review and share any ideas on how to resolve this intermittent bug and also what causes it if its a known issue.

Hi @Jeanfaith234

What version of CHT Core are you on?

How common is this error? ie: How many users are successfully loading tasks vs how many are getting an error like this?

Have you tried logging the same users in with a different device? This will help narrow down if it’s a user issue or a device issue.

@gareth
CHT 3.16

The bug is intermittent but it usually comes up toward the end of the month when almost every CHW is syncing their data. The numbers keep increasing into the first week of the new month and those affected are almost even with those that are not affected.

Logging into a different mobile device works some times and at times fails.

Do you have a workaround or does the bug just come right if the user keeps trying?

After the bug occurs, does the user have to download everything again, or does the app just recover on its own?

@gareth would the user telemetry record for the day the exception was thrown help?

@gareth the current workaround is to sync, clear data, and log in again. Though this doesn’t scale due the large number of affected users.

It seems suspicious that the bug is more common when there is heavy server load. However, for the record, I have checked the CouchDB logs from 2023-02-07 (the day the errors in the linked document were recorded) and did not see anything out of the ordinary. There were no server errors (and particularly no timeout errors).

The errors appear to be somehow related to task generation, and my hunch is that these users have a large number of contacts, reports, or tasks being generated.

@rmayore Yes I think the user telemetry may help. Also if you could find out how many of each type of doc the user has that would be very interesting. Particularly the tasks which should be visible in the users meta database in CouchDB. Some of this may include PHI so please summarize or redact the information if pushing here.

1 Like

@gareth Telemetry for date when the error occurred:

{
  "_id": "telemetry-2023-2-7-username-uuid",
  "_rev": "rev,
  "type": "telemetry",
  "dbInfo": {
    "adapter": "idb",
    "db_name": "medic-user-username",
    "doc_count": 8580,
    "update_seq": 8697,
    "auto_compaction": true,
    "idb_attachment_format": "binary"
  },
  "device": {
    "screen": {
      "width": 320,
      "height": 570
    },
    "userAgent": "Mozilla/5.0 (Linux; Android 7.0; TECNO WX3P Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Crosswalk/23.53.589.4 Mobile Safari/537.36 org.livinggoods.smarthealth.region1.ug/v0.4.33",
    "deviceInfo": {
      "error": true,
      "message": "Problem fetching device info: class java.lang.NullPointerException: Attempt to invoke virtual method 'int android.net.NetworkCapabilities.getLinkDownstreamBandwidthKbps()' on a null object reference"
    },
    "hardwareConcurrency": 4
  },
  "metrics": {
    "boot_time": {
      "max": 97988.19000000002,
      "min": 15857.235000000004,
      "sum": 195200.95000000004,
      "count": 5,
      "sumsqr": 12092588125.843254
    },
    "tasks:load": {
      "max": 20818586,
      "min": 20818586,
      "sum": 20818586,
      "count": 1,
      "sumsqr": 433413523039396
    },
    "tasks:refresh": {
      "max": 4818,
      "min": 4818,
      "sum": 4818,
      "count": 1,
      "sumsqr": 23213124
    },
    "rules-engine:targets": {
      "max": 2398,
      "min": 749,
      "sum": 3147,
      "count": 2,
      "sumsqr": 6311405
    },
    "search:contacts:types": {
      "max": 24402.655000000028,
      "min": 10277.375,
      "sum": 48171.98999999906,
      "count": 3,
      "sumsqr": 883146992.5812253
    },
    "analytics:targets:load": {
      "max": 193794,
      "min": 161,
      "sum": 194879,
      "count": 3,
      "sumsqr": 37556994133
    },
    "boot_time:2_2:to_purge": {
      "max": 15636.380000000001,
      "min": 15367.72,
      "sum": 31004.1,
      "count": 2,
      "sumsqr": 480663197.5028
    },
    "rules-engine:initialize": {
      "max": 8793,
      "min": 5675,
      "sum": 33267,
      "count": 5,
      "sumsqr": 227583245
    },
    "boot_time:2:to_bootstrap": {
      "max": 87683.54500000001,
      "min": 5754.285,
      "sum": 148091.76,
      "count": 6,
      "sumsqr": 8624915479.627401
    },
    "replication:medic:to:docs": {
      "max": 12,
      "min": 0,
      "sum": 16,
      "count": 36,
      "sumsqr": 148
    },
    "search:reports:subjectIds": {
      "max": 6649.395000001416,
      "min": 1688.1900000000605,
      "sum": 37432.03500000015,
      "count": 13,
      "sumsqr": 127562173.52883837
    },
    "user_settings:language:en": {
      "max": 1,
      "min": 1,
      "sum": 5,
      "count": 5,
      "sumsqr": 5
    },
    "replication:meta:sync:docs": {
      "max": 2,
      "min": 0,
      "sum": 2,
      "count": 15,
      "sumsqr": 4
    },
    "replication:user-initiated": {
      "max": 1,
      "min": 1,
      "sum": 2,
      "count": 2,
      "sumsqr": 2
    },
    "boot_time:2_3:to_purge_meta": {
      "max": 3708.5499999999993,
      "min": 1604.765000000003,
      "sum": 15043.219999999992,
      "count": 6,
      "sumsqr": 41218758.22589995
    },
    "boot_time:purging:undefined": {
      "max": 1,
      "min": 1,
      "sum": 6,
      "count": 6,
      "sumsqr": 6
    },
    "boot_time:purging_meta:true": {
      "max": 1,
      "min": 1,
      "sum": 6,
      "count": 6,
      "sumsqr": 6
    },
    "replication:medic:from:docs": {
      "max": 1,
      "min": 0,
      "sum": 2,
      "count": 33,
      "sumsqr": 2
    },
    "rules-engine:targets:queued": {
      "max": 20667249,
      "min": 3,
      "sum": 20667252,
      "count": 2,
      "sumsqr": 427135181228010
    },
    "replication:medic:to:failure": {
      "max": 20415433,
      "min": 4410,
      "sum": 28033065,
      "count": 27,
      "sumsqr": 426252552278261
    },
    "replication:medic:to:success": {
      "max": 434085,
      "min": 1646,
      "sum": 537273,
      "count": 12,
      "sumsqr": 189774910589
    },
    "replication:meta:sync:failure": {
      "max": 185182,
      "min": 2336,
      "sum": 664723,
      "count": 7,
      "sumsqr": 102160248453
    },
    "replication:meta:sync:success": {
      "max": 29624,
      "min": 3617,
      "sum": 87187,
      "count": 8,
      "sumsqr": 1408293543
    },
    "replication:medic:from:failure": {
      "max": 17618871,
      "min": 1753,
      "sum": 23940868,
      "count": 27,
      "sumsqr": 319400275675204
    },
    "replication:medic:from:success": {
      "max": 780111,
      "min": 27007,
      "sum": 2247066,
      "count": 8,
      "sumsqr": 1289097564638
    },
    "rules-engine:tasks:all-contacts": {
      "max": 20787268,
      "min": 4306,
      "sum": 20791574,
      "count": 2,
      "sumsqr": 432110529445460
    },
    "boot_time:3:to_angular_bootstrap": {
      "max": 1866.3349999999991,
      "min": 1146.9150000000027,
      "sum": 6670.750000000002,
      "count": 5,
      "sumsqr": 9263241.917600002
    },
    "rules-engine:tasks:some-contacts": {
      "max": 6053,
      "min": 3288,
      "sum": 9341,
      "count": 2,
      "sumsqr": 47449753
    },
    "rules-engine:tasks:dirty-contacts": {
      "max": 1727,
      "min": 0,
      "sum": 29359,
      "count": 20,
      "sumsqr": 50702993
    },
    "boot_time:1:to_first_code_execution": {
      "max": 12213.300000000001,
      "min": 7951.165000000001,
      "sum": 54934.645,
      "count": 6,
      "sumsqr": 515045362.95017505
    },
    "rules-engine:targets:dirty-contacts": {
      "max": 1727,
      "min": 0,
      "sum": 12089,
      "count": 8,
      "sumsqr": 20877703
    },
    "rules-engine:tasks:all-contacts:queued": {
      "max": 6,
      "min": 2,
      "sum": 17,
      "count": 6,
      "sumsqr": 61
    },
    "rules-engine:ensureTaskFreshness:cancel": {
      "max": 54,
      "min": 25,
      "sum": 170,
      "count": 5,
      "sumsqr": 6306
    },
    "rules-engine:tasks:some-contacts:queued": {
      "max": 1,
      "min": 0,
      "sum": 1,
      "count": 2,
      "sumsqr": 1
    },
    "replication:medic:to:failure:reason:offline:server": {
      "max": 1,
      "min": 1,
      "sum": 26,
      "count": 26,
      "sumsqr": 26
    },
    "replication:medic:to:ms-since-last-replicated-date": {
      "max": 602988943,
      "min": 4664,
      "sum": 1802369862,
      "count": 39,
      "sumsqr": 737948026911367300
    },
    "replication:meta:sync:failure:reason:offline:server": {
      "max": 1,
      "min": 1,
      "sum": 7,
      "count": 7,
      "sumsqr": 7
    },
    "replication:medic:from:failure:reason:offline:server": {
      "max": 1,
      "min": 1,
      "sum": 25,
      "count": 25,
      "sumsqr": 25
    },
    "replication:medic:from:ms-since-last-replicated-date": {
      "max": 602988909,
      "min": 4656,
      "sum": 1202197300,
      "count": 35,
      "sumsqr": 378403573074891260
    }
  },
  "metadata": {
    "day": 7,
    "user": "username",
    "year": 2023,
    "month": 2,
    "deviceId": "570c7269-36a7-42a1-b66b-7db3cc1bd538",
    "versions": {
      "app": "3.16.0",
      "forms": {
        "forms list"
      },
      "settings": "13-b0ba9b9a9ef213b31ecc6592c2567697"
    }
  }
}
1 Like

Here is the “Reload experience” shared today:

  1. User logs in. Tasks are working as expected.
  2. After some variable time the user sees “Error Loading Tasks”
  3. At this time console errors show specific error "Database has a global failure"
  4. If the user hits Reload: the page reloads in a “normal time to view tasks tab” and the user reliably and consistently sees “Error Loading Tasks”

@Jeanfaith234 Can you confirm this is correct?

Some other details:

  • Problem onset is inconsistent. Once it happens the problem occurs consistently forever. This is why users are encouraged to clear cache and resync.
  • Anecdotally this affects about 200 users per month
  • Anecdotally this affects Spark and Pop Tecno devices the most

As a next step @Jeanfaith234 and LG Team have agreed to capture some data on the failures:

  1. A list of usernames and date of onset for the “Failure to Load Tasks” error
  2. From this we can look at patterns across users (doc count, etc)
  3. We will use telemetry to obtain the frequency with which the issue occurs per device type
  4. We will also get a list of specific devices in-use for this project for which the error occurs less frequently
2 Likes

Thanks for sharing telemetry and for sharing the reload experience. I think that we have some data to go on in trying to replicate this.

Some thoughts:

  1. After some variable time the user sees “Error Loading Tasks”

Problem onset is inconsistent. Once it happens the problem occurs consistently forever.

It seems that the device triggers getting tasks for all contacts:

 "rules-engine:tasks:all-contacts": {
      "max": 20787268,
      "min": 4306,
      "sum": 20791574,
      "count": 2,
      "sumsqr": 432110529445460
    },
"tasks:load": {
      "max": 20818586,
      "min": 20818586,
      "sum": 20818586,
      "count": 1,
      "sumsqr": 433413523039396
    },

According to these entries, generating tasks does succeed, and tasks are loaded successfully once - recording telemetry happens after task generation.

 "rules-engine:tasks:dirty-contacts": {
      "max": 1727,
      "min": 0,
      "sum": 29359,
      "count": 20,
      "sumsqr": 50702993
    },

Having 1727 dirty contacts makes me think that emissions have automatically expired at 7 days post calculation.

Another valuable metric for attempting replicate this would be the total number of tasks that you expect this user to generate, and I don’t think we have information about that in telemetry. Does anyone happen to know this number?

  • Anecdotally this affects about 200 users per month

Is there anything common between these users? A common role? or a common workflow?

I’ve also created an issue to fix the error in the deviceInfo section of the telemetry entry: Telemetry entry reporting Java NullPointerException error in `deviceInfo` section · Issue #304 · medic/cht-android · GitHub

Here is a useful query which can give us this answer and allow us to look at general task health per user:

SELECT
    s.chw,
    s.task,
    current_database() as partner,
    count(s.state) as docs
FROM (
    SELECT 
        doc->>'user' as chw,
        doc#>>'{emission,contact,name}' as contact,
        doc#>>'{state}' as state,
        (doc#>>'{emission,title}') as task,  
        date_trunc('day', to_timestamp((doc ->> 'authoredOn')::bigint / 1000)) AS day
    FROM 
        couchdb
    WHERE 
        doc ->> 'type' = 'task'
        AND (doc ->> 'authoredOn')::double precision / 1000 >= extract(epoch from date_trunc('day', now() - '120 days'::interval))
) s
GROUP BY
    chw, task;

This assumes the task information is replicating successfully. It may not be 100% accurate but worth looking at.

@kenn that is correct.

@kenn sharing this via slack.

@kenn As earlier requested, as of today which was the expected date for the reoccurrence of the issue, I have attached the list of the affected users as of now since today morning.

TULA VHTS WITH REOCCURING ERRORS.pdf (34.0 KB)

1 Like

Thanks to both for your follow-ups.

I’ve done a quick analysis of the task data shared by @rmayore specifically looking at:

  1. Number of task documents total by task title
  2. Number of task docs per user
  3. Deep-dive into the most common tasks and heaviest users
  4. Specific look at doc counts for users experiencing “Error Loading Tasks”

My findings are:

  1. Generally the task documents per user looks quite healthy. A mean around 250 docs/user.
  2. The users with the most task documents (~3000 docs) are due to high rates of a “Covid Follow up” task. We should investigate this to ensure the task is behaving as expected.
  3. The “Upcoming EDD reminder” task triggers for few users but those who see it see it a lot (> 300 docs/chw). Less important but we can also investigate this.
  4. The users listed by @Faith Kinuthia here have between 0 - 120 task documents. This is very low so nothing interesting in this data set.

I’ve also created a new thread here with a theory and recommendations on what may be provoking this problem. I’d be happy to discuss this with you on our call on Thursday or prior to it.

Can we try this query @rmayore and we can try to figure out what is happening for these users?

select *
from couchdb
where doc ->> 'type' = 'task'
and doc ->> 'user' = 'org.couchdb.user:kawalahar'
and doc #>> '{emission,title,0,content}' = 'Covid Follow up'
limit 150;

@kenn shared the results via slack.

Sorry but the results from that query were inconclusive. Can you send me data again but without the limit 150?

select *
from couchdb
where doc ->> 'type' = 'task'
and doc ->> 'user' = 'org.couchdb.user:kawalahar'
and doc #>> '{emission,title,0,content}' = 'Covid Follow up'
;