Gateway timeouts on password reset

Anro · February 6, 2025, 3:05pm

Recently, we started experiencing errors when setting passwords on login profiles.

We suspect this happens because the user is linked to a large amount of data. When the profile submission fires, the user-info request attempts to determine the number of records associated with this user in order to display a warning. However, the request takes too long, exceeding 30 seconds in our test case, resulting in a 504 Gateway Timeout.

We believe this might be due to differences in our NGINX configuration:

keepalive_timeout 65;
proxy_read_timeout <not set, use nginx default>

According to the NGINX documentation, the default proxy_read_timeout is 60 seconds, which seems to contradict our theory.

From what we can see in the following issue thread, increasing the timeout resolves a variety of problems:
Align nginx and ALB timeout values · Issue #8214 · medic/cht-core

We have been somewhat conservative with our timeout settings, as they help ensure that unresponsive servers do not leave connections open unnecessarily. However, we may have set them too strictly due to a lack of understanding. Additionally, a different timeout setting may be the actual cause of the issue.

We would appreciate any guidance on this matter.

diana · February 6, 2025, 3:27pm

Hi @Anro

In the CHT-Core nginx config, proxy_read_timeout is set to 3600 (cht-core/nginx/nginx.conf at master · medic/cht-core · GitHub).
I understand you wish to be conservative about timeouts, however large requests can take longer. I suggest you experiment with increasing this timeout. Our safe recommendation is 3600 seconds.

Anro · February 7, 2025, 12:11pm

Hi @diana,

We’re happy to increase the timeout slightly for most routes. However, if there are specific routes that genuinely require an hour, please let us know.

diana · February 10, 2025, 10:43am

Hi @Anro

The “heavy lifting” is performed on:

/api/v1/initial-replication/get-ids
/api/v1/replication/get-ids
/api/v1/replication/get-deletes
/medic/_bulk_get
/medic/_bulk_docs
/medic/_all_docs

These endpoints should have timeouts above the default. I recommend you experiment with an increased timeout and check whether any of your requests get dropped. We export a metrics endpoint that will give you endpoint stats
: API to interact with CHT Applications | Community Health Toolkit . This can be monitored through watchdog for overtime stats so you can check what your rough timeout should be.