Speeding up Sentinel

danielmwakanema · March 14, 2025, 2:09pm

Hello there,

We are running a CHT instance with messaging-driven workflows. It has a number of transitions configured.
As a result of this, during peak hours, Sentinel has quite the backlog to go through. We know that Sentinel processes transitions in series which then is resulting in delays in sending out messages for example.

We would like to know if increasing the number of replicas in the CHT Docker configuration so that we have multiple instances of Sentinel running in tandem would help improve alleviate our troubles without introducting unexpected behaviour.

Additionally, though not completely related, we see that Sentinel only processes a transition once but anything that goes to Outbound is retried indefinitely if it keeps failing. Is there a way to opt out of this? In our case it looks like we have about 600 outbound requests that are stuck in the backlog and keep getting retried.

diana · March 14, 2025, 2:45pm

Hi @danielmwakanema

At this moment, Sentinel cannot be scaled up - you cannot add multiple instances of it (because they would end up processing the same docs).

I do have a few questions about your performance considerations:

How big of a backlog do you have during peak hours? How many messages does Sentinel need to process?
How do you host your CouchDb? What resources does it have available?

Sentinel transitions should not be CPU heavy (you can check this in your monitoring), but they do require fast connections to the database, so it is more likely that your bottleneck is CouchDb.

mrjones · March 14, 2025, 3:12pm

@danielmwakanema - if you have not already, you might consider installing CHT Watchdog. This makes it trivial to monitor sentinel backlog over time, as well as be able to send alerts if certain undesirable conditions come up you’d like to know about.

A powerful set up can be to add cAdvisor to the Watchdog deployment. This adds detailed resource use per container, which includes sentinel backlog, allowing deep insights of CPU, RAM and disk used as sentinel backlog increases overtime. A final layer you might consider is Node Exporter to get monitoring on your bare-metal host(s).

Best of luck!

danielmwakanema · March 27, 2025, 6:53am

@diana and @mrjones at the time, I was noticing a 200k+ backlog for Sentinel, but overtime it has normalized.

The real problem that was reported and led to me looking at Sentinel and Outbound however was that messages are taking a while before they transition from pending and are sent to RapidPro. The messages do in most cases get to RapidPro but at times you find that it is taking eight minutes for that to happen.

Interestingly this is an issue we are facing both on our live and development instances and the latter is hosted by Medic.

Here’s what I tried:

I created test messages in the Messaging tab
I verified that they are created almost immediately in CouchDB with the initial pending state
I grepped Sentinel logs to watch for the transition that sends the messages to RapidPro; it often took a while before it was run
Going through the logs also showed that requests to the Medic RapidPro instance we use are being throttled but I am not clear how this is contributing to the issue we are facing

@diana currently we are running a single node instance with 16 cores, 24GB RAM and ~600GB storage.

@diana and @mrjones find some screenshots from our Watchdog instance below:

Your help will be greatly appreciated. The crux of our intervention is the messaging component. If you need further information, feel free.

diana · March 27, 2025, 3:22pm

Hi @danielmwakanema

Indeed there is a delay between when a task gets created to when it is sent. this is due to some imposed limitations that are not resource constraints.

For example, API queries the message queue every minute to send, and will only send 25 messages at a time. You can actually see a long entry in api logs saying sending <number> messages - because yes, it’s actually API that sends the messages to Rapidpro, and not Sentinel.

Another limitation is an actual RapidPro limitation that it seems you are also encountering, where RapidPro throttles when too many messages are sent over a short period of time.

The Sentinel backlog that you mention could be due to resource constraints, but it seems that was a one-off and I’m not sure at the moment this is influencing your situation in any way.

I think it’s safe to assume that SMS messages have some latency before being sent, and this latency is due to both systems imposing artificial limits.