I’m looking at Nairobi data in watchdog and I don’t see any metrics even though the instance is online. How do you go about investigating this monitoring failure?
Hi Kenn,
We discovered that MoH Kenya infrastructure team had blocked IP’s outside Kenya as a way of mitigating information that port scanners could collect on eCHIS infrastructure. An update to firewall rules has been requested and I’ve added you to the email thread.
Ahh, great @elijah, glad that we have figured out the likely source of the issue.
For the record, here are the troubleshooting steps that I was following to try to figure this out from the Watchdog end of things:
Front-end
- Check the time range configured for your Grafana page. If it is tightly scoped, try expanding it to cover a couple days. When Grafana says “No data” it just means it has no metrics for that time range.
- Check the raw metrics page in Grafana (
explore/metrics
) and see if any metrics are being collected at all for that instance (this will help to determine if it is just a dashboard issue.) - Check the CHT Admin Overview > Monitoring Stack > Scrape Duration panel and see if the scrape you care about is happening (or maybe timing out). In the Metrics Explorer page you can also just directly look at the
scrape_duration_seconds
metric and use theinstance
label to qualify which instance you care about. - Try directly curling the endpoints you expect to be providing metrics data. (e.g. the
api/v2/monitoring
endpoint of the CHT instance you are trying to monitor).
Back-end:
- Check back-end services to see if they are running/logging errors
- Try spinning up a local watchdog instance against the same CHT instance to see if you can repro the issue locally.
- Prometheus should be listening on
PROMETHEUS_PORT
(default9090
) both locally and on your deployed Watchdog instance. If you can access that port from the browser, you can use the website and go to Status > Targets and check on the last error message from the relevant scrape job. If you cannot access the port from a browser, you get the same data as JSON via the:9090/api/v1/targets
prometheus endpoint. (e.g. with curl).
In this case, my findings are consistent with what Elijah has noted. In #3
I can see that Watchdog is trying to scrape the Nairobi instance, but the scrape duration is always 10sec (which is the max duration when we time out). For bonus confirmation, I checked the Prometheus api/v1/targets
data (#7
) and confirmed that the Nairobi scrape is failing with the error context deadline exceeded
.
FTR, I was able to scrape the metrics successfully from my local machine, so I guess they are not blocking all IPs… Whatever the case, it does appear that connections from
watchdog.app.medicmobile.org
are not being allowed through.
This was discussed (see private link) back in Feb 21st - which came up with a nice list of IPs to ensure are on the allow list:
- SSL Labs:
64.41.200.0/24
&2600:C02:1020:4202::/64
- and see more details if needed. - Watchdog:
13.38.33.129
(no IPv6) - UptimeRobot: Many Dozens of IPs
(edited)