Discrepencies over time in Watchdog Replication Apdex

mrjones · August 4, 2025, 9:58pm

Hi all! I’m confused how Watchdog’s Replication Apdex works over time. For example, when I look at a large instance over the past 10 days I see days of an Apdex of about 55 for 6 of the 10 days on the left (2025-07-24 00:00:00 - now):

However, when I zoom out to the Last 30 days that 6 day lull is mostly gone, only to be represented by a dip to just above 80 for a short period. As well, the high of above 90 in the last day or so is gone as well, and it never gets above 88.

My concern is that if I look at the second graph I might think all is well, but a day or two so-so performance. If I look at the first graph I would think terrible things were happening for days and days.

Is there some averaging going on? I clicked on “Edit” on that graph but the calculation (see below and GH Deep link) isn’t very approachable:

(
    (   
        sum(increase(cht_api_http_request_duration_seconds_bucket{instance=~"$cht_instance",route=~".*/get-ids",le="180",code=~"^2..$|^3..$"}[$__range]))
        + (
		    sum(increase(cht_api_http_request_duration_seconds_bucket{instance=~"$cht_instance",route=~".*/get-ids",le="360",code=~"^2..$|^3..$"}[$__range]))
		    - sum(increase(cht_api_http_request_duration_seconds_bucket{instance=~"$cht_instance",route=~".*/get-ids",le="180",code=~"^2..$|^3..$"}[$__range]))
        ) / 2
    ) / sum(increase(cht_api_http_request_duration_seconds_count{instance=~"$cht_instance",route=~".*/get-ids",code=~"^2..$|^3..$"}[$__range]))
) * 100

diana · August 5, 2025, 9:40am

The first graph is indeed confusing.

My guess is that the average is across all requests, while the graph shows the average based on a specific date and does now show the total number of requests that succeeded on that date.

So if on 7/25 there were 5 requests with apdex of 50, but on 8/2 there were 100 requests with apdex of 95, the average (for just these two days) would end up being 92.

We can also see that during the low apdex period there is a really high error rate, and I believe that recent changes made it so errored requests are not counted towards the average.

mrjones · August 6, 2025, 9:32pm

Thanks @diana !

Checking the query more closely, I see it’s not actually an average of all requests, but just requests to .*/get-ids and requests which return a 2xx or 3xx response. Here’s an formatted excerpt of the query where you can more easily see the route value:

cht_api_http_request_duration_seconds_count{
   instance=~"$cht_instance",
   route=~".*/get-ids",
   code=~"^2..$|^3..$"
}

But yeah, I guess the original Apdex math we’ve recreated…

        SatisfiedCount + (0.5 * ToleratingCount) + (0 * FrustratedCount)
Apdex = ----------------------------------------------------------------
                                  TotalSamples

…means that over a longer period of time, the scores will smooth out like we’re seeing?

@kenn - if you wanna weigh in, your input would be most welcome! You originally added this feature to Watchdog and you and I have tried recently to improve the API Apdex docs.