Hosting Total Cost of Ownership Squad

24 Mar 2026 Call

Attending

Notes

eCHIS KE

  • waiting on migration date to move data centers away from Konza
  • “big 5” instances are still seeing high CPU use (>99%) - but no one wants to change them before move away from Konza
  • no changes to k8s hosting or remaining 4.21 → 5.1 upgrade.
  • some instances have high CPU use but have >3k users on only 8 cores, for example
  • would a multi-node couch instance be helpful? or what about a read only couch node?
    • strongly suggest not doing this as multi-node has shown to both give only nominal performance gains and adds massive complexity for backup and restore and general day to day maintenance.
    • read only node is not supported by couch :frowning:

TCO v2:

  • Split up views into different ddocs
  • discussion about these graphs charting disk use across upgrades
  • splitting ddocs does indeed use more CPU and IOPS, which makes the upgrade take longer (150% in this test case)
  • for TCOv2, backing off to a more simple PR that will use constant for DDOC paths instead of hard coded values which make it more easy to re-balance the indexes at a later date. We’ll go on to TCO v3 and come back to v2 later as v3 is easier to achieve.
  • moving on to focus on index improvements in TCO v3: will try to merge improvements to existing indexes as they are now
  • by delaying v2, we knowingly causing painful re-indexes of medic client, by focusing on v3 now
  • amazing real world benefits of a production upgrade of 33% disk savings!! (private repo)
  • discuss interactive hosting cost calculator PR question and how it might benefit TCO planning call for MoH KE
  • otherwise interactive hosting cost calculator is moving along nicely
  • note that calculator is based on Operating Expenses (OpEx) where you pay month to month for resources like you do in AWS and doesn’t speak too well to costs of Capital Expenses (CapEx) of buying hardware, but may lend itself to coming up with CPU/RAM/DISK for CapEx
  • likely ship calculator in the next couple of weeks

31 Mar 2026 Call

Attending

Notes

eCHIS KE

  • Discuss tarkana in k3s that is restarting
    • k8s error is “OOMkilled” for couch, but API is “terminated with 1: error” - v odd
    • it is noted that API, sentinel and all services are restarting. - this is because couch crashes and these are sympathetic crashes
    • diving into grafana dashboard
    • there is an overlap of API and couch ruining more than one instance for minutes at a time which is suspicious. and not good. Diana wonders about strategy.type: Recreate vs RollingUpdate
    • diana also notes: The default deployment strategy in Kubernetes is RollingUpdate
    • Elijah to set up log streaming so we can see logs from prior OOMed containers

TCO v2:

7 Apr 2026 Call

Attending

Notes

eCHIS KE

  • tk

TCO v2:

  • Optimize indexes to reduce disk use (Hosting TCO 3.0)

  • Discuss cold storage

    • old issue based on just tasks. this ticket specifies massive list of UUIDs to purge. in future could offer a feature “purge tasks older than X” or “purge reports older than Y”
    • POC is underway with new purge endpoint in couch 3.5.0
    • promising results!
    • need to be careful about views getting too far out of date where a race condition causes them to no longer be indexed and queries fail
    • testing on small 8 core EC2 instance with 2M docs
    • 1M docs purged in the amount of time to write 1M, which is a good speed
    • docs would be written to some un-indexed DB (eg “cold storage”) as a fall back
    • indexes get rebuilt as IDs are put into cold storage. currently batch size is 10k docs purged, and the all views are queried so they’re up to date. This causes heavy CPU use during purging
    • for UUIDs on offline clients that get purged on server, it could get re-upload. need to be careful to ensure that too high a percent doesn’t get re-uploaded. it will only be tasks or reports that are updated offline after last sync that will be pushed up. IE very old reports and tasks very likely will not get pushed up.
    • per above, feature should likely be opinionated (ie delete on server and on client)
    • this will work back to 4.20 when we released CouchDB 3.5
    • By end of April, Diana wants to do more tests on small server (~8 cores/16GB) to sure it’s stable across large amount of docs
    • Any CHT core endpoints in support of this will be after 5.2
  • Medic EKS cluster had a number of instances upgrade 4.x->5.x and CPU usage was high enough on couch k8s nodes to cause brief outages

    • one idea is to go back to prior 4.x → 4.x upgrades to see how the cluster behaved when NP did a bunch of upgrdas prior

14 Apr 2026 Call

Attending

Notes

eCHIS KE

  • Nairobi
    • was unhealthy for past 6 weeks - with high CPU and high sync replication latency
    • a couch restart fixed it (for unknown reasons)
    • sentinel backlog came down to zero, but then started to creep up
    • logs were shared and it was discovered a HUGE volume of failed outbound push failures
    • by having these failures return a 200 instead of a 404 sentinel stopped retrying and sentinel backlog promptly went back to zero
  • turkana
    • still restarting
    • nothing in couchdb logs, so restart seem to be coming from hypervisor
    • possibly compare to other k8s instances which don’t have the problem to see if there’s a difference?
  • discuss the “fit for future” document, and how a national deployment will scale to 100k+ users

TCO:

  • Optimize indexes to reduce disk use (TCO v3.0)
    • no updates
  • TCO v2
    • minor work done on renaming constants in prep for future work
  • cold storage (TCO v4?)
    • more tests run, still showing promising results on churn
    • no major announcements/discoveries
    • holding off on writing purged docs to cold storage DB b/c writes are boring and a known quantity. Looking for unknowns that might have negative impact
    • first iteration will be simplistic and accept a list of a boatload of UUIDs. this requires deployments to create this list first before they can purge → cold storage
    • hoping to ship this first iteration in 5.3, which assumes 5.2 ships in weeks not months from now
    • maybe cht core 5.4 has automated cold storage enabled?
    • good discussion on what do we ship? do we enable something by default? maybe all tasks older than 6mo are auto cold storaged?

21 Apr 2026 Call

Attending

Notes

eCHIS KE

  • discuss Nairobi having 100k’s push errors a day and not knowing. While log analysis should be part of the dailily or weekly best practice, exposing this in monitoring API and thus watchdog, would make it a lot easier to monitor and alert
  • @mrjones to file an improvement ticket on core to expose this monitoring endpoint in a future release

TCO:

  • Optimize indexes to reduce disk use (TCO v3.0)

    • tests are failing :frowning:
    • in one test, consistently failing on web driver io, possible regression in test
    • Diana offering to take one of the PRs to review to offload Josh
    • discussion around failed tests with API & SMS flows - failing in sentinel-api-transitions.spec.js test
    • failed test error is:
     5) transitions
        should run all sync transitions and all async transitions:  AssertionError: expected 12 to equal 10
       + expected - actual
    
       -12
       +10
    
       at cht-core/tests/integration/transitions/sentinel-api-transitions.spec.js:502:50
       at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
    
    • tom to work on this offline after the squad call and try and isolate the issue so others might work on it
  • TCO v2

    • minor work done on renaming constants in prep for future work
    • no updates
  • cold storage (TCO v4)

    • should wait until full featured ticket is ready in an a TBD API endpoint
    • need to check for regressions in phone performance and other offline possible things that might get impacted
    • tests being done on EC2 instance on a tiny c4.large, I have 2.7 mil docs - see slack but error log is:
     [notice] 2026-04-17T17:20:59.863194Z couchdb@127.0.0.1 <0.3597.0> 57f92006c1 haproxy:5984 172.18.0.3 admin GET /medic/_design/medic-offline-freetext 404 ok 9
     [notice] 2026-04-17T17:21:02.418474Z couchdb@127.0.0.1 <0.4055.0> a5c56501b6 undefined 172.18.0.6 admin GET /_up 200 ok 14
     [os_mon] cpu supervisor port (cpu_sup): Erlang has closed
     [os_mon] memory supervisor port (memsup): Erlang has closed
     Killed
    
    • diana to look into repo the crash outside of CHT in couch only, as well as bump ec2 instance from 2core → 4core & 8GB → 16GB to see if crash persists

28 Apr 2026 Call

Attending

Details

  • Cold storage (Diana)
    • Not many updates from last week
    • Small VM was crashing Couch
      • Tried a few workarounds (including bumping the resources)
      • Have not been able to replicate with direct requests to Couch.
      • Giving machine more resources made the crashing stop
  • Refactoring views (Tom)
    • 3 open PRs in progress
    • Looking into possibly removing indexing instead of just refactoring.
    • Tests failing reliably and Tom has been investigating issue with wdio.
    • Integration tests change Sentinal timing around change hydration. Triggers latent race condition.
    • Targeting to ship all these index changes in 5.3
  • MoH_KE
    • Waiting to move to new data center.
    • Weird issue with Sentinel not processing backlog. Hangs and does not continue until the container is restarted.
    • Seeing this behavior on multiple instances (the most busy).

5 May 2026 Call

Attending

Details

Ice Breaker

  • bolognese
  • Josh recommends: chicory + coffee + milk (milk takes edge off chicory)

MoH_KE

  • discuss instance health dashboard (internal link)
  • Look at AI/BI integration against 6 instances and what the impact is
  • nginx rate limiting may solve one problem, but if something else is done, like an additional index being added which takes weeks to build, nginx rate limiting won’t help
  • for integrations to not hit couchdb directory, TCO squad recommends these ideas:
    • Using CHT Sync’s Postgres DB (confirmed this is working and at 1min latency)
    • Set up a new server with couch replication. as CHT Sync puts all counties in one DB and table, this could offer a nice permission structure of 1:1 per county
    • Materialized views in CHT Sync Postgres is another solution that could both offer speed and per county granular permissions

Cold storage

  • Diana’s under PR PR PR PR business and no time for cold storage work
  • Josh mentioned he feels this is the most important feature - more important than other PR from C4GT et al.
  • as it’s not ready now, not targeting 5.2, so hopeful in 5.3. However, like cold storage, this is top priority and if it’s ready tomorrow, we should ship in 5.2

Index optimizing/refactoring

Breaking up Indexing into smaller pieces

  • on hold while we pursue other proven methods (se cold storage and index optimizing above)
  • ticket

12 May 2026 Call

Attending

Details

Ice breaker:

  • how do you like your coffee roasted? med? dark? light?
  • DIY roasting with flour sifter and heat gun

Sync health:

Cold storage

  • top priority!!
  • PR PR PR work and other features is diluting focus on this feature
  • once it comes up to be worked, there’s a good amount of effort to figure how this works without too much negative impact, will not be super qiuck

Couch 3.5.2 coming soon! brings a bunch of performance improvements, including features in 3.5.1 we couldn’t take because of a bug which should be fixed!

19 May 2026 Call

Attending

Details

Ice breaker:

  • weather in the US vs mountains and conspiracy theory
  • eurovision

TCO:

  • hard limit
    • josh has made no progress since last week
    • pre purge vs post purge limits need to be researched
    • at top of josh’s (oversubscribed) queue
  • optimizing indexes
    • one PR merged
    • doc summaries PR out for review to Josh
    • discuss pros cons of making API do work in terms of processing lots of documents (vs streaming to client having it be down on browser side)
    • larger discussion of API & Sentinel being single threaded
    • tom waiting for review from Diana on race condition for sync transitions PR which is blocking other work for improving indexes.
    • all 4 items in improving index ticket will cause expensive index rebuilding so they all need to be shipped together so this is expense is only paid once (as opposed to multiple times) - likely 5.3 is the soonest, but really, when they’re all done. This could be sooner or could be later.

26 May 2026 Call

Attending

Details

House keeping

  • welcome Sarah Mirembe, joined briefly at the top of the call

TCO:

  • hard limit

    • no updates due to other work
  • optimizing indexes

    • 3 of the 4 issues are merged into the mega PR which in turn will be merged to main
    • looking to merge it all to same branch even though CI will fail to get performance baseline of improvements
    • Optimize large map reduce views #10636 has timing issue which hits pre-existing race condition - Tom stuck here, but has some possible ideas around locking (psuedo locking)?

replication health

cold storage

Off topic, but chatted about graviton testing (private repo)

2 Jun 2026 Call

Attending

Details

eCHIS KE

cold storage

  • issue is going well, almost done

  • still targeting 5.3

  • Diana would love to have eCHIS KE test alpha version of this to validate effectiveness

  • Elijah confirmed should be availaibe

  • hard limit

    • no updates - still outstanding
  • optimizing indexes

    • work in progress, same updates as stewardship call prior

replication health

  • replication failure monitoring
  • changed structure of storage to keep 30 days of data per PR feedback
  • storing failures per day in a doc, and then index that by user, day, and error
  • this ensures monitoring endpoint is cheap to call
  • will enable whole deployment monitoring as well as individual user debugging
  • PR for monitoring
  • next step will be showing a list of users that can never replicate

9 Jun 2026 Call

Attending

Details

  • eCHIS KE

    • 5.1.3 is out, Elijah planning upgrading next week for affected counties with the issue
    • seeing Nouveau error on CHT 4.22 which is odd: 2026-06-09 08:23:11.792error{{case_clause,{error,nouveau_not_enabled}},[{ken_server,maybe_start_job,4,[{file,"src/ken_server.erl"},.... Diana thinks this is safe - it’s couchdb trying to index the Nouveau indexes that were created in prep for upgrade that got abandoned
    • loki log retention defaults to indefinite, will set this to sane value to ensure disk isn’t depleted
  • cold storage

    • had beefy test running, overwhelmed API so config changes weren’t being processed, causes tests to fail
    • Discuss single threaded nature of API
  • hard limit

    • no update currently
    • at top of Josh’s queue after UI extensions which is going out soon - an he hopes to be working on it today
    • recap of feature: when server hits predefined upper limit (say 100k docs), the replication request is hard reset
    • existing users failed replication log will be updated
  • optimizing indexes

    • PRs under review, actively being worked on
  • replication failure monitoring

    • optimistically targeting 5.2, but may get bumped to 5.3
    • adding to `/api/v2/monit

16 Jun 2026 Call

Attending

Details

  • eCHIS KE

    • no upgrades to 5.1.3 patch release yet.
    • still waiting for large instances still on 4.x to be upgraded
  • cold storage

    • no updates issue
    • will come after 5.2 (currently focused on couch db uplift to 3.5.2 & timeout issue)
    • as it’s coming after 5.2, which will likely have couchdb 3.5.2, new couch has new purging endpoint which is supposed to be faster
  • hard limit

    • we’ll be setting limits on : too many subject IDs, too many contact IDs, and finally all doc IDs
    • limiting subject IDs will prevent uploads which we likely want to allow
    • will write entry to replication failure log for each event
    • do we want cache the failures so we can easily lookup whether we try to repliicate or just auto fail until cache expires
    • Things that might trigger a cache clearing: purging, replication depth change, user’s contact is moved in hierarchy or other admin based actions
    • considering making a “can not replication, too many docs” on the _users document and an admin needs to reset this to unblock the user and enable them to sync again
    • ideally they should be able to push new data even if they can’t pull , which would mean ignoring limits on subject
  • replication failure monitoring

    • in the end, the feature should enable easy debugging of what the username is, what role doe they have, where are they in the hierarchy - these enable the app admin to fix the user
    • user doc count gives count of docs for users who replication successfully and the replication failure logs give replication errors and the endpoint will give a nice tidy synopsis of list of users who are broken and not showing up in either list. eg. user mrjones hasn’t replicated in 17 days, but they tried 1224 times, they’re likely miss-configured.
    • there is a use case Josh knows of where a supervisor wants to report on their CHW’s success or failures
    • there’s gonna be some interplay with this feature and the hard limits feature, but the superset of users who are affected is important