Hosting Total Cost of Ownership Squad

24 Mar 2026 Call

Attending

Notes

eCHIS KE

  • waiting on migration date to move data centers away from Konza
  • “big 5” instances are still seeing high CPU use (>99%) - but no one wants to change them before move away from Konza
  • no changes to k8s hosting or remaining 4.21 → 5.1 upgrade.
  • some instances have high CPU use but have >3k users on only 8 cores, for example
  • would a multi-node couch instance be helpful? or what about a read only couch node?
    • strongly suggest not doing this as multi-node has shown to both give only nominal performance gains and adds massive complexity for backup and restore and general day to day maintenance.
    • read only node is not supported by couch :frowning:

TCO v2:

  • Split up views into different ddocs
  • discussion about these graphs charting disk use across upgrades
  • splitting ddocs does indeed use more CPU and IOPS, which makes the upgrade take longer (150% in this test case)
  • for TCOv2, backing off to a more simple PR that will use constant for DDOC paths instead of hard coded values which make it more easy to re-balance the indexes at a later date. We’ll go on to TCO v3 and come back to v2 later as v3 is easier to achieve.
  • moving on to focus on index improvements in TCO v3: will try to merge improvements to existing indexes as they are now
  • by delaying v2, we knowingly causing painful re-indexes of medic client, by focusing on v3 now
  • amazing real world benefits of a production upgrade of 33% disk savings!! (private repo)
  • discuss interactive hosting cost calculator PR question and how it might benefit TCO planning call for MoH KE
  • otherwise interactive hosting cost calculator is moving along nicely
  • note that calculator is based on Operating Expenses (OpEx) where you pay month to month for resources like you do in AWS and doesn’t speak too well to costs of Capital Expenses (CapEx) of buying hardware, but may lend itself to coming up with CPU/RAM/DISK for CapEx
  • likely ship calculator in the next couple of weeks

31 Mar 2026 Call

Attending

Notes

eCHIS KE

  • Discuss tarkana in k3s that is restarting
    • k8s error is “OOMkilled” for couch, but API is “terminated with 1: error” - v odd
    • it is noted that API, sentinel and all services are restarting. - this is because couch crashes and these are sympathetic crashes
    • diving into grafana dashboard
    • there is an overlap of API and couch ruining more than one instance for minutes at a time which is suspicious. and not good. Diana wonders about strategy.type: Recreate vs RollingUpdate
    • diana also notes: The default deployment strategy in Kubernetes is RollingUpdate
    • Elijah to set up log streaming so we can see logs from prior OOMed containers

TCO v2:

7 Apr 2026 Call

Attending

Notes

eCHIS KE

  • tk

TCO v2:

  • Optimize indexes to reduce disk use (Hosting TCO 3.0)

  • Discuss cold storage

    • old issue based on just tasks. this ticket specifies massive list of UUIDs to purge. in future could offer a feature “purge tasks older than X” or “purge reports older than Y”
    • POC is underway with new purge endpoint in couch 3.5.0
    • promising results!
    • need to be careful about views getting too far out of date where a race condition causes them to no longer be indexed and queries fail
    • testing on small 8 core EC2 instance with 2M docs
    • 1M docs purged in the amount of time to write 1M, which is a good speed
    • docs would be written to some un-indexed DB (eg “cold storage”) as a fall back
    • indexes get rebuilt as IDs are put into cold storage. currently batch size is 10k docs purged, and the all views are queried so they’re up to date. This causes heavy CPU use during purging
    • for UUIDs on offline clients that get purged on server, it could get re-upload. need to be careful to ensure that too high a percent doesn’t get re-uploaded. it will only be tasks or reports that are updated offline after last sync that will be pushed up. IE very old reports and tasks very likely will not get pushed up.
    • per above, feature should likely be opinionated (ie delete on server and on client)
    • this will work back to 4.20 when we released CouchDB 3.5
    • By end of April, Diana wants to do more tests on small server (~8 cores/16GB) to sure it’s stable across large amount of docs
    • Any CHT core endpoints in support of this will be after 5.2
  • Medic EKS cluster had a number of instances upgrade 4.x->5.x and CPU usage was high enough on couch k8s nodes to cause brief outages

    • one idea is to go back to prior 4.x → 4.x upgrades to see how the cluster behaved when NP did a bunch of upgrdas prior

14 Apr 2026 Call

Attending

Notes

eCHIS KE

  • Nairobi
    • was unhealthy for past 6 weeks - with high CPU and high sync replication latency
    • a couch restart fixed it (for unknown reasons)
    • sentinel backlog came down to zero, but then started to creep up
    • logs were shared and it was discovered a HUGE volume of failed outbound push failures
    • by having these failures return a 200 instead of a 404 sentinel stopped retrying and sentinel backlog promptly went back to zero
  • turkana
    • still restarting
    • nothing in couchdb logs, so restart seem to be coming from hypervisor
    • possibly compare to other k8s instances which don’t have the problem to see if there’s a difference?
  • discuss the “fit for future” document, and how a national deployment will scale to 100k+ users

TCO:

  • Optimize indexes to reduce disk use (TCO v3.0)
    • no updates
  • TCO v2
    • minor work done on renaming constants in prep for future work
  • cold storage (TCO v4?)
    • more tests run, still showing promising results on churn
    • no major announcements/discoveries
    • holding off on writing purged docs to cold storage DB b/c writes are boring and a known quantity. Looking for unknowns that might have negative impact
    • first iteration will be simplistic and accept a list of a boatload of UUIDs. this requires deployments to create this list first before they can purge → cold storage
    • hoping to ship this first iteration in 5.3, which assumes 5.2 ships in weeks not months from now
    • maybe cht core 5.4 has automated cold storage enabled?
    • good discussion on what do we ship? do we enable something by default? maybe all tasks older than 6mo are auto cold storaged?