Hosting Total Cost of Ownership Squad

27 May 2025 Call

Attending

Notes

  • General discussion about “upgrades take a lot of space” - very likely Hosting TCO Squad 2.0 focus
    • why does CHT upgrading (adding new DDocs) cause more than 100% increase disk space?
    • view changes are what cause a large increase in disk space, which do not happen every upgrade
    • @jkuester to file POC ticket in CHT Coreto show how disk space goes up on a generic couch instance with view re-indexing. Discuss w/ @diana and @twier . We’ll then take this over to Couch Slack channel for questions.
  • @elijah - looking to have VMs to test eCHIS KE upgrades clones of their production from 4.11 to nouveau@master
  • @mrjones - confirm how extra disk space use when upgrading 4.19/4.18 → nouveau@master

3 Jun 2025 Call

Attending

Notes

  • Review main ticket
  • any ddoc changes, all views are rebuilt
    • See related comment on research ticket
    • Q: do we know for certain if a 4.19 → 4.20 will cause this spike in disk use? A: . From the current diff of 4.19.0...master, I do not see any changes to the ddoc files
    • Need for more disk space upgrade testing on ec2 test instance with prod data?
  • MoH KE update on testing Nouveau on prod data
    • had to work through MoH process for provisioning a new VM
    • got a large and small cloned instance
    • will commence testing once ready - but need to make sure large instance has enough free space before proceeding
  • hosting TCO 2.0
    • consider moving views to more/different ddocs?
    • existing research ticket
    • current ticket status is “it’s tricky”
    • still looking into this effort though as it’s still quite promising - it is likely a key part of path forward to address 5x disk space in upgrading
    • relevant findings in shard/cpu research on forums
    • tom exploring why exactly the space used is more than ~5x we see in production. ideally it’s ~2x.
    • early research showing map reduce (including, but beyond, freetext) views are resource intensive (both CPU & Disk)
    • some early research on removing freetext views

10 Jun 2025 Call

Attending

Notes

  • Review main ticket
  • Review large MoH deployment testing Nouveau branch on prod data
    • How much spare disk space do they need?
    • Sugat concerned about more than 5x as seen in research ticket
    • Elijah: I’m in the process of procuring additional storage to begin the upgrade process and wanted to clarify that we settled on 5x current capacity e.g. one clone has a 5TB volume with 3.2TB utilization, should we expand to 16TB or up to 25TB to get some margin of safety.
    • recommendation: 16TB should be fine since the utilization is 3.2TB
    • Interrupted upgrades should both not lose all progress, and be able to resume where they left off
    • recommend starting in MoH Data Center with 8vCPU/16GB RAM to see how it goes. Success or failure will be well to inform TCO Squad with next steps.
  • Maybe if a 3.2TB instance needs >16TB (25TB!!?!), do we need to ship TCO V2 (eg in 4.2x) before TCO v1 (in 5.0) which will reduce total space needed
  • TCO Squad agrees that we should wait for Elijah’s testing in MoH datacenter, this may be weeks, possibly months in worst case, to completed. In the interim, upgrade disk space research can continue
1 Like

24 Jun 2025 Call

Attending

Notes

  • Review main ticket
  • Does eCHIS KE need to test Nouveau before we release 5.0?
    • nothing will really change for the branch if don’t test MoH KE before the release
    • eCHIS KE only has 50% of avail storage, so upgrades won’t be easy b/c they don’t have 5x avail free disk
    • input from eCHIS KE: prepare community for what they’ll need to upgrade and what the benefits will be
    • eCHIS has two instances, one small and one big, small is done and big is maybe 3/4 done.
    • will retest and run CHT Toolbox to closely monitor disk space
    • Despite earlier choice to wait, we will not wait for eCHIS KE test results. If we get lucky and it’s done before hand, we’ll incorporate findings, but not blocking.
  • k8s effort underway
  • review 5.0 milestone

8 Jul 2025 Call

Attending

Notes

  • Review main ticket

  • eCHIS KE upgrade tests

    • small test completed, but larger test instance has stalled - it keeps rebuilding views in a loop. Possible to pre-stage the views?
    • this issue isn’t tied to Nouveau, but to issues endemic to KE Datacenter
    • current theory is that disk access to SAN is so slow (high IO Wait) that happy path upgrade fails
    • possibly have a “go in serial” as opposed to “go in parallel” check box on “upgrade” button?
  • action items:

    • get a copy of upgrade logs
    • manually stage views for just medic-client to see if it works - should be good enough to update CHT Toolbox to do it already successfully done!
    • create a branch of CHT Toolbox to auto-serially upgrade an instance
    • set up cAdvisor and Node Exporter to get hard numbers on how fast/slow hardware/disk/network is

22 Jul 2025 Call

Attending

Notes

  • MoH KE upgrade test:
    • Josh provided branch of CHToolbox to do sequential updates
    • CHToolbox - it’s working but active tasks shown in TUI is different than active tasks shown in couch Fauxton e.g. tasks outside if medic db (eg medic-client). Can we look into why these non-medic indexes are getting run?
    • Any way to get index to to resume if gets interrupted half way through (or 20% of the way through?)?
    • Josh will try writing 1 ddoc at a time and warm it which would hopefully allow just one index at a time and also avoid other dbs like medic-client from being active
  • 5.0 effort - what do we need to do to finish up main ticket to ship 5.0
    • @twier to follow up with @Lorena_Rodriguez about implications of new helm charts being merged to cht-core master and what this means for 5.0 (eg: are these new helm charts a breaking change for folks running 4.x helm charts now?)

29 Jul 2025 Call

Attending

Notes

  • 5.0 milestone
    • TCO branch is now up to date with latest from master
    • CHT Conf ticket with PR is under way
    • Fix k3s integration was closed b/c it’s no longer needed with upstream helm changes
    • Helm Chart changes
      • are pending - Tom is focused on applying changes CHT Core instead targeting helm charts
      • Next steps is getting this reviewed
    • docs ticket
      • no need document docker upgrade, it will “just work”
      • effort 1: update docs on what Nouveau is and how it works in general
      • effort 2: how do you upgrade a k8s deployment from 4.x to 5.x? or…is that too much and maybe 5.0 as a whole will need to cover helm 4.x → 5.x process and this squad doesn’t need to cover it
      • For effort 2, Josh to open docs issue about “how to upgrade to new helm charts for 5.0” and add to 5.0 milestone
  • eCHIS KE efforts
    • elijah still working with branch of CHT Toolbox to auto-serially upgrade an instance.
    • has had to restart couch container a few times which has crashed during upgrade test
    • noted that Toolbox doesn’t always auto continue from one DB to another
    • medic and then medic-client are the really hard ones, after which it should go very fast. failing just the once between these should just need one restart and maybe isn’t worth fixing?
    • there is an upcoming deadline Aug 29th to have all instances with latest feature (specifically contact summary) - this is not tenable
    • crazy ideas to work around slow IO:
      • implement pre-compute caching
      • add queuing mechanism for users, allowing at least 1 sync per day
      • deny large portion users access to system to reduce IO requirements
      • rate limit ingress
      • drop free-text indexes - or even remove search all together?
      • what other major features can we fully remove
      • not a crazy idea at all, but TCO feature is likely where it should go as it’s a logical destination
      • fork couch to possibly stop re-indexing when you do something like delete free-text indexes
      • follow example from other NGO with huge, slow couch: consider indexing off the VMs, set replication to off VM instance, run-reindex on other non-VM, stop both and sneaker-net back the data to production
      • There’s ~10 instances that are doing poorly out of 47

5 Aug 2025 Call

Attending

Notes

  • Review main ticket
  • 5.0 milestone
  • eCHIS KE
    • Have restarted Kwale clone upgrade test to see how long it will take now that eCHIS Data Center is happy - to see how long will it take. Prior effort was sequential upgrade via CHToolbox. New retest is done “normally” with concurrency
    • eCHIS KE very likely (80%?) to upgrade to 5.0 when it comes out - according to Elijah
    • Nairobi has ~100 users who are over 10k doc limit. Some are even over 100k :frowning: This will have an impact on the server
    • Current action item for eCHIS Instances:
      • upgrade to 4.x branch with 5m → 30m downward sync
      • fix sentinel backlog by fast forwarding sequence ID
      • fix users who have >10k documents to sync (above replication limit). If they’re all “focal persons” that are offline account - maybe migrate to online account? Caution: There’s no way to find users who have failed to log in because they can’t sync the large amount of users
      • CHT 4.21 has CouchDB 3.5 which has shown massive performance improvements specifically around disk use (reduce locking calls to disk). This should be the 4th to do for eCHIS KE to improve performance
    • Diana excited about new endpoint that takes a time frame to get document IDs instead of sequenced ID. Is in CouchDB PR, will hopefully be released in couch 3.x soon
    • Stewardship teammates should sign up for Gitlab to get access - ping Elijah with account name

Action items:

  • @mrjones to create 3 gitlab tickets for “Current action” above

12 Aug 2025 Call

Attending

Notes

  • 5.0 milestone
    • look at maybe dropping Android 5.0? and 6.0 and 7.0?
    • Realized this can be a CHT Android 2.0 release and can not at all be tied to CHT Core 5.0
  • Review main ticket
    • nothing pressing, following up last ticket which is just docs
  • eCHIS KE
    • Tom still testing Upgrade failures in eCHIS VMs that have API crash, and fix is to restart API
  • TCO v2
    • Discuss splitting out DDocs for better performance in that one index a DDoc with a small amount of indexes will only rebuild that DDoc and not just about every DDoc like we do today
    • Josh to file a research ticket around rebuilding DDocs and how it can saturate the CPUs

Action items:

  • @jkuester - look at adding CI to CHT Android to see if it supports testing on Android 5.0 (NB - this is not tied to CHT Core 5.0, just a good deprecate support for really old Android)

19 Aug 2025 Call

Attending

Notes

  • TCO v2 parent issue
  • 5.0 milestone
  • eCHIS KE
    • tom found bug on upgrade 4.11 to 4.21
    • Elijah reports that goals is to have all instances on 4.21 by end of August
    • Bug fix won’t be for days (weeks?) so a work around to crashing upgrades will be to use CHToolbox prestage feature. Note that pre-stage has been merged to main. Also note that sentinel sequence ID reset feature branch will include pre-stage feature

Action items:

  • @jkuester - add feature to CHToolbox: sentinel sequence ID reset to zero out sentinel backlog
  • @jkuester - look at adding CI to CHT Android to see if it supports testing on Android 5.0 (NB - this is not tied to CHT Core 5.0, just a good deprecate support for really old Android)
  • @mrjones - create ticket in eCHIS KE gitlab to propose merging VMs for low use instances and re-allocate the CPU/RAM to higher need instances like Nairobi

26 Aug 2025 Call

Attending

Notes

  • eCHIS KE
    • for instances not on 4.21, try going to 4.15 first which has the 2 second fix, then go to 4.21
    • Review node exporter graphs to see about upgrade correlation to lower iowait
  • TCO v2
    • working on compaction issue

Action items:

  • @jkuester - look at adding CI to CHT Android to see if it supports testing on Android 5.0 (NB - this is not tied to CHT Core 5.0, just a good deprecate support for really old Android)
  • @mrjones - create sentinel backlog status and disk use status (siaya and kwale full)
  • @mrjones - check out jump server access to eCHIS KE

2 Sep 2025 Call

Attending

Notes

  • 5.0/eCHIS KE
    • going to be released in low weeks not months

    • Elijiah deleted a lot/all the stale compaction files to clear up space - these have not come back - but concerns are they may yet still. we should keep an eye out for them, but hopefully all the code changes have fixed the issue.

    • not going to be 5x the total space needed to upgrade, it should be same as 4.11/13 → 4.21 space needed

    • There’s two VM clones in eCHIS KE that they’d like to reclaim to save on resources - Tom agrees good to reclaim - but we’ll do a spot upgrade to 4.21 → 5.0 and measure disk savings

    • end users will have to re-index medic-client , which is no worse than it has been in the past

    • confirming that the big 9 instances that are CPU oversubscribed to get more resources

    • Hopefully CHT Toolbox won’t be needed for upgrade

  • TCO v2

Action items:

  • @jkuester - follow up to see if the removal of old migration code affects rebuilding indexes for 4.21 → 5.0 upgrade. report back to group/slack/elijah
  • @jkuester - look at adding CI to CHT Android to see if it supports testing on Android 5.0 (NB - this is not tied to CHT Core 5.0, just a good deprecate support for really old Android)
  • @mrjones - check out jump server access to eCHIS KE
  • @mrjones - set up call with kenn/elijah/tom/mrjones to maybe gain insight as to how/why/when replication latency goes up as compared to other activity in Node Exporter or Watchdog
  • @twier - on a test VM, check upgrade from 4.21 → 5.0 , measure time and disk and check for any complications
1 Like

9 Sep 2025 Call

Attending

Notes

  • 5.0
  • eCHIS KE
    *
  • TCO v2
    • no progress on main ticket or sub-tickets
    • did close some tickets that didn’t pan out or were already fixed in couchdb 3.5 wihch shipped in CHT 4.x
    • dig into what correlates with the “happy times” per recent research in most of Aug.
    • compaction may be causing issues? Compaction is currently set to only run at night
  • Improving replication performance
    • Diana has a branch that replaces the server side replication : Caught up on what nouveau is, because I wanted to try to see how replication would look with a nouveau view. Turned out to be pretty easy and fast and created an issue
    • How to do we validate this works? that it holistically scales and doesn’t have negative knock on effects on the rest of the server.
    • Suggested: EC2 instance so that it’s isolated and other EKS resources don’t affect performance. This would be side by side to compare couchdb vs nouveau serving the replication endpoint
    • Run same tests as EC2 on eCHIS KE test VM that’s already provisioned to get an idea of performance with their large datasets

Action items:

  • @jkuester - look at adding CI to CHT Android to see if it supports testing on Android 5.0 (NB - this is not tied to CHT Core 5.0, just a good deprecate support for really old Android)
  • @twier - on a test VM, check upgrade from 4.21 → 5.0 , measure time and disk and check for any complications
  • @mrjones - get a copy of high use instances couch logs with the log script

16 Sep 2025 Call

Attending

Notes

  • 5.0
    • @twier tested VM to check upgrade from 4.21 → 5.0 , measure time and disk and check for any complications - see ticket
    • Kwale only saw ~10% reduction overall, vs the 34% seen on Zanzibar test
    • We should be sure to measure space in CHT data directory, not total free space. For an instance w/ 1.5TB that has 40GB of savings, will be a very small amount of the 1.5TB, but possible more of just the data directory
  • eCHIS KE
    • @jkuester - look through production logs from eCHIS KE to try and uncover crashes, errors, anomalies etc.
    • @jkuester has another set of more recent logs to look through see if he can figure patterns in sentinel et al craziness
    • Adding more CPUs didn’t seem to help - adding endless CPUs likely isn’t a good idea until we know what is causing the excessive CPU use
  • TCO v2
    • tk

Action items:

  • @jkuester - look at adding CI to CHT Android to see if it supports testing on Android 5.0 (NB - this is not tied to CHT Core 5.0, just a good deprecate support for really old Android)
  • @jkuester - analyze Zanz and Kwale CSVs from test upgrades and see if we can predict savings of a 5.x upgrade
  • @mrjones - retest Zanzibar upgrade by going from 4.5.2 → 4.21 first, then to 5.0 and mesure the second upgrade
  • @mrjones - test 30 min on migori, kisii, siaya and vihaga:
    • take planning back to Slack - ran out of time to finish the plan
    • stop just nginx? stop just HA Proxy?

23 Sep 2025 Call

Attending

Notes

  • 5.0
    • What to do about MoH Zanzibar dataset not being available to re-test upgrades?
    • Maybe do an eCHIS KE instance again?
      • what happened to Kwale CSV monitoring disk us? @twier to ensure it’s shared
    • also do MoH CIV to test medic hosted
  • eCHIS KE
    • Diana testing on master @ 5.0 on an EC2 instance to test replication index in Nouveau
      • reproduced a CPU error where couch crashes, and then API crashes. Here’s the couch error: [os_mon] cpu supervisor port (cpu_sup): Erlang has closed
      • everything in node exporter and watchdog to monitor stuffs
  • TCO v2
    *

Action items:

  • @jkuester - look at adding CI to CHT Android to see if it supports testing on Android 5.0 (NB - this is not tied to CHT Core 5.0, just a good deprecate support for really old Android)
  • @jkuester - analyze Zanz and Kwale CSVs from test upgrades and see if we can predict savings of a 5.x upgrade
  • @elijah & @twier - test Nairobi upgrade 4.21 → master @ 5.0 on the kwale clone VM
  • @mrjones - test MoH CIV upgrade 4.9.0 → 4.21 → master @ 5.0

30 Sep 2025 Call

Attending

Notes

  • 5.0
    • tk
  • eCHIS KE
    • Josh looked at logs from instances and noticed that Nairobi only processed ~1k documents when it should have done 10x to 100x more. Why is it moving so slow? CPU!!

    • When there’s spare CPU we see a more normal sentinel processing speed

    • Why does a spike in network traffic correlate to a drop in CPU use? We would expect the opposite?!

    • Does purging have an effect? A bit of research was done, but it’s otherwise running every 24 hours.

    • Run test to see if we can narrow down CPU use source

      • 30 min total
      • 15 min stop HAProxy
      • 15 Min keep HAProxy stopped, restart Couch
      • at the end, start HAProxy and ensure all services are working
  • TCO v2
    • tk

Action items:

  • @jkuester - look at adding CI to CHT Android to see if it supports testing on Android 5.0 (NB - this is not tied to CHT Core 5.0, just a good deprecate support for really old Android)
  • @jkuester - analyze Zanz and Kwale CSVs from test upgrades and see if we can predict savings of a 5.x upgrade
  • @elijah & @twier - test Nairobi upgrade 4.21 → master @ 5.0 on the kwale clone VM
  • @mrjones - test MoH CIV upgrade 4.9.0 → 4.21 → master @ 5.0
  • @mrjones - get copy of /opt/couchdb/etc/ from Nairobi and share
  • @mrjones - file ticket on HAProxy and Couch restart tests above and execute

7 Oct 2025 Call

Attending

Notes

  • eCHIS KE
    • Kwale clone upgrade coming along, slow and steady. used normal admin web GUI to initiate, all services running nominally with no restarts of services
    • transition back to k8s - with a 1:1 node mapping, what is the intent? suggest very tall nodes if this is a for sure thing. the prior reallocation of CPU/RAM (see 1 and 2) seemed to very quickly, was that considered a failure? How will k8s improve performance?
    • review testing of turning off Sentinel and impact on CPU and what next steps are
    • confirming that targeting master for upgrade tests is still helpful. consider targeting

Action items:

  • @jkuester - look at adding CI to CHT Android to see if it supports testing on Android 5.0 (NB - this is not tied to CHT Core 5.0, just a good deprecate support for really old Android)
  • @jkuester - analyze Kwale CSVs from test upgrades and see if we can predict savings of a 5.x upgrade. still need CSVs from @twier before getting started.
  • @elijah & @twier - test Nairobi upgrade 4.21 → master @ 5.0 on the kwale clone VM
  • @mrjones - test MoH CIV Lumbini NE upgrade 4.9.0 → 4.21 → master @ 5.0

14 Oct 2025 Call

Attending

Notes

  • eCHIS KE

    • mrjones to follow up with gentle, respectful k8s questions about how to quantify impact of k8s migration
    • Diana to follow up on on stopping background cleanup on certain instances. not all instances will benefit:
    • how to better monitor what couch is up to? how can we know what is taking up the CPU when couch is gone rogue?
  • 5.0

    • @jkuester happy to see the test results of indexing both to master and to branch with Cinque index to see side by side comparison of the different impacts on upgrades:
    • @elijah was able upgrade Nairobi 4.21 → master @ 5.0 but hit the upgrade bug . was able to work around by click “upgrade” again in admin GUI
    • plan is to get a beta build of 5.0 out a week from now so eCHIS KE can test a Nairobi clone upgrade from 4.21 → 5.0 beta. this will mean we need merge Cinque PR.
    • what to do about user-devices improvement? The user-devices can crush a CPU so the CHT becomes unusable to sync etc. while the user-devices request is being processed.

Action items:

  • @jkuester - look at adding CI to CHT Android to see if it supports testing on Android 5.0 (NB - this is not tied to CHT Core 5.0, just a good deprecate support for really old Android)
  • @mrjones - test MoH CIV Lumbini NE upgrade 4.9.0 → 4.21 → master @ 5.0 →
  • @mrjones - add upgrade bug info to docs in to both preparing for 5.0 doc and to troubleshooting 4.0 upgrades
  • @mrjones - add a deprecated warning to user-devices API in docs and link to user-devices improvement ticket.
  • @Diana is going to temporarily stop Background Cleanup tasks to see what impact this has on performance

21 Oct 2025 Call

Attending

Notes

  • eCHIS KE
    • tk
  • Muso Mali, currently on 4.15 in Google Kubernetes Engine (GKE) on Multi-node CouchDB with 37,551,919 docs in medic DB
    • looking for way to reduce cost
    • tried reducing size of GKS node types (CPU/RAM) but was unable to support existing users. compaction was especially bad for end users
    • have focused on reducing storage until more space is need to upgrade
    • Docs are in progress for how to migrate from 4.x → 5.x helm chart upgrade
  • 5.0
    • Cinque introduces more ephemeral disk use during upgrade to 5.x. The benefit is pretty massive (>40% faster API calls to /api/v1/initial-replication/get-ids, in turn, less CPU use) - we’re all good with this trade off?
    • More work in v2 purging for 5.1 on Nouveau, but index changes in preparation in the same Cinque PR
    • 5.0 Beta build is behind, but this week is reasonable
    • Comments on removing old UI in 5.0
      • Muso Mali is on old UI
      • Training cards may be possible, but not tested/ready yet
      • Muso & WA would prefer to wait on being forced to take new UI
      • would love to get public feedback on forum post

Action items:

  • @diana - 5.0 beta build this week (Oct 20) for eCHIS KE et al. to test on
  • @mrjones - publish forum post on how to find users with high doc count
  • Hiell - follow up with Ulrich for invite to Hosting TCO Slack channel

What are we learning? Curious what the space saving (if any are) from the upgrades we’ve managed to run.