Hosting Total Cost of Ownership Squad

27 May 2025 Call

Attending

Notes

  • General discussion about “upgrades take a lot of space” - very likely Hosting TCO Squad 2.0 focus
    • why does CHT upgrading (adding new DDocs) cause more than 100% increase disk space?
    • view changes are what cause a large increase in disk space, which do not happen every upgrade
    • @jkuester to file POC ticket in CHT Coreto show how disk space goes up on a generic couch instance with view re-indexing. Discuss w/ @diana and @twier . We’ll then take this over to Couch Slack channel for questions.
  • @elijah - looking to have VMs to test eCHIS KE upgrades clones of their production from 4.11 to nouveau@master
  • @mrjones - confirm how extra disk space use when upgrading 4.19/4.18 → nouveau@master

3 Jun 2025 Call

Attending

Notes

  • Review main ticket
  • any ddoc changes, all views are rebuilt
    • See related comment on research ticket
    • Q: do we know for certain if a 4.19 → 4.20 will cause this spike in disk use? A: . From the current diff of 4.19.0...master, I do not see any changes to the ddoc files
    • Need for more disk space upgrade testing on ec2 test instance with prod data?
  • MoH KE update on testing Nouveau on prod data
    • had to work through MoH process for provisioning a new VM
    • got a large and small cloned instance
    • will commence testing once ready - but need to make sure large instance has enough free space before proceeding
  • hosting TCO 2.0
    • consider moving views to more/different ddocs?
    • existing research ticket
    • current ticket status is “it’s tricky”
    • still looking into this effort though as it’s still quite promising - it is likely a key part of path forward to address 5x disk space in upgrading
    • relevant findings in shard/cpu research on forums
    • tom exploring why exactly the space used is more than ~5x we see in production. ideally it’s ~2x.
    • early research showing map reduce (including, but beyond, freetext) views are resource intensive (both CPU & Disk)
    • some early research on removing freetext views

10 Jun 2025 Call

Attending

Notes

  • Review main ticket
  • Review large MoH deployment testing Nouveau branch on prod data
    • How much spare disk space do they need?
    • Sugat concerned about more than 5x as seen in research ticket
    • Elijah: I’m in the process of procuring additional storage to begin the upgrade process and wanted to clarify that we settled on 5x current capacity e.g. one clone has a 5TB volume with 3.2TB utilization, should we expand to 16TB or up to 25TB to get some margin of safety.
    • recommendation: 16TB should be fine since the utilization is 3.2TB
    • Interrupted upgrades should both not lose all progress, and be able to resume where they left off
    • recommend starting in MoH Data Center with 8vCPU/16GB RAM to see how it goes. Success or failure will be well to inform TCO Squad with next steps.
  • Maybe if a 3.2TB instance needs >16TB (25TB!!?!), do we need to ship TCO V2 (eg in 4.2x) before TCO v1 (in 5.0) which will reduce total space needed
  • TCO Squad agrees that we should wait for Elijah’s testing in MoH datacenter, this may be weeks, possibly months in worst case, to completed. In the interim, upgrade disk space research can continue
1 Like

24 Jun 2025 Call

Attending

Notes

  • Review main ticket
  • Does eCHIS KE need to test Nouveau before we release 5.0?
    • nothing will really change for the branch if don’t test MoH KE before the release
    • eCHIS KE only has 50% of avail storage, so upgrades won’t be easy b/c they don’t have 5x avail free disk
    • input from eCHIS KE: prepare community for what they’ll need to upgrade and what the benefits will be
    • eCHIS has two instances, one small and one big, small is done and big is maybe 3/4 done.
    • will retest and run CHT Toolbox to closely monitor disk space
    • Despite earlier choice to wait, we will not wait for eCHIS KE test results. If we get lucky and it’s done before hand, we’ll incorporate findings, but not blocking.
  • k8s effort underway
  • review 5.0 milestone

8 Jul 2025 Call

Attending

Notes

  • Review main ticket

  • eCHIS KE upgrade tests

    • small test completed, but larger test instance has stalled - it keeps rebuilding views in a loop. Possible to pre-stage the views?
    • this issue isn’t tied to Nouveau, but to issues endemic to KE Datacenter
    • current theory is that disk access to SAN is so slow (high IO Wait) that happy path upgrade fails
    • possibly have a “go in serial” as opposed to “go in parallel” check box on “upgrade” button?
  • action items:

    • get a copy of upgrade logs
    • manually stage views for just medic-client to see if it works - should be good enough to update CHT Toolbox to do it already successfully done!
    • create a branch of CHT Toolbox to auto-serially upgrade an instance
    • set up cAdvisor and Node Exporter to get hard numbers on how fast/slow hardware/disk/network is

22 Jul 2025 Call

Attending

Notes

  • MoH KE upgrade test:
    • Josh provided branch of CHToolbox to do sequential updates
    • CHToolbox - it’s working but active tasks shown in TUI is different than active tasks shown in couch Fauxton e.g. tasks outside if medic db (eg medic-client). Can we look into why these non-medic indexes are getting run?
    • Any way to get index to to resume if gets interrupted half way through (or 20% of the way through?)?
    • Josh will try writing 1 ddoc at a time and warm it which would hopefully allow just one index at a time and also avoid other dbs like medic-client from being active
  • 5.0 effort - what do we need to do to finish up main ticket to ship 5.0
    • @twier to follow up with @Lorena_Rodriguez about implications of new helm charts being merged to cht-core master and what this means for 5.0 (eg: are these new helm charts a breaking change for folks running 4.x helm charts now?)

29 Jul 2025 Call

Attending

Notes

  • 5.0 milestone
    • TCO branch is now up to date with latest from master
    • CHT Conf ticket with PR is under way
    • Fix k3s integration was closed b/c it’s no longer needed with upstream helm changes
    • Helm Chart changes
      • are pending - Tom is focused on applying changes CHT Core instead targeting helm charts
      • Next steps is getting this reviewed
    • docs ticket
      • no need document docker upgrade, it will “just work”
      • effort 1: update docs on what Nouveau is and how it works in general
      • effort 2: how do you upgrade a k8s deployment from 4.x to 5.x? or…is that too much and maybe 5.0 as a whole will need to cover helm 4.x → 5.x process and this squad doesn’t need to cover it
      • For effort 2, Josh to open docs issue about “how to upgrade to new helm charts for 5.0” and add to 5.0 milestone
  • eCHIS KE efforts
    • elijah still working with branch of CHT Toolbox to auto-serially upgrade an instance.
    • has had to restart couch container a few times which has crashed during upgrade test
    • noted that Toolbox doesn’t always auto continue from one DB to another
    • medic and then medic-client are the really hard ones, after which it should go very fast. failing just the once between these should just need one restart and maybe isn’t worth fixing?
    • there is an upcoming deadline Aug 29th to have all instances with latest feature (specifically contact summary) - this is not tenable
    • crazy ideas to work around slow IO:
      • implement pre-compute caching
      • add queuing mechanism for users, allowing at least 1 sync per day
      • deny large portion users access to system to reduce IO requirements
      • rate limit ingress
      • drop free-text indexes - or even remove search all together?
      • what other major features can we fully remove
      • not a crazy idea at all, but TCO feature is likely where it should go as it’s a logical destination
      • fork couch to possibly stop re-indexing when you do something like delete free-text indexes
      • follow example from other NGO with huge, slow couch: consider indexing off the VMs, set replication to off VM instance, run-reindex on other non-VM, stop both and sneaker-net back the data to production
      • There’s ~10 instances that are doing poorly out of 47

5 Aug 2025 Call

Attending

Notes

  • Review main ticket
  • 5.0 milestone
  • eCHIS KE
    • Have restarted Kwale clone upgrade test to see how long it will take now that eCHIS Data Center is happy - to see how long will it take. Prior effort was sequential upgrade via CHToolbox. New retest is done “normally” with concurrency
    • eCHIS KE very likely (80%?) to upgrade to 5.0 when it comes out - according to Elijah
    • Nairobi has ~100 users who are over 10k doc limit. Some are even over 100k :frowning: This will have an impact on the server
    • Current action item for eCHIS Instances:
      • upgrade to 4.x branch with 5m → 30m downward sync
      • fix sentinel backlog by fast forwarding sequence ID
      • fix users who have >10k documents to sync (above replication limit). If they’re all “focal persons” that are offline account - maybe migrate to online account? Caution: There’s no way to find users who have failed to log in because they can’t sync the large amount of users
      • CHT 4.21 has CouchDB 3.5 which has shown massive performance improvements specifically around disk use (reduce locking calls to disk). This should be the 4th to do for eCHIS KE to improve performance
    • Diana excited about new endpoint that takes a time frame to get document IDs instead of sequenced ID. Is in CouchDB PR, will hopefully be released in couch 3.x soon
    • Stewardship teammates should sign up for Gitlab to get access - ping Elijah with account name

Action items:

  • @mrjones to create 3 gitlab tickets for “Current action” above

12 Aug 2025 Call

Attending

Notes

  • 5.0 milestone
    • look at maybe dropping Android 5.0? and 6.0 and 7.0?
    • Realized this can be a CHT Android 2.0 release and can not at all be tied to CHT Core 5.0
  • Review main ticket
    • nothing pressing, following up last ticket which is just docs
  • eCHIS KE
    • Tom still testing Upgrade failures in eCHIS VMs that have API crash, and fix is to restart API
  • TCO v2
    • Discuss splitting out DDocs for better performance in that one index a DDoc with a small amount of indexes will only rebuild that DDoc and not just about every DDoc like we do today
    • Josh to file a research ticket around rebuilding DDocs and how it can saturate the CPUs

Action items:

  • @jkuester - look at adding CI to CHT Android to see if it supports testing on Android 5.0 (NB - this is not tied to CHT Core 5.0, just a good deprecate support for really old Android)

19 Aug 2025 Call

Attending

Notes

  • TCO v2 parent issue
  • 5.0 milestone
  • eCHIS KE
    • tom found bug on upgrade 4.11 to 4.21
    • Elijah reports that goals is to have all instances on 4.21 by end of August
    • Bug fix won’t be for days (weeks?) so a work around to crashing upgrades will be to use CHToolbox prestage feature. Note that pre-stage has been merged to main. Also note that sentinel sequence ID reset feature branch will include pre-stage feature

Action items:

  • @jkuester - add feature to CHToolbox: sentinel sequence ID reset to zero out sentinel backlog
  • @jkuester - look at adding CI to CHT Android to see if it supports testing on Android 5.0 (NB - this is not tied to CHT Core 5.0, just a good deprecate support for really old Android)
  • @mrjones - create ticket in eCHIS KE gitlab to propose merging VMs for low use instances and re-allocate the CPU/RAM to higher need instances like Nairobi

26 Aug 2025 Call

Attending

Notes

  • eCHIS KE
    • for instances not on 4.21, try going to 4.15 first which has the 2 second fix, then go to 4.21
    • Review node exporter graphs to see about upgrade correlation to lower iowait
  • TCO v2
    • working on compaction issue

Action items:

  • @jkuester - look at adding CI to CHT Android to see if it supports testing on Android 5.0 (NB - this is not tied to CHT Core 5.0, just a good deprecate support for really old Android)
  • @mrjones - create sentinel backlog status and disk use status (siaya and kwale full)
  • @mrjones - check out jump server access to eCHIS KE