Hosting Total Cost of Ownership Squad

Hi @derick - thanks for the comment!

As far as Hosting TCO v1, which is the effort to reduce overall disk space used by the CHT day to day, we’re not learning anything new - and this is a good thing! Let’s dive into why.

Almost exactly a year ago the ticket was opened to implement Couch 3.5’s Nouveau in the CHT. In this ticket, a number of key questions were asked. If all the answers were positive, it would very likely be a win for the community at large.

Just a month later, in Nov of 2024 we had good answers:

how much disk savings do we see?

I saw ~25% disk savings [on a production data set]

Every single upgrade since then has shown about this same level of disk savings. In the best case scenario, it’s been above 30%.

By April of 2025 there was some concern that maybe our test results weren’t valid, so we we re-tested using a number of production datasets. The high watermark of just under 35% savings still held.

And so it’s true with the tests this month - we’re still seeing the same savings be true and larger instances will stand to gain the most. We can’t promise any specific amount, but it’s looking good! Here’s the chart we cite often that summarizes all this:

1 Like

28 Oct 2025 Call

Attending

Notes

  • 5.0

    • last issue (remove unused views) is still outstanding. @twier to work on it. discussing which views are where, specifically, what is or is not in medic ddoc. Tom to create PR for the smallest viable change that can fit in 5.0 safely.
    • Be sure to message around no training needed, no UI changes.
    • 5.0 requires Chrome 107 for future angular uplift. research was done showing UG is most affected, but surmountable.
  • eCHIS KE

    • based on our research in eCHIS, forum post published on finding high doc count users
    • @elijah working on testing CHT Core @ master for Nairobi clone, will start test upgrade tomorrow
  • Hosting TCO v2

Action items:

4 Nov 2025 Call

Attending

Notes

  • 5.0 (main milestone ticket)

  • eCHIS KE

    • ticket to test 5.0 in Nairobi created
      • compaction completed, upgrade started, eta middle of next week to finish test
      • instance is public and URL shared on call
      • what is measurement of success for test upgrade? successful upgrade and all systems nominal. Specific disk space savings is not a goal, just success of upgrade
    • does eCHIS want a hard limit on doc count?
      • it would be a very fantastic: we have a multi-group effort and any one specific group can increase the number of documents without informing other groups. having this hard limit will be a great way for everyone to work with in the same rules
      • West Africa also interested in this feature
      • is using contacts as a proxy for all documents reasonable? Elijah’s concern is that number of contacts stays fixed, what about using number of reports as a proxy?
      • likely feature would just set hard limit on number of reports
      • being able to report on users who have hit the limit will be important, but we also have an improvement to monitoring endpoint which should help track users approaching the limit
  • Hosting TCO v2 (main ticket: Reduce disk space during upgrades )

    • NA
  • CHToolbox vs CHT Conf

    • should we move CHToolbox to Medic org? Maybe an experiment to have community based scripts/code? no big pressure
    • is it confusing to have two CLI tools to the community? possibly! for now toolbox is viewed as a place for josh (and others?) to have a place put one of scripts that can leverage existing CRUD style code

Action items:

  • None

11 Nov 2025 Call

Attending

Notes

  • 5.0 (main milestone ticket)

    • see eCHIS KE “ticket to test 5.0 in Nairobi created”
  • eCHIS KE

    • ticket to test 5.0 in Nairobi created
      • upgrade started Nov 3rd, 12:10:20
      • as of Nov 10 10.20am Pacific - upgrade is 81% done
      • don’t block release of 5.0 on compaction, as soon as upgrade is successful, we can release
      • still do compaction to get final disk use numbers for TCO effort and to report back to MoH
      • expected to complete by EOW
  • Hosting TCO v2 (main ticket: Reduce disk space during upgrades )

    • On hold

Action items:

  • None

18 Nov 2025 Call

Attending

Notes

  • eCHIS KE
    • NA
  • 5.0
    • Test upgrade succeeded, but disk space wasn’t saved.
    • Looking into file sizes on disk in production vs clone of production where tests were done, we see compaction didn’t complete
    • Errors were found in the logs confirming compaction died, the exact cause isn’t known:
      [info] 2025-11-14T00:25:23.387450Z couchdb@127.0.0.1 <0.726.0> -------- db shards/95555553-aaaaaaa7/medic-user-foobar-meta.1726644348 died with reason {{badmatch,{error,{noproc,{gen_server,call,[<0.3043468.0>,{pread_iolists,[24835]},infinity]},[{gen_server,call,3,[{file,"gen_server.erl"},{line,419}]},{couch_file,pread_binaries,2,[{file,"src/couch_file.erl"},{line,194}]},{couch_file,pread_binary,2,[{file,"src/couch_file.erl"},{line,179}]},{couch_file,pread_term,2,[{file,"src/couch_file.erl"},{line,167}]},{couch_btree,get_node,2,[{file,"src/couch_btree.erl"},{line,474}]},{couch_btree,stream_node,8,[{file,"src/couch_btree.erl"},{line,1069}]},{couch_btree,fold,4,[{file,"src/couch_btree.erl"},{line,242}]},{couch_bt_engine,fold_docs_int,5,[{file,"src/couch_bt_engine.erl"},{line,1131}]}]}}},[{couch_db_engine,trigger_on_compact,1,[{file,"src/couch_db_engine.erl"},{line,993}]},{couch_bt_engine_compactor,start,4,[{file,"src/couch_bt_engine_compactor.erl"},{line,54}]}]}
      
    • the main error string to look for was died with reason {{badmatch
    • It’s further suspected that this is a timeout error due to resource contention. The test instance has 8 CPUs vs production having 48. 8 CPUs is likely not enough, but this is a theory
    • we started compaction on medic-client and we’ll watch this over the next ~24 hours to see if it completes
    • latest logs from docker were published in google drive (private link)

Action items:

  • @jkuester - reach out to couchdb slack to see if they have any insights as to way compaction failed.

25 Nov 2025 Call

Attending

Notes

  • 5.0.0 k8s issue:
    • Issue with EKS deployment of 5.0.0 helm charts.
    • Helm charts might need to be updated to add a separate Persistent Volume for Nouveau because EBS volumes have an inherent property of not allowing multiple mounts
      • This might be a config option? (one that was set differently test) - Multi-Attach enabled: false
      • Considerations and limitations of Multi-attach.
      • Possible that the instances we tested the helm charts with were using io2 volumes which have broad support for multi-attach, but maybe the Demo instance is running an older io1 volume (which has a much more narrow support path for multi-attach).
      • The Demo instance is running on gp2 (and not io2) so there is no support for multi-attach.
      • The decision is that we will move the Nouveau container inside the Couch container’s pod so they can share the same volume without needing multi-attach.
    • Also an issue with “rolling upgrade”. With our current helm charts we are not set the configuration to force k8s to not “roll” the upgrade. This means that, by default, K8s is trying to spin up new containers before destroying the old ones. This is obviously a serious problem when it comes to the CHT (particularly Couch) containers.
      • To fix this, we need to set the config in our helm charts to prevent “rolling” the updates (at least to the Couch container).
1 Like

02 Dec 2025 Call

Attending

Notes

  • 5.0.0

    • chat a bit about 5.0.1 happenings and why
  • chat about NPM vulnerability to supply chain attacks - anything Medic should do different? Anything stewardship/app serv teammates with sensitive tokens on disk should do different?

    • change release note script to prompt for GH token instead of having it on disk in the clear?
    • for agentic work, how do we ensure we’re not exposing/sharing private keys and CHT passwords? A teammate was using Claude and saw it had used local contexts and was returning full URLs with production passwords in them b/c it had been given all that info :frowning:
  • Hosting TCO v2

    • Josh doesn’t have time to look into CPU saturation ticket and tom looking to splitting up DDocs (but presumably is busy too)
    • Diana brought up Background Indexing - specifically batch_channels: This setting controls the number of background view builds that can be running in parallel at any given time. The default is 20.
    • discussion of using mutli-agent stuffs on hosting TCO v2? Sugat feels multi-agent isn’t ready
    • 5.0 release was taking up a lot of time/resources - with it not being released, team wants to really focus on v2 effort.
    • very likely, the biggest win will be split up views in ddocs - b/c today every time we make a change you have to rebuild all the ddocs as opposed to just the one if they were split up. currently at ~4 and would go to ~35
    • there could be an “advanced” way to upgrade where you get to choose which ddoc to index on upgrade - in the works! we could hard code description of what all ddoc does.
  • unrelated to tco, but chat about upgrade service e2e issues

tasks:

  • @mrjones - change release notes script to prompt for token instead of keeping in the clear on disk in a json file

9 Dec 2025 Call

Attending

Notes

  • Hosting TCO v2
    • touch on 2849 and 10220. What is the CPU/time cost to split up ddocs vs the cost to run this split up architecture day to day. 1 cpu per index per ddoc per shard.
    • what’s “too many” JS processes? if we have the highest amount, we reduce the amount of space needed for upgrades (b/c so few views will be updated), but then you need a lot of CPUs
    • Background Indexing (aka ken config) - will come into play, especially as we increase number of processes to handle all the processes
    • We should run all the combos of 1:1 of views to ddocs vs how it is today and performance on 2, 4, 8, 16, 32 and 64 core
    • Because of code path complexity and tech debt we’d take on, we won’t consider doing more the one configuration of number of indexes - we’ll have a one size fits all
    • Diana still interested in working partial re-indexing on upgrade too
    • maybe use scalability test to automate testing?

tasks:

  • @mrjones - change release notes script to prompt for token instead of keeping in the clear on disk in a json file

15 Dec 2025 Call

Attending

Notes

  • eCHIS KE

    • diving into why Baringo had such bad downtime - no exact resolution except that couch was crashing
    • Diana suspects medic-sentinel db might be corrupt?
    • Elijah will open Gitlab ticket to track and share logs etc.
  • Hosting TCO v2

    • @twier - did a bunch more testing to split it up so 1:1 view <> ddoc
    • doesn’t seem to have negative affect on CPU & disk: it’s about the same amount of space and CPU use.
    • test were mostly ad hoc, no Prometheus measuring etc
    • separately - Tom also looking into adding postgres into CHT core as a way to save disk space

tasks

  • @diana - update view/ddoc ticket to have specific test we should be running on master vs ticket’s branch so we can get realistic test results we’re confident in.

6 Jan 2026 Call

Attending

Notes

  • eCHIS KE

  • Hosting TCO v2

    • on docs site, PR: “interactive cost calculator
    • idea: what if we update the upgrade service to:
      • allow sequential rebuilds/updates off views
      • delete views first, this making this an upgrade that pulls the server offline until views are rebuilt
      • could be backwards compatible to 4.x (maybe back to 4.5? 4.0?!?)
      • could use TBD CHT Conf features which could then be both baked into CHT upgrade service (literally add CHT cont to upgrade service) as well allow k8s deployment to do the same trick using CHT Conf
      • tricky part will be how to prevent CHW’s from accessing the server (eg stop API, stop nginx/ingress) but still allow CHT Conf/Upgrade Service access
      • could also add an app settings value of new style “do upgrade with downtime” or if not present or has new value “do old style 5x disk space” upgrade.
  • TCONext (v3?)

tasks

  • @diana - update view/ddoc ticket to have specific test we should be running on master vs ticket’s branch so we can get realistic test results we’re confident in.

13 Jan 2026 Call

Attending

Notes

TCO v2:

  • Upgrade paths discussion around “add design document comparison support during upgrades” and " Support “downtime upgrade" with minimal disk space requirements
  • What is the UX for when you delete your views but run out of disk space in the TBD feature?
  • Early research on downtime upgrade is that we’re only getting ~25% savings (so 3.8xx space vs 5x space), but when we get the ddocs broken up, this approach in 10560 will be more viable
  • discuss UX for add design document comparison support feature
    • mrjones: collapse the list of affected ddocs and show a count in the summary.
    • mrjones: If possible, for each release (not branches) show whether or not that version requires a ddoc re-index. This would massively encourage folks to take the cheap upgrades that don’t index and inform them be cautious to do others which require the indexing
  • discuss downtime upgrade and how it will work with CHT Conf and k8s with helm
  • discuss feedback on backup page
    • add “restore” to title
    • add a link to restore at the top of the page so folks know it’s a key part of backup
  • discuss future more granular perms for online users coming in future couch release
  • touch on security improvements that Jonathan brought up (private slack thread) around PIN vs encryption at rest vs partner feature requests

20 Jan 2026 Call

Attending

Notes

TCO v2:

  • Support “downtime upgrade” with minimal disk space requirements
    • Best possible disk use with destructive upgrade approach is 2.6x
    • the risk is that if you delete all indexes and something goes wrong, the server is down until you restore from backup or revert and rebuild on existing code base. This could be especially bad if you are low on disk space, follow this path, and then run out of disk space - you have to restore from backup - only option.
    • medic-client is by far the largest use of space, it is the first spike in the red line the graph
    • See earlier Slack discussion on same
  • Split up views into different ddocs
    • @diana suggests grouping views by things that use them - focusing on what the Android client use. eg “this ddoc for sentinel, this ddoc for sync, etc.”
    • this feature is likely the most important to ship ASAP as it’s the
    • no immediate blockers known, increase IOPS may have impact on performance day to day, which we’re watching out for
  • Selectively only index specific / necessary ddocs when upgrading to a new version
    • when coupled with broken down ddocs, this could
    • tom to update ticket, no updates since May of last year

eCHIS KE

  • 5.0 upgrade plan - how is current 5x disk space requirements being addressed?
  • plan to 5 instances at a time, no current plan for lack of disk space
  • need to figure to how to shrink volumes, 186TB of 200TB in use :frowning:
  • Looking at internal dashboard all but 4 VMs are well under 50%, most under 30% disk use.
  • Maybe one idea would be to simulate having a NAS with a lot of storage that can be dynamically allocated to the VM “most in indeed”. Something like:
    • find out which VMs are on which hypervisors
    • create a “storage” VM with a lot of disk on the hypervisor with the CHT instance you want to upgrade
    • NFS mount the storage VM
    • copy the couch data to the NFS mount
    • reconfigure the CHT instance to use the couch data on the NFS mount
    • upgrade the CHT and use a bunch of extra disk space, but you have more room via NFS. as well, because it’s on the same hypervisor, the disk access should be very fast, hopefully as fast as if it were on the VM directory
    • when the upgrade is done, move couch data back to boot volume, unmount NFS
    • when all upgrades are done, delete the storage VM, returning the storage to the unused storage pool
    • net result should be no extra disk space needed to be permanently assigned to any one VM, but all instances are upgrade.

27 Jan 2026 Call

Attending

Notes

Welcome @Jude_Zambarakji !

TCO v1:

TCO v2:

  • Split up views into different ddocs
  • great update from last week: Attached is a little analysis of where views are used built from grepping each ones name from cht-core, separating references in webapp, sentinel, api, and shared-libs; for shared-libs it goes one layer deeper to try and see where its ‘actually’ used.
  • Look into ways of possibly not having offline clients medic_client as this will be almost as bad as a 1st time login
  • moving a view from one ddoc to another doesn’t cause an re-index

eCHIS KE

03 Feb 2026 Call

Attending

Notes

Welcome @Jude_Zambarakji !

TCO v2:

  • Split up views into different ddocs
  • most recent effort has seen high iowait, but IOPS have been fine - confusing!
  • would be a good test to compare concurrent indexing vs sequential with more ddocs

eCHIS KE

  • working on 4.21 → 5.0.1 upgrade effort
  • indexing is takng a long time durig upgrades
  • having to move instances to k8s so they have a flexible with more disk space. not a PVC, but something like an NFS mount so tied to one VM, and can’t move
  • 34 instances upgrade out 47
  • larger instances are the ones that are outstanding/slow. These are expected this week
  • VMs are renamed to follow which counties are on them
  • Discussion about k8s affinity (pinning a service like couchdb to a specific node with high CPU/RAM count) and how MoH KE is using this
  • aware that they’re loosing the upgrade service and planning to move to helm (or a DIY direct k8s calls?) for future upgrades

10 Feb 2026 Call

Attending

Notes

TCO v2:

  • Split up views into different ddocs
  • There is a branch that Tom is working on getting CI to pass on
  • Using erlang shell to get size of indexes from Nouveau Kwale clone to see how big they are - results posted in table on ticket
  • some indexes are only used by offline users (tasks_by_contact) and some are only used by only online user (doc_summaries_by_id)
  • there could be big gains be optimizing and removing these indexes (TCOv3? re-prioritize to make this TCOv2?)

eCHIS KE

  • 4 of the larger instances have just completed 4.21 → 5 upgrades , they were then migrated docker → k8s
  • there’s been some shard corruption, will try to rebuild the shard first, then recover from backup if need be. rebuilding shard avoids rebuilding index so try that first
  • fix for search issue found by eCHIS KE will come in 5.0.2 first, and be baked into all future versions like 5.2
  • 5.0.1 → 5.0.2 upgrade is free, in that there’s no re-indexing. this means it will effectively be instant to upgrade, as compared

17 Feb 2026 Call

Attending

Notes

TCO v2:

eCHIS KE

  • still migrating instances in k8s to use SAN instead of local VM disk.
  • Using SAN allows to grow and shrink disks where as Konza datacenter built in feature of growinig a VM disk doesn’t allow you to shrink.
  • Hope to have all instances on 5.0.2, on k8s, and on SAN by end of Q1
  • 5.0.1 → 5.0.2 upgrade is free, in that there’s no re-indexing. this means it will effectively be instant to upgrade, as compared

24 Feb 2026 Call

Attending

Notes

eCHIS KE

  • is eCHIS KE over-provisioned on disk?
    • For example, some instances have 2TB provisioned, but only spike up to 500GB during upgrade and idle at 150GB
    • 2 migrations in play: move to k8s and consolidating instances into few VMs
    • @twier to put together a report on how much echis KE is over-provisioned on specific values to share with eCHIS KE
  • Earlier batch size optimization changed in settings has been getting dropped
    • this is stored in DB, so shouldn’t be “getting dropped”
    • default was 30,000 which should be enough, so not needed in 5.x

TCO v2:

  • Add frontend performance benchmarking suite
    • what should we do with metrics like this? Answer: put it in a super easy to access CSV file in a GH repo
    • What is missing from these tests? Answer: Large/Complex form - maybe use a complex form we used when testing Enketo uplift that a partner provided?
    • This won’t answer “how will this affect users with 200k docs vs 20k docs”, just app front end experience, all things being equal
    • This will run on the same type of EC2 instance to ensure consistent performance from run to run
  • Split up views into different ddocs
    • CI was green so images were created which was handy for testing
    • Finding hitting “stage” is failing b/c EC2 instance is running out of RAM @ 8Gb on a 4M doc in Medic DB - a regression which will need addressed