Hosting Total Cost of Ownership Squad

Hi @derick - thanks for the comment!

As far as Hosting TCO v1, which is the effort to reduce overall disk space used by the CHT day to day, we’re not learning anything new - and this is a good thing! Let’s dive into why.

Almost exactly a year ago the ticket was opened to implement Couch 3.5’s Nouveau in the CHT. In this ticket, a number of key questions were asked. If all the answers were positive, it would very likely be a win for the community at large.

Just a month later, in Nov of 2024 we had good answers:

how much disk savings do we see?

I saw ~25% disk savings [on a production data set]

Every single upgrade since then has shown about this same level of disk savings. In the best case scenario, it’s been above 30%.

By April of 2025 there was some concern that maybe our test results weren’t valid, so we we re-tested using a number of production datasets. The high watermark of just under 35% savings still held.

And so it’s true with the tests this month - we’re still seeing the same savings be true and larger instances will stand to gain the most. We can’t promise any specific amount, but it’s looking good! Here’s the chart we cite often that summarizes all this:

1 Like

28 Oct 2025 Call

Attending

Notes

  • 5.0

    • last issue (remove unused views) is still outstanding. @twier to work on it. discussing which views are where, specifically, what is or is not in medic ddoc. Tom to create PR for the smallest viable change that can fit in 5.0 safely.
    • Be sure to message around no training needed, no UI changes.
    • 5.0 requires Chrome 107 for future angular uplift. research was done showing UG is most affected, but surmountable.
  • eCHIS KE

    • based on our research in eCHIS, forum post published on finding high doc count users
    • @elijah working on testing CHT Core @ master for Nairobi clone, will start test upgrade tomorrow
  • Hosting TCO v2

Action items:

4 Nov 2025 Call

Attending

Notes

  • 5.0 (main milestone ticket)

  • eCHIS KE

    • ticket to test 5.0 in Nairobi created
      • compaction completed, upgrade started, eta middle of next week to finish test
      • instance is public and URL shared on call
      • what is measurement of success for test upgrade? successful upgrade and all systems nominal. Specific disk space savings is not a goal, just success of upgrade
    • does eCHIS want a hard limit on doc count?
      • it would be a very fantastic: we have a multi-group effort and any one specific group can increase the number of documents without informing other groups. having this hard limit will be a great way for everyone to work with in the same rules
      • West Africa also interested in this feature
      • is using contacts as a proxy for all documents reasonable? Elijah’s concern is that number of contacts stays fixed, what about using number of reports as a proxy?
      • likely feature would just set hard limit on number of reports
      • being able to report on users who have hit the limit will be important, but we also have an improvement to monitoring endpoint which should help track users approaching the limit
  • Hosting TCO v2 (main ticket: Reduce disk space during upgrades )

    • NA
  • CHToolbox vs CHT Conf

    • should we move CHToolbox to Medic org? Maybe an experiment to have community based scripts/code? no big pressure
    • is it confusing to have two CLI tools to the community? possibly! for now toolbox is viewed as a place for josh (and others?) to have a place put one of scripts that can leverage existing CRUD style code

Action items:

  • None

11 Nov 2025 Call

Attending

Notes

  • 5.0 (main milestone ticket)

    • see eCHIS KE “ticket to test 5.0 in Nairobi created”
  • eCHIS KE

    • ticket to test 5.0 in Nairobi created
      • upgrade started Nov 3rd, 12:10:20
      • as of Nov 10 10.20am Pacific - upgrade is 81% done
      • don’t block release of 5.0 on compaction, as soon as upgrade is successful, we can release
      • still do compaction to get final disk use numbers for TCO effort and to report back to MoH
      • expected to complete by EOW
  • Hosting TCO v2 (main ticket: Reduce disk space during upgrades )

    • On hold

Action items:

  • None

18 Nov 2025 Call

Attending

Notes

  • eCHIS KE
    • NA
  • 5.0
    • Test upgrade succeeded, but disk space wasn’t saved.
    • Looking into file sizes on disk in production vs clone of production where tests were done, we see compaction didn’t complete
    • Errors were found in the logs confirming compaction died, the exact cause isn’t known:
      [info] 2025-11-14T00:25:23.387450Z couchdb@127.0.0.1 <0.726.0> -------- db shards/95555553-aaaaaaa7/medic-user-foobar-meta.1726644348 died with reason {{badmatch,{error,{noproc,{gen_server,call,[<0.3043468.0>,{pread_iolists,[24835]},infinity]},[{gen_server,call,3,[{file,"gen_server.erl"},{line,419}]},{couch_file,pread_binaries,2,[{file,"src/couch_file.erl"},{line,194}]},{couch_file,pread_binary,2,[{file,"src/couch_file.erl"},{line,179}]},{couch_file,pread_term,2,[{file,"src/couch_file.erl"},{line,167}]},{couch_btree,get_node,2,[{file,"src/couch_btree.erl"},{line,474}]},{couch_btree,stream_node,8,[{file,"src/couch_btree.erl"},{line,1069}]},{couch_btree,fold,4,[{file,"src/couch_btree.erl"},{line,242}]},{couch_bt_engine,fold_docs_int,5,[{file,"src/couch_bt_engine.erl"},{line,1131}]}]}}},[{couch_db_engine,trigger_on_compact,1,[{file,"src/couch_db_engine.erl"},{line,993}]},{couch_bt_engine_compactor,start,4,[{file,"src/couch_bt_engine_compactor.erl"},{line,54}]}]}
      
    • the main error string to look for was died with reason {{badmatch
    • It’s further suspected that this is a timeout error due to resource contention. The test instance has 8 CPUs vs production having 48. 8 CPUs is likely not enough, but this is a theory
    • we started compaction on medic-client and we’ll watch this over the next ~24 hours to see if it completes
    • latest logs from docker were published in google drive (private link)

Action items:

  • @jkuester - reach out to couchdb slack to see if they have any insights as to way compaction failed.

25 Nov 2025 Call

Attending

Notes

  • 5.0.0 k8s issue:
    • Issue with EKS deployment of 5.0.0 helm charts.
    • Helm charts might need to be updated to add a separate Persistent Volume for Nouveau because EBS volumes have an inherent property of not allowing multiple mounts
      • This might be a config option? (one that was set differently test) - Multi-Attach enabled: false
      • Considerations and limitations of Multi-attach.
      • Possible that the instances we tested the helm charts with were using io2 volumes which have broad support for multi-attach, but maybe the Demo instance is running an older io1 volume (which has a much more narrow support path for multi-attach).
      • The Demo instance is running on gp2 (and not io2) so there is no support for multi-attach.
      • The decision is that we will move the Nouveau container inside the Couch container’s pod so they can share the same volume without needing multi-attach.
    • Also an issue with “rolling upgrade”. With our current helm charts we are not set the configuration to force k8s to not “roll” the upgrade. This means that, by default, K8s is trying to spin up new containers before destroying the old ones. This is obviously a serious problem when it comes to the CHT (particularly Couch) containers.
      • To fix this, we need to set the config in our helm charts to prevent “rolling” the updates (at least to the Couch container).
1 Like

02 Dec 2025 Call

Attending

Notes

  • 5.0.0

    • chat a bit about 5.0.1 happenings and why
  • chat about NPM vulnerability to supply chain attacks - anything Medic should do different? Anything stewardship/app serv teammates with sensitive tokens on disk should do different?

    • change release note script to prompt for GH token instead of having it on disk in the clear?
    • for agentic work, how do we ensure we’re not exposing/sharing private keys and CHT passwords? A teammate was using Claude and saw it had used local contexts and was returning full URLs with production passwords in them b/c it had been given all that info :frowning:
  • Hosting TCO v2

    • Josh doesn’t have time to look into CPU saturation ticket and tom looking to splitting up DDocs (but presumably is busy too)
    • Diana brought up Background Indexing - specifically batch_channels: This setting controls the number of background view builds that can be running in parallel at any given time. The default is 20.
    • discussion of using mutli-agent stuffs on hosting TCO v2? Sugat feels multi-agent isn’t ready
    • 5.0 release was taking up a lot of time/resources - with it not being released, team wants to really focus on v2 effort.
    • very likely, the biggest win will be split up views in ddocs - b/c today every time we make a change you have to rebuild all the ddocs as opposed to just the one if they were split up. currently at ~4 and would go to ~35
    • there could be an “advanced” way to upgrade where you get to choose which ddoc to index on upgrade - in the works! we could hard code description of what all ddoc does.
  • unrelated to tco, but chat about upgrade service e2e issues

tasks:

  • @mrjones - change release notes script to prompt for token instead of keeping in the clear on disk in a json file

9 Dec 2025 Call

Attending

Notes

  • Hosting TCO v2
    • touch on 2849 and 10220. What is the CPU/time cost to split up ddocs vs the cost to run this split up architecture day to day. 1 cpu per index per ddoc per shard.
    • what’s “too many” JS processes? if we have the highest amount, we reduce the amount of space needed for upgrades (b/c so few views will be updated), but then you need a lot of CPUs
    • Background Indexing (aka ken config) - will come into play, especially as we increase number of processes to handle all the processes
    • We should run all the combos of 1:1 of views to ddocs vs how it is today and performance on 2, 4, 8, 16, 32 and 64 core
    • Because of code path complexity and tech debt we’d take on, we won’t consider doing more the one configuration of number of indexes - we’ll have a one size fits all
    • Diana still interested in working partial re-indexing on upgrade too
    • maybe use scalability test to automate testing?

tasks:

  • @mrjones - change release notes script to prompt for token instead of keeping in the clear on disk in a json file