Hosting Total Cost of Ownership Squad

mrjones · October 22, 2025, 1:48pm

Hi @derick - thanks for the comment!

As far as Hosting TCO v1, which is the effort to reduce overall disk space used by the CHT day to day, we’re not learning anything new - and this is a good thing! Let’s dive into why.

Almost exactly a year ago the ticket was opened to implement Couch 3.5’s Nouveau in the CHT. In this ticket, a number of key questions were asked. If all the answers were positive, it would very likely be a win for the community at large.

Just a month later, in Nov of 2024 we had good answers:

how much disk savings do we see?

I saw ~25% disk savings [on a production data set]

Every single upgrade since then has shown about this same level of disk savings. In the best case scenario, it’s been above 30%.

By April of 2025 there was some concern that maybe our test results weren’t valid, so we we re-tested using a number of production datasets. The high watermark of just under 35% savings still held.

And so it’s true with the tests this month - we’re still seeing the same savings be true and larger instances will stand to gain the most. We can’t promise any specific amount, but it’s looking good! Here’s the chart we cite often that summarizes all this:

mrjones · October 28, 2025, 3:25pm

28 Oct 2025 Call

Attending

@jkuester - Medic
@mrjones - Medic
@twier - Medic
@elijah - Medic
@diana - Medic

Notes

5.0
- last issue (remove unused views) is still outstanding. @twier to work on it. discussing which views are where, specifically, what is or is not in medic ddoc. Tom to create PR for the smallest viable change that can fit in 5.0 safely.
- Be sure to message around no training needed, no UI changes.
- 5.0 requires Chrome 107 for future angular uplift. research was done showing UG is most affected, but surmountable.
eCHIS KE
- based on our research in eCHIS, forum post published on finding high doc count users
- @elijah working on testing CHT Core @ master for Nairobi clone, will start test upgrade tomorrow
Hosting TCO v2
- main ticket: Reduce disk space during upgrades
- Is disk space still the main cost? all research points to “yes”. in the past 12-18mo a national deployment said, “we’re not upgrade, we can’t afford to get 5x the disk space to accommodate during the upgrade”
- exploratory issue was added 2 weeks ago: “Selectively only index specific / necessary ddocs when upgrading to a new version”. this could reduce disk use during upgrade possibly at the cost of uptime for online users.

Action items:

@diana to get @mrjones a screenshot/ASCII of error you get when you push imperative config
@mrjones to update in prep for 5.0 doc with item from Diana . also note fix is to just recompile your config in cht-conf and push it.
@mrjones remove cht conf issue from 5.0

mrjones · October 31, 2025, 11:23pm

4 Nov 2025 Call

Attending

@jkuester - Medic
@mrjones - Medic
@twier - Medic
@elijah - Medic
@diana - Medic

Notes

5.0 (main milestone ticket)
- release notes draft
eCHIS KE
- ticket to test 5.0 in Nairobi created
  - compaction completed, upgrade started, eta middle of next week to finish test
  - instance is public and URL shared on call
  - what is measurement of success for test upgrade? successful upgrade and all systems nominal. Specific disk space savings is not a goal, just success of upgrade
- does eCHIS want a hard limit on doc count?
  - it would be a very fantastic: we have a multi-group effort and any one specific group can increase the number of documents without informing other groups. having this hard limit will be a great way for everyone to work with in the same rules
  - West Africa also interested in this feature
  - is using contacts as a proxy for all documents reasonable? Elijah’s concern is that number of contacts stays fixed, what about using number of reports as a proxy?
  - likely feature would just set hard limit on number of reports
  - being able to report on users who have hit the limit will be important, but we also have an improvement to monitoring endpoint which should help track users approaching the limit
Hosting TCO v2 (main ticket: Reduce disk space during upgrades )
- NA
CHToolbox vs CHT Conf
- should we move CHToolbox to Medic org? Maybe an experiment to have community based scripts/code? no big pressure
- is it confusing to have two CLI tools to the community? possibly! for now toolbox is viewed as a place for josh (and others?) to have a place put one of scripts that can leverage existing CRUD style code

Action items:

None

mrjones · November 10, 2025, 6:20pm

11 Nov 2025 Call

Attending

@twier - Medic
@elijah - Medic
@diana - Medic
@sugat - Medic
@antony - Medic

Notes

5.0 (main milestone ticket)
- see eCHIS KE “ticket to test 5.0 in Nairobi created”
eCHIS KE
- ticket to test 5.0 in Nairobi created
  - upgrade started Nov 3rd, 12:10:20
  - as of Nov 10 10.20am Pacific - upgrade is 81% done
  - don’t block release of 5.0 on compaction, as soon as upgrade is successful, we can release
  - still do compaction to get final disk use numbers for TCO effort and to report back to MoH
  - expected to complete by EOW
Hosting TCO v2 (main ticket: Reduce disk space during upgrades )
- On hold

Action items:

None

mrjones · November 18, 2025, 4:55pm

18 Nov 2025 Call

Attending

@jkuester - Medic
@mrjones - Medic
@twier - Medic
@elijah - Medic
@diana - Medic
Ulrich Anani - SRE @ Muso team

Notes

eCHIS KE
- NA

5.0

Test upgrade succeeded, but disk space wasn’t saved.
Looking into file sizes on disk in production vs clone of production where tests were done, we see compaction didn’t complete

Errors were found in the logs confirming compaction died, the exact cause isn’t known:

[info] 2025-11-14T00:25:23.387450Z couchdb@127.0.0.1 <0.726.0> -------- db shards/95555553-aaaaaaa7/medic-user-foobar-meta.1726644348 died with reason {{badmatch,{error,{noproc,{gen_server,call,[<0.3043468.0>,{pread_iolists,[24835]},infinity]},[{gen_server,call,3,[{file,"gen_server.erl"},{line,419}]},{couch_file,pread_binaries,2,[{file,"src/couch_file.erl"},{line,194}]},{couch_file,pread_binary,2,[{file,"src/couch_file.erl"},{line,179}]},{couch_file,pread_term,2,[{file,"src/couch_file.erl"},{line,167}]},{couch_btree,get_node,2,[{file,"src/couch_btree.erl"},{line,474}]},{couch_btree,stream_node,8,[{file,"src/couch_btree.erl"},{line,1069}]},{couch_btree,fold,4,[{file,"src/couch_btree.erl"},{line,242}]},{couch_bt_engine,fold_docs_int,5,[{file,"src/couch_bt_engine.erl"},{line,1131}]}]}}},[{couch_db_engine,trigger_on_compact,1,[{file,"src/couch_db_engine.erl"},{line,993}]},{couch_bt_engine_compactor,start,4,[{file,"src/couch_bt_engine_compactor.erl"},{line,54}]}]}

the main error string to look for was died with reason {{badmatch
It’s further suspected that this is a timeout error due to resource contention. The test instance has 8 CPUs vs production having 48. 8 CPUs is likely not enough, but this is a theory
we started compaction on medic-client and we’ll watch this over the next ~24 hours to see if it completes
latest logs from docker were published in google drive (private link)

Action items:

@jkuester - reach out to couchdb slack to see if they have any insights as to way compaction failed.

jkuester · November 25, 2025, 3:14pm

25 Nov 2025 Call

Attending

Notes

5.0.0 k8s issue:
- Issue with EKS deployment of 5.0.0 helm charts.
- Helm charts might need to be updated to add a separate Persistent Volume for Nouveau because EBS volumes have an inherent property of not allowing multiple mounts
  - This might be a config option? (one that was set differently test) - Multi-Attach enabled: false
  - Considerations and limitations of Multi-attach.
  - Possible that the instances we tested the helm charts with were using io2 volumes which have broad support for multi-attach, but maybe the Demo instance is running an older io1 volume (which has a much more narrow support path for multi-attach).
  - The Demo instance is running on gp2 (and not io2) so there is no support for multi-attach.
  - The decision is that we will move the Nouveau container inside the Couch container’s pod so they can share the same volume without needing multi-attach.
- Also an issue with “rolling upgrade”. With our current helm charts we are not set the configuration to force k8s to not “roll” the upgrade. This means that, by default, K8s is trying to spin up new containers before destroying the old ones. This is obviously a serious problem when it comes to the CHT (particularly Couch) containers.
  - To fix this, we need to set the config in our helm charts to prevent “rolling” the updates (at least to the Couch container).

mrjones · December 2, 2025, 3:38pm

02 Dec 2025 Call

Attending

Notes

5.0.0
- chat a bit about 5.0.1 happenings and why
chat about NPM vulnerability to supply chain attacks - anything Medic should do different? Anything stewardship/app serv teammates with sensitive tokens on disk should do different?
- change release note script to prompt for GH token instead of having it on disk in the clear?
- for agentic work, how do we ensure we’re not exposing/sharing private keys and CHT passwords? A teammate was using Claude and saw it had used local contexts and was returning full URLs with production passwords in them b/c it had been given all that info
Hosting TCO v2
- Josh doesn’t have time to look into CPU saturation ticket and tom looking to splitting up DDocs (but presumably is busy too)
- Diana brought up Background Indexing - specifically batch_channels: This setting controls the number of background view builds that can be running in parallel at any given time. The default is 20.
- discussion of using mutli-agent stuffs on hosting TCO v2? Sugat feels multi-agent isn’t ready
- 5.0 release was taking up a lot of time/resources - with it not being released, team wants to really focus on v2 effort.
- very likely, the biggest win will be split up views in ddocs - b/c today every time we make a change you have to rebuild all the ddocs as opposed to just the one if they were split up. currently at ~4 and would go to ~35
- there could be an “advanced” way to upgrade where you get to choose which ddoc to index on upgrade - in the works! we could hard code description of what all ddoc does.
unrelated to tco, but chat about upgrade service e2e issues

tasks:

@mrjones - change release notes script to prompt for token instead of keeping in the clear on disk in a json file

mrjones · December 9, 2025, 3:30pm

9 Dec 2025 Call

Attending

Notes

Hosting TCO v2
- touch on 2849 and 10220. What is the CPU/time cost to split up ddocs vs the cost to run this split up architecture day to day. 1 cpu per index per ddoc per shard.
- what’s “too many” JS processes? if we have the highest amount, we reduce the amount of space needed for upgrades (b/c so few views will be updated), but then you need a lot of CPUs
- Background Indexing (aka ken config) - will come into play, especially as we increase number of processes to handle all the processes
- We should run all the combos of 1:1 of views to ddocs vs how it is today and performance on 2, 4, 8, 16, 32 and 64 core
- Because of code path complexity and tech debt we’d take on, we won’t consider doing more the one configuration of number of indexes - we’ll have a one size fits all
- Diana still interested in working partial re-indexing on upgrade too
- maybe use scalability test to automate testing?

tasks:

@mrjones - change release notes script to prompt for token instead of keeping in the clear on disk in a json file

mrjones · December 15, 2025, 3:51pm

15 Dec 2025 Call

Attending

Notes

eCHIS KE
- diving into why Baringo had such bad downtime - no exact resolution except that couch was crashing
- Diana suspects medic-sentinel db might be corrupt?
- Elijah will open Gitlab ticket to track and share logs etc.
Hosting TCO v2
- @twier - did a bunch more testing to split it up so 1:1 view <> ddoc
- doesn’t seem to have negative affect on CPU & disk: it’s about the same amount of space and CPU use.
- test were mostly ad hoc, no Prometheus measuring etc
- separately - Tom also looking into adding postgres into CHT core as a way to save disk space

tasks

@diana - update view/ddoc ticket to have specific test we should be running on master vs ticket’s branch so we can get realistic test results we’re confident in.

mrjones · January 6, 2026, 3:36pm

6 Jan 2026 Call

Attending

Notes

eCHIS KE
- @elijah not present, no updates
Hosting TCO v2
- on docs site, PR: “interactive cost calculator”
- idea: what if we update the upgrade service to:
  - allow sequential rebuilds/updates off views
  - delete views first, this making this an upgrade that pulls the server offline until views are rebuilt
  - could be backwards compatible to 4.x (maybe back to 4.5? 4.0?!?)
  - could use TBD CHT Conf features which could then be both baked into CHT upgrade service (literally add CHT cont to upgrade service) as well allow k8s deployment to do the same trick using CHT Conf
  - tricky part will be how to prevent CHW’s from accessing the server (eg stop API, stop nginx/ingress) but still allow CHT Conf/Upgrade Service access
  - could also add an app settings value of new style “do upgrade with downtime” or if not present or has new value “do old style 5x disk space” upgrade.
TCONext (v3?)
- https://electric-sql.com/

tasks

@diana - update view/ddoc ticket to have specific test we should be running on master vs ticket’s branch so we can get realistic test results we’re confident in.

jkuester · January 13, 2026, 3:27pm

13 Jan 2026 Call

Attending

Notes

TCO v2:

Upgrade paths discussion around “add design document comparison support during upgrades” and " Support “downtime upgrade" with minimal disk space requirements”
What is the UX for when you delete your views but run out of disk space in the TBD feature?
Early research on downtime upgrade is that we’re only getting ~25% savings (so 3.8xx space vs 5x space), but when we get the ddocs broken up, this approach in 10560 will be more viable
discuss UX for add design document comparison support feature
- mrjones: collapse the list of affected ddocs and show a count in the summary.
- mrjones: If possible, for each release (not branches) show whether or not that version requires a ddoc re-index. This would massively encourage folks to take the cheap upgrades that don’t index and inform them be cautious to do others which require the indexing
discuss downtime upgrade and how it will work with CHT Conf and k8s with helm
discuss feedback on backup page
- add “restore” to title
- add a link to restore at the top of the page so folks know it’s a key part of backup
discuss future more granular perms for online users coming in future couch release
touch on security improvements that Jonathan brought up (private slack thread) around PIN vs encryption at rest vs partner feature requests

mrjones · January 20, 2026, 3:20pm

20 Jan 2026 Call

Attending

Notes

TCO v2:

Support “downtime upgrade” with minimal disk space requirements
- Best possible disk use with destructive upgrade approach is 2.6x
  
  image993×616 48.9 KB
- the risk is that if you delete all indexes and something goes wrong, the server is down until you restore from backup or revert and rebuild on existing code base. This could be especially bad if you are low on disk space, follow this path, and then run out of disk space - you have to restore from backup - only option.
- medic-client is by far the largest use of space, it is the first spike in the red line the graph
- See earlier Slack discussion on same
Split up views into different ddocs
- @diana suggests grouping views by things that use them - focusing on what the Android client use. eg “this ddoc for sentinel, this ddoc for sync, etc.”
- this feature is likely the most important to ship ASAP as it’s the
- no immediate blockers known, increase IOPS may have impact on performance day to day, which we’re watching out for
Selectively only index specific / necessary ddocs when upgrading to a new version
- when coupled with broken down ddocs, this could
- tom to update ticket, no updates since May of last year

eCHIS KE

5.0 upgrade plan - how is current 5x disk space requirements being addressed?
plan to 5 instances at a time, no current plan for lack of disk space
need to figure to how to shrink volumes, 186TB of 200TB in use
Looking at internal dashboard all but 4 VMs are well under 50%, most under 30% disk use.
Maybe one idea would be to simulate having a NAS with a lot of storage that can be dynamically allocated to the VM “most in indeed”. Something like:
- find out which VMs are on which hypervisors
- create a “storage” VM with a lot of disk on the hypervisor with the CHT instance you want to upgrade
- NFS mount the storage VM
- copy the couch data to the NFS mount
- reconfigure the CHT instance to use the couch data on the NFS mount
- upgrade the CHT and use a bunch of extra disk space, but you have more room via NFS. as well, because it’s on the same hypervisor, the disk access should be very fast, hopefully as fast as if it were on the VM directory
- when the upgrade is done, move couch data back to boot volume, unmount NFS
- when all upgrades are done, delete the storage VM, returning the storage to the unused storage pool
- net result should be no extra disk space needed to be permanently assigned to any one VM, but all instances are upgrade.

mrjones · January 27, 2026, 3:19pm

27 Jan 2026 Call

Attending

Notes

Welcome @Jude_Zambarakji !

TCO v1:

For Jude: Review features of Nouveau based on Lucene. Good write up here on Neighbourhoodie

TCO v2:

Split up views into different ddocs
great update from last week: Attached is a little analysis of where views are used built from grepping each ones name from cht-core, separating references in webapp, sentinel, api, and shared-libs; for shared-libs it goes one layer deeper to try and see where its ‘actually’ used.
Look into ways of possibly not having offline clients medic_client as this will be almost as bad as a 1st time login
moving a view from one ddoc to another doesn’t cause an re-index

eCHIS KE

eCHIS KE already underway with upgrades to 5.0.1. At a later date, will go 5.1 which is going to have previous month’s targets

mrjones · February 3, 2026, 3:21pm

03 Feb 2026 Call

Attending

Notes

Welcome @Jude_Zambarakji !

TCO v2:

Split up views into different ddocs
most recent effort has seen high iowait, but IOPS have been fine - confusing!
would be a good test to compare concurrent indexing vs sequential with more ddocs

eCHIS KE

working on 4.21 → 5.0.1 upgrade effort
indexing is takng a long time durig upgrades
having to move instances to k8s so they have a flexible with more disk space. not a PVC, but something like an NFS mount so tied to one VM, and can’t move
34 instances upgrade out 47
larger instances are the ones that are outstanding/slow. These are expected this week
VMs are renamed to follow which counties are on them
Discussion about k8s affinity (pinning a service like couchdb to a specific node with high CPU/RAM count) and how MoH KE is using this
aware that they’re loosing the upgrade service and planning to move to helm (or a DIY direct k8s calls?) for future upgrades

mrjones · February 10, 2026, 3:20pm

10 Feb 2026 Call

Attending

Notes

TCO v2:

Split up views into different ddocs
There is a branch that Tom is working on getting CI to pass on
Using erlang shell to get size of indexes from Nouveau Kwale clone to see how big they are - results posted in table on ticket
some indexes are only used by offline users (tasks_by_contact) and some are only used by only online user (doc_summaries_by_id)
there could be big gains be optimizing and removing these indexes (TCOv3? re-prioritize to make this TCOv2?)

eCHIS KE

4 of the larger instances have just completed 4.21 → 5 upgrades , they were then migrated docker → k8s
there’s been some shard corruption, will try to rebuild the shard first, then recover from backup if need be. rebuilding shard avoids rebuilding index so try that first
fix for search issue found by eCHIS KE will come in 5.0.2 first, and be baked into all future versions like 5.2
5.0.1 → 5.0.2 upgrade is free, in that there’s no re-indexing. this means it will effectively be instant to upgrade, as compared

mrjones · February 17, 2026, 3:34pm

17 Feb 2026 Call

Attending

Notes

TCO v2:

Split up views into different ddocs
CI is failing in integration b/c of ddocs have moved/changed
work is still underway, but moving slowly
tom added an issues for tco v3 :
- Optimize map reduce view doc_by_type
- looking at the largest indexes, how can we improve them? two research issues filed for two items below
- Remove or optimize map reduce view doc_summaries_by_id
- Optimize large map reduce views
to make it go faster, we might consider taking on some of the v3 items above , some of which could be shipped in same version as the ddocs/view split

eCHIS KE

still migrating instances in k8s to use SAN instead of local VM disk.
Using SAN allows to grow and shrink disks where as Konza datacenter built in feature of growinig a VM disk doesn’t allow you to shrink.
Hope to have all instances on 5.0.2, on k8s, and on SAN by end of Q1
5.0.1 → 5.0.2 upgrade is free, in that there’s no re-indexing. this means it will effectively be instant to upgrade, as compared

mrjones · February 24, 2026, 3:22pm

24 Feb 2026 Call

Attending

Notes

eCHIS KE

is eCHIS KE over-provisioned on disk?
- For example, some instances have 2TB provisioned, but only spike up to 500GB during upgrade and idle at 150GB
- 2 migrations in play: move to k8s and consolidating instances into few VMs
- @twier to put together a report on how much echis KE is over-provisioned on specific values to share with eCHIS KE
Earlier batch size optimization changed in settings has been getting dropped
- this is stored in DB, so shouldn’t be “getting dropped”
- default was 30,000 which should be enough, so not needed in 5.x

TCO v2:

Add frontend performance benchmarking suite
- what should we do with metrics like this? Answer: put it in a super easy to access CSV file in a GH repo
- What is missing from these tests? Answer: Large/Complex form - maybe use a complex form we used when testing Enketo uplift that a partner provided?
- This won’t answer “how will this affect users with 200k docs vs 20k docs”, just app front end experience, all things being equal
- This will run on the same type of EC2 instance to ensure consistent performance from run to run
Split up views into different ddocs
- CI was green so images were created which was handy for testing
- Finding hitting “stage” is failing b/c EC2 instance is running out of RAM @ 8Gb on a 4M doc in Medic DB - a regression which will need addressed