General discussion about “upgrades take a lot of space” - very likely Hosting TCO Squad 2.0 focus
why does CHT upgrading (adding new DDocs) cause more than 100% increase disk space?
view changes are what cause a large increase in disk space, which do not happen every upgrade
@jkuester to file POC ticket in CHT Coreto show how disk space goes up on a generic couch instance with view re-indexing. Discuss w/ @diana and @twier . We’ll then take this over to Couch Slack channel for questions.
Q: do we know for certain if a 4.19 → 4.20 will cause this spike in disk use?A: . From the current diff of 4.19.0...master, I do not see any changes to the ddoc files
Need for more disk space upgrade testing on ec2 test instance with prod data?
MoH KE update on testing Nouveau on prod data
had to work through MoH process for provisioning a new VM
got a large and small cloned instance
will commence testing once ready - but need to make sure large instance has enough free space before proceeding
Review large MoH deployment testing Nouveau branch on prod data
How much spare disk space do they need?
Sugat concerned about more than 5x as seen in research ticket
Elijah: I’m in the process of procuring additional storage to begin the upgrade process and wanted to clarify that we settled on 5x current capacity e.g. one clone has a 5TB volume with 3.2TB utilization, should we expand to 16TB or up to 25TB to get some margin of safety.
recommendation: 16TB should be fine since the utilization is 3.2TB
Interrupted upgrades should both not lose all progress, and be able to resume where they left off
recommend starting in MoH Data Center with 8vCPU/16GB RAM to see how it goes. Success or failure will be well to inform TCO Squad with next steps.
Maybe if a 3.2TB instance needs >16TB (25TB!!?!), do we need to ship TCO V2 (eg in 4.2x) before TCO v1 (in 5.0) which will reduce total space needed
TCO Squad agrees that we should wait for Elijah’s testing in MoH datacenter, this may be weeks, possibly months in worst case, to completed. In the interim, upgrade disk space research can continue
Does eCHIS KE need to test Nouveau before we release 5.0?
nothing will really change for the branch if don’t test MoH KE before the release
eCHIS KE only has 50% of avail storage, so upgrades won’t be easy b/c they don’t have 5x avail free disk
input from eCHIS KE: prepare community for what they’ll need to upgrade and what the benefits will be
eCHIS has two instances, one small and one big, small is done and big is maybe 3/4 done.
will retest and run CHT Toolbox to closely monitor disk space
Despite earlier choice to wait, we will not wait for eCHIS KE test results. If we get lucky and it’s done before hand, we’ll incorporate findings, but not blocking.
Josh provided branch of CHToolbox to do sequential updates
CHToolbox - it’s working but active tasks shown in TUI is different than active tasks shown in couch Fauxton e.g. tasks outside if medic db (eg medic-client). Can we look into why these non-medic indexes are getting run?
Any way to get index to to resume if gets interrupted half way through (or 20% of the way through?)?
Josh will try writing 1 ddoc at a time and warm it which would hopefully allow just one index at a time and also avoid other dbs like medic-client from being active
5.0 effort - what do we need to do to finish up main ticket to ship 5.0
@twier to follow up with @Lorena_Rodriguez about implications of new helm charts being merged to cht-core master and what this means for 5.0 (eg: are these new helm charts a breaking change for folks running 4.x helm charts now?)
no need document docker upgrade, it will “just work”
effort 1: update docs on what Nouveau is and how it works in general
effort 2: how do you upgrade a k8s deployment from 4.x to 5.x? or…is that too much and maybe 5.0 as a whole will need to cover helm 4.x → 5.x process and this squad doesn’t need to cover it
For effort 2, Josh to open docs issue about “how to upgrade to new helm charts for 5.0” and add to 5.0 milestone
eCHIS KE efforts
elijah still working with branch of CHT Toolbox to auto-serially upgrade an instance.
has had to restart couch container a few times which has crashed during upgrade test
noted that Toolbox doesn’t always auto continue from one DB to another
medic and then medic-client are the really hard ones, after which it should go very fast. failing just the once between these should just need one restart and maybe isn’t worth fixing?
there is an upcoming deadline Aug 29th to have all instances with latest feature (specifically contact summary) - this is not tenable
crazy ideas to work around slow IO:
implement pre-compute caching
add queuing mechanism for users, allowing at least 1 sync per day
deny large portion users access to system to reduce IO requirements
rate limit ingress
drop free-text indexes - or even remove search all together?
what other major features can we fully remove
not a crazy idea at all, but TCO feature is likely where it should go as it’s a logical destination
fork couch to possibly stop re-indexing when you do something like delete free-text indexes
follow example from other NGO with huge, slow couch: consider indexing off the VMs, set replication to off VM instance, run-reindex on other non-VM, stop both and sneaker-net back the data to production
There’s ~10 instances that are doing poorly out of 47
Have restarted Kwale clone upgrade test to see how long it will take now that eCHIS Data Center is happy - to see how long will it take. Prior effort was sequential upgrade via CHToolbox. New retest is done “normally” with concurrency
eCHIS KE very likely (80%?) to upgrade to 5.0 when it comes out - according to Elijah
Nairobi has ~100 users who are over 10k doc limit. Some are even over 100k This will have an impact on the server
Current action item for eCHIS Instances:
upgrade to 4.x branch with 5m → 30m downward sync
fix sentinel backlog by fast forwarding sequence ID
fix users who have >10k documents to sync (above replication limit). If they’re all “focal persons” that are offline account - maybe migrate to online account? Caution: There’s no way to find users who have failed to log in because they can’t sync the large amount of users
CHT 4.21 has CouchDB 3.5 which has shown massive performance improvements specifically around disk use (reduce locking calls to disk). This should be the 4th to do for eCHIS KE to improve performance
Diana excited about new endpoint that takes a time frame to get document IDs instead of sequenced ID. Is in CouchDB PR, will hopefully be released in couch 3.x soon
Stewardship teammates should sign up for Gitlab to get access - ping Elijah with account name
Action items:
@mrjones to create 3 gitlab tickets for “Current action” above
nothing pressing, following up last ticket which is just docs
eCHIS KE
Tom still testing Upgrade failures in eCHIS VMs that have API crash, and fix is to restart API
TCO v2
Discuss splitting out DDocs for better performance in that one index a DDoc with a small amount of indexes will only rebuild that DDoc and not just about every DDoc like we do today
@jkuester - look at adding CI to CHT Android to see if it supports testing on Android 5.0 (NB - this is not tied to CHT Core 5.0, just a good deprecate support for really old Android)
Elijah reports that goals is to have all instances on 4.21 by end of August
Bug fix won’t be for days (weeks?) so a work around to crashing upgrades will be to use CHToolbox prestage feature. Note that pre-stage has been merged to main. Also note that sentinel sequence ID reset feature branch will include pre-stage feature
Action items:
@jkuester - add feature to CHToolbox: sentinel sequence ID reset to zero out sentinel backlog
@jkuester - look at adding CI to CHT Android to see if it supports testing on Android 5.0 (NB - this is not tied to CHT Core 5.0, just a good deprecate support for really old Android)
@mrjones - create ticket in eCHIS KE gitlab to propose merging VMs for low use instances and re-allocate the CPU/RAM to higher need instances like Nairobi
for instances not on 4.21, try going to 4.15 first which has the 2 second fix, then go to 4.21
Review node exporter graphs to see about upgrade correlation to lower iowait
TCO v2
working on compaction issue
Action items:
@jkuester - look at adding CI to CHT Android to see if it supports testing on Android 5.0 (NB - this is not tied to CHT Core 5.0, just a good deprecate support for really old Android)
@mrjones - create sentinel backlog status and disk use status (siaya and kwale full)
@mrjones - check out jump server access to eCHIS KE