waiting on migration date to move data centers away from Konza
“big 5” instances are still seeing high CPU use (>99%) - but no one wants to change them before move away from Konza
no changes to k8s hosting or remaining 4.21 → 5.1 upgrade.
some instances have high CPU use but have >3k users on only 8 cores, for example
would a multi-node couch instance be helpful? or what about a read only couch node?
strongly suggest not doing this as multi-node has shown to both give only nominal performance gains and adds massive complexity for backup and restore and general day to day maintenance.
discussion about these graphs charting disk use across upgrades
splitting ddocs does indeed use more CPU and IOPS, which makes the upgrade take longer (150% in this test case)
for TCOv2, backing off to a more simple PR that will use constant for DDOC paths instead of hard coded values which make it more easy to re-balance the indexes at a later date. We’ll go on to TCO v3 and come back to v2 later as v3 is easier to achieve.
moving on to focus on index improvements in TCO v3: will try to merge improvements to existing indexes as they are now
by delaying v2, we knowingly causing painful re-indexes of medic client, by focusing on v3 now
amazing real world benefits of a production upgrade of 33% disk savings!! (private repo)
discuss interactive hosting cost calculator PR question and how it might benefit TCO planning call for MoH KE
otherwise interactive hosting cost calculator is moving along nicely
note that calculator is based on Operating Expenses (OpEx) where you pay month to month for resources like you do in AWS and doesn’t speak too well to costs of Capital Expenses (CapEx) of buying hardware, but may lend itself to coming up with CPU/RAM/DISK for CapEx
likely ship calculator in the next couple of weeks
there is an overlap of API and couch ruining more than one instance for minutes at a time which is suspicious. and not good. Diana wonders aboutstrategy.type: Recreate vs RollingUpdate
diana also notes: The default deployment strategy in Kubernetes is RollingUpdate
Elijah to set up log streaming so we can see logs from prior OOMed containers
this should reduce size of indexes, thus make upgrades go quicker and reduce disk space
current guess is if we reduce index size by, say, 25%, the overall upgrade size would go down the same amount during upgrade
current estimate is weeks not months for subset TCO v3 features
Discuss cold storage
old issue based on just tasks. this ticket specifies massive list of UUIDs to purge. in future could offer a feature “purge tasks older than X” or “purge reports older than Y”
POC is underway with new purge endpoint in couch 3.5.0
promising results!
need to be careful about views getting too far out of date where a race condition causes them to no longer be indexed and queries fail
testing on small 8 core EC2 instance with 2M docs
1M docs purged in the amount of time to write 1M, which is a good speed
docs would be written to some un-indexed DB (eg “cold storage”) as a fall back
indexes get rebuilt as IDs are put into cold storage. currently batch size is 10k docs purged, and the all views are queried so they’re up to date. This causes heavy CPU use during purging
for UUIDs on offline clients that get purged on server, it could get re-upload. need to be careful to ensure that too high a percent doesn’t get re-uploaded. it will only be tasks or reports that are updated offline after last sync that will be pushed up. IE very old reports and tasks very likely will not get pushed up.
per above, feature should likely be opinionated (ie delete on server and on client)
this will work back to 4.20 when we released CouchDB 3.5
By end of April, Diana wants to do more tests on small server (~8 cores/16GB) to sure it’s stable across large amount of docs
Any CHT core endpoints in support of this will be after 5.2
Medic EKS cluster had a number of instances upgrade 4.x->5.x and CPU usage was high enough on couch k8s nodes to cause brief outages
one idea is to go back to prior 4.x → 4.x upgrades to see how the cluster behaved when NP did a bunch of upgrdas prior
more tests run, still showing promising results on churn
no major announcements/discoveries
holding off on writing purged docs to cold storage DB b/c writes are boring and a known quantity. Looking for unknowns that might have negative impact
first iteration will be simplistic and accept a list of a boatload of UUIDs. this requires deployments to create this list first before they can purge → cold storage
hoping to ship this first iteration in 5.3, which assumes 5.2 ships in weeks not months from now
maybe cht core 5.4 has automated cold storage enabled?
good discussion on what do we ship? do we enable something by default? maybe all tasks older than 6mo are auto cold storaged?