CHT on Kubernetes

Job_Isabai · September 23, 2024, 11:46am

Hello,
There is minimum documentation on CHT installation on Kubernetes: Kubernetes vs Docker | Community Health Toolkit
I would like to know if there is anyone who has implemented this and if they can share with me their experience or documentation so that I can referee to it and make a case of migration from Docker to Kubernetes. My focus is having a much more robust infrastructure.
Thanks,
Job

elijah · September 23, 2024, 12:20pm

Hi Job,

We’ve migrated CHT projects from docker to kubernetes and are working on adding guides in PR#1557. I’d be happy to walk through the process with you when ready.

icrc-fdeniger · December 13, 2024, 10:18am

Hello @elijah
at ICRC we’d like to deploy our first CHT instance on Kubernetes. Is there any resources ( helm, yaml files) we can use ?
thanks

elijah · December 13, 2024, 10:41am

Hi Frederic,

CHT’s helm charts are hosted in this repository: GitHub - medic/helm-charts

icrc-fdeniger · December 13, 2024, 2:14pm

Thanks @elijah
I suppose we should use helm-charts/charts/cht-chart-4x at main · medic/helm-charts · GitHub correct ?

elijah · December 16, 2024, 8:44am

Confirmed Frederic, this is the correct link for the recommended CHT v4 deployment

Job_Isabai · January 6, 2025, 8:39am

Hello,
I was looking at the helms charts, tried to set them up locally on Docker environment with Kubernetes but I couldn’t find a clear documentation or process. Would you please share with us a documentation for both Windows and Linux environment.

elijah · January 7, 2025, 4:35am

Hi Job,

Helm charts can be deployed to a kubernetes cluster using this command helm install /path/to/chart. Additional details are available on the helm documentation site: Helm | Using Helm

Job_Isabai · January 7, 2025, 9:05am

Hi Elijah,
Yes that’s exactly what I thought, but I kept on getting this error on a couple of yaml files:
Error: INSTALLATION FAILED: YAML parse error on cht-chart-4x/templates/api-deployment.yaml: error converting YAML to JSON: yaml: line 35: mapping values are not allowed in this context

elijah · January 7, 2025, 10:27am

Please share the values.yaml file that you’re using for this installation

Job_Isabai · January 7, 2025, 10:50am

project_name: "msfecare" # e.g. mrjones-dev
namespace: "ecare-dev" # e.g. "mrjones-dev"
chtversion: 4.10.0
# cht_image_tag: 4.1.1-4.1.1 #- This is filled in automatically by the deploy script. Don't uncomment this line.

# If images are cached, the same image tag will never be pulled twice. For development, this means that it's not
# possible to upgrade to a newer version of the same branch, as the old image will always be reused.
# For development instances, set this value to false.
cache_images: true

# Don't change upstream-servers unless you know what you're doing.
upstream_servers:
  docker_registry: "public.ecr.aws/medic"
  builds_url: "https://staging.dev.medicmobile.org/_couch/builds_4"
upgrade_service:
  tag: 0.32

# CouchDB Settings
couchdb:
  password: "Password" # Avoid using non-url-safe characters in password
  secret: "f9053a0a-ef77-4be3-994d-87d6732600fd" # for prod, change to output of `uuidgen
  user: "medic"
  uuid: "7300115e-1a98-4607-a37c-50e0c9913767" # for prod, change to output of `uuidgen`
  clusteredCouch_enabled: false
  couchdb_node_storage_size: 100Mi
clusteredCouch:
  noOfCouchDBNodes: 3
toleration:   # This is for the couchdb pods. Don't change this unless you know what you're doing.
  key: "dev-couchdb-only"
  operator: "Equal"
  value: "true"
  effect: "NoSchedule"
ingress:
  annotations:
    groupname: "dev-cht-alb"
    tags: "Environment=dev,Team=QA"
    certificate: "arn:aws:iam::720541322708:server-certificate/2024-wildcard-dev-medicmobile-org-chain"
  # Ensure the host is not already taken. Valid characters for a subdomain are:
  #   a-z, 0-9, and - (but not as first or last character).
  host: "<subdomain>.dev.medicmobile.org"  # e.g. "mrjones.dev.medicmobile.org"
  hosted_zone_id: "Z3304WUAJTCM7P"
  load_balancer: "dualstack.k8s-devchtalb-3eb0781cbb-694321496.eu-west-2.elb.amazonaws.com"

environment: "remote"  # "local", "remote"
cluster_type: "eks" # "eks" or "k3s-k3d"
cert_source: "eks-medic" # "eks-medic" or "specify-file-path" or "my-ip-co"
certificate_crt_file_path: "/path/to/certificate.crt" # Only required if cert_source is "specify-file-path"
certificate_key_file_path: "/path/to/certificate.key" # Only required if cert_source is "specify-file-path"

nodes:
  # If using clustered couchdb, add the nodes here: node-1: name-of-first-node, node-2: name-of-second-node, etc.
  # Add equal number of nodes as specified in clusteredCouch.noOfCouchDBNodes
  node-1: "" # This is the name of the first node where couchdb will be deployed
  node-2: "" # This is the name of the second node where couchdb will be deployed
  node-3: "" # This is the name of the third node where couchdb will be deployed
  # For single couchdb node, use the following:
  # Leave it commented out if you don't know what it means.
  # Leave it commented out if you want to let kubernetes deploy this on any available node. (Recommended)
  # single_node_deploy: "gamma-cht-node" # This is the name of the node where all components will be deployed - for non-clustered configuration. 

# Applicable only if using k3s
k3s_use_vSphere_storage_class: "false" # "true" or "false"
# vSphere specific configurations. If you set "true" for k3s_use_vSphere_storage_class, fill in the details below.
vSphere:
  datastoreName: "DatastoreName"  # Replace with your datastore name
  diskPath: "path/to/disk"         # Replace with your disk path

# -----------------------------------------
#       Pre-existing data section
# -----------------------------------------
couchdb_data:
  preExistingDataAvailable: "false" #If this is false, you don't have to fill in details in local_storage or remote.
  dataPathOnDiskForCouchDB: "data" # This is the path where couchdb data will be stored. Leave it as data if you don't have pre-existing data.
    # To mount to a specific subpath (If data is from an old 3.x instance for example): dataPathOnDiskForCouchDB: "storage/medic-core/couchdb/data"
    # To mount to the root of the volume: dataPathOnDiskForCouchDB: ""
    # To use the default "data" subpath, remove the subPath line entirely from values.yaml or name it "data" or use null.
    # for Multi-node configuration, you can use %d to substitute with the node number.
    # You can use %d for each node to be substituted with the node number.
    # If %d doesn't exist, the same path will be used for all nodes.
    # example: test-path%d will be test-path1, test-path2, test-path3 for 3 nodes.
    # example: test-path will be test-path for all nodes.
  partition: "0" # This is the partition number for the EBS volume. Leave it as 0 if you don't have a partitioned disk.

# If preExistingDataAvailable is true, fill in the details below.
# For local_storage, fill in the details if you are using k3s-k3d cluster type.
local_storage:  #If using k3s-k3d cluster type and you already have existing data.
  preExistingDiskPath-1: "/var/lib/couchdb1" #If node1 has pre-existing data.
  preExistingDiskPath-2: "/var/lib/couchdb2" #If node2 has pre-existing data.
  preExistingDiskPath-3: "/var/lib/couchdb3" #If node3 has pre-existing data.
# For ebs storage when using eks cluster type, fill in the details below.
ebs:
  preExistingEBSVolumeID-1: "vol-0123456789abcdefg" # If you have already created the EBS volume, put the ID here.
  preExistingEBSVolumeID-2: "vol-0123456789abcdefg" # If you have already created the EBS volume, put the ID here.
  preExistingEBSVolumeID-3: "vol-0123456789abcdefg" # If you have already created the EBS volume, put the ID here.
  preExistingEBSVolumeSize: "100Gi" # The size of the EBS volume.

elijah · January 7, 2025, 2:03pm

Thanks Job,

The following deployment files within the template directory reference cht_image_tag which has been replaced by chtversion in values.yaml:

api-deployment.yaml
couchdb-single-deployment.yaml
haproxy-deployment.yaml
healthcheck-deployment.yaml
sentinel-deployment.yaml

After making this update the charts will compile successfully.

Job_Isabai · January 8, 2025, 11:20am

Hello Elijah,
Thank you very much for this, I made the update and now am able to run installations. Now onto my next issue:

All pods except couchdb and haproxy even when running a single couchdb or a clustered.
The couchdb logs:

[notice] 2025-01-08T11:18:18.773849Z couchdb@127.0.0.1 <0.109.0> -------- config: [admins] medic set to '****' for reason nil
[info] 2025-01-08T11:18:18.812620Z couchdb@127.0.0.1 <0.254.0> -------- Apache CouchDB has started. Time to relax.
[notice] 2025-01-08T11:18:18.818237Z couchdb@127.0.0.1 <0.353.0> -------- rexi_server : started servers
[notice] 2025-01-08T11:18:18.819214Z couchdb@127.0.0.1 <0.357.0> -------- rexi_buffer : started servers
[warning] 2025-01-08T11:18:18.838070Z couchdb@127.0.0.1 <0.365.0> -------- creating missing database: _nodes
[info] 2025-01-08T11:18:18.838123Z couchdb@127.0.0.1 <0.366.0> -------- open_result error {not_found,no_db_file} for _nodes
[warning] 2025-01-08T11:18:18.889595Z couchdb@127.0.0.1 <0.381.0> -------- creating missing database: _dbs
[warning] 2025-01-08T11:18:18.889595Z couchdb@127.0.0.1 <0.382.0> -------- creating missing database: _dbs
[info] 2025-01-08T11:18:18.889640Z couchdb@127.0.0.1 <0.384.0> -------- open_result error {not_found,no_db_file} for _dbs
[notice] 2025-01-08T11:18:18.907307Z couchdb@127.0.0.1 <0.396.0> -------- mem3_reshard_dbdoc start init()
[notice] 2025-01-08T11:18:18.926356Z couchdb@127.0.0.1 <0.398.0> -------- mem3_reshard start init()
[notice] 2025-01-08T11:18:18.926461Z couchdb@127.0.0.1 <0.399.0> -------- mem3_reshard db monitor <0.399.0> starting
[notice] 2025-01-08T11:18:18.930542Z couchdb@127.0.0.1 <0.398.0> -------- mem3_reshard starting reloading jobs
[notice] 2025-01-08T11:18:18.930639Z couchdb@127.0.0.1 <0.398.0> -------- mem3_reshard finished reloading jobs
[info] 2025-01-08T11:18:18.952029Z couchdb@127.0.0.1 <0.405.0> -------- Apache CouchDB has started. Time to relax.
[info] 2025-01-08T11:18:18.952116Z couchdb@127.0.0.1 <0.405.0> -------- Apache CouchDB has started on http://0.0.0.0:5984/
[notice] 2025-01-08T11:18:18.967965Z couchdb@127.0.0.1 <0.426.0> -------- chttpd_auth_cache changes listener died because the _users database does not exist. Create the database to silence this notice.
[error] 2025-01-08T11:18:18.968299Z couchdb@127.0.0.1 emulator -------- Error in process <0.427.0> on node 'couchdb@127.0.0.1' with exit value:
{database_does_not_exist,[{mem3_shards,load_shards_from_db,[<<"_users">>],[{file,"src/mem3_shards.erl"},{line,430}]},{mem3_shards,load_shards_from_disk,1,[{file,"src/mem3_shards.erl"},{line,405}]},{mem3_shards,load_shards_from_disk,2,[{file,"src/mem3_shards.erl"},{line,434}]},{mem3_shards,for_docid,3,[{file,"src/mem3_shards.erl"},{line,100}]},{fabric_doc_open,go,3,[{file,"src/fabric_doc_open.erl"},{line,39}]},{chttpd_auth_cache,ensure_auth_ddoc_exists,2,[{file,"src/chttpd_auth_cache.erl"},{line,214}]},{chttpd_auth_cache,listen_for_changes,1,[{file,"src/chttpd_auth_cache.erl"},{line,160}]}]}
[error] 2025-01-08T11:18:18.968424Z couchdb@127.0.0.1 emulator -------- Error in process <0.427.0> on node 'couchdb@127.0.0.1' with exit value:
{database_does_not_exist,[{mem3_shards,load_shards_from_db,[<<"_users">>],[{file,"src/mem3_shards.erl"},{line,430}]},{mem3_shards,load_shards_from_disk,1,[{file,"src/mem3_shards.erl"},{line,405}]},{mem3_shards,load_shards_from_disk,2,[{file,"src/mem3_shards.erl"},{line,434}]},{mem3_shards,for_docid,3,[{file,"src/mem3_shards.erl"},{line,100}]},{fabric_doc_open,go,3,[{file,"src/fabric_doc_open.erl"},{line,39}]},{chttpd_auth_cache,ensure_auth_ddoc_exists,2,[{file,"src/chttpd_auth_cache.erl"},{line,214}]},{chttpd_auth_cache,listen_for_changes,1,[{file,"src/chttpd_auth_cache.erl"},{line,160}]}]}
[notice] 2025-01-08T11:18:19.015081Z couchdb@127.0.0.1 <0.474.0> -------- Missing system database _users
Waiting for cht couchdb

Haproxy logs

# servers are added at runtime, in entrypoint.sh, based on couchdb-1.ecare.svc.cluster.local,couchdb-2.ecare.svc.cluster.local,couchdb-3.ecare.svc.cluster.local
  server couchdb-1.ecare.svc.cluster.local couchdb-1.ecare.svc.cluster.local:5984 check agent-check agent-inter 5s agent-addr healthcheck.ecare.svc.cluster.local agent-port 5555
  server couchdb-2.ecare.svc.cluster.local couchdb-2.ecare.svc.cluster.local:5984 check agent-check agent-inter 5s agent-addr healthcheck.ecare.svc.cluster.local agent-port 5555
  server couchdb-3.ecare.svc.cluster.local couchdb-3.ecare.svc.cluster.local:5984 check agent-check agent-inter 5s agent-addr healthcheck.ecare.svc.cluster.local agent-port 5555
[alert] 007/111000 (1) : parseBasic loaded
[alert] 007/111000 (1) : parseCookie loaded
[alert] 007/111000 (1) : replacePassword loaded
[NOTICE]   (1) : haproxy version is 2.6.17-a7cab98
[NOTICE]   (1) : path to executable is /usr/local/sbin/haproxy
[ALERT]    (1) : config : [/usr/local/etc/haproxy/backend.cfg:7] : 'server couchdb-servers/couchdb-1.ecare.svc.cluster.local' : parsing agent-addr failed. Check if 'healthcheck.ecare.svc.cluster.local' is correct address..
[ALERT]    (1) : config : [/usr/local/etc/haproxy/backend.cfg:8] : 'server couchdb-servers/couchdb-2.ecare.svc.cluster.local' : parsing agent-addr failed. Check if 'healthcheck.ecare.svc.cluster.local' is correct address..
[ALERT]    (1) : config : [/usr/local/etc/haproxy/backend.cfg:9] : 'server couchdb-servers/couchdb-3.ecare.svc.cluster.local' : parsing agent-addr failed. Check if 'healthcheck.ecare.svc.cluster.local' is correct address..
[ALERT]    (1) : config : Error(s) found in configuration file : /usr/local/etc/haproxy/backend.cfg
[ALERT]    (1) : config : Fatal errors found in configuration.`

elijah · January 15, 2025, 9:51am

Hi Job,

Couch logs indicate that it’s trying to create databases with failing writes.

I would recommend setting up CHT locally using k3s starting with a single-node deployment then move to multi-node deployment and finally the production environment. This approach will make it easier to isolate where specific issues are occurring.

hareet · January 28, 2025, 11:38pm

Hi Job,

Hoping you have resolved your issue. Here are some other helpful paths for similar scenarios:

In your couchdb logs, waiting for CouchDB implies the cluster is unable to talk to other couchdb nodes. Can you log into the CouchDB-1 pod, and manually try to curl localhost:5984, then curl the service ip that couchdb-1 is listed on when running kubectl get services. Last you want to try accessing the other couchdb nodes from inside each other. Your logs imply the service links between CouchDB’s are not working, so the cluster does not come up, and CouchDB then tries a fresh install and fails at creating system databases.
fabric_rpc errors seen in your CouchDB logs are related to resource constraints. Can you pull monitoring metrics and resource usage for the project?
From your HaProxy logs, the healthcheck pod or healthcheck service is not running correctly. Please restart it, and verify the kubernetes service resources are working. You will want to restart healthcheck, then restart haproxy, but only after you are sure the service IPs between couchdb’s are manually working.

Job_Isabai · January 29, 2025, 7:57am

Hello @hareet ,
I am still stuck, my latest error was on persistent volume.
I will try again today and share the logs as soon as possible.
Thanks,
Job

Job_Isabai · January 29, 2025, 7:04pm

Can anyone please share with me helm chart that is working on localhost without existing data. I think there is something wrong with the helm charts published on Github. I am no expert but each time I try them I end up with an new error. I a more than willing to have a quick call on this.
Thanks in advance.

diana · February 3, 2025, 10:05am

Hi @Job_Isabai

Sorry about your errors with persistent volumes.
For local development, we have a version of the helm charts that we also use for e2e testing in the CI suite. You can find them at cht-core/tests/helm at master · medic/cht-core · GitHub

hareet · February 4, 2025, 4:50pm

@Job_Isabai
Local development and persistent volumes are supported https://github.com/medic/helm-charts/blob/main/charts/cht-chart-4x/values.yaml#L45

Can you share your values.yaml file (be sure to edit out any sensitive information)? And paste your recent error?

Job_Isabai · February 5, 2025, 7:31pm

Hi @diana ,
I got this error a couple of times on haproxy.

type or paste code here
global
  maxconn 60000
  spread-checks 5
  lua-load-per-thread /usr/local/etc/haproxy/parse_basic.lua
  lua-load-per-thread /usr/local/etc/haproxy/parse_cookie.lua
  lua-load-per-thread /usr/local/etc/haproxy/replace_password.lua
  log stdout len 65535 local2 debug
  tune.bufsize 32768
  tune.buffers.limit 60000
http-errors json
  errorfile 200 /usr/local/etc/haproxy/errors/200-json.http
  errorfile 400 /usr/local/etc/haproxy/errors/400-json.http
  errorfile 401 /usr/local/etc/haproxy/errors/401-json.http
  errorfile 403 /usr/local/etc/haproxy/errors/403-json.http
  errorfile 404 /usr/local/etc/haproxy/errors/404-json.http
  errorfile 405 /usr/local/etc/haproxy/errors/405-json.http
  errorfile 407 /usr/local/etc/haproxy/errors/407-json.http
  errorfile 408 /usr/local/etc/haproxy/errors/408-json.http
  errorfile 410 /usr/local/etc/haproxy/errors/410-json.http
  errorfile 413 /usr/local/etc/haproxy/errors/413-json.http
  errorfile 421 /usr/local/etc/haproxy/errors/421-json.http
  errorfile 422 /usr/local/etc/haproxy/errors/422-json.http
  errorfile 425 /usr/local/etc/haproxy/errors/425-json.http
  errorfile 429 /usr/local/etc/haproxy/errors/429-json.http
  errorfile 500 /usr/local/etc/haproxy/errors/500-json.http
  errorfile 501 /usr/local/etc/haproxy/errors/501-json.http
  errorfile 502 /usr/local/etc/haproxy/errors/502-json.http
  errorfile 503 /usr/local/etc/haproxy/errors/503-json.http
  errorfile 504 /usr/local/etc/haproxy/errors/504-json.http
defaults
  mode http
  option http-ignore-probes
  option httplog
  option forwardfor
  option redispatch
  option http-server-close
  timeout client 15000000
  timeout server 360000000
  timeout connect 1500000
  timeout http-keep-alive 5m
  errorfiles json
  stats enable
  stats refresh 30s
  stats auth medic:Secret_1
  stats uri /haproxy?stats
frontend http-in
  bind  0.0.0.0:5984
  acl has_user req.hdr(x-medic-user) -m found
  acl has_cookie req.hdr(cookie) -m found
  acl has_basic_auth req.hdr(authorization) -m found
  declare capture request len 400000
  http-request set-header x-medic-user %[lua.parseBasic] if has_basic_auth
  http-request set-header x-medic-user %[lua.parseCookie] if !has_basic_auth !has_user has_cookie
  http-request capture req.body id 0 # capture.req.hdr(0)
  http-request capture req.hdr(x-medic-service) len 200 # capture.req.hdr(1)
  http-request capture req.hdr(x-medic-user) len 200 # capture.req.hdr(2)
  http-request capture req.hdr(user-agent) len 600 # capture.req.hdr(3)
  capture response header Content-Length len 10 # capture.res.hdr(0)
  http-response set-header Connection Keep-Alive
  http-response set-header Keep-Alive timeout=18000
  log global
  log-format "%ci,%s,%ST,%Ta,%Ti,%TR,%[capture.req.method],%[capture.req.uri],%[capture.req.hdr(1)],%[capture.req.hdr(2)],'%[capture.req.hdr(0),lua.replacePassword]',%B,%Tr,%[capture.res.hdr(0)],'%[capture.req.hdr(3)]'"
  default_backend couchdb-servers
backend couchdb-servers
  balance leastconn
  retry-on all-retryable-errors
  log global
  retries 5
  # servers are added at runtime, in entrypoint.sh, based on couchdb-1.msfecare.svc.cluster.local,couchdb-2.msfecare.svc.cluster.local,couchdb-3.msfecare.svc.cluster.local
  server couchdb-1.msfecare.svc.cluster.local couchdb-1.msfecare.svc.cluster.local:5984 check agent-check agent-inter 5s agent-addr healthcheck.msfecare.svc.cluster.local agent-port 5555
  server couchdb-2.msfecare.svc.cluster.local couchdb-2.msfecare.svc.cluster.local:5984 check agent-check agent-inter 5s agent-addr healthcheck.msfecare.svc.cluster.local agent-port 5555
  server couchdb-3.msfecare.svc.cluster.local couchdb-3.msfecare.svc.cluster.local:5984 check agent-check agent-inter 5s agent-addr healthcheck.msfecare.svc.cluster.local agent-port 5555
[alert] 035/192700 (1) : parseBasic loaded
[alert] 035/192700 (1) : parseCookie loaded
[alert] 035/192700 (1) : replacePassword loaded
[NOTICE]   (1) : haproxy version is 2.6.15-446b02c
[NOTICE]   (1) : path to executable is /usr/local/sbin/haproxy
[ALERT]    (1) : config : [/usr/local/etc/haproxy/backend.cfg:7] : 'server couchdb-servers/couchdb-1.msfecare.svc.cluster.local' : parsing agent-addr failed. Check if 'healthcheck.msfecare.svc.cluster.local' is correct address..
[ALERT]    (1) : config : [/usr/local/etc/haproxy/backend.cfg:8] : 'server couchdb-servers/couchdb-2.msfecare.svc.cluster.local' : parsing agent-addr failed. Check if 'healthcheck.msfecare.svc.cluster.local' is correct address..
[ALERT]    (1) : config : [/usr/local/etc/haproxy/backend.cfg:9] : 'server couchdb-servers/couchdb-3.msfecare.svc.cluster.local' : parsing agent-addr failed. Check if 'healthcheck.msfecare.svc.cluster.local' is correct address..
[ALERT]    (1) : config : Error(s) found in configuration file : /usr/local/etc/haproxy/backend.cfg
[ALERT]    (1) : config : Fatal errors found in configuration.

namespace: "msfecare"
cht_image_tag: "4.5.0"

upstream_servers:
  docker_registry: "public.ecr.aws/medic"

couchdb:
  password: "Secret_1"
  secret: "f9053a0a-ef77-4be3-994d-87d6732600fd"
  user: "medic"
  uuid: "7300115e-1a98-4607-a37c-50e0c9913767"
  db_name: "ecare"

local_path:
  preExistingDiskPath-1: "/Users/Isabai/Documents/GitHub/cht-core/tests/helm/templates/srv1"
  preExistingDiskPath-2: "/Users/Isabai/Documents/GitHub/cht-core/tests/helm/templates/srv2"
  preExistingDiskPath-3: "/Users/Isabai/Documents/GitHub/cht-core/tests/helm/templates/srv3"