Couchdb backup script with couchbackup

delcroip · February 25, 2025, 4:44pm

Dear all,

I am a bit surprised to not find a thread a db backup, apparently just coping file is not possible on distributed db.

I would like to have some feedback on people are doing it

here the latest approach I tried (someone else is trying to restore from it)

# load the env var for the docker compose .env
source cht/upgrade-service/.env
# create the target folder
mkdir backup
# for each db
for db in $(curl -s https://${COUCHDB_USER}:${COUCHDB_PASSWORD}@**cht-url.com**/_all_dbs | jq -r '.[]'); do
# backup with attachmets
    couchbackup --db $db --attachments true --url https://${COUCHDB_USER}:${COUCHDB_PASSWORD}@**cht-cht-url.com** > backup/${db}.txt
done

br

EDIT: I removed the part saying copy was not the best idea, as it is recommended by couchdb so my statement was simply wrong

mrjones · February 25, 2025, 6:09pm

Thanks for starting this thread @delcroip ! I’m “one who documents” and not so much “one who runs production”, so I’ll be watching this thread with a close eye in case our documentation needs an updating.

Our backup up docs (This is docker centric, but applies to all hosting styles) suggest copying files. Couch docs also say:

you can … copy the actual .couch files from the CouchDB data directory (by default, data/) at any time, without problem. CouchDB’s append-only storage format for both databases and secondary indexes ensures that this will work without issue.

For CHT instances Medic hosts, we use cloud backupsof the data directory (EC2 volume snapshots).

Both solutions should work equally well for single or multi-node deployments

Two questions:

What does the couchbackup executable do?
Are you having issues with trying to restore? And if yes, can you specify what the issue is?

thanks!

jkuester · February 25, 2025, 7:47pm

Presumably it is GitHub - IBM/couchbackup: Cloudant backup and restore library and command-line utility

Looks a bit like pgdump but for Couch, but the lack of support for attachments would mean that this is not suitable for doing full backups that could be restored to a working state…

jkuester · February 25, 2025, 7:58pm

Also, regarding the copying of .couch files, the couchbackup docs note:

The easiest way to backup a CouchDB database is to copy the “.couch” file. This is fine on a single-node instance, but when running multi-node Cloudant or using CouchDB 2.0 or greater, the “.couch” file only has a single shard of data. This utility allows simple backups of CouchDB or Cloudant database using the HTTP API.

This tool can script the backup of your databases. Move the backup and log files to cheap Object Storage so that you have copies of your precious data.

So, I guess the main advantage of the couchbackup approach is that you can do a remote backup of a database (over HTTP). It still seems that for a production DB of any non-trivial size, it would be much better just to copy the .couch files.

mrjones · February 25, 2025, 8:29pm

Ah - thanks Josh!

Does couchbackup do a full back up from scratch every time or do a handshake to discover the last synced sequence ID and on only send the delta to save some time? The file based backup approach will only copy the new or changed files - a massive time saver!

I note that some issues suggest seeking out CouchDB docs on how to do backup of a multi-node cluster.

delcroip · February 25, 2025, 9:25pm

Indeed I am using @cloudant/couchbackup - npm

We are working on an handover of a server and the recipient were not so happy with the couch files (it is the first thing I did, I will ask mode details, it is possible they do have several couchdb nodes), they did suggest couchbackup

@jkuester if you look at the command, I added --attachments true so it seems to support the attachments (size change a lot with that option activated x5 in our case).

Overall dump the size seems lower with couchbackup that the .couchdb files (factor 7) I wonder if it is a better format or just that some data are lost

I will update that thread once I get more information

jkuester · February 25, 2025, 9:44pm

I guess I was more worried about all the “experimental” language in the docs. I usually try to avoid too much experimental stuff when it comes to prod backups…

That being said, if you are not collecting any images/files/etc in your forms, then you might be in pretty good shape even without backing up attachments. The CHT server stores some things in attachments for the special system docs (e.g. form metadata), but in theory you could re-deploy your config from its SCM.

jkuester · February 25, 2025, 9:50pm

@delcroip if you happen to be free this Thursday, this would be a great topic to discuss at the CHT Dev Hour!

delcroip · February 26, 2025, 8:19am

Usually we are doing our backup with a script that copy the file and send them via this script:

#!/bin/bash


# Configuration EXOSCALE
export AWS_ACCESS_KEY_ID="<MY_ACCESS_KEY>"
export AWS_SECRET_ACCESS_KEY="<MY_SECRET_KEY>"

#Generate secure passphrase for encryption
export PASSPHRASE=<MY_EXPORT_PASSPHRASE>

#Change this if different bucket
HOSTNAME="<MY-CHT-INSTANCE>"

BUCKET_NAME="<MY-STORAGE_BUCKET_NAME>"
# Endpoint of the bucket - Adapt if different bucket location 
S3_ENDPOINT="https://<OBJECT_STORAGE_URL>"
SOURCE_DIR="/<COUCHDB_PARENT_PATH>/couchd"
DATE=$(date +%Y-%m-%d)
FOLDERNAME="couch-data-prod"

## According to Couchdb documentation we should first copy .shards
## and after the other files of the db
duplicity "${SOURCE_DIR}/.shards" \
	"s3://${BUCKET_NAME}/${HOSTNAME}/${FOLDERNAME}-${DATE}" --s3-endpoint-url=${S3_ENDPOINT} --allow-source-mismatch

# Backup the rest of the data
duplicity --exclude "${SOURCE_DIR}/.shards" \
	"${SOURCE_DIR}" \
	"s3://${BUCKET_NAME}/${HOSTNAME}/${FOLDERNAME}-${DATE}" --s3-endpoint-url=${S3_ENDPOINT} --allow-source-mismatch

we configured the object storage access to have write only (no update) so a ransonware can not encrypt the file on the object storage

only one server, dedicated to that, have write access on the object storage to perform the backup rotation (object storage mounted as directory)

keep daily for 14 days
keep weekly for 2 months
keep monthly for a year
keep yearly backup

#!/bin/bash
find $BACKUP_DIR/*/ -type f  -exec sh -c '
      for file do
          path=$file
          created_date=$(date -r "$path")

          year=$( date -r "$path"  "+%Y")
          month=$(date -r "$path" +"%B")
          month_day=$( date -r "$path" +"%d")
          week_day=$(date -r "$path" +"%u")
          year_day=$(date -r "$path"  +"%j")

          data_diff=$(( ( $(date +"%s") - $(date -r "$path" +"%s")) / (60*60*24) ))
          # keep 2 weeks of logs everyday
          if [ "$data_diff" -gt 14 ] ;then
              if [ "$data_diff" -lt 60 ] ;then
                  if [ "$week_day" -ne 1 ] && [ "$month_day" -ne 1 ] && [ "$year_day" -ne 1 ]; then
                     rm $file -f
                  fi
              else
                  if [ "$data_diff" -lt 360 ];then
                      if [ "$month_day" -ne 1 ] && [ "$year_day" -ne 1 ] ; then
                        rm $file -f
                      fi
                  else
                      if [ "$year_day" -ne 1 ] ; then
                         rm $file -f
                      fi
                  fi
              fi
          fi
      done'  sh {} +

mrjones · March 12, 2025, 7:00pm

@delcroip @jkuester - oh no! I some how didn’t get notified about further updates to this thread, including the scheduled call and the great script that @delcroip posted directly above. I’m sorry!

That said, I don’t have too much to add. This script posted looks excellent! I will add one nugget of wisdom: you don’t truly have a backup until you have done a restore. Be sure to do a full restore now and again to verify it all works!

delcroip · March 13, 2025, 8:31am

@mrjones, I agree and they did manage the restore part