CHT stuck in API startup screen

gkesh · September 4, 2023, 10:54am

To give a little background on the issue, it started with the application crashing due to the OS error. When using the dmesg command, it showed something like this Read-only file system. After finding no other viable solution, we requested the cloud service providers to restart the instance. This prompted the app to go to the API startup screen as seen in the screenshot below:

The app has been stuck in the stage for the past 4 days and has not changed even after multiple restarts of the container. For some more context, here are some details regarding the environment and system resources:

CHT Version: 4.X
CHT Conf: 3.19.2
Disk Usage: 13% (of 150GB)
Memory Usage: 775MB (of 16GB)
OS: Ubuntu 20.04 LTS

Another issue that could be related or unrelated is not being able to access the docker logs. When the command to access the docker logs is entered:

sudo docker logs --follow --tail 200 cht_api_1

The following message shows up:
error from daemon in stream: Error grabbing logs: error getting reference to decompressed log file: error while copying decompressed log stream to file: unexpected EOF

I did not manage to find any similar issue in the forum and this issue occurred production machine, so there’s a bit of a rush on our end.

If anyone has ever encountered this issue before, please let me know if/how it was solved eventually. Anyone trying to diagnose this issue, please let me know what more information you need from our end.

@sanjay @yuv @Prajwol

diana · September 4, 2023, 12:12pm

Hi @gkesh

Thanks for including lots of detail about your issue! This allows us to skip a few steps in debugging this.

I couldn’t find anyone reporting a similar docker error except for: Running docker logs consumes a large amount a storage space if the logs are compressed · Issue #41678 · moby/moby · GitHub , where the whole logs were requested.
So I’m inclined to believe that either:

The log file is very large (?) and docker has trouble decompressing it
Since your original error was about disk write permissions, my guess is that docker is trying to decompress the log file in a location that it doesn’t have access to.

To eliminate option 1, could you please remove all containers and start the CHT again? You would need to run docker rm and pass each CHT container. Then start the CHT again.
Retry the docker logs cht_api_1 command afterwards. By removing the containers, we make sure that the logs for them start fresh, and there’s no chance the files have become so large that it would be an issue to decompress and etc.

gkesh · September 5, 2023, 5:50am

Thank you for your quick response @diana.

I will check the instance to see if any of these problems exist in the system.

As I said in the post above, this is an production instance, so we cannot remove the containers or take any destructive actions that could hamper the data present in the instance.

Please let me know if there is any other options that we can explore.

For now, no other container shows an error when manually accessing logs from /var/lib/docker/containers/.../local-logs except for couchdb which shows the following:

[error] 2023-09-05T06:06:43.064399Z couchdb@127.0.0.1 <0.1604.0> -------- CRASH REPORT Process  (<0.1604.0>) with 2 neighbors crashed with reason: no case clause matching {mrheader,727,0,{223139,[],223572},nil,[{{233022,{114,[],6290},10342},nil,nil,687,0},{{367574,{2064,[],251444},133857},nil,nil,704,0},{nil,nil,nil,0,0},{{367822,{2,[],68},160},nil,nil,312,0},{{385322,{186,[],15187},17707},nil,nil,704,0},{{395335,{186,[{0,186,0,0,0}],9585},9828},nil,nil,704,0},{nil,nil,nil,0,0},{{401164,{114,[],2750},6066},nil,nil,687,0},{{408428,{114,[true],6752},7063},nil,nil,687,0},{{414693,{114,[],2978},6512},nil,nil,687,0},{{420837,{114,[],2752},6124},nil,nil,727,0},{{433374,...},...},...]} at couch_bt_engine_header:downgrade_partition_header/1(line:225) <= lists:foldl/3(line:1263) <= couch_bt_engine:init_state/4(line:790) <= couch_bt_engine:init/2(line:159) <= couch_db_engine:init/3(line:722) <= couch_db_updater:init/1(line:32) <= proc_lib:init_p_do_apply/3(line:247); initial_call: {couch_db_updater,init,['Argument__1']}, ancestors: [<0.1603.0>], message_queue_len: 0, messages: [], links: [<0.1603.0>,<0.1605.0>], dictionary: [{io_priority,{db_update,<<"shards/95555553-aaaaaaa7/medic-sentinel.1...">>}},...], trap_exit: false, status: running, heap_size: 1598, stack_size: 27, reductions: 817(
stderr�	[error] 2023-09-05T06:08:19.125669Z couchdb@127.0.0.1 <0.4029.0> -------- CRASH REPORT Process  (<0.4029.0>) with 2 neighbors crashed with reason: no case clause matching {mrheader,821,0,{484276,[],92028},nil,[{{522637,{291,[],182204},120843},nil,nil,821,0},{nil,nil,nil,0,0},{nil,nil,nil,0,0},{{487489,{300,[],40298},29129},nil,nil,816,0},{nil,nil,nil,0,0},{nil,nil,nil,0,0},{{248289,{103,[{174308209776416,103,1691914099589,1693148200881,294984006818723137501912846}],11388},8951},nil,nil,791,0},{nil,nil,nil,0,0},{{501916,{2778,[],172744},153415},nil,nil,816,0}]} at couch_bt_engine_header:downgrade_partition_header/1(line:225) <= lists:foldl/3(line:1263) <= couch_bt_engine:init_state/4(line:790) <= couch_bt_engine:init/2(line:159) <= couch_db_engine:init/3(line:722) <= couch_db_updater:init/1(line:32) <= proc_lib:init_p_do_apply/3(line:247); initial_call: {couch_db_updater,init,['Argument__1']}, ancestors: [<0.4028.0>], message_queue_len: 0, messages: [], links: [<0.4028.0>,<0.4030.0>], dictionary: [{io_priority,{db_update,<<"shards/3fffffff-55555553/medic-sentinel.1...">>}},...], trap_exit: false, status: running, heap_size: 987, stack_size: 27, reductions: 809(
stderr��

cc: @sanjay

diana · September 5, 2023, 11:53am

I understand, but your instance is down so some measures are necessary.

Destroying the containers should not remove data.
You can backup the data before destroying the containers.
You should already be backing up data, in case this turns out to be a data corruption issue.

gkesh · September 17, 2023, 7:52am

Hey @diana,

I was wondering if you missed the logs I posted in my last comment which showed a crash report in the couchdb logs. This shows an error in the medic-sentinel shards with the following message:

[error] 2023-09-17T07:21:07.986973Z couchdb@127.0.0.1 <0.20981.663> -------- rexi_server: from: couchdb@127.0.0.1(<0.9577.664>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,265}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,205}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,462}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,682}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]
[error] 2023-09-17T07:21:07.988479Z couchdb@127.0.0.1 <0.32146.663> -------- rexi_server: from: couchdb@127.0.0.1(<0.9577.664>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,265}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,205}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,462}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,682}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]
[error] 2023-09-17T07:21:07.988633Z couchdb@127.0.0.1 <0.18349.664> -------- rexi_server: from: couchdb@127.0.0.1(<0.9577.664>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,265}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,205}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,462}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,682}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]
[error] 2023-09-17T07:21:07.990486Z couchdb@127.0.0.1 <0.28804.663> -------- rexi_server: from: couchdb@127.0.0.1(<0.9577.664>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,265}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,205}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,462}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,682}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]
[error] 2023-09-17T07:24:34.801860Z couchdb@127.0.0.1 <0.28082.664> -------- CRASH REPORT Process  (<0.28082.664>) with 2 neighbors crashed with reason: no case clause matching {mrheader,727,0,{223139,[],223572},nil,[{{233022,{114,[],6290},10342},nil,nil,687,0},{{367574,{2064,[],251444},133857},nil,nil,704,0},{nil,nil,nil,0,0},{{367822,{2,[],68},160},nil,nil,312,0},{{385322,{186,[],15187},17707},nil,nil,704,0},{{395335,{186,[{0,186,0,0,0}],9585},9828},nil,nil,704,0},{nil,nil,nil,0,0},{{401164,{114,[],2750},6066},nil,nil,687,0},{{408428,{114,[true],6752},7063},nil,nil,687,0},{{414693,{114,[],2978},6512},nil,nil,687,0},{{420837,{114,[],2752},6124},nil,nil,727,0},{{433374,...},...},...]} at couch_bt_engine_header:downgrade_partition_header/1(line:225) <= lists:foldl/3(line:1263) <= couch_bt_engine:init_state/4(line:790) <= couch_bt_engine:init/2(line:159) <= couch_db_engine:init/3(line:722) <= couch_db_updater:init/1(line:32) <= proc_lib:init_p_do_apply/3(line:247); initial_call: {couch_db_updater,init,['Argument__1']}, ancestors: [<0.23656.664>], message_queue_len: 0, messages: [], links: [<0.23656.664>,<0.23623.664>], dictionary: [{io_priority,{db_update,<<"shards/95555553-aaaaaaa7/medic-sentinel.1...">>}},...], trap_exit: false, status: running, heap_size: 1598, stack_size: 27, reductions: 817
[error] 2023-09-17T07:24:34.802836Z couchdb@127.0.0.1 <0.21583.664> -------- CRASH REPORT Process  (<0.21583.664>) with 2 neighbors crashed with reason: no case clause matching {mrheader,821,0,{484276,[],92028},nil,[{{522637,{291,[],182204},120843},nil,nil,821,0},{nil,nil,nil,0,0},{nil,nil,nil,0,0},{{487489,{300,[],40298},29129},nil,nil,816,0},{nil,nil,nil,0,0},{nil,nil,nil,0,0},{{248289,{103,[{174308209776416,103,1691914099589,1693148200881,294984006818723137501912846}],11388},8951},nil,nil,791,0},{nil,nil,nil,0,0},{{501916,{2778,[],172744},153415},nil,nil,816,0}]} at couch_bt_engine_header:downgrade_partition_header/1(line:225) <= lists:foldl/3(line:1263) <= couch_bt_engine:init_state/4(line:790) <= couch_bt_engine:init/2(line:159) <= couch_db_engine:init/3(line:722) <= couch_db_updater:init/1(line:32) <= proc_lib:init_p_do_apply/3(line:247); initial_call: {couch_db_updater,init,['Argument__1']}, ancestors: [<0.12235.664>], message_queue_len: 0, messages: [], links: [<0.12235.664>,<0.11643.664>], dictionary: [{io_priority,{db_update,<<"shards/3fffffff-55555553/medic-sentinel.1...">>}},...], trap_exit: false, status: running, heap_size: 987, stack_size: 27, reductions: 809
[error] 2023-09-17T07:28:07.369064Z couchdb@127.0.0.1 <0.3462.665> -------- CRASH REPORT Process  (<0.3462.665>) with 2 neighbors crashed with reason: no case clause matching {mrheader,821,0,{484276,[],92028},nil,[{{522637,{291,[],182204},120843},nil,nil,821,0},{nil,nil,nil,0,0},{nil,nil,nil,0,0},{{487489,{300,[],40298},29129},nil,nil,816,0},{nil,nil,nil,0,0},{nil,nil,nil,0,0},{{248289,{103,[{174308209776416,103,1691914099589,1693148200881,294984006818723137501912846}],11388},8951},nil,nil,791,0},{nil,nil,nil,0,0},{{501916,{2778,[],172744},153415},nil,nil,816,0}]} at couch_bt_engine_header:downgrade_partition_header/1(line:225) <= lists:foldl/3(line:1263) <= couch_bt_engine:init_state/4(line:790) <= couch_bt_engine:init/2(line:159) <= couch_db_engine:init/3(line:722) <= couch_db_updater:init/1(line:32) <= proc_lib:init_p_do_apply/3(line:247); initial_call: {couch_db_updater,init,['Argument__1']}, ancestors: [<0.14500.664>], message_queue_len: 0, messages: [], links: [<0.14500.664>,<0.12117.664>], dictionary: [{io_priority,{db_update,<<"shards/3fffffff-55555553/medic-sentinel.1...">>}},...], trap_exit: false, status: running, heap_size: 987, stack_size: 27, reductions: 809
[error] 2023-09-17T07:28:07.376409Z couchdb@127.0.0.1 <0.2324.665> -------- CRASH REPORT Process  (<0.2324.665>) with 2 neighbors crashed with reason: no case clause matching {mrheader,727,0,{223139,[],223572},nil,[{{233022,{114,[],6290},10342},nil,nil,687,0},{{367574,{2064,[],251444},133857},nil,nil,704,0},{nil,nil,nil,0,0},{{367822,{2,[],68},160},nil,nil,312,0},{{385322,{186,[],15187},17707},nil,nil,704,0},{{395335,{186,[{0,186,0,0,0}],9585},9828},nil,nil,704,0},{nil,nil,nil,0,0},{{401164,{114,[],2750},6066},nil,nil,687,0},{{408428,{114,[true],6752},7063},nil,nil,687,0},{{414693,{114,[],2978},6512},nil,nil,687,0},{{420837,{114,[],2752},6124},nil,nil,727,0},{{433374,...},...},...]} at couch_bt_engine_header:downgrade_partition_header/1(line:225) <= lists:foldl/3(line:1263) <= couch_bt_engine:init_state/4(line:790) <= couch_bt_engine:init/2(line:159) <= couch_db_engine:init/3(line:722) <= couch_db_updater:init/1(line:32) <= proc_lib:init_p_do_apply/3(line:247); initial_call: {couch_db_updater,init,['Argument__1']}, ancestors: [<0.10794.664>], message_queue_len: 0, messages: [], links: [<0.10794.664>,<0.21046.664>], dictionary: [{io_priority,{db_update,<<"shards/95555553-aaaaaaa7/medic-sentinel.1...">>}},...], trap_exit: false, status: running, heap_size: 1598, stack_size: 27, reductions: 817

Please let me know if this can help diagnose the issue. All other containers seem to be running without a hitch.

cc: @sanjay @yuv

diana · September 17, 2023, 2:13pm

Hi @gkesh

This is some internal CouchDb error. Did you try restarting the container?