Description of problem (please be detailed as possible and provide log snippets): When the case was initially opened, there were two issues. One was a recent CEPH daemon crash, which was verified as not ongoing and archived. The other was the noobaa-db-pg-0 pod was in “Pending” status with the following error: Warning FailedMount 0s (x663 over 22h) kubelet MountVolume.MountDevice failed for volume "pvc-54a1120a-de56-4cd4-883b-ec289766d8e1" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/rbd0 but could not correct them: fsck from util-linux 2.32.1 /dev/rbd0 contains a file system with errors, check forced. /dev/rbd0: Inode 788022, end of extent exceeds allowed value (logical block 40, physical block 3178746, len 2) /dev/rbd0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. We were able to run fsck manually however, although the "RUN fsck MANUALLY" message went away and volume mounted, noobaa-db-pg-0 pod still goes between “Pending” and CLBO status. The customer states this started when personnel added extra worker nodes to the cluster. In the cluster, they share the same datastore for both OPN and ODF. Version of all relevant components (if applicable): CSV: NAME DISPLAY VERSION REPLACES PHASE egressip-ipam-operator.v1.2.4 Egressip Ipam Operator 1.2.4 egressip-ipam-operator.v1.2.3 Succeeded mcg-operator.v4.10.5 NooBaa Operator 4.10.5 mcg-operator.v4.10.4 Succeeded ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded odf-csi-addons-operator.v4.10.5 CSI Addons 4.10.5 odf-csi-addons-operator.v4.10.4 Succeeded odf-operator.v4.10.5 OpenShift Data Foundation 4.10.5 odf-operator.v4.10.4 Succeeded postgresoperator.v5.1.3 Crunchy Postgres for Kubernetes 5.1.3 postgresoperator.v5.1.2 Succeeded rhacs-operator.v3.71.0 Advanced Cluster Security for Kubernetes 3.71.0 rhacs-operator.v3.70.1 Succeeded Cluster Version: NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.21 True False 58d Cluster version is 4.10.21 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, PostgresCrunchy DB is impacted. Cannot prepare prod to go live. Is there any workaround available to the best of your knowledge? We were able to successfully use a workaround to run fsck on /dev/rbd0 by mapping csi-vol while rsh’d into csi pod however, noobaa-db-pg-0 status didn’t change from “Pending,” but the “RUN fsck MANUALLY” error went away. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? It's believed this issue stemmed from a CEPH crash that caused /dev/rbd0 from ocs-storagecluster-cephblockpool/csi-vol-a0f93fe6-0e70-11ed-a841-0a580a3d1a07 to have data inconsistencies that needed to be fixed with fsck. Cannot reproduce in a testing environment. Additional info: The initial error of: noobaa-db-pg-0 “Pending”: Warning FailedMount 0s (x663 over 22h) kubelet MountVolume.MountDevice failed for volume "pvc-54a1120a-de56-4cd4-883b-ec289766d8e1" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/rbd0 Changed to the following after running fsck manually to: Normal SuccessfulAttachVolume pod/noobaa-db-pg-0 AttachVolume.Attach succeeded for volume "pvc-54a1120a-de56-4cd4-883b-ec289766d8e1 But we’re still seeing: Warning BackOff 3m18s (x730 over 163m) kubelet Back-off restarting failed container oc get po noobaa-db-pg-0 NAME READY STATUS RESTARTS AGE noobaa-db-pg-0 0/1 CrashLoopBackOff 230 (118s ago) 19h name: db ready: false restartCount: 35 started: false state: waiting: message: back-off 5m0s restarting failed container=db pod=noobaa-db-pg-0_openshift-storage(cae4419b-ced3-4819-8af6-50a38dc45d83) reason: CrashLoopBackOff oc logs noobaa-db-pg-0 waiting for server to start....2022-09-17 13:08:49.166 UTC [22] LOG: starting PostgreSQL 12.11 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), 64-bit 2022-09-17 13:08:49.167 UTC [22] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2022-09-17 13:08:49.178 UTC [22] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" 2022-09-17 13:08:49.259 UTC [22] LOG: redirecting log output to logging collector process 2022-09-17 13:08:49.259 UTC [22] HINT: Future log output will appear in directory "log". ... stopped waiting pg_ctl: could not start server Thanks! Regards, Craig Wayman TSE Red Hat OpenShift Data Foundations (ODF) Customer Experience and Engagement, NA
Good Morning, I understand and this makes sense as the case was initially opened up because noobaa-db was down and the customer was getting a message to run fsck manually on /dev/rbd0. So data corruption was the initial issue/concern. We successfully accomplished the task of running fsck manually and noobaa-db then transitioned from failed PVC mount to successful, but statefulset and pod went to CLBO. I will inform the customer of the fix mentioned by backing up the "$PGDATA/pg_logical/replorigin_checkpoint" file and removing it. I will also put emphasis on backing it up to ensure there is no data loss. Appreciate your time and effort. Regards, Craig Wayman TSE Red Hat OpenShift Data Foundations (ODF) Customer Experience and Engagement, NA