Bug 2128078

Summary:	[GSS] noobaa-db-pg-0 Pending/CLBO even after Running fsck Manually
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Craig Wayman <crwayman>
Component:	Multi-Cloud Object Gateway	Assignee:	Craig Wayman <crwayman>
Status:	CLOSED WONTFIX	QA Contact:	krishnaram Karthick <kramdoss>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.10	CC:	bkunal, etamir, hnallurv, mmuench, nbecker, ocs-bugs, odf-bz-bot, shchan, sheggodu, usrivast
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-08 07:58:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Craig Wayman 2022-09-19 19:12:14 UTC

Description of problem (please be detailed as possible and provide log snippets):

  When the case was initially opened, there were two issues. One was a recent CEPH daemon crash, which was verified as not ongoing and archived. The other was the noobaa-db-pg-0 pod was in “Pending” status with the following error:

Warning  FailedMount  0s (x663 over 22h)   kubelet  MountVolume.MountDevice failed for volume "pvc-54a1120a-de56-4cd4-883b-ec289766d8e1" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/rbd0 but could not correct them: fsck from util-linux 2.32.1
/dev/rbd0 contains a file system with errors, check forced.
/dev/rbd0: Inode 788022, end of extent exceeds allowed value
  (logical block 40, physical block 3178746, len 2)

/dev/rbd0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.

  We were able to run fsck manually however, although the "RUN fsck MANUALLY" message went away and volume mounted, noobaa-db-pg-0 pod still goes between “Pending” and CLBO status. 

  The customer states this started when personnel added extra worker nodes to the cluster. In the cluster, they share the same datastore for both OPN and ODF.


Version of all relevant components (if applicable):
CSV:
NAME                              DISPLAY                                    VERSION   REPLACES                          PHASE
egressip-ipam-operator.v1.2.4     Egressip Ipam Operator                     1.2.4     egressip-ipam-operator.v1.2.3     Succeeded
mcg-operator.v4.10.5              NooBaa Operator                            4.10.5    mcg-operator.v4.10.4              Succeeded
ocs-operator.v4.10.5              OpenShift Container Storage                4.10.5    ocs-operator.v4.10.4              Succeeded
odf-csi-addons-operator.v4.10.5   CSI Addons                                 4.10.5    odf-csi-addons-operator.v4.10.4   Succeeded
odf-operator.v4.10.5              OpenShift Data Foundation                  4.10.5    odf-operator.v4.10.4              Succeeded
postgresoperator.v5.1.3           Crunchy Postgres for Kubernetes            5.1.3     postgresoperator.v5.1.2           Succeeded
rhacs-operator.v3.71.0            Advanced Cluster Security for Kubernetes   3.71.0    rhacs-operator.v3.70.1            Succeeded

Cluster Version:
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.21   True        False         58d     Cluster version is 4.10.21


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

  Yes, PostgresCrunchy DB is impacted. Cannot prepare prod to go live.


Is there any workaround available to the best of your knowledge?

  We were able to successfully use a workaround to run fsck on /dev/rbd0 by mapping csi-vol while rsh’d into csi pod however, noobaa-db-pg-0 status didn’t change from “Pending,” but the “RUN fsck MANUALLY” error went away.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3


Can this issue reproducible?

  It's believed this issue stemmed from a CEPH crash that caused /dev/rbd0 from ocs-storagecluster-cephblockpool/csi-vol-a0f93fe6-0e70-11ed-a841-0a580a3d1a07 to have data inconsistencies that needed to be fixed with fsck. Cannot reproduce in a testing environment.


Additional info:

The initial error of:
noobaa-db-pg-0 “Pending”: 

Warning  FailedMount  0s (x663 over 22h)   kubelet  MountVolume.MountDevice failed for volume "pvc-54a1120a-de56-4cd4-883b-ec289766d8e1" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/rbd0


Changed to the following after running fsck manually to:


Normal    SuccessfulAttachVolume   pod/noobaa-db-pg-0    AttachVolume.Attach succeeded for volume "pvc-54a1120a-de56-4cd4-883b-ec289766d8e1

But we’re still seeing:

Warning  BackOff       3m18s (x730 over 163m)  kubelet      Back-off restarting failed container

 oc get po noobaa-db-pg-0
NAME             READY   STATUS             RESTARTS         AGE
noobaa-db-pg-0   0/1     CrashLoopBackOff   230 (118s ago)   19h


   name: db
    ready: false
    restartCount: 35
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=db pod=noobaa-db-pg-0_openshift-storage(cae4419b-ced3-4819-8af6-50a38dc45d83)
        reason: CrashLoopBackOff



oc logs noobaa-db-pg-0
waiting for server to start....2022-09-17 13:08:49.166 UTC [22] LOG:  starting PostgreSQL 12.11 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), 64-bit
2022-09-17 13:08:49.167 UTC [22] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-09-17 13:08:49.178 UTC [22] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2022-09-17 13:08:49.259 UTC [22] LOG:  redirecting log output to logging collector process
2022-09-17 13:08:49.259 UTC [22] HINT:  Future log output will appear in directory "log".
... stopped waiting
pg_ctl: could not start server


  Thanks!


Regards,


Craig Wayman
TSE Red Hat OpenShift Data Foundations (ODF) 
Customer Experience and Engagement, NA

Comment 19 Craig Wayman 2022-11-07 15:39:21 UTC

Good Morning,

  I understand and this makes sense as the case was initially opened up because noobaa-db was down and the customer was getting a message to run fsck manually on /dev/rbd0. So data corruption was the initial issue/concern. We successfully accomplished the task of running fsck manually and noobaa-db then transitioned from failed PVC mount to successful, but statefulset and pod went to CLBO. I will inform the customer of the fix mentioned by backing up the "$PGDATA/pg_logical/replorigin_checkpoint" file and removing it. I will also put emphasis on backing it up to ensure there is no data loss. Appreciate your time and effort. 

Regards,


Craig Wayman
TSE Red Hat OpenShift Data Foundations (ODF) 
Customer Experience and Engagement, NA