Bug 2038387

Summary: [GSS][RCA] OSDs have rocksdb corruption reporting block checksum mismatch
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: kelwhite
Component: cephAssignee: Adam Kupczyk <akupczyk>
Status: CLOSED NOTABUG QA Contact: Harish NV Rao <hnallurv>
Severity: high Docs Contact:
Priority: urgent    
Version: 4.6CC: akupczyk, assingh, awyatt, bniver, dhellard, gsitlani, khover, madam, mhackett, mmuench, muagarwa, nojha, ocs-bugs, odf-bz-bot, tpetr, tprinz, vumrao, ykaul
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-12 14:12:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 7 Dwayne 2022-01-11 13:23:57 UTC
Set Customer Escalation flag = Yes, per ACE EN-EN-47237.


From SA Trey Prinz,

I spoke to the customer this morning and he is very appreciative of everyone's help getting their system back up.

At this stage, they are in need of a Root Cause Analysis as soon as possible.  Their peak filing season starts this Friday and runs through 1/20/2022 and they want to make sure they understand what happened so that they can ensure they are prepared.

Customer told me that if the system goes down during this time, it would make the news and it would also cost the state millions in penalties and interest because they would have to move the filing deadline out.

The customer wants to understand why there were 2 OSD failures, especially since the drives backing them are RAID-type drives provided by the Nutanix storage system.  He also feels it is a little odd that they saw 2 failures like this.

Any help you can provide would be appreciated.

Comment 10 khover 2022-01-13 00:06:26 UTC
Kelson and I looked through the sos reports and cold not find errors related to a disk failure.

Comment 14 khover 2022-01-14 19:28:19 UTC
Thank you Vikhyat and Adam,

We reviewed your analysis with the customer on a call 1/14.

Nutanix engineers and representatives of Vmware stated no issues with the HW/storage stack observed in their logs.

We will need assistance capturing more verbose debug logs if the customer hits this issue again. 

Needed are, specific commands to run and where to run them ie. rook-ceph-tools-pod or at the node layer.

Comment 19 khover 2022-01-19 21:50:53 UTC
++from a running osd pod

# oc rsh rook-ceph-osd-2-bc9df5c5d-94cqt 
Defaulting container name to osd.
Use 'oc describe pod/rook-ceph-osd-2-bc9df5c5d-94cqt -n openshift-storage' to see all of the containers in this pod.


Step 2:) 
# cd /var/log/ceph
# pwd
/var/log/ceph

# mkdir bluefs 


sh-4.4# ceph-bluestore-tool --out-dir /var/log/ceph/bluefs --path /var/lib/ceph/osd/ceph-2 bluefs-export
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-2/block -> /var/lib/ceph/osd/ceph-2/block
unable to open /var/lib/ceph/osd/ceph-2/block: (11) Resource temporarily unavailable
2022-01-19T14:33:59.297+0000 7fc2c56a13c0 -1 bdev(0x564b605f1000 /var/lib/ceph/osd/ceph-2/block) open failed to lock /var/lib/ceph/osd/ceph-2/block: (11) Resource temporarily unavailable


++ created CLBO state for osd-2 by unmounting volume for osd in aws which replicates what the cu will see when Corruption: block checksum mismatch is hit. 

rook-ceph-osd-2-bc9df5c5d-94cqt                                   1/2     CrashLoopBackOff   45 (4m27s ago)   27h
 

# oc rsh rook-ceph-osd-2-bc9df5c5d-94cqt 
Defaulting container name to osd.
Use 'oc describe pod/rook-ceph-osd-2-bc9df5c5d-94cqt -n openshift-storage' to see all of the containers in this pod.
error: unable to upgrade connection: container not found ("osd")

Comment 20 Vikhyat Umrao 2022-01-19 21:56:14 UTC
(In reply to khover from comment #19)
> ++from a running osd pod
> 
> # oc rsh rook-ceph-osd-2-bc9df5c5d-94cqt 
> Defaulting container name to osd.
> Use 'oc describe pod/rook-ceph-osd-2-bc9df5c5d-94cqt -n openshift-storage'
> to see all of the containers in this pod.
> 
> 
> Step 2:) 
> # cd /var/log/ceph
> # pwd
> /var/log/ceph
> 
> # mkdir bluefs 
> 
> 
> sh-4.4# ceph-bluestore-tool --out-dir /var/log/ceph/bluefs --path
> /var/lib/ceph/osd/ceph-2 bluefs-export
> inferring bluefs devices from bluestore path
>  slot 1 /var/lib/ceph/osd/ceph-2/block -> /var/lib/ceph/osd/ceph-2/block
> unable to open /var/lib/ceph/osd/ceph-2/block: (11) Resource temporarily
> unavailable

The ceph-bluestore-tool or osd-objectstore-tool can only be run when OSD is stopped(or dead) it can not be run in live OSD. In this case OSD is running and because of that you get the above error - "(11) Resource temporarily unavailable"


> 2022-01-19T14:33:59.297+0000 7fc2c56a13c0 -1 bdev(0x564b605f1000
> /var/lib/ceph/osd/ceph-2/block) open failed to lock
> /var/lib/ceph/osd/ceph-2/block: (11) Resource temporarily unavailable
> 
> 
> ++ created CLBO state for osd-2 by unmounting volume for osd in aws which
> replicates what the cu will see when Corruption: block checksum mismatch is
> hit. 
> 
> rook-ceph-osd-2-bc9df5c5d-94cqt                                   1/2    
> CrashLoopBackOff   45 (4m27s ago)   27h
>  
> 
> # oc rsh rook-ceph-osd-2-bc9df5c5d-94cqt 
> Defaulting container name to osd.
> Use 'oc describe pod/rook-ceph-osd-2-bc9df5c5d-94cqt -n openshift-storage'
> to see all of the containers in this pod.
> error: unable to upgrade connection: container not found ("osd")

Comment 23 khover 2022-01-20 15:40:44 UTC
Customer state after corruption osds stuck in CLBO

rook-ceph-osd-1-57969fb978-6mwj2                                  1/2     CrashLoopBackOff   1 (5s ago)   3d22h


[root@vm250-43 ~]# oc scale deployment rook-ceph-operator --replicas=0 -n openshift-storage
deployment.apps/rook-ceph-operator scaled
[root@vm250-43 ~]# oc scale deployment ocs-operator --replicas=0 -n openshift-storage
deployment.apps/ocs-operator scaled


# oc get deployment rook-ceph-osd-1 -oyaml > rook-ceph-osd-1-deployment.yaml

[root@vm250-43 ~]# oc patch deployment/rook-ceph-osd-1 -n openshift-storage -p '{"spec": {"template": {"spec": {"containers": [{"name": "osd", "command": ["sleep", "infinity"], "args": []}]}}}}'
deployment.apps/rook-ceph-osd-1 patched

rook-ceph-osd-1-57969fb978-6mwj2                                  0/2     Terminating   5          3d22h


# oc patch deployment/rook-ceph-osd-1  -n openshift-storage --type='json' -p '[{"op":"remove", "path":"/spec/template/spec/containers/0/livenessProbe"}]'
deployment.apps/rook-ceph-osd-1 patched


rook-ceph-osd-1-57969fb978-6mwj2                                  0/2     Terminating   5          3d22h

stuck terminating ^^

delete osd pod 

rook-ceph-osd-1-7bc6dbcd6d-9mtfr                                  0/2     Init:0/4    0          9s

stuck init ^^

# oc rsh rook-ceph-osd-1-7bc6dbcd6d-9mtfr
Defaulting container name to osd.
Use 'oc describe pod/rook-ceph-osd-1-7bc6dbcd6d-9mtfr -n openshift-storage' to see all of the containers in this pod.
error: unable to upgrade connection: container not found ("osd")

Comment 25 khover 2022-01-20 16:57:14 UTC
Hello Ashish,

Agreed that this is not a correct reproducer by removing the disks. 

Essentially the customer had no disk as far as OCS is concerned, corruption > causing osd pod to CLBO 

The question I have is, how do we fetch sst files when the pod is CLBO.

If the customer hits this again, the osd pod will be CLBO even if the disk is present.

Comment 26 Skip Wyatt 2022-01-24 14:55:39 UTC
Could we override the cmd or entrypoint of that particular OSD pod, so it waits as at a sleep indefinitely, this would could prevent CLBO.
unless the readiness and health probes are too sensitive.

Then we could access the sst files with "oc rsync".