Set Customer Escalation flag = Yes, per ACE EN-EN-47237. From SA Trey Prinz, I spoke to the customer this morning and he is very appreciative of everyone's help getting their system back up. At this stage, they are in need of a Root Cause Analysis as soon as possible. Their peak filing season starts this Friday and runs through 1/20/2022 and they want to make sure they understand what happened so that they can ensure they are prepared. Customer told me that if the system goes down during this time, it would make the news and it would also cost the state millions in penalties and interest because they would have to move the filing deadline out. The customer wants to understand why there were 2 OSD failures, especially since the drives backing them are RAID-type drives provided by the Nutanix storage system. He also feels it is a little odd that they saw 2 failures like this. Any help you can provide would be appreciated.
Kelson and I looked through the sos reports and cold not find errors related to a disk failure.
Thank you Vikhyat and Adam, We reviewed your analysis with the customer on a call 1/14. Nutanix engineers and representatives of Vmware stated no issues with the HW/storage stack observed in their logs. We will need assistance capturing more verbose debug logs if the customer hits this issue again. Needed are, specific commands to run and where to run them ie. rook-ceph-tools-pod or at the node layer.
++from a running osd pod # oc rsh rook-ceph-osd-2-bc9df5c5d-94cqt Defaulting container name to osd. Use 'oc describe pod/rook-ceph-osd-2-bc9df5c5d-94cqt -n openshift-storage' to see all of the containers in this pod. Step 2:) # cd /var/log/ceph # pwd /var/log/ceph # mkdir bluefs sh-4.4# ceph-bluestore-tool --out-dir /var/log/ceph/bluefs --path /var/lib/ceph/osd/ceph-2 bluefs-export inferring bluefs devices from bluestore path slot 1 /var/lib/ceph/osd/ceph-2/block -> /var/lib/ceph/osd/ceph-2/block unable to open /var/lib/ceph/osd/ceph-2/block: (11) Resource temporarily unavailable 2022-01-19T14:33:59.297+0000 7fc2c56a13c0 -1 bdev(0x564b605f1000 /var/lib/ceph/osd/ceph-2/block) open failed to lock /var/lib/ceph/osd/ceph-2/block: (11) Resource temporarily unavailable ++ created CLBO state for osd-2 by unmounting volume for osd in aws which replicates what the cu will see when Corruption: block checksum mismatch is hit. rook-ceph-osd-2-bc9df5c5d-94cqt 1/2 CrashLoopBackOff 45 (4m27s ago) 27h # oc rsh rook-ceph-osd-2-bc9df5c5d-94cqt Defaulting container name to osd. Use 'oc describe pod/rook-ceph-osd-2-bc9df5c5d-94cqt -n openshift-storage' to see all of the containers in this pod. error: unable to upgrade connection: container not found ("osd")
(In reply to khover from comment #19) > ++from a running osd pod > > # oc rsh rook-ceph-osd-2-bc9df5c5d-94cqt > Defaulting container name to osd. > Use 'oc describe pod/rook-ceph-osd-2-bc9df5c5d-94cqt -n openshift-storage' > to see all of the containers in this pod. > > > Step 2:) > # cd /var/log/ceph > # pwd > /var/log/ceph > > # mkdir bluefs > > > sh-4.4# ceph-bluestore-tool --out-dir /var/log/ceph/bluefs --path > /var/lib/ceph/osd/ceph-2 bluefs-export > inferring bluefs devices from bluestore path > slot 1 /var/lib/ceph/osd/ceph-2/block -> /var/lib/ceph/osd/ceph-2/block > unable to open /var/lib/ceph/osd/ceph-2/block: (11) Resource temporarily > unavailable The ceph-bluestore-tool or osd-objectstore-tool can only be run when OSD is stopped(or dead) it can not be run in live OSD. In this case OSD is running and because of that you get the above error - "(11) Resource temporarily unavailable" > 2022-01-19T14:33:59.297+0000 7fc2c56a13c0 -1 bdev(0x564b605f1000 > /var/lib/ceph/osd/ceph-2/block) open failed to lock > /var/lib/ceph/osd/ceph-2/block: (11) Resource temporarily unavailable > > > ++ created CLBO state for osd-2 by unmounting volume for osd in aws which > replicates what the cu will see when Corruption: block checksum mismatch is > hit. > > rook-ceph-osd-2-bc9df5c5d-94cqt 1/2 > CrashLoopBackOff 45 (4m27s ago) 27h > > > # oc rsh rook-ceph-osd-2-bc9df5c5d-94cqt > Defaulting container name to osd. > Use 'oc describe pod/rook-ceph-osd-2-bc9df5c5d-94cqt -n openshift-storage' > to see all of the containers in this pod. > error: unable to upgrade connection: container not found ("osd")
Customer state after corruption osds stuck in CLBO rook-ceph-osd-1-57969fb978-6mwj2 1/2 CrashLoopBackOff 1 (5s ago) 3d22h [root@vm250-43 ~]# oc scale deployment rook-ceph-operator --replicas=0 -n openshift-storage deployment.apps/rook-ceph-operator scaled [root@vm250-43 ~]# oc scale deployment ocs-operator --replicas=0 -n openshift-storage deployment.apps/ocs-operator scaled # oc get deployment rook-ceph-osd-1 -oyaml > rook-ceph-osd-1-deployment.yaml [root@vm250-43 ~]# oc patch deployment/rook-ceph-osd-1 -n openshift-storage -p '{"spec": {"template": {"spec": {"containers": [{"name": "osd", "command": ["sleep", "infinity"], "args": []}]}}}}' deployment.apps/rook-ceph-osd-1 patched rook-ceph-osd-1-57969fb978-6mwj2 0/2 Terminating 5 3d22h # oc patch deployment/rook-ceph-osd-1 -n openshift-storage --type='json' -p '[{"op":"remove", "path":"/spec/template/spec/containers/0/livenessProbe"}]' deployment.apps/rook-ceph-osd-1 patched rook-ceph-osd-1-57969fb978-6mwj2 0/2 Terminating 5 3d22h stuck terminating ^^ delete osd pod rook-ceph-osd-1-7bc6dbcd6d-9mtfr 0/2 Init:0/4 0 9s stuck init ^^ # oc rsh rook-ceph-osd-1-7bc6dbcd6d-9mtfr Defaulting container name to osd. Use 'oc describe pod/rook-ceph-osd-1-7bc6dbcd6d-9mtfr -n openshift-storage' to see all of the containers in this pod. error: unable to upgrade connection: container not found ("osd")
Hello Ashish, Agreed that this is not a correct reproducer by removing the disks. Essentially the customer had no disk as far as OCS is concerned, corruption > causing osd pod to CLBO The question I have is, how do we fetch sst files when the pod is CLBO. If the customer hits this again, the osd pod will be CLBO even if the disk is present.
Could we override the cmd or entrypoint of that particular OSD pod, so it waits as at a sleep indefinitely, this would could prevent CLBO. unless the readiness and health probes are too sensitive. Then we could access the sst files with "oc rsync".