I would propose we move this out to 4.5. The crash collector does not impact the cluster health. The crash collector watches for the status of ceph daemon pods and will restart whenever there is a change in the pod state. Likely the OSD pod was deleted and so the crash collector controller was responding to that event. @Seb, any concern with this issue before we move it out to 4.5?
It looks like the crash collector tried to post a crash but there was an issue accessing the crash directory. Can you share the content of datadirhostpath? /var/lib/rook on the host Ultimately, restarting the pod should solve the issue if the volume comes back. Also, this component is not critical for the health of the cluster so we can move it out of 4.4. One more thing, we might need to improve the error handling of the ceph_crash daemon, which means changes in RHCS, so this can't be done in OCS 4.4. Thanks.
Done, hopefully it's clear.
Bug not Reconstructed. Setup: Provider:AWS OCP Version:4.5.0-0.nightly-2020-08-13-071731 OCS Version:ocs-operator.v4.5.0-54.ci Mode:Internal LSO: $ oc get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE gp2 kubernetes.io/aws-ebs Delete WaitForFirstConsumer true 161m localblock kubernetes.io/no-provisioner Delete WaitForFirstConsumer false 148m ocs-storagecluster-ceph-rbd openshift-storage.rbd.csi.ceph.com Delete Immediate true 136m ocs-storagecluster-cephfs openshift-storage.cephfs.csi.ceph.com Delete Immediate true 136m Test Process: 1.Check rook-ceph-crashcollector pods status: $ oc get pod -n openshift-storage -o wide | grep rook-ceph-crashcollector rook-ceph-crashcollector-ip-10-0-150-110-b89f9f678-ntb2j 1/1 Running 0 136m 10.128.2.20 ip-10-0-150-110.us-east-2.compute.internal <none> <none> rook-ceph-crashcollector-ip-10-0-178-133-67ff7d6b7-jnj8n 1/1 Running 0 136m 10.129.2.19 ip-10-0-178-133.us-east-2.compute.internal <none> <none> rook-ceph-crashcollector-ip-10-0-218-74-5c7b59dd8d-mt9bb 1/1 Running 0 136m 10.131.0.25 ip-10-0-218-74.us-east-2.compute.internal <none> <none> 2.Go to "ip-10-0-218-74.us-east-2.compute.internal" worker node: $ oc debug node/ip-10-0-218-74.us-east-2.compute.internal Starting pod/ip-10-0-218-74us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.218.74 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host /bin/bash [root@ip-10-0-218-74 /]# 3.Remove nvme1n1: [root@ip-10-0-218-74 /]# rm /mnt/local-storage/localblock/nvme1n1 4.Check rook-ceph-crashcollector pods status: $ oc get pod -n openshift-storage -o wide | grep rook-ceph-crashcollector rook-ceph-crashcollector-ip-10-0-150-110-b89f9f678-ntb2j 1/1 Running 0 144m 10.128.2.20 ip-10-0-150-110.us-east-2.compute.internal <none> <none> rook-ceph-crashcollector-ip-10-0-178-133-67ff7d6b7-jnj8n 1/1 Running 0 144m 10.129.2.19 ip-10-0-178-133.us-east-2.compute.internal <none> <none> rook-ceph-crashcollector-ip-10-0-218-74-5c7b59dd8d-mt9bb 1/1 Running 0 144m 10.131.0.25 ip-10-0-218-74.us-east-2.compute.internal <none> <none> 5.Remove nvme2n1: [root@ip-10-0-218-74 /]# rm /mnt/local-storage/localblock/nvme2n1 6.Check rook-ceph-crashcollector pods status: $ oc get pod -n openshift-storage -o wide | grep rook-ceph-crashcollector rook-ceph-crashcollector-ip-10-0-150-110-b89f9f678-ntb2j 1/1 Running 0 148m 10.128.2.20 ip-10-0-150-110.us-east-2.compute.internal <none> <none> rook-ceph-crashcollector-ip-10-0-178-133-67ff7d6b7-jnj8n 1/1 Running 0 147m 10.129.2.19 ip-10-0-178-133.us-east-2.compute.internal <none> <none> rook-ceph-crashcollector-ip-10-0-218-74-5c7b59dd8d-mt9bb 1/1 Running 0 148m 10.131.0.25 ip-10-0-218-74.us-east-2.compute.internal <none> <none> **noobaa-db-0 in pending state: I did not check before deleting nvme2n1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754