Bug 1834939 - rook-ceph-crashcollector went into CLBO when localvolume sym link is deleted
Summary: rook-ceph-crashcollector went into CLBO when localvolume sym link is deleted
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.5.0
Assignee: Sébastien Han
QA Contact: Oded
URL:
Whiteboard:
Depends On:
Blocks: 1859307
TreeView+ depends on / blocked
 
Reported: 2020-05-12 17:28 UTC by Pratik Surve
Modified: 2023-12-15 17:53 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
.`crash-collector` runs smoothly on OpenShift Container Platform Previously, the `crash-collector` deployment lacked permissions to run on OpenShift Container Platform. The appropriate security context has been added to allow accessing a path on the host.
Clone Of:
Environment:
Last Closed: 2020-09-15 10:17:01 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github rook rook pull 5548 0 None closed ceph: add missing security context to crash collector 2021-02-05 15:15:54 UTC
Red Hat Product Errata RHBA-2020:3754 0 None None None 2020-09-15 10:17:23 UTC

Comment 2 Travis Nielsen 2020-05-12 17:37:25 UTC
I would propose we move this out to 4.5. The crash collector does not impact the cluster health.

The crash collector watches for the status of ceph daemon pods and will restart whenever there is a change in the pod state. Likely the OSD pod was deleted and so the crash collector controller was responding to that event. 

@Seb, any concern with this issue before we move it out to 4.5?

Comment 4 Sébastien Han 2020-05-13 10:43:25 UTC
It looks like the crash collector tried to post a crash but there was an issue accessing the crash directory.
Can you share the content of datadirhostpath? /var/lib/rook on the host

Ultimately, restarting the pod should solve the issue if the volume comes back.
Also, this component is not critical for the health of the cluster so we can move it out of 4.4.
One more thing, we might need to improve the error handling of the ceph_crash daemon, which means changes in RHCS, so this can't be done in OCS 4.4.
 
Thanks.

Comment 11 Sébastien Han 2020-08-06 06:59:59 UTC
Done, hopefully it's clear.

Comment 12 Oded 2020-08-13 15:06:04 UTC
Bug not Reconstructed.

Setup:
Provider:AWS
OCP Version:4.5.0-0.nightly-2020-08-13-071731
OCS Version:ocs-operator.v4.5.0-54.ci
Mode:Internal
LSO:
$ oc get sc
NAME                          PROVISIONER                             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
gp2                           kubernetes.io/aws-ebs                   Delete          WaitForFirstConsumer   true                   161m
localblock                    kubernetes.io/no-provisioner            Delete          WaitForFirstConsumer   false                  148m
ocs-storagecluster-ceph-rbd   openshift-storage.rbd.csi.ceph.com      Delete          Immediate              true                   136m
ocs-storagecluster-cephfs     openshift-storage.cephfs.csi.ceph.com   Delete          Immediate              true                   136m



Test Process:
1.Check rook-ceph-crashcollector pods status:
$ oc get pod -n openshift-storage -o wide | grep rook-ceph-crashcollector 
rook-ceph-crashcollector-ip-10-0-150-110-b89f9f678-ntb2j          1/1     Running     0          136m   10.128.2.20    ip-10-0-150-110.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-ip-10-0-178-133-67ff7d6b7-jnj8n          1/1     Running     0          136m   10.129.2.19    ip-10-0-178-133.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-ip-10-0-218-74-5c7b59dd8d-mt9bb          1/1     Running     0          136m   10.131.0.25    ip-10-0-218-74.us-east-2.compute.internal    <none>           <none>

2.Go to "ip-10-0-218-74.us-east-2.compute.internal" worker node:
$ oc debug node/ip-10-0-218-74.us-east-2.compute.internal
Starting pod/ip-10-0-218-74us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.218.74
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host /bin/bash
[root@ip-10-0-218-74 /]#

3.Remove nvme1n1:
[root@ip-10-0-218-74 /]# rm /mnt/local-storage/localblock/nvme1n1


4.Check rook-ceph-crashcollector pods status:
$ oc get pod -n openshift-storage -o wide | grep rook-ceph-crashcollector 
rook-ceph-crashcollector-ip-10-0-150-110-b89f9f678-ntb2j          1/1     Running     0          144m   10.128.2.20    ip-10-0-150-110.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-ip-10-0-178-133-67ff7d6b7-jnj8n          1/1     Running     0          144m   10.129.2.19    ip-10-0-178-133.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-ip-10-0-218-74-5c7b59dd8d-mt9bb          1/1     Running     0          144m   10.131.0.25    ip-10-0-218-74.us-east-2.compute.internal    <none>           <none>

5.Remove nvme2n1:
[root@ip-10-0-218-74 /]# rm /mnt/local-storage/localblock/nvme2n1

6.Check rook-ceph-crashcollector pods status:
$ oc get pod -n openshift-storage -o wide | grep rook-ceph-crashcollector 
rook-ceph-crashcollector-ip-10-0-150-110-b89f9f678-ntb2j          1/1     Running     0          148m   10.128.2.20    ip-10-0-150-110.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-ip-10-0-178-133-67ff7d6b7-jnj8n          1/1     Running     0          147m   10.129.2.19    ip-10-0-178-133.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-ip-10-0-218-74-5c7b59dd8d-mt9bb          1/1     Running     0          148m   10.131.0.25    ip-10-0-218-74.us-east-2.compute.internal    <none>           <none>


**noobaa-db-0  in pending state: I did not check before deleting nvme2n1

Comment 14 errata-xmlrpc 2020-09-15 10:17:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754


Note You need to log in before you can comment on or make changes to this bug.