1834939 – rook-ceph-crashcollector went into CLBO when localvolume sym link is deleted

Bug 1834939 - rook-ceph-crashcollector went into CLBO when localvolume sym link is deleted

Summary: rook-ceph-crashcollector went into CLBO when localvolume sym link is deleted

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.5.0
Assignee:	Sébastien Han
QA Contact:	Oded
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1859307
TreeView+	depends on / blocked

Reported:	2020-05-12 17:28 UTC by Pratik Surve
Modified:	2023-12-15 17:53 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	.`crash-collector` runs smoothly on OpenShift Container Platform Previously, the `crash-collector` deployment lacked permissions to run on OpenShift Container Platform. The appropriate security context has been added to allow accessing a path on the host.
Clone Of:
Environment:
Last Closed:	2020-09-15 10:17:01 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	rook rook pull 5548	0	None	closed	ceph: add missing security context to crash collector	2021-02-05 15:15:54 UTC
Red Hat Product Errata	RHBA-2020:3754	0	None	None	None	2020-09-15 10:17:23 UTC

Comment 2 Travis Nielsen 2020-05-12 17:37:25 UTC

I would propose we move this out to 4.5. The crash collector does not impact the cluster health.

The crash collector watches for the status of ceph daemon pods and will restart whenever there is a change in the pod state. Likely the OSD pod was deleted and so the crash collector controller was responding to that event. 

@Seb, any concern with this issue before we move it out to 4.5?

Comment 4 Sébastien Han 2020-05-13 10:43:25 UTC

It looks like the crash collector tried to post a crash but there was an issue accessing the crash directory.
Can you share the content of datadirhostpath? /var/lib/rook on the host

Ultimately, restarting the pod should solve the issue if the volume comes back.
Also, this component is not critical for the health of the cluster so we can move it out of 4.4.
One more thing, we might need to improve the error handling of the ceph_crash daemon, which means changes in RHCS, so this can't be done in OCS 4.4.
 
Thanks.

Comment 11 Sébastien Han 2020-08-06 06:59:59 UTC

Done, hopefully it's clear.

Comment 12 Oded 2020-08-13 15:06:04 UTC

Bug not Reconstructed.

Setup:
Provider:AWS
OCP Version:4.5.0-0.nightly-2020-08-13-071731
OCS Version:ocs-operator.v4.5.0-54.ci
Mode:Internal
LSO:
$ oc get sc
NAME                          PROVISIONER                             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
gp2                           kubernetes.io/aws-ebs                   Delete          WaitForFirstConsumer   true                   161m
localblock                    kubernetes.io/no-provisioner            Delete          WaitForFirstConsumer   false                  148m
ocs-storagecluster-ceph-rbd   openshift-storage.rbd.csi.ceph.com      Delete          Immediate              true                   136m
ocs-storagecluster-cephfs     openshift-storage.cephfs.csi.ceph.com   Delete          Immediate              true                   136m



Test Process:
1.Check rook-ceph-crashcollector pods status:
$ oc get pod -n openshift-storage -o wide | grep rook-ceph-crashcollector 
rook-ceph-crashcollector-ip-10-0-150-110-b89f9f678-ntb2j          1/1     Running     0          136m   10.128.2.20    ip-10-0-150-110.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-ip-10-0-178-133-67ff7d6b7-jnj8n          1/1     Running     0          136m   10.129.2.19    ip-10-0-178-133.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-ip-10-0-218-74-5c7b59dd8d-mt9bb          1/1     Running     0          136m   10.131.0.25    ip-10-0-218-74.us-east-2.compute.internal    <none>           <none>

2.Go to "ip-10-0-218-74.us-east-2.compute.internal" worker node:
$ oc debug node/ip-10-0-218-74.us-east-2.compute.internal
Starting pod/ip-10-0-218-74us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.218.74
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host /bin/bash
[root@ip-10-0-218-74 /]#

3.Remove nvme1n1:
[root@ip-10-0-218-74 /]# rm /mnt/local-storage/localblock/nvme1n1


4.Check rook-ceph-crashcollector pods status:
$ oc get pod -n openshift-storage -o wide | grep rook-ceph-crashcollector 
rook-ceph-crashcollector-ip-10-0-150-110-b89f9f678-ntb2j          1/1     Running     0          144m   10.128.2.20    ip-10-0-150-110.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-ip-10-0-178-133-67ff7d6b7-jnj8n          1/1     Running     0          144m   10.129.2.19    ip-10-0-178-133.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-ip-10-0-218-74-5c7b59dd8d-mt9bb          1/1     Running     0          144m   10.131.0.25    ip-10-0-218-74.us-east-2.compute.internal    <none>           <none>

5.Remove nvme2n1:
[root@ip-10-0-218-74 /]# rm /mnt/local-storage/localblock/nvme2n1

6.Check rook-ceph-crashcollector pods status:
$ oc get pod -n openshift-storage -o wide | grep rook-ceph-crashcollector 
rook-ceph-crashcollector-ip-10-0-150-110-b89f9f678-ntb2j          1/1     Running     0          148m   10.128.2.20    ip-10-0-150-110.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-ip-10-0-178-133-67ff7d6b7-jnj8n          1/1     Running     0          147m   10.129.2.19    ip-10-0-178-133.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-ip-10-0-218-74-5c7b59dd8d-mt9bb          1/1     Running     0          148m   10.131.0.25    ip-10-0-218-74.us-east-2.compute.internal    <none>           <none>


**noobaa-db-0  in pending state: I did not check before deleting nvme2n1

Comment 14 errata-xmlrpc 2020-09-15 10:17:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754

Note You need to log in before you can comment on or make changes to this bug.