2224493 – Panic when operator is fencing a node where pv is no provisioned by CSI

Bug 2224493 - Panic when operator is fencing a node where pv is no provisioned by CSI

Summary: Panic when operator is fencing a node where pv is no provisioned by CSI

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	ODF 4.14.0
Assignee:	Subham Rai
QA Contact:	Joy John Pinto
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-07-21 06:37 UTC by Subham Rai
Modified:	2023-11-08 18:53 UTC (History)
CC List:	5 users (show)
Fixed In Version:	4.14.0-96
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-11-08 18:52:55 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	rook rook issues 12558	None	open	Panic when operator is fencing a node	2023-07-21 06:37:46 UTC
Github	rook rook pull 12563	None	open	rbd: node fencing, skip pv when pv is not backed by csi	2023-07-21 06:46:37 UTC
Red Hat Product Errata	RHSA-2023:6832	None	None	None	2023-11-08 18:53:47 UTC

Description Subham Rai 2023-07-21 06:37:46 UTC

Description of problem (please be detailed as possible and provide log
snippets):


This is a negative case where a on deployment pod consuming pv not provisioned by csi and another deployment pod with rbd rwo are on the same node and node fencing is triggered.

In this case rook operator goes in a panic state. more details https://github.com/rook/rook/issues/12558


Version of all relevant components (if applicable):
4.14

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

yes

Is there any workaround available to the best of your knowledge?
No,


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 4 Subham Rai 2023-08-01 15:06:27 UTC

Already in builds.

Comment 7 Yuli Persky 2023-10-17 14:22:01 UTC

How can I simulate a situation that Panic when operator is fencing a node where pv is no provisioned by CSI  ? 
Can you please provide reproduction instructions? 
Was there any automatic test that run and caused this ?

Comment 8 Subham Rai 2023-10-18 13:40:53 UTC

You can use LSO to create the pv(as mentioned in the upstream link) and use that pv to bind the application pod. And follow similar steps afterward

>Was there any automatic test that run and caused this?
No, it was detected by the upstream user https://github.com/rook/rook/issues/12558

Comment 9 Joy John Pinto 2023-11-06 16:54:52 UTC

Verified with OCP 4.14.0-0.nightly-2023-11-05-194730 and ODF 4.14.0-161

Created non csi deployment pod and csi deployment pod on same node (compute-0) and shut down the node compute-0

Added taint to compute-0 (oc adm taint nodes compute-0 node.kubernetes.io/out-of-service=nodeshutdown:NoExecute)


All pods in openshift-storage namespace came back online after 2-3m delay


[jopinto@jopinto new]$ oc get pods -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS       AGE
csi-addons-controller-manager-6749c89487-bww85                    2/2     Running     1              5m54s
csi-cephfsplugin-hfsqn                                            2/2     Running     0              9h
csi-cephfsplugin-hw4cb                                            2/2     Running     0              9h
csi-cephfsplugin-provisioner-54c89b944d-7svgs                     5/5     Running     0              9h
csi-cephfsplugin-provisioner-54c89b944d-mv9dt                     5/5     Running     0              5m54s
csi-rbdplugin-provisioner-669449fdcb-7zff2                        6/6     Running     0              9h
csi-rbdplugin-provisioner-669449fdcb-m55s2                        6/6     Running     0              9h
csi-rbdplugin-v4fxr                                               3/3     Running     0              9h
csi-rbdplugin-vzkqg                                               3/3     Running     0              9h
noobaa-core-0                                                     1/1     Running     0              5m50s
noobaa-db-pg-0                                                    1/1     Running     0              5m50s
noobaa-endpoint-b69796f8-njl74                                    1/1     Running     0              5m54s
noobaa-operator-686c6444d9-9hg9l                                  2/2     Running     1              5m56s
ocs-metrics-exporter-65c7d9bbbb-529f5                             1/1     Running     0              9h
ocs-operator-5d87659678-g7lkv                                     1/1     Running     3 (2m4s ago)   5m54s
odf-console-674bbff5d9-jw6d7                                      1/1     Running     0              9h
odf-operator-controller-manager-7bf98567cb-gnt8j                  2/2     Running     2 (155m ago)   9h
rook-ceph-crashcollector-compute-1-5c5bf77958-2pjdr               1/1     Running     0              9h
rook-ceph-crashcollector-compute-2-7774c577bf-789lf               1/1     Running     0              9h
rook-ceph-exporter-compute-1-55f5d44457-nh58q                     1/1     Running     0              9h
rook-ceph-exporter-compute-2-6c5c857c9d-nr47l                     1/1     Running     0              9h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6968bc788lzbz   2/2     Running     2 (61s ago)    9h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5d94d89dwkpc7   2/2     Running     2 (28s ago)    9h
rook-ceph-mgr-a-79666b4f-7gz5r                                    2/2     Running     0              9h
rook-ceph-mon-a-7bdff6fbf8-f4sdn                                  0/2     Pending     0              5m54s
rook-ceph-mon-b-74d6676bf4-tdjdz                                  2/2     Running     0              9h
rook-ceph-mon-c-6bbcb64766-l22fz                                  2/2     Running     0              9h
rook-ceph-operator-595c4f8ddf-b6swb                               1/1     Running     0              5m54s
rook-ceph-osd-0-b9779fffb-wg85l                                   2/2     Running     0              9h
rook-ceph-osd-1-864567b969-5h5fh                                  2/2     Running     0              9h
rook-ceph-osd-2-7bf4bf998f-4llvc                                  0/2     Pending     0              5m56s
rook-ceph-osd-prepare-593ac2d7fb3c46046eaebe7605f35856-f976t      0/1     Completed   0              9h
rook-ceph-osd-prepare-ec1c36da76df2d2d8ecb6228fbd1c000-6bzj5      0/1     Completed   0              9h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6c89f5899v6m   2/2     Running     0              9h
rook-ceph-tools-5bbc55fdf-g878r                                   1/1     Running     0              9h

[jopinto@jopinto new]$ oc get pods -n test2
NAME                           READY   STATUS    RESTARTS   AGE
simple-app-7649fdb746-nrtrd    1/1     Running   0          9m39s
simple-app1-779d9ddf59-8fgh7   1/1     Running   0          9m39s

Comment 11 errata-xmlrpc 2023-11-08 18:52:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832

Note You need to log in before you can comment on or make changes to this bug.