Bug 2224493 - Panic when operator is fencing a node where pv is no provisioned by CSI
Summary: Panic when operator is fencing a node where pv is no provisioned by CSI
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ODF 4.14.0
Assignee: Subham Rai
QA Contact: Joy John Pinto
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-21 06:37 UTC by Subham Rai
Modified: 2023-11-08 18:53 UTC (History)
5 users (show)

Fixed In Version: 4.14.0-96
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-11-08 18:52:55 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github rook rook issues 12558 0 None open Panic when operator is fencing a node 2023-07-21 06:37:46 UTC
Github rook rook pull 12563 0 None open rbd: node fencing, skip pv when pv is not backed by csi 2023-07-21 06:46:37 UTC
Red Hat Product Errata RHSA-2023:6832 0 None None None 2023-11-08 18:53:47 UTC

Description Subham Rai 2023-07-21 06:37:46 UTC
Description of problem (please be detailed as possible and provide log
snippets):


This is a negative case where a on deployment pod consuming pv not provisioned by csi and another deployment pod with rbd rwo are on the same node and node fencing is triggered.

In this case rook operator goes in a panic state. more details https://github.com/rook/rook/issues/12558


Version of all relevant components (if applicable):
4.14

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

yes

Is there any workaround available to the best of your knowledge?
No,


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 4 Subham Rai 2023-08-01 15:06:27 UTC
Already in builds.

Comment 7 Yuli Persky 2023-10-17 14:22:01 UTC
How can I simulate a situation that Panic when operator is fencing a node where pv is no provisioned by CSI  ? 
Can you please provide reproduction instructions? 
Was there any automatic test that run and caused this ?

Comment 8 Subham Rai 2023-10-18 13:40:53 UTC
You can use LSO to create the pv(as mentioned in the upstream link) and use that pv to bind the application pod. And follow similar steps afterward

>Was there any automatic test that run and caused this?
No, it was detected by the upstream user https://github.com/rook/rook/issues/12558

Comment 9 Joy John Pinto 2023-11-06 16:54:52 UTC
Verified with OCP 4.14.0-0.nightly-2023-11-05-194730 and ODF 4.14.0-161

Created non csi deployment pod and csi deployment pod on same node (compute-0) and shut down the node compute-0

Added taint to compute-0 (oc adm taint nodes compute-0 node.kubernetes.io/out-of-service=nodeshutdown:NoExecute)


All pods in openshift-storage namespace came back online after 2-3m delay


[jopinto@jopinto new]$ oc get pods -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS       AGE
csi-addons-controller-manager-6749c89487-bww85                    2/2     Running     1              5m54s
csi-cephfsplugin-hfsqn                                            2/2     Running     0              9h
csi-cephfsplugin-hw4cb                                            2/2     Running     0              9h
csi-cephfsplugin-provisioner-54c89b944d-7svgs                     5/5     Running     0              9h
csi-cephfsplugin-provisioner-54c89b944d-mv9dt                     5/5     Running     0              5m54s
csi-rbdplugin-provisioner-669449fdcb-7zff2                        6/6     Running     0              9h
csi-rbdplugin-provisioner-669449fdcb-m55s2                        6/6     Running     0              9h
csi-rbdplugin-v4fxr                                               3/3     Running     0              9h
csi-rbdplugin-vzkqg                                               3/3     Running     0              9h
noobaa-core-0                                                     1/1     Running     0              5m50s
noobaa-db-pg-0                                                    1/1     Running     0              5m50s
noobaa-endpoint-b69796f8-njl74                                    1/1     Running     0              5m54s
noobaa-operator-686c6444d9-9hg9l                                  2/2     Running     1              5m56s
ocs-metrics-exporter-65c7d9bbbb-529f5                             1/1     Running     0              9h
ocs-operator-5d87659678-g7lkv                                     1/1     Running     3 (2m4s ago)   5m54s
odf-console-674bbff5d9-jw6d7                                      1/1     Running     0              9h
odf-operator-controller-manager-7bf98567cb-gnt8j                  2/2     Running     2 (155m ago)   9h
rook-ceph-crashcollector-compute-1-5c5bf77958-2pjdr               1/1     Running     0              9h
rook-ceph-crashcollector-compute-2-7774c577bf-789lf               1/1     Running     0              9h
rook-ceph-exporter-compute-1-55f5d44457-nh58q                     1/1     Running     0              9h
rook-ceph-exporter-compute-2-6c5c857c9d-nr47l                     1/1     Running     0              9h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6968bc788lzbz   2/2     Running     2 (61s ago)    9h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5d94d89dwkpc7   2/2     Running     2 (28s ago)    9h
rook-ceph-mgr-a-79666b4f-7gz5r                                    2/2     Running     0              9h
rook-ceph-mon-a-7bdff6fbf8-f4sdn                                  0/2     Pending     0              5m54s
rook-ceph-mon-b-74d6676bf4-tdjdz                                  2/2     Running     0              9h
rook-ceph-mon-c-6bbcb64766-l22fz                                  2/2     Running     0              9h
rook-ceph-operator-595c4f8ddf-b6swb                               1/1     Running     0              5m54s
rook-ceph-osd-0-b9779fffb-wg85l                                   2/2     Running     0              9h
rook-ceph-osd-1-864567b969-5h5fh                                  2/2     Running     0              9h
rook-ceph-osd-2-7bf4bf998f-4llvc                                  0/2     Pending     0              5m56s
rook-ceph-osd-prepare-593ac2d7fb3c46046eaebe7605f35856-f976t      0/1     Completed   0              9h
rook-ceph-osd-prepare-ec1c36da76df2d2d8ecb6228fbd1c000-6bzj5      0/1     Completed   0              9h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6c89f5899v6m   2/2     Running     0              9h
rook-ceph-tools-5bbc55fdf-g878r                                   1/1     Running     0              9h

[jopinto@jopinto new]$ oc get pods -n test2
NAME                           READY   STATUS    RESTARTS   AGE
simple-app-7649fdb746-nrtrd    1/1     Running   0          9m39s
simple-app1-779d9ddf59-8fgh7   1/1     Running   0          9m39s

Comment 11 errata-xmlrpc 2023-11-08 18:52:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832


Note You need to log in before you can comment on or make changes to this bug.