Bug 2280202

Summary: [4.14.z clone] Rook deletes ceph csi deployments and daemonsets even if it isn't the owner
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Mudit Agarwal <muagarwa>
Component: rookAssignee: Parth Arora <paarora>
Status: CLOSED ERRATA QA Contact: Jilju Joy <jijoy>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.15CC: lgangava, muagarwa, nberry, odf-bz-bot, paarora, resoni, sheggodu, tnielsen
Target Milestone: ---Keywords: AutomationTriaged
Target Release: ODF 4.14.8   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: isf-provider
Fixed In Version: 4.14.8-3 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2279992 Environment:
Last Closed: 2024-06-12 07:38:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2279992    
Bug Blocks:    

Description Mudit Agarwal 2024-05-13 06:55:25 UTC
+++ This bug was initially created as a clone of Bug #2279992 +++

Pls refer https://github.com/rook/rook/issues/13942 for more details
Fixed by https://github.com/rook/rook/pull/13966 (implemented in 4.16)

Need back port for 4.15.

Since we aren't aiming to back port all content that surround this fix, the testing is as below

1. This BZ effects only provider-client deployments, specifically client is also running in same cluster as provider, for everything else only regression is enough
2. For effected deployment w/ this fix

a. restart ocs-client-operator-controller-manager-* pod in client operator namespace
b. wait/verify the presence of "openshift-storage" prefixed resources in "oc get csidriver"
c. add "ROOK_CSI_DISABLE_DRIVER: true" in "rook-ceph-operator-config" cm in "openshift-storage" ns
d. restart "rook-ceph-operator" pod and "oc get csidriver" should still list "openshift-storage" prefixed resources

--- Additional comment from RHEL Program Management on 2024-05-10 12:11:19 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.16.0' to '?', and so is being proposed to be fixed at the ODF 4.16.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from RHEL Program Management on 2024-05-10 12:11:19 UTC ---

The 'Target Release' is not to be set manually at the Red Hat OpenShift Data Foundation product.

The 'Target Release' will be auto set appropriately, after the 3 Acks (pm,devel,qa) are set to "+" for a specific release flag and that release flag gets auto set to "+".

--- Additional comment from Leela Venkaiah Gangavarapu on 2024-05-10 13:16:21 UTC ---

>>> Below is performed on 4.14 cluster w/ 4.14 ODF and should be similar for 4.15 ocp/odf

$ type ko kc
ko is aliased to `kubectl -nopenshift-storage'
kc is aliased to `kubectl -nopenshift-storage-client'

$ k get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.14   True        False         23h     Cluster version is 4.14.14

$ ko get csv
NAME                              DISPLAY                              VERSION   REPLACES   PHASE
mcg-operator.v4.14.6              NooBaa Operator                      4.14.6               Succeeded
ocs-operator.v4.14.6              Container Storage                    4.14.6               Succeeded
odf-csi-addons-operator.v4.14.6   CSI Addons                           4.14.6               Succeeded
odf-operator.v4.14.6              IBM Storage Fusion Data Foundation   4.14.6               Succeeded

$ kc get csv
NAME                              DISPLAY                            VERSION   REPLACES   PHASE
ocs-client-operator.v4.14.6       OpenShift Data Foundation Client   4.14.6               Succeeded
odf-csi-addons-operator.v4.14.6   CSI Addons                         4.14.6               Succeeded

$ ko get cm rook-ceph-operator-config -oyaml | yq -r .data
CSI_PLUGIN_TOLERATIONS: |
  - effect: NoSchedule
    key: node.ocs.openshift.io/storage
    value: "true"
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
CSI_PROVISIONER_NODE_AFFINITY: node-role.kubernetes.io/master=
CSI_PROVISIONER_TOLERATIONS: |
  - effect: NoSchedule
    key: node.ocs.openshift.io/storage
    value: "true"
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
ROOK_CSI_ENABLE_CEPHFS: "false"
ROOK_CSI_ENABLE_RBD: "false"

Issue:
------

>>> openshift-storage* resources may have been deleted by rook if it has rebooted or reconciled resources already
$ k get csidriver
NAME                                    ATTACHREQUIRED   PODINFOONMOUNT   STORAGECAPACITY   TOKENREQUESTS   REQUIRESREPUBLISH   MODES        AGE
openshift-storage.cephfs.csi.ceph.com   true             false            false             <unset>         false               Persistent   100s
openshift-storage.rbd.csi.ceph.com      true             false            false             <unset>         false               Persistent   100s
topolvm.io                              false            true             true              <unset>         false               Persistent   19h

$ ko rollout restart deploy/rook-ceph-operator
deployment.apps/rook-ceph-operator restarted

>>> openshift-storage* resources are deleted by rook
$ k get csidriver
NAME         ATTACHREQUIRED   PODINFOONMOUNT   STORAGECAPACITY   TOKENREQUESTS   REQUIRESREPUBLISH   MODES        AGE
topolvm.io   false            true             true              <unset>         false               Persistent   19h

With fix:
---------

>>> new field (ROOK_CSI_DISABLE_DRIVER: "true") is added manually
$ ko get cm rook-ceph-operator-config -oyaml | yq -r .data
CSI_PLUGIN_TOLERATIONS: |
  - effect: NoSchedule
    key: node.ocs.openshift.io/storage
    value: "true"
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
CSI_PROVISIONER_NODE_AFFINITY: node-role.kubernetes.io/master=
CSI_PROVISIONER_TOLERATIONS: |
  - effect: NoSchedule
    key: node.ocs.openshift.io/storage
    value: "true"
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
ROOK_CSI_DISABLE_DRIVER: "true"
ROOK_CSI_ENABLE_CEPHFS: "false"
ROOK_CSI_ENABLE_RBD: "false"

$ kc rollout restart deploy/ocs-client-operator-controller-manager
deployment.apps/ocs-client-operator-controller-manager restarted

$ date; kc delete $(kc get po -oname | grep ocs-client-operator-controller-manager)
Friday 10 May 2024 06:15:40 PM IST
pod "ocs-client-operator-controller-manager-69bff4c7b6-vlg6l" deleted

$ date; k get csidriver 
Friday 10 May 2024 06:16:08 PM IST
NAME                                    ATTACHREQUIRED   PODINFOONMOUNT   STORAGECAPACITY   TOKENREQUESTS   REQUIRESREPUBLISH   MODES        AGE
openshift-storage.cephfs.csi.ceph.com   true             false            false             <unset>         false               Persistent   7s
openshift-storage.rbd.csi.ceph.com      true             false            false             <unset>         false               Persistent   7s
topolvm.io                              false            true             true              <unset>         false               Persistent   20h

$ date; ko rollout restart deploy/rook-ceph-operator
Friday 10 May 2024 06:16:39 PM IST
deployment.apps/rook-ceph-operator restarted

$ date; ko get po -lapp=rook-ceph-operator
Friday 10 May 2024 06:18:10 PM IST
NAME                                  READY   STATUS    RESTARTS   AGE
rook-ceph-operator-668c5cfb4f-jkzdg   1/1     Running   0          89s

>>> openshift-storage* resources are not touched by rook. Workloads already using cephfs volumes (RWX+fsGroup) should be scaled down completely and up after the fix, however testing that specifically is at QE discretion.
$ date; k get csidriver 
Friday 10 May 2024 06:18:23 PM IST
NAME                                    ATTACHREQUIRED   PODINFOONMOUNT   STORAGECAPACITY   TOKENREQUESTS   REQUIRESREPUBLISH   MODES        AGE
openshift-storage.cephfs.csi.ceph.com   true             false            false             <unset>         false               Persistent   2m22s
openshift-storage.rbd.csi.ceph.com      true             false            false             <unset>         false               Persistent   2m22s
topolvm.io                              false            true             true              <unset>         false               Persistent   20h


-----

>>> A bare minimum smoke test
$ k get -ndefault pod,pvc -ltype
NAME                  READY   STATUS    RESTARTS   AGE
pod/csirbd-demo-pod   1/1     Running   0          40s

NAME                            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
persistentvolumeclaim/rbd-pvc   Bound    pvc-ca1ee8bd-df2d-498d-bc5f-1649fb018013   1Gi        RWO            ocs-storagecluster-ceph-rbd   42s

$ k exec -ndefault pod/csirbd-demo-pod -- /bin/sh -c 'echo hello world > /var/lib/www/html/index.html'

$ k exec -ndefault pod/csirbd-demo-pod -- /bin/sh -c 'cat /var/lib/www/html/index.html'
hello world


Note:
1. Tested w/ quay.io/paarora/rook-ceph-operator:v209 provided by Parth, thanks!
2. Issue initially reported at https://github.ibm.com/ProjectAbell/abell-tracking/issues/36413

thanks.

--- Additional comment from RHEL Program Management on 2024-05-13 05:58:09 UTC ---

Since this bug has severity set to 'urgent', it is being proposed as a blocker for the currently set release flag. Please resolve ASAP.

--- Additional comment from Mudit Agarwal on 2024-05-13 05:59:18 UTC ---

Parth, when can we have the PR ready?

--- Additional comment from Parth Arora on 2024-05-13 06:06:05 UTC ---

Will raise a backport PR today, and we can merge it too by eod

Comment 11 errata-xmlrpc 2024-06-12 07:38:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.14.8 Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:3861