2280202 – [4.14.z clone] Rook deletes ceph csi deployments and daemonsets even if it isn't the owner

Bug 2280202 - [4.14.z clone] Rook deletes ceph csi deployments and daemonsets even if it isn't the owner

Summary: [4.14.z clone] Rook deletes ceph csi deployments and daemonsets even if it is...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.15
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.14.8
Assignee:	Parth Arora
QA Contact:	Jilju Joy
Docs Contact:
URL:
Whiteboard:	isf-provider
Depends On:	2279992
Blocks:
TreeView+	depends on / blocked

Reported:	2024-05-13 06:55 UTC by Mudit Agarwal
Modified:	2024-08-20 05:19 UTC (History)
CC List:	8 users (show)
Fixed In Version:	4.14.8-3
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2279992
Environment:
Last Closed:	2024-06-12 07:38:51 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage rook pull 659	0	None	open	Bug 2280202: csi: add a new flag to disable csi driver	2024-05-27 14:39:54 UTC
Red Hat Product Errata	RHBA-2024:3861	0	None	None	None	2024-06-12 07:38:54 UTC

Description Mudit Agarwal 2024-05-13 06:55:25 UTC

+++ This bug was initially created as a clone of Bug #2279992 +++

Pls refer https://github.com/rook/rook/issues/13942 for more details
Fixed by https://github.com/rook/rook/pull/13966 (implemented in 4.16)

Need back port for 4.15.

Since we aren't aiming to back port all content that surround this fix, the testing is as below

1. This BZ effects only provider-client deployments, specifically client is also running in same cluster as provider, for everything else only regression is enough
2. For effected deployment w/ this fix

a. restart ocs-client-operator-controller-manager-* pod in client operator namespace
b. wait/verify the presence of "openshift-storage" prefixed resources in "oc get csidriver"
c. add "ROOK_CSI_DISABLE_DRIVER: true" in "rook-ceph-operator-config" cm in "openshift-storage" ns
d. restart "rook-ceph-operator" pod and "oc get csidriver" should still list "openshift-storage" prefixed resources

--- Additional comment from RHEL Program Management on 2024-05-10 12:11:19 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.16.0' to '?', and so is being proposed to be fixed at the ODF 4.16.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from RHEL Program Management on 2024-05-10 12:11:19 UTC ---

The 'Target Release' is not to be set manually at the Red Hat OpenShift Data Foundation product.

The 'Target Release' will be auto set appropriately, after the 3 Acks (pm,devel,qa) are set to "+" for a specific release flag and that release flag gets auto set to "+".

--- Additional comment from Leela Venkaiah Gangavarapu on 2024-05-10 13:16:21 UTC ---

>>> Below is performed on 4.14 cluster w/ 4.14 ODF and should be similar for 4.15 ocp/odf

$ type ko kc
ko is aliased to `kubectl -nopenshift-storage'
kc is aliased to `kubectl -nopenshift-storage-client'

$ k get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.14   True        False         23h     Cluster version is 4.14.14

$ ko get csv
NAME                              DISPLAY                              VERSION   REPLACES   PHASE
mcg-operator.v4.14.6              NooBaa Operator                      4.14.6               Succeeded
ocs-operator.v4.14.6              Container Storage                    4.14.6               Succeeded
odf-csi-addons-operator.v4.14.6   CSI Addons                           4.14.6               Succeeded
odf-operator.v4.14.6              IBM Storage Fusion Data Foundation   4.14.6               Succeeded

$ kc get csv
NAME                              DISPLAY                            VERSION   REPLACES   PHASE
ocs-client-operator.v4.14.6       OpenShift Data Foundation Client   4.14.6               Succeeded
odf-csi-addons-operator.v4.14.6   CSI Addons                         4.14.6               Succeeded

$ ko get cm rook-ceph-operator-config -oyaml | yq -r .data
CSI_PLUGIN_TOLERATIONS: |
  - effect: NoSchedule
    key: node.ocs.openshift.io/storage
    value: "true"
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
CSI_PROVISIONER_NODE_AFFINITY: node-role.kubernetes.io/master=
CSI_PROVISIONER_TOLERATIONS: |
  - effect: NoSchedule
    key: node.ocs.openshift.io/storage
    value: "true"
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
ROOK_CSI_ENABLE_CEPHFS: "false"
ROOK_CSI_ENABLE_RBD: "false"

Issue:
------

>>> openshift-storage* resources may have been deleted by rook if it has rebooted or reconciled resources already
$ k get csidriver
NAME                                    ATTACHREQUIRED   PODINFOONMOUNT   STORAGECAPACITY   TOKENREQUESTS   REQUIRESREPUBLISH   MODES        AGE
openshift-storage.cephfs.csi.ceph.com   true             false            false             <unset>         false               Persistent   100s
openshift-storage.rbd.csi.ceph.com      true             false            false             <unset>         false               Persistent   100s
topolvm.io                              false            true             true              <unset>         false               Persistent   19h

$ ko rollout restart deploy/rook-ceph-operator
deployment.apps/rook-ceph-operator restarted

>>> openshift-storage* resources are deleted by rook
$ k get csidriver
NAME         ATTACHREQUIRED   PODINFOONMOUNT   STORAGECAPACITY   TOKENREQUESTS   REQUIRESREPUBLISH   MODES        AGE
topolvm.io   false            true             true              <unset>         false               Persistent   19h

With fix:
---------

>>> new field (ROOK_CSI_DISABLE_DRIVER: "true") is added manually
$ ko get cm rook-ceph-operator-config -oyaml | yq -r .data
CSI_PLUGIN_TOLERATIONS: |
  - effect: NoSchedule
    key: node.ocs.openshift.io/storage
    value: "true"
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
CSI_PROVISIONER_NODE_AFFINITY: node-role.kubernetes.io/master=
CSI_PROVISIONER_TOLERATIONS: |
  - effect: NoSchedule
    key: node.ocs.openshift.io/storage
    value: "true"
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
ROOK_CSI_DISABLE_DRIVER: "true"
ROOK_CSI_ENABLE_CEPHFS: "false"
ROOK_CSI_ENABLE_RBD: "false"

$ kc rollout restart deploy/ocs-client-operator-controller-manager
deployment.apps/ocs-client-operator-controller-manager restarted

$ date; kc delete $(kc get po -oname | grep ocs-client-operator-controller-manager)
Friday 10 May 2024 06:15:40 PM IST
pod "ocs-client-operator-controller-manager-69bff4c7b6-vlg6l" deleted

$ date; k get csidriver 
Friday 10 May 2024 06:16:08 PM IST
NAME                                    ATTACHREQUIRED   PODINFOONMOUNT   STORAGECAPACITY   TOKENREQUESTS   REQUIRESREPUBLISH   MODES        AGE
openshift-storage.cephfs.csi.ceph.com   true             false            false             <unset>         false               Persistent   7s
openshift-storage.rbd.csi.ceph.com      true             false            false             <unset>         false               Persistent   7s
topolvm.io                              false            true             true              <unset>         false               Persistent   20h

$ date; ko rollout restart deploy/rook-ceph-operator
Friday 10 May 2024 06:16:39 PM IST
deployment.apps/rook-ceph-operator restarted

$ date; ko get po -lapp=rook-ceph-operator
Friday 10 May 2024 06:18:10 PM IST
NAME                                  READY   STATUS    RESTARTS   AGE
rook-ceph-operator-668c5cfb4f-jkzdg   1/1     Running   0          89s

>>> openshift-storage* resources are not touched by rook. Workloads already using cephfs volumes (RWX+fsGroup) should be scaled down completely and up after the fix, however testing that specifically is at QE discretion.
$ date; k get csidriver 
Friday 10 May 2024 06:18:23 PM IST
NAME                                    ATTACHREQUIRED   PODINFOONMOUNT   STORAGECAPACITY   TOKENREQUESTS   REQUIRESREPUBLISH   MODES        AGE
openshift-storage.cephfs.csi.ceph.com   true             false            false             <unset>         false               Persistent   2m22s
openshift-storage.rbd.csi.ceph.com      true             false            false             <unset>         false               Persistent   2m22s
topolvm.io                              false            true             true              <unset>         false               Persistent   20h


-----

>>> A bare minimum smoke test
$ k get -ndefault pod,pvc -ltype
NAME                  READY   STATUS    RESTARTS   AGE
pod/csirbd-demo-pod   1/1     Running   0          40s

NAME                            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
persistentvolumeclaim/rbd-pvc   Bound    pvc-ca1ee8bd-df2d-498d-bc5f-1649fb018013   1Gi        RWO            ocs-storagecluster-ceph-rbd   42s

$ k exec -ndefault pod/csirbd-demo-pod -- /bin/sh -c 'echo hello world > /var/lib/www/html/index.html'

$ k exec -ndefault pod/csirbd-demo-pod -- /bin/sh -c 'cat /var/lib/www/html/index.html'
hello world


Note:
1. Tested w/ quay.io/paarora/rook-ceph-operator:v209 provided by Parth, thanks!
2. Issue initially reported at https://github.ibm.com/ProjectAbell/abell-tracking/issues/36413

thanks.

--- Additional comment from RHEL Program Management on 2024-05-13 05:58:09 UTC ---

Since this bug has severity set to 'urgent', it is being proposed as a blocker for the currently set release flag. Please resolve ASAP.

--- Additional comment from Mudit Agarwal on 2024-05-13 05:59:18 UTC ---

Parth, when can we have the PR ready?

--- Additional comment from Parth Arora on 2024-05-13 06:06:05 UTC ---

Will raise a backport PR today, and we can merge it too by eod

Comment 11 errata-xmlrpc 2024-06-12 07:38:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.14.8 Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:3861

Note You need to log in before you can comment on or make changes to this bug.