+++ This bug was initially created as a clone of Bug #2279992 +++ Pls refer https://github.com/rook/rook/issues/13942 for more details Fixed by https://github.com/rook/rook/pull/13966 (implemented in 4.16) Need back port for 4.15. Since we aren't aiming to back port all content that surround this fix, the testing is as below 1. This BZ effects only provider-client deployments, specifically client is also running in same cluster as provider, for everything else only regression is enough 2. For effected deployment w/ this fix a. restart ocs-client-operator-controller-manager-* pod in client operator namespace b. wait/verify the presence of "openshift-storage" prefixed resources in "oc get csidriver" c. add "ROOK_CSI_DISABLE_DRIVER: true" in "rook-ceph-operator-config" cm in "openshift-storage" ns d. restart "rook-ceph-operator" pod and "oc get csidriver" should still list "openshift-storage" prefixed resources --- Additional comment from RHEL Program Management on 2024-05-10 12:11:19 UTC --- This bug having no release flag set previously, is now set with release flag 'odf‑4.16.0' to '?', and so is being proposed to be fixed at the ODF 4.16.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag. --- Additional comment from RHEL Program Management on 2024-05-10 12:11:19 UTC --- The 'Target Release' is not to be set manually at the Red Hat OpenShift Data Foundation product. The 'Target Release' will be auto set appropriately, after the 3 Acks (pm,devel,qa) are set to "+" for a specific release flag and that release flag gets auto set to "+". --- Additional comment from Leela Venkaiah Gangavarapu on 2024-05-10 13:16:21 UTC --- >>> Below is performed on 4.14 cluster w/ 4.14 ODF and should be similar for 4.15 ocp/odf $ type ko kc ko is aliased to `kubectl -nopenshift-storage' kc is aliased to `kubectl -nopenshift-storage-client' $ k get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.14 True False 23h Cluster version is 4.14.14 $ ko get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.14.6 NooBaa Operator 4.14.6 Succeeded ocs-operator.v4.14.6 Container Storage 4.14.6 Succeeded odf-csi-addons-operator.v4.14.6 CSI Addons 4.14.6 Succeeded odf-operator.v4.14.6 IBM Storage Fusion Data Foundation 4.14.6 Succeeded $ kc get csv NAME DISPLAY VERSION REPLACES PHASE ocs-client-operator.v4.14.6 OpenShift Data Foundation Client 4.14.6 Succeeded odf-csi-addons-operator.v4.14.6 CSI Addons 4.14.6 Succeeded $ ko get cm rook-ceph-operator-config -oyaml | yq -r .data CSI_PLUGIN_TOLERATIONS: | - effect: NoSchedule key: node.ocs.openshift.io/storage value: "true" - effect: NoSchedule key: node-role.kubernetes.io/master CSI_PROVISIONER_NODE_AFFINITY: node-role.kubernetes.io/master= CSI_PROVISIONER_TOLERATIONS: | - effect: NoSchedule key: node.ocs.openshift.io/storage value: "true" - effect: NoSchedule key: node-role.kubernetes.io/master ROOK_CSI_ENABLE_CEPHFS: "false" ROOK_CSI_ENABLE_RBD: "false" Issue: ------ >>> openshift-storage* resources may have been deleted by rook if it has rebooted or reconciled resources already $ k get csidriver NAME ATTACHREQUIRED PODINFOONMOUNT STORAGECAPACITY TOKENREQUESTS REQUIRESREPUBLISH MODES AGE openshift-storage.cephfs.csi.ceph.com true false false <unset> false Persistent 100s openshift-storage.rbd.csi.ceph.com true false false <unset> false Persistent 100s topolvm.io false true true <unset> false Persistent 19h $ ko rollout restart deploy/rook-ceph-operator deployment.apps/rook-ceph-operator restarted >>> openshift-storage* resources are deleted by rook $ k get csidriver NAME ATTACHREQUIRED PODINFOONMOUNT STORAGECAPACITY TOKENREQUESTS REQUIRESREPUBLISH MODES AGE topolvm.io false true true <unset> false Persistent 19h With fix: --------- >>> new field (ROOK_CSI_DISABLE_DRIVER: "true") is added manually $ ko get cm rook-ceph-operator-config -oyaml | yq -r .data CSI_PLUGIN_TOLERATIONS: | - effect: NoSchedule key: node.ocs.openshift.io/storage value: "true" - effect: NoSchedule key: node-role.kubernetes.io/master CSI_PROVISIONER_NODE_AFFINITY: node-role.kubernetes.io/master= CSI_PROVISIONER_TOLERATIONS: | - effect: NoSchedule key: node.ocs.openshift.io/storage value: "true" - effect: NoSchedule key: node-role.kubernetes.io/master ROOK_CSI_DISABLE_DRIVER: "true" ROOK_CSI_ENABLE_CEPHFS: "false" ROOK_CSI_ENABLE_RBD: "false" $ kc rollout restart deploy/ocs-client-operator-controller-manager deployment.apps/ocs-client-operator-controller-manager restarted $ date; kc delete $(kc get po -oname | grep ocs-client-operator-controller-manager) Friday 10 May 2024 06:15:40 PM IST pod "ocs-client-operator-controller-manager-69bff4c7b6-vlg6l" deleted $ date; k get csidriver Friday 10 May 2024 06:16:08 PM IST NAME ATTACHREQUIRED PODINFOONMOUNT STORAGECAPACITY TOKENREQUESTS REQUIRESREPUBLISH MODES AGE openshift-storage.cephfs.csi.ceph.com true false false <unset> false Persistent 7s openshift-storage.rbd.csi.ceph.com true false false <unset> false Persistent 7s topolvm.io false true true <unset> false Persistent 20h $ date; ko rollout restart deploy/rook-ceph-operator Friday 10 May 2024 06:16:39 PM IST deployment.apps/rook-ceph-operator restarted $ date; ko get po -lapp=rook-ceph-operator Friday 10 May 2024 06:18:10 PM IST NAME READY STATUS RESTARTS AGE rook-ceph-operator-668c5cfb4f-jkzdg 1/1 Running 0 89s >>> openshift-storage* resources are not touched by rook. Workloads already using cephfs volumes (RWX+fsGroup) should be scaled down completely and up after the fix, however testing that specifically is at QE discretion. $ date; k get csidriver Friday 10 May 2024 06:18:23 PM IST NAME ATTACHREQUIRED PODINFOONMOUNT STORAGECAPACITY TOKENREQUESTS REQUIRESREPUBLISH MODES AGE openshift-storage.cephfs.csi.ceph.com true false false <unset> false Persistent 2m22s openshift-storage.rbd.csi.ceph.com true false false <unset> false Persistent 2m22s topolvm.io false true true <unset> false Persistent 20h ----- >>> A bare minimum smoke test $ k get -ndefault pod,pvc -ltype NAME READY STATUS RESTARTS AGE pod/csirbd-demo-pod 1/1 Running 0 40s NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/rbd-pvc Bound pvc-ca1ee8bd-df2d-498d-bc5f-1649fb018013 1Gi RWO ocs-storagecluster-ceph-rbd 42s $ k exec -ndefault pod/csirbd-demo-pod -- /bin/sh -c 'echo hello world > /var/lib/www/html/index.html' $ k exec -ndefault pod/csirbd-demo-pod -- /bin/sh -c 'cat /var/lib/www/html/index.html' hello world Note: 1. Tested w/ quay.io/paarora/rook-ceph-operator:v209 provided by Parth, thanks! 2. Issue initially reported at https://github.ibm.com/ProjectAbell/abell-tracking/issues/36413 thanks. --- Additional comment from RHEL Program Management on 2024-05-13 05:58:09 UTC --- Since this bug has severity set to 'urgent', it is being proposed as a blocker for the currently set release flag. Please resolve ASAP. --- Additional comment from Mudit Agarwal on 2024-05-13 05:59:18 UTC --- Parth, when can we have the PR ready? --- Additional comment from Parth Arora on 2024-05-13 06:06:05 UTC --- Will raise a backport PR today, and we can merge it too by eod
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.14.8 Bug Fix Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:3861