Tested in version: -------------------- ODF - quay.io/rhceph-dev/ocs-registry:4.12.0-87 OCP - 4.12.0-0.nightly-2022-10-25-210451 Test Steps: -------------- 1. Added taint to all compute nodes(workers) oc adm taint nodes compute-0 compute-1 compute-2 xyz=true:NoSchedule 2.Deleted all the pods under openshift-storage and all pods in Pending state $ oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-79dc65bd67-2h9cp 0/2 Pending 0 61s noobaa-core-0 0/1 Pending 0 30s noobaa-db-pg-0 0/1 Pending 0 40s noobaa-endpoint-748f784f4c-xkqdz 0/1 Pending 0 48s noobaa-operator-6d5495b584-h2msg 0/1 Pending 0 46s ocs-metrics-exporter-765fc45d7c-nnmnt 0/1 Pending 0 42s ocs-operator-765c6d89-z9g5w 0/1 Pending 0 25s odf-console-7777b966dc-9g7w2 0/1 Pending 0 23s odf-operator-controller-manager-56bf57bcfc-tx74r 0/2 Pending 0 21s rook-ceph-operator-649b6764d8-mgftx 0/1 Pending 0 19s 3. Added the below in the storagecluster CR spec: placement: noobaa-standalone: tolerations: - effect: NoSchedule key: xyz operator: Equal value: "true" - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" 4. Added the below in all subscriptions under openshift-storage tolerations: - effect: NoSchedule key: xyz operator: Equal value: "true" Note: 1. Few pods continued to stay in Pending state, No tolerations got added to the 4 pods mentioned until force respin of pods(noobaa-core, noobaa-db-pg, noobaa endpoint, noobaa-default-backing-store) $ oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-6c97cbfb4-chwnq 2/2 Running 0 49m noobaa-core-0 0/1 Pending 0 7m49s noobaa-db-pg-0 0/1 Pending 0 7m49s noobaa-endpoint-69ff749cc-55sk5 1/1 Running 0 7m22s noobaa-endpoint-748f784f4c-4z645 0/1 Pending 0 8m20s noobaa-operator-54f6557997-vqh5r 1/1 Running 0 49m ocs-metrics-exporter-86bff75b98-zpt8b 1/1 Running 0 48m ocs-operator-7fbc857858-hshr9 1/1 Running 0 48m odf-console-9d9d748dd-795kz 1/1 Running 0 49m odf-operator-controller-manager-57d949f6db-pssc5 2/2 Running 0 49m rook-ceph-operator-7d5df87845-g6hzt 1/1 Running 0 48m Noted the tolerations on noobaa CR but it seems like it did not propagate to the pods. Tried it multiple times with multiple combinations on storagecluster CR, it only works when pods are respinned forcefully. Have a live cluster to debug Raising a need-info on @usrivast to confirm the behaviour and update the test steps
Above test was performed on MCG standalone cluster vsphere
Tested in version: -------------------- ODF - quay.io/rhceph-dev/ocs-registry:4.12.0-87 OCP - 4.12.0-0.nightly-2022-10-25-210451 Test steps: ------------ 1) Deploy MCG standalone 2) Taint the nodes 3) Respin all the pods under openshift-storage and let it go to pending state 4) Apply tolerations in storage CR and 4 subscriptions 5) Observe if pods are respining automatically and coming to running state 6) For the pods which come to running state check the pods yaml to ensure it has tolerations Observation: ------------- ** Content added to storagecluster spec: placement: noobaa-standalone: tolerations: - effect: NoSchedule key: xyz operator: Equal value: "true" - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" *** Content added to all 4 subs config: tolerations: - effect: NoSchedule key: xyz operator: Equal value: "true" *** After applying below tolerations to all subs and storage CR the o/p was $ oc get pods NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-5d58ffdcdc-dwsht 2/2 Running 0 119s noobaa-core-0 0/1 Pending 0 11m noobaa-db-pg-0 0/1 Pending 0 11m noobaa-endpoint-75dbbccfc5-6h5j9 0/1 Pending 0 12m noobaa-operator-54b56ff554-4c4zn 1/1 Running 0 113s ocs-metrics-exporter-78747df-w8hws 1/1 Running 0 94s ocs-operator-d758f8ddf-g4qqn 1/1 Running 0 94s odf-console-654bcc65bc-twth9 1/1 Running 0 109s odf-operator-controller-manager-75dd69d4dc-6tql5 2/2 Running 0 109s rook-ceph-operator-78fc5d5648-xzxp6 1/1 Running 0 94s **** Tried to delete pods forcefully and still it did not help [sraghave@localhost ~]$ oc delete pod noobaa-core-0 noobaa-db-pg-0 noobaa-endpoint-75dbbccfc5-6h5j9 pod "noobaa-core-0" deleted pod "noobaa-db-pg-0" deleted pod "noobaa-endpoint-75dbbccfc5-6h5j9" deleted [sraghave@localhost ~]$ oc get pods NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-5d58ffdcdc-dwsht 2/2 Running 0 4m2s noobaa-core-0 1/1 Running 0 93s noobaa-db-pg-0 0/1 Pending 0 92s noobaa-endpoint-75dbbccfc5-cx2d5 0/1 Pending 0 90s noobaa-operator-54b56ff554-4c4zn 1/1 Running 0 3m56s ocs-metrics-exporter-78747df-w8hws 1/1 Running 0 3m37s ocs-operator-d758f8ddf-g4qqn 1/1 Running 0 3m37s odf-console-654bcc65bc-twth9 1/1 Running 0 3m52s odf-operator-controller-manager-75dd69d4dc-6tql5 2/2 Running 0 3m52s rook-ceph-operator-78fc5d5648-xzxp6 1/1 Running 0 3m37s Fail QA based on the above observations Have live cluster to debug, will upload logs also in sometime Note: Similar observations made on 4.11.3 as well on BZ-2131168
Logs being copied here http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2115613/