Tested version: ---------------- ODF - quay.io/rhceph-dev/ocs-registry:4.11.3-5 OCP - 4.11.9 Test steps: ----------- 1. Deploy MCG standalone 2. Taint the nodes 3) Respin all the pods under openshift-storage and let it go to pending state 4) Apply tolerations in storage CR and 4 subscriptions 5) Observe if pods are respining automatically and coming to running state 6) For the pods which come to running state check the pods yaml to ensure it has tolerations Observation: ------------- ** Content added to storagecluster spec: placement: noobaa-standalone: tolerations: - effect: NoSchedule key: xyz operator: Equal value: "true" - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" *** Content added to all 4 subs config: tolerations: - effect: NoSchedule key: xyz operator: Equal value: "true" *** After applying below tolerations to all subs and storage CR the o/p was NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-6fc9dbc6f7-2kxd8 2/2 Running 0 66s noobaa-core-0 0/1 Pending 0 15m noobaa-db-pg-0 0/1 Pending 0 15m noobaa-endpoint-59f5c68f8f-dqf4h 0/1 Pending 0 16m noobaa-operator-57d447fcb5-grwbq 1/1 Running 0 5m57s ocs-metrics-exporter-6654597c54-s8h7v 1/1 Running 0 5m41s ocs-operator-7b7487d774-qpwfr 1/1 Running 0 5m41s odf-console-666c8b4bbd-th2tb 1/1 Running 0 38s odf-operator-controller-manager-579c8448df-rmz8c 2/2 Running 0 38s rook-ceph-operator-5674659b57-4kqcg 1/1 Running 0 5m42s **** Tried to delete pods forcefully and still it did not help $ oc delete pod noobaa-core-0 noobaa-db-pg-0 noobaa-endpoint-59f5c68f8f-dqf4h pod "noobaa-core-0" deleted pod "noobaa-db-pg-0" deleted pod "noobaa-endpoint-59f5c68f8f-dqf4h" deleted NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-6fc9dbc6f7-2kxd8 2/2 Running 0 9m noobaa-core-0 0/1 Pending 0 4m36s noobaa-db-pg-0 1/1 Running 0 4m35s noobaa-endpoint-59f5c68f8f-gw2jx 0/1 Pending 0 4m32s noobaa-operator-57d447fcb5-grwbq 1/1 Running 0 13m ocs-metrics-exporter-6654597c54-s8h7v 1/1 Running 0 13m ocs-operator-7b7487d774-qpwfr 1/1 Running 0 13m odf-console-666c8b4bbd-th2tb 1/1 Running 0 8m32s odf-operator-controller-manager-579c8448df-rmz8c 2/2 Running 0 8m32s rook-ceph-operator-5674659b57-4kqcg 1/1 Running 0 13m Fail QA based on the above observations Have live cluster to debug, will upload logs also in sometime
Logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2131168/
Hi, * We verified BZs on non-standalone clusters too we did not see noobaa pods in Pending on 4.11 and 4.12. 4.12 https://bugzilla.redhat.com/show_bug.cgi?id=2121842 4.11 https://bugzilla.redhat.com/show_bug.cgi?id=2125147 * Even after restarts of the pod it doesn't seem to be Running at all times. Hence ruling out this option as a WA for this issue. Please correct if my understanding is wrong, Kubernetes behaviour doesn't change on standalone MCG and non-standalone clusters, so I don't expect pods in Pending state on standalone MCGs when it perfectly works on non-standalone clusters, is it something to do with noobaa operators ? Please check the failed_qa logs from comment #9 Raising need-info on @nbecker @belimele @bkunal Let me know if we can meet to have more discussions on this.
Versions: ---------- OCP - 4.11.0-0.nightly-2022-11-30-175707 ODF - quay.io/rhceph-dev/ocs-registry:4.11.4-4 Test steps: ----------- 1. Deploy MCG standalone 2. Taint the nodes oc adm taint nodes compute-0 compute-1 compute-2 xyz=true:NoSchedule 3. Apply tolerations in storage CR and 4 subscriptions 4. Observe if pods are respining automatically and coming to running state 5. For the pods which come to running state check the pods yaml to ensure it has tolerations 6. Respin all the pods to just check if the toleration exists Observation: ------------- 1. Noobaa-default-backing-store pods not getting automatically respinned after updating the storage cluster and subs with the required toleration. However force restart of pod works fine. AFter force restart noobaa-default-backing store pod comes up with the desired toleration ** Content added to storagecluster spec: placement: noobaa-standalone: tolerations: - effect: NoSchedule key: xyz operator: Equal value: "true" - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" *** Content added to all 4 subs config: tolerations: - effect: NoSchedule key: xyz operator: Equal value: "true" Moving the BZ to verified state
New BZ to track above mentioned issue https://bugzilla.redhat.com/show_bug.cgi?id=2149872
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.11.4 Bug Fix Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:8877