Bug 2131168
| Summary: | [Backport to 4.11][GSS][noobaa]Fix Tolerations setting for NooBaa in standalone mode | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Bipin Kunal <bkunal> | |
| Component: | ocs-operator | Assignee: | Utkarsh Srivastava <usrivast> | |
| Status: | CLOSED ERRATA | QA Contact: | Shrivaibavi Raghaventhiran <sraghave> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 4.10 | CC: | belimele, etamir, kjosy, kramdoss, madam, muagarwa, nbecker, ocs-bugs, odf-bz-bot, rcyriac, sheggodu, sostapov, sraghave, usrivast | |
| Target Milestone: | --- | Flags: | sheggodu:
needinfo?
(usrivast) |
|
| Target Release: | ODF 4.11.4 | |||
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | No Doc Update | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | 2115613 | |||
| : | 2131169 (view as bug list) | Environment: | ||
| Last Closed: | 2022-12-07 11:19:24 UTC | Type: | --- | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 2115613 | |||
| Bug Blocks: | 2131169 | |||
Hi, * We verified BZs on non-standalone clusters too we did not see noobaa pods in Pending on 4.11 and 4.12. 4.12 https://bugzilla.redhat.com/show_bug.cgi?id=2121842 4.11 https://bugzilla.redhat.com/show_bug.cgi?id=2125147 * Even after restarts of the pod it doesn't seem to be Running at all times. Hence ruling out this option as a WA for this issue. Please correct if my understanding is wrong, Kubernetes behaviour doesn't change on standalone MCG and non-standalone clusters, so I don't expect pods in Pending state on standalone MCGs when it perfectly works on non-standalone clusters, is it something to do with noobaa operators ? Please check the failed_qa logs from comment #9 Raising need-info on @nbecker @belimele @bkunal Let me know if we can meet to have more discussions on this. Versions:
----------
OCP - 4.11.0-0.nightly-2022-11-30-175707
ODF - quay.io/rhceph-dev/ocs-registry:4.11.4-4
Test steps:
-----------
1. Deploy MCG standalone
2. Taint the nodes
oc adm taint nodes compute-0 compute-1 compute-2 xyz=true:NoSchedule
3. Apply tolerations in storage CR and 4 subscriptions
4. Observe if pods are respining automatically and coming to running state
5. For the pods which come to running state check the pods yaml to ensure it has tolerations
6. Respin all the pods to just check if the toleration exists
Observation:
-------------
1. Noobaa-default-backing-store pods not getting automatically respinned after updating the storage cluster and subs with the required toleration.
However force restart of pod works fine. AFter force restart noobaa-default-backing store pod comes up with the desired toleration
** Content added to storagecluster
spec:
placement:
noobaa-standalone:
tolerations:
- effect: NoSchedule
key: xyz
operator: Equal
value: "true"
- effect: NoSchedule
key: node.ocs.openshift.io/storage
operator: Equal
value: "true"
*** Content added to all 4 subs
config:
tolerations:
- effect: NoSchedule
key: xyz
operator: Equal
value: "true"
Moving the BZ to verified state
New BZ to track above mentioned issue https://bugzilla.redhat.com/show_bug.cgi?id=2149872 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.11.4 Bug Fix Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:8877 |
Tested version: ---------------- ODF - quay.io/rhceph-dev/ocs-registry:4.11.3-5 OCP - 4.11.9 Test steps: ----------- 1. Deploy MCG standalone 2. Taint the nodes 3) Respin all the pods under openshift-storage and let it go to pending state 4) Apply tolerations in storage CR and 4 subscriptions 5) Observe if pods are respining automatically and coming to running state 6) For the pods which come to running state check the pods yaml to ensure it has tolerations Observation: ------------- ** Content added to storagecluster spec: placement: noobaa-standalone: tolerations: - effect: NoSchedule key: xyz operator: Equal value: "true" - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" *** Content added to all 4 subs config: tolerations: - effect: NoSchedule key: xyz operator: Equal value: "true" *** After applying below tolerations to all subs and storage CR the o/p was NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-6fc9dbc6f7-2kxd8 2/2 Running 0 66s noobaa-core-0 0/1 Pending 0 15m noobaa-db-pg-0 0/1 Pending 0 15m noobaa-endpoint-59f5c68f8f-dqf4h 0/1 Pending 0 16m noobaa-operator-57d447fcb5-grwbq 1/1 Running 0 5m57s ocs-metrics-exporter-6654597c54-s8h7v 1/1 Running 0 5m41s ocs-operator-7b7487d774-qpwfr 1/1 Running 0 5m41s odf-console-666c8b4bbd-th2tb 1/1 Running 0 38s odf-operator-controller-manager-579c8448df-rmz8c 2/2 Running 0 38s rook-ceph-operator-5674659b57-4kqcg 1/1 Running 0 5m42s **** Tried to delete pods forcefully and still it did not help $ oc delete pod noobaa-core-0 noobaa-db-pg-0 noobaa-endpoint-59f5c68f8f-dqf4h pod "noobaa-core-0" deleted pod "noobaa-db-pg-0" deleted pod "noobaa-endpoint-59f5c68f8f-dqf4h" deleted NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-6fc9dbc6f7-2kxd8 2/2 Running 0 9m noobaa-core-0 0/1 Pending 0 4m36s noobaa-db-pg-0 1/1 Running 0 4m35s noobaa-endpoint-59f5c68f8f-gw2jx 0/1 Pending 0 4m32s noobaa-operator-57d447fcb5-grwbq 1/1 Running 0 13m ocs-metrics-exporter-6654597c54-s8h7v 1/1 Running 0 13m ocs-operator-7b7487d774-qpwfr 1/1 Running 0 13m odf-console-666c8b4bbd-th2tb 1/1 Running 0 8m32s odf-operator-controller-manager-579c8448df-rmz8c 2/2 Running 0 8m32s rook-ceph-operator-5674659b57-4kqcg 1/1 Running 0 13m Fail QA based on the above observations Have live cluster to debug, will upload logs also in sometime