Bug 2115613
| Summary: | [GSS][Nooba] Fix Tolerations setting for NooBaa in standalone mode | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Karun Josy <kjosy> | |
| Component: | ocs-operator | Assignee: | Utkarsh Srivastava <usrivast> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Vishakha Kathole <vkathole> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 4.10 | CC: | bkunal, hnallurv, kramdoss, muagarwa, ocs-bugs, odf-bz-bot, sostapov, tdesala, usrivast | |
| Target Milestone: | --- | Flags: | sheggodu:
needinfo?
(usrivast) |
|
| Target Release: | ODF 4.12.0 | |||
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | 4.12.0-74 | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2131168 (view as bug list) | Environment: | ||
| Last Closed: | 2023-02-08 14:06:28 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2131168 | |||
Above test was performed on MCG standalone cluster vsphere Tested in version:
--------------------
ODF - quay.io/rhceph-dev/ocs-registry:4.12.0-87
OCP - 4.12.0-0.nightly-2022-10-25-210451
Test steps:
------------
1) Deploy MCG standalone
2) Taint the nodes
3) Respin all the pods under openshift-storage and let it go to pending state
4) Apply tolerations in storage CR and 4 subscriptions
5) Observe if pods are respining automatically and coming to running state
6) For the pods which come to running state check the pods yaml to ensure it has tolerations
Observation:
-------------
** Content added to storagecluster
spec:
placement:
noobaa-standalone:
tolerations:
- effect: NoSchedule
key: xyz
operator: Equal
value: "true"
- effect: NoSchedule
key: node.ocs.openshift.io/storage
operator: Equal
value: "true"
*** Content added to all 4 subs
config:
tolerations:
- effect: NoSchedule
key: xyz
operator: Equal
value: "true"
*** After applying below tolerations to all subs and storage CR the o/p was
$ oc get pods
NAME READY STATUS RESTARTS AGE
csi-addons-controller-manager-5d58ffdcdc-dwsht 2/2 Running 0 119s
noobaa-core-0 0/1 Pending 0 11m
noobaa-db-pg-0 0/1 Pending 0 11m
noobaa-endpoint-75dbbccfc5-6h5j9 0/1 Pending 0 12m
noobaa-operator-54b56ff554-4c4zn 1/1 Running 0 113s
ocs-metrics-exporter-78747df-w8hws 1/1 Running 0 94s
ocs-operator-d758f8ddf-g4qqn 1/1 Running 0 94s
odf-console-654bcc65bc-twth9 1/1 Running 0 109s
odf-operator-controller-manager-75dd69d4dc-6tql5 2/2 Running 0 109s
rook-ceph-operator-78fc5d5648-xzxp6 1/1 Running 0 94s
**** Tried to delete pods forcefully and still it did not help
[sraghave@localhost ~]$ oc delete pod noobaa-core-0 noobaa-db-pg-0 noobaa-endpoint-75dbbccfc5-6h5j9
pod "noobaa-core-0" deleted
pod "noobaa-db-pg-0" deleted
pod "noobaa-endpoint-75dbbccfc5-6h5j9" deleted
[sraghave@localhost ~]$ oc get pods
NAME READY STATUS RESTARTS AGE
csi-addons-controller-manager-5d58ffdcdc-dwsht 2/2 Running 0 4m2s
noobaa-core-0 1/1 Running 0 93s
noobaa-db-pg-0 0/1 Pending 0 92s
noobaa-endpoint-75dbbccfc5-cx2d5 0/1 Pending 0 90s
noobaa-operator-54b56ff554-4c4zn 1/1 Running 0 3m56s
ocs-metrics-exporter-78747df-w8hws 1/1 Running 0 3m37s
ocs-operator-d758f8ddf-g4qqn 1/1 Running 0 3m37s
odf-console-654bcc65bc-twth9 1/1 Running 0 3m52s
odf-operator-controller-manager-75dd69d4dc-6tql5 2/2 Running 0 3m52s
rook-ceph-operator-78fc5d5648-xzxp6 1/1 Running 0 3m37s
Fail QA based on the above observations
Have live cluster to debug, will upload logs also in sometime
Note: Similar observations made on 4.11.3 as well on BZ-2131168
Logs being copied here http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2115613/ |
Tested in version: -------------------- ODF - quay.io/rhceph-dev/ocs-registry:4.12.0-87 OCP - 4.12.0-0.nightly-2022-10-25-210451 Test Steps: -------------- 1. Added taint to all compute nodes(workers) oc adm taint nodes compute-0 compute-1 compute-2 xyz=true:NoSchedule 2.Deleted all the pods under openshift-storage and all pods in Pending state $ oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-79dc65bd67-2h9cp 0/2 Pending 0 61s noobaa-core-0 0/1 Pending 0 30s noobaa-db-pg-0 0/1 Pending 0 40s noobaa-endpoint-748f784f4c-xkqdz 0/1 Pending 0 48s noobaa-operator-6d5495b584-h2msg 0/1 Pending 0 46s ocs-metrics-exporter-765fc45d7c-nnmnt 0/1 Pending 0 42s ocs-operator-765c6d89-z9g5w 0/1 Pending 0 25s odf-console-7777b966dc-9g7w2 0/1 Pending 0 23s odf-operator-controller-manager-56bf57bcfc-tx74r 0/2 Pending 0 21s rook-ceph-operator-649b6764d8-mgftx 0/1 Pending 0 19s 3. Added the below in the storagecluster CR spec: placement: noobaa-standalone: tolerations: - effect: NoSchedule key: xyz operator: Equal value: "true" - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" 4. Added the below in all subscriptions under openshift-storage tolerations: - effect: NoSchedule key: xyz operator: Equal value: "true" Note: 1. Few pods continued to stay in Pending state, No tolerations got added to the 4 pods mentioned until force respin of pods(noobaa-core, noobaa-db-pg, noobaa endpoint, noobaa-default-backing-store) $ oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-6c97cbfb4-chwnq 2/2 Running 0 49m noobaa-core-0 0/1 Pending 0 7m49s noobaa-db-pg-0 0/1 Pending 0 7m49s noobaa-endpoint-69ff749cc-55sk5 1/1 Running 0 7m22s noobaa-endpoint-748f784f4c-4z645 0/1 Pending 0 8m20s noobaa-operator-54f6557997-vqh5r 1/1 Running 0 49m ocs-metrics-exporter-86bff75b98-zpt8b 1/1 Running 0 48m ocs-operator-7fbc857858-hshr9 1/1 Running 0 48m odf-console-9d9d748dd-795kz 1/1 Running 0 49m odf-operator-controller-manager-57d949f6db-pssc5 2/2 Running 0 49m rook-ceph-operator-7d5df87845-g6hzt 1/1 Running 0 48m Noted the tolerations on noobaa CR but it seems like it did not propagate to the pods. Tried it multiple times with multiple combinations on storagecluster CR, it only works when pods are respinned forcefully. Have a live cluster to debug Raising a need-info on @usrivast to confirm the behaviour and update the test steps