Bug 2115613

Summary:	[GSS][Nooba] Fix Tolerations setting for NooBaa in standalone mode
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Karun Josy <kjosy>
Component:	ocs-operator	Assignee:	Utkarsh Srivastava <usrivast>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Vishakha Kathole <vkathole>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.10	CC:	bkunal, hnallurv, kramdoss, muagarwa, ocs-bugs, odf-bz-bot, sostapov, tdesala, usrivast
Target Milestone:	---
Target Release:	ODF 4.12.0
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	4.12.0-74	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	2131168 (view as bug list)		Environment:
Last Closed:	2023-02-08 14:06:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2131168

Comment 8 Shrivaibavi Raghaventhiran 2022-11-03 13:57:24 UTC

Tested in version:
--------------------
ODF - quay.io/rhceph-dev/ocs-registry:4.12.0-87
OCP - 4.12.0-0.nightly-2022-10-25-210451

Test Steps:
--------------
1. Added taint to all compute nodes(workers)
oc adm taint nodes compute-0 compute-1 compute-2 xyz=true:NoSchedule

2.Deleted all the pods under openshift-storage and all pods in Pending state
$ oc get pods -n openshift-storage
NAME                                               READY   STATUS    RESTARTS   AGE
csi-addons-controller-manager-79dc65bd67-2h9cp     0/2     Pending   0          61s
noobaa-core-0                                      0/1     Pending   0          30s
noobaa-db-pg-0                                     0/1     Pending   0          40s
noobaa-endpoint-748f784f4c-xkqdz                   0/1     Pending   0          48s
noobaa-operator-6d5495b584-h2msg                   0/1     Pending   0          46s
ocs-metrics-exporter-765fc45d7c-nnmnt              0/1     Pending   0          42s
ocs-operator-765c6d89-z9g5w                        0/1     Pending   0          25s
odf-console-7777b966dc-9g7w2                       0/1     Pending   0          23s
odf-operator-controller-manager-56bf57bcfc-tx74r   0/2     Pending   0          21s
rook-ceph-operator-649b6764d8-mgftx                0/1     Pending   0          19s

3. Added the below in the storagecluster CR
spec:
  placement:
    noobaa-standalone:    
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"
      - effect: NoSchedule
        key: node.ocs.openshift.io/storage
        operator: Equal
        value: "true"

4. Added the below in all subscriptions under openshift-storage
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"

Note: 

1. Few pods continued to stay in Pending state, No tolerations got added to the 4 pods mentioned until force respin of pods(noobaa-core, noobaa-db-pg, noobaa endpoint, noobaa-default-backing-store)
$ oc get pods -n openshift-storage
NAME                                               READY   STATUS    RESTARTS   AGE
csi-addons-controller-manager-6c97cbfb4-chwnq      2/2     Running   0          49m
noobaa-core-0                                      0/1     Pending   0          7m49s
noobaa-db-pg-0                                     0/1     Pending   0          7m49s
noobaa-endpoint-69ff749cc-55sk5                    1/1     Running   0          7m22s
noobaa-endpoint-748f784f4c-4z645                   0/1     Pending   0          8m20s
noobaa-operator-54f6557997-vqh5r                   1/1     Running   0          49m
ocs-metrics-exporter-86bff75b98-zpt8b              1/1     Running   0          48m
ocs-operator-7fbc857858-hshr9                      1/1     Running   0          48m
odf-console-9d9d748dd-795kz                        1/1     Running   0          49m
odf-operator-controller-manager-57d949f6db-pssc5   2/2     Running   0          49m
rook-ceph-operator-7d5df87845-g6hzt                1/1     Running   0          48m

Noted the tolerations on noobaa CR but it seems like it did not propagate to the pods. 

Tried it multiple times with multiple combinations on storagecluster CR, it only works when pods are respinned forcefully.

Have a live cluster to debug

Raising a need-info on @usrivast to confirm the behaviour and update the test steps

Comment 9 Shrivaibavi Raghaventhiran 2022-11-03 14:06:52 UTC

Above test was performed on MCG standalone cluster vsphere

Comment 12 Shrivaibavi Raghaventhiran 2022-11-04 12:36:50 UTC

Tested in version:
--------------------
ODF - quay.io/rhceph-dev/ocs-registry:4.12.0-87
OCP - 4.12.0-0.nightly-2022-10-25-210451

Test steps:
------------
1) Deploy MCG standalone
2) Taint the nodes
3) Respin all the pods under openshift-storage and let it go to pending state 
4) Apply tolerations in storage CR and 4 subscriptions 
5) Observe if pods are respining automatically and coming to running state 
6) For the pods which come to running state check the pods yaml to ensure it has tolerations

Observation:
-------------

** Content added to storagecluster
spec:
  placement:
    noobaa-standalone:    
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"
      - effect: NoSchedule
        key: node.ocs.openshift.io/storage
        operator: Equal
        value: "true"

*** Content added to all 4 subs
  config:
    tolerations:
    - effect: NoSchedule
      key: xyz
      operator: Equal
      value: "true"

*** After applying below tolerations to all subs and storage CR the o/p was

$ oc get pods
NAME                                               READY   STATUS    RESTARTS   AGE
csi-addons-controller-manager-5d58ffdcdc-dwsht     2/2     Running   0          119s
noobaa-core-0                                      0/1     Pending   0          11m
noobaa-db-pg-0                                     0/1     Pending   0          11m
noobaa-endpoint-75dbbccfc5-6h5j9                   0/1     Pending   0          12m
noobaa-operator-54b56ff554-4c4zn                   1/1     Running   0          113s
ocs-metrics-exporter-78747df-w8hws                 1/1     Running   0          94s
ocs-operator-d758f8ddf-g4qqn                       1/1     Running   0          94s
odf-console-654bcc65bc-twth9                       1/1     Running   0          109s
odf-operator-controller-manager-75dd69d4dc-6tql5   2/2     Running   0          109s
rook-ceph-operator-78fc5d5648-xzxp6                1/1     Running   0          94s

**** Tried to delete pods forcefully and still it did not help

[sraghave@localhost ~]$ oc delete pod noobaa-core-0 noobaa-db-pg-0 noobaa-endpoint-75dbbccfc5-6h5j9
pod "noobaa-core-0" deleted
pod "noobaa-db-pg-0" deleted
pod "noobaa-endpoint-75dbbccfc5-6h5j9" deleted

[sraghave@localhost ~]$ oc get pods
NAME                                               READY   STATUS    RESTARTS   AGE
csi-addons-controller-manager-5d58ffdcdc-dwsht     2/2     Running   0          4m2s
noobaa-core-0                                      1/1     Running   0          93s
noobaa-db-pg-0                                     0/1     Pending   0          92s
noobaa-endpoint-75dbbccfc5-cx2d5                   0/1     Pending   0          90s
noobaa-operator-54b56ff554-4c4zn                   1/1     Running   0          3m56s
ocs-metrics-exporter-78747df-w8hws                 1/1     Running   0          3m37s
ocs-operator-d758f8ddf-g4qqn                       1/1     Running   0          3m37s
odf-console-654bcc65bc-twth9                       1/1     Running   0          3m52s
odf-operator-controller-manager-75dd69d4dc-6tql5   2/2     Running   0          3m52s
rook-ceph-operator-78fc5d5648-xzxp6                1/1     Running   0          3m37s



Fail QA based on the above observations

Have live cluster to debug, will upload logs also in sometime

Note: Similar observations made on 4.11.3 as well on BZ-2131168

Comment 13 Shrivaibavi Raghaventhiran 2022-11-04 12:45:39 UTC

Logs being copied here http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2115613/

Comment 28 Red Hat Bugzilla 2023-12-08 04:29:53 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days