Bug 2115613 - [GSS][Nooba] Fix Tolerations setting for NooBaa in standalone mode [NEEDINFO]
Summary: [GSS][Nooba] Fix Tolerations setting for NooBaa in standalone mode
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.10
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
: ODF 4.12.0
Assignee: Utkarsh Srivastava
QA Contact: Vishakha Kathole
URL:
Whiteboard:
Depends On:
Blocks: 2131168
TreeView+ depends on / blocked
 
Reported: 2022-08-05 02:18 UTC by Karun Josy
Modified: 2023-08-09 17:00 UTC (History)
9 users (show)

Fixed In Version: 4.12.0-74
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2131168 (view as bug list)
Environment:
Last Closed: 2023-02-08 14:06:28 UTC
Embargoed:
sheggodu: needinfo? (usrivast)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-ci pull 7060 0 None Merged Non ocs taint and tolerations for NooBaa in standalone mode 2023-04-17 12:34:54 UTC
Github red-hat-storage ocs-operator pull 1827 0 None Merged add "noobaa-standalone" daemonplacement for MCG standalone deployment 2022-09-23 10:32:59 UTC

Comment 8 Shrivaibavi Raghaventhiran 2022-11-03 13:57:24 UTC
Tested in version:
--------------------
ODF - quay.io/rhceph-dev/ocs-registry:4.12.0-87
OCP - 4.12.0-0.nightly-2022-10-25-210451

Test Steps:
--------------
1. Added taint to all compute nodes(workers)
oc adm taint nodes compute-0 compute-1 compute-2 xyz=true:NoSchedule

2.Deleted all the pods under openshift-storage and all pods in Pending state
$ oc get pods -n openshift-storage
NAME                                               READY   STATUS    RESTARTS   AGE
csi-addons-controller-manager-79dc65bd67-2h9cp     0/2     Pending   0          61s
noobaa-core-0                                      0/1     Pending   0          30s
noobaa-db-pg-0                                     0/1     Pending   0          40s
noobaa-endpoint-748f784f4c-xkqdz                   0/1     Pending   0          48s
noobaa-operator-6d5495b584-h2msg                   0/1     Pending   0          46s
ocs-metrics-exporter-765fc45d7c-nnmnt              0/1     Pending   0          42s
ocs-operator-765c6d89-z9g5w                        0/1     Pending   0          25s
odf-console-7777b966dc-9g7w2                       0/1     Pending   0          23s
odf-operator-controller-manager-56bf57bcfc-tx74r   0/2     Pending   0          21s
rook-ceph-operator-649b6764d8-mgftx                0/1     Pending   0          19s

3. Added the below in the storagecluster CR
spec:
  placement:
    noobaa-standalone:    
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"
      - effect: NoSchedule
        key: node.ocs.openshift.io/storage
        operator: Equal
        value: "true"

4. Added the below in all subscriptions under openshift-storage
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"

Note: 

1. Few pods continued to stay in Pending state, No tolerations got added to the 4 pods mentioned until force respin of pods(noobaa-core, noobaa-db-pg, noobaa endpoint, noobaa-default-backing-store)
$ oc get pods -n openshift-storage
NAME                                               READY   STATUS    RESTARTS   AGE
csi-addons-controller-manager-6c97cbfb4-chwnq      2/2     Running   0          49m
noobaa-core-0                                      0/1     Pending   0          7m49s
noobaa-db-pg-0                                     0/1     Pending   0          7m49s
noobaa-endpoint-69ff749cc-55sk5                    1/1     Running   0          7m22s
noobaa-endpoint-748f784f4c-4z645                   0/1     Pending   0          8m20s
noobaa-operator-54f6557997-vqh5r                   1/1     Running   0          49m
ocs-metrics-exporter-86bff75b98-zpt8b              1/1     Running   0          48m
ocs-operator-7fbc857858-hshr9                      1/1     Running   0          48m
odf-console-9d9d748dd-795kz                        1/1     Running   0          49m
odf-operator-controller-manager-57d949f6db-pssc5   2/2     Running   0          49m
rook-ceph-operator-7d5df87845-g6hzt                1/1     Running   0          48m

Noted the tolerations on noobaa CR but it seems like it did not propagate to the pods. 

Tried it multiple times with multiple combinations on storagecluster CR, it only works when pods are respinned forcefully.

Have a live cluster to debug

Raising a need-info on @usrivast to confirm the behaviour and update the test steps

Comment 9 Shrivaibavi Raghaventhiran 2022-11-03 14:06:52 UTC
Above test was performed on MCG standalone cluster vsphere

Comment 12 Shrivaibavi Raghaventhiran 2022-11-04 12:36:50 UTC
Tested in version:
--------------------
ODF - quay.io/rhceph-dev/ocs-registry:4.12.0-87
OCP - 4.12.0-0.nightly-2022-10-25-210451

Test steps:
------------
1) Deploy MCG standalone
2) Taint the nodes
3) Respin all the pods under openshift-storage and let it go to pending state 
4) Apply tolerations in storage CR and 4 subscriptions 
5) Observe if pods are respining automatically and coming to running state 
6) For the pods which come to running state check the pods yaml to ensure it has tolerations

Observation:
-------------

** Content added to storagecluster
spec:
  placement:
    noobaa-standalone:    
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"
      - effect: NoSchedule
        key: node.ocs.openshift.io/storage
        operator: Equal
        value: "true"

*** Content added to all 4 subs
  config:
    tolerations:
    - effect: NoSchedule
      key: xyz
      operator: Equal
      value: "true"

*** After applying below tolerations to all subs and storage CR the o/p was

$ oc get pods
NAME                                               READY   STATUS    RESTARTS   AGE
csi-addons-controller-manager-5d58ffdcdc-dwsht     2/2     Running   0          119s
noobaa-core-0                                      0/1     Pending   0          11m
noobaa-db-pg-0                                     0/1     Pending   0          11m
noobaa-endpoint-75dbbccfc5-6h5j9                   0/1     Pending   0          12m
noobaa-operator-54b56ff554-4c4zn                   1/1     Running   0          113s
ocs-metrics-exporter-78747df-w8hws                 1/1     Running   0          94s
ocs-operator-d758f8ddf-g4qqn                       1/1     Running   0          94s
odf-console-654bcc65bc-twth9                       1/1     Running   0          109s
odf-operator-controller-manager-75dd69d4dc-6tql5   2/2     Running   0          109s
rook-ceph-operator-78fc5d5648-xzxp6                1/1     Running   0          94s

**** Tried to delete pods forcefully and still it did not help

[sraghave@localhost ~]$ oc delete pod noobaa-core-0 noobaa-db-pg-0 noobaa-endpoint-75dbbccfc5-6h5j9
pod "noobaa-core-0" deleted
pod "noobaa-db-pg-0" deleted
pod "noobaa-endpoint-75dbbccfc5-6h5j9" deleted

[sraghave@localhost ~]$ oc get pods
NAME                                               READY   STATUS    RESTARTS   AGE
csi-addons-controller-manager-5d58ffdcdc-dwsht     2/2     Running   0          4m2s
noobaa-core-0                                      1/1     Running   0          93s
noobaa-db-pg-0                                     0/1     Pending   0          92s
noobaa-endpoint-75dbbccfc5-cx2d5                   0/1     Pending   0          90s
noobaa-operator-54b56ff554-4c4zn                   1/1     Running   0          3m56s
ocs-metrics-exporter-78747df-w8hws                 1/1     Running   0          3m37s
ocs-operator-d758f8ddf-g4qqn                       1/1     Running   0          3m37s
odf-console-654bcc65bc-twth9                       1/1     Running   0          3m52s
odf-operator-controller-manager-75dd69d4dc-6tql5   2/2     Running   0          3m52s
rook-ceph-operator-78fc5d5648-xzxp6                1/1     Running   0          3m37s



Fail QA based on the above observations

Have live cluster to debug, will upload logs also in sometime

Note: Similar observations made on 4.11.3 as well on BZ-2131168

Comment 13 Shrivaibavi Raghaventhiran 2022-11-04 12:45:39 UTC
Logs being copied here http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2115613/


Note You need to log in before you can comment on or make changes to this bug.