Bug 2131168 - [Backport to 4.11][GSS][noobaa]Fix Tolerations setting for NooBaa in standalone mode [NEEDINFO]
Summary: [Backport to 4.11][GSS][noobaa]Fix Tolerations setting for NooBaa in standalo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.10
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
: ODF 4.11.4
Assignee: Utkarsh Srivastava
QA Contact: Shrivaibavi Raghaventhiran
URL:
Whiteboard:
Depends On: 2115613
Blocks: 2131169
TreeView+ depends on / blocked
 
Reported: 2022-09-30 06:59 UTC by Bipin Kunal
Modified: 2023-08-09 17:00 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of: 2115613
: 2131169 (view as bug list)
Environment:
Last Closed: 2022-12-07 11:19:24 UTC
Embargoed:
sheggodu: needinfo? (usrivast)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-operator pull 1827 0 None Merged add "noobaa-standalone" daemonplacement for MCG standalone deployment 2022-10-11 12:53:18 UTC
Github red-hat-storage ocs-operator pull 1837 0 None Merged Bug 2131168: [release-4.11] add "noobaa-standalone" daemonplacement for MCG standalone deployment 2022-11-21 05:14:12 UTC
Red Hat Bugzilla 2149872 1 unspecified CLOSED Toleration not added to the noobaa-default-backing-store pod after editing storagecluster CR and subs with toleration 2023-08-09 16:49:53 UTC
Red Hat Product Errata RHBA-2022:8877 0 None None None 2022-12-07 11:19:40 UTC

Comment 8 Shrivaibavi Raghaventhiran 2022-11-04 11:48:33 UTC
Tested version:
----------------
ODF - quay.io/rhceph-dev/ocs-registry:4.11.3-5
OCP - 4.11.9

Test steps:
-----------
1. Deploy MCG standalone
2. Taint the nodes
3) Respin all the pods under openshift-storage and let it go to pending state 
4) Apply tolerations in storage CR and 4 subscriptions 
5) Observe if pods are respining automatically and coming to running state 
6) For the pods which come to running state check the pods yaml to ensure it has tolerations

Observation:
-------------

** Content added to storagecluster
spec:
  placement:
    noobaa-standalone:    
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"
      - effect: NoSchedule
        key: node.ocs.openshift.io/storage
        operator: Equal
        value: "true"

*** Content added to all 4 subs
  config:
    tolerations:
    - effect: NoSchedule
      key: xyz
      operator: Equal
      value: "true"

*** After applying below tolerations to all subs and storage CR the o/p was

NAME                                               READY   STATUS    RESTARTS   AGE
csi-addons-controller-manager-6fc9dbc6f7-2kxd8     2/2     Running   0          66s
noobaa-core-0                                      0/1     Pending   0          15m
noobaa-db-pg-0                                     0/1     Pending   0          15m
noobaa-endpoint-59f5c68f8f-dqf4h                   0/1     Pending   0          16m
noobaa-operator-57d447fcb5-grwbq                   1/1     Running   0          5m57s
ocs-metrics-exporter-6654597c54-s8h7v              1/1     Running   0          5m41s
ocs-operator-7b7487d774-qpwfr                      1/1     Running   0          5m41s
odf-console-666c8b4bbd-th2tb                       1/1     Running   0          38s
odf-operator-controller-manager-579c8448df-rmz8c   2/2     Running   0          38s
rook-ceph-operator-5674659b57-4kqcg                1/1     Running   0          5m42s

**** Tried to delete pods forcefully and still it did not help

$ oc delete pod noobaa-core-0 noobaa-db-pg-0 noobaa-endpoint-59f5c68f8f-dqf4h
pod "noobaa-core-0" deleted
pod "noobaa-db-pg-0" deleted
pod "noobaa-endpoint-59f5c68f8f-dqf4h" deleted

NAME                                               READY   STATUS    RESTARTS   AGE
csi-addons-controller-manager-6fc9dbc6f7-2kxd8     2/2     Running   0          9m
noobaa-core-0                                      0/1     Pending   0          4m36s
noobaa-db-pg-0                                     1/1     Running   0          4m35s
noobaa-endpoint-59f5c68f8f-gw2jx                   0/1     Pending   0          4m32s
noobaa-operator-57d447fcb5-grwbq                   1/1     Running   0          13m
ocs-metrics-exporter-6654597c54-s8h7v              1/1     Running   0          13m
ocs-operator-7b7487d774-qpwfr                      1/1     Running   0          13m
odf-console-666c8b4bbd-th2tb                       1/1     Running   0          8m32s
odf-operator-controller-manager-579c8448df-rmz8c   2/2     Running   0          8m32s
rook-ceph-operator-5674659b57-4kqcg                1/1     Running   0          13m


Fail QA based on the above observations

Have live cluster to debug, will upload logs also in sometime

Comment 9 Shrivaibavi Raghaventhiran 2022-11-04 12:06:32 UTC
Logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2131168/

Comment 13 Shrivaibavi Raghaventhiran 2022-11-17 06:51:38 UTC
Hi,
* We verified BZs on non-standalone clusters too we did not see noobaa pods in Pending on 4.11 and 4.12. 
4.12 https://bugzilla.redhat.com/show_bug.cgi?id=2121842
4.11 https://bugzilla.redhat.com/show_bug.cgi?id=2125147

* Even after restarts of the pod it doesn't seem to be Running at all times. Hence ruling out this option as a WA for this issue.

Please correct if my understanding is wrong, Kubernetes behaviour doesn't change on standalone MCG and non-standalone clusters, so I don't expect pods in Pending state on standalone MCGs when it perfectly works on non-standalone clusters, is it something to do with noobaa operators ?

Please check the failed_qa logs from comment #9

Raising need-info on @nbecker @belimele @bkunal

Let me know if we can meet to have more discussions on this.

Comment 25 Shrivaibavi Raghaventhiran 2022-12-01 09:11:50 UTC
Versions:
----------
OCP - 4.11.0-0.nightly-2022-11-30-175707
ODF - quay.io/rhceph-dev/ocs-registry:4.11.4-4

Test steps:
-----------
1. Deploy MCG standalone
2. Taint the nodes
oc adm taint nodes compute-0 compute-1 compute-2 xyz=true:NoSchedule
3. Apply tolerations in storage CR and 4 subscriptions 
4. Observe if pods are respining automatically and coming to running state 
5. For the pods which come to running state check the pods yaml to ensure it has tolerations
6. Respin all the pods to just check if the toleration exists

Observation:
-------------
1. Noobaa-default-backing-store pods not getting automatically respinned after updating the storage cluster and subs with the required toleration.

However force restart of pod works fine. AFter force restart noobaa-default-backing store pod comes up with the desired toleration

** Content added to storagecluster
spec:
  placement:
    noobaa-standalone:    
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"
      - effect: NoSchedule
        key: node.ocs.openshift.io/storage
        operator: Equal
        value: "true"

*** Content added to all 4 subs
  config:
    tolerations:
    - effect: NoSchedule
      key: xyz
      operator: Equal
      value: "true"


Moving the BZ to verified state

Comment 28 Shrivaibavi Raghaventhiran 2022-12-01 09:28:37 UTC
New BZ to track above mentioned issue https://bugzilla.redhat.com/show_bug.cgi?id=2149872

Comment 32 errata-xmlrpc 2022-12-07 11:19:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.11.4 Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8877


Note You need to log in before you can comment on or make changes to this bug.