2131168 – [Backport to 4.11][GSS][noobaa]Fix Tolerations setting for NooBaa in standalone mode

Bug 2131168 - [Backport to 4.11][GSS][noobaa]Fix Tolerations setting for NooBaa in standalone mode

Summary: [Backport to 4.11][GSS][noobaa]Fix Tolerations setting for NooBaa in standalo...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.10
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.11.4
Assignee:	Utkarsh Srivastava
QA Contact:	Shrivaibavi Raghaventhiran
Docs Contact:
URL:
Whiteboard:
Depends On:	2115613
Blocks:	2131169
TreeView+	depends on / blocked

Reported:	2022-09-30 06:59 UTC by Bipin Kunal
Modified:	2023-12-08 04:30 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:	2115613
Clones:	2131169 (view as bug list)
Environment:
Last Closed:	2022-12-07 11:19:24 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 1827	0	None	Merged	add "noobaa-standalone" daemonplacement for MCG standalone deployment	2022-10-11 12:53:18 UTC
Github	red-hat-storage ocs-operator pull 1837	0	None	Merged	Bug 2131168: [release-4.11] add "noobaa-standalone" daemonplacement for MCG standalone deployment	2022-11-21 05:14:12 UTC
Red Hat Bugzilla	2149872	1	None	None	None	2023-08-09 16:49:53 UTC
Red Hat Product Errata	RHBA-2022:8877	0	None	None	None	2022-12-07 11:19:40 UTC

Comment 8 Shrivaibavi Raghaventhiran 2022-11-04 11:48:33 UTC

Tested version:
----------------
ODF - quay.io/rhceph-dev/ocs-registry:4.11.3-5
OCP - 4.11.9

Test steps:
-----------
1. Deploy MCG standalone
2. Taint the nodes
3) Respin all the pods under openshift-storage and let it go to pending state 
4) Apply tolerations in storage CR and 4 subscriptions 
5) Observe if pods are respining automatically and coming to running state 
6) For the pods which come to running state check the pods yaml to ensure it has tolerations

Observation:
-------------

** Content added to storagecluster
spec:
  placement:
    noobaa-standalone:    
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"
      - effect: NoSchedule
        key: node.ocs.openshift.io/storage
        operator: Equal
        value: "true"

*** Content added to all 4 subs
  config:
    tolerations:
    - effect: NoSchedule
      key: xyz
      operator: Equal
      value: "true"

*** After applying below tolerations to all subs and storage CR the o/p was

NAME                                               READY   STATUS    RESTARTS   AGE
csi-addons-controller-manager-6fc9dbc6f7-2kxd8     2/2     Running   0          66s
noobaa-core-0                                      0/1     Pending   0          15m
noobaa-db-pg-0                                     0/1     Pending   0          15m
noobaa-endpoint-59f5c68f8f-dqf4h                   0/1     Pending   0          16m
noobaa-operator-57d447fcb5-grwbq                   1/1     Running   0          5m57s
ocs-metrics-exporter-6654597c54-s8h7v              1/1     Running   0          5m41s
ocs-operator-7b7487d774-qpwfr                      1/1     Running   0          5m41s
odf-console-666c8b4bbd-th2tb                       1/1     Running   0          38s
odf-operator-controller-manager-579c8448df-rmz8c   2/2     Running   0          38s
rook-ceph-operator-5674659b57-4kqcg                1/1     Running   0          5m42s

**** Tried to delete pods forcefully and still it did not help

$ oc delete pod noobaa-core-0 noobaa-db-pg-0 noobaa-endpoint-59f5c68f8f-dqf4h
pod "noobaa-core-0" deleted
pod "noobaa-db-pg-0" deleted
pod "noobaa-endpoint-59f5c68f8f-dqf4h" deleted

NAME                                               READY   STATUS    RESTARTS   AGE
csi-addons-controller-manager-6fc9dbc6f7-2kxd8     2/2     Running   0          9m
noobaa-core-0                                      0/1     Pending   0          4m36s
noobaa-db-pg-0                                     1/1     Running   0          4m35s
noobaa-endpoint-59f5c68f8f-gw2jx                   0/1     Pending   0          4m32s
noobaa-operator-57d447fcb5-grwbq                   1/1     Running   0          13m
ocs-metrics-exporter-6654597c54-s8h7v              1/1     Running   0          13m
ocs-operator-7b7487d774-qpwfr                      1/1     Running   0          13m
odf-console-666c8b4bbd-th2tb                       1/1     Running   0          8m32s
odf-operator-controller-manager-579c8448df-rmz8c   2/2     Running   0          8m32s
rook-ceph-operator-5674659b57-4kqcg                1/1     Running   0          13m


Fail QA based on the above observations

Have live cluster to debug, will upload logs also in sometime

Comment 9 Shrivaibavi Raghaventhiran 2022-11-04 12:06:32 UTC

Logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2131168/

Comment 13 Shrivaibavi Raghaventhiran 2022-11-17 06:51:38 UTC

Hi,
* We verified BZs on non-standalone clusters too we did not see noobaa pods in Pending on 4.11 and 4.12. 
4.12 https://bugzilla.redhat.com/show_bug.cgi?id=2121842
4.11 https://bugzilla.redhat.com/show_bug.cgi?id=2125147

* Even after restarts of the pod it doesn't seem to be Running at all times. Hence ruling out this option as a WA for this issue.

Please correct if my understanding is wrong, Kubernetes behaviour doesn't change on standalone MCG and non-standalone clusters, so I don't expect pods in Pending state on standalone MCGs when it perfectly works on non-standalone clusters, is it something to do with noobaa operators ?

Please check the failed_qa logs from comment #9

Raising need-info on @nbecker @belimele @bkunal

Let me know if we can meet to have more discussions on this.

Comment 25 Shrivaibavi Raghaventhiran 2022-12-01 09:11:50 UTC

Versions:
----------
OCP - 4.11.0-0.nightly-2022-11-30-175707
ODF - quay.io/rhceph-dev/ocs-registry:4.11.4-4

Test steps:
-----------
1. Deploy MCG standalone
2. Taint the nodes
oc adm taint nodes compute-0 compute-1 compute-2 xyz=true:NoSchedule
3. Apply tolerations in storage CR and 4 subscriptions 
4. Observe if pods are respining automatically and coming to running state 
5. For the pods which come to running state check the pods yaml to ensure it has tolerations
6. Respin all the pods to just check if the toleration exists

Observation:
-------------
1. Noobaa-default-backing-store pods not getting automatically respinned after updating the storage cluster and subs with the required toleration.

However force restart of pod works fine. AFter force restart noobaa-default-backing store pod comes up with the desired toleration

** Content added to storagecluster
spec:
  placement:
    noobaa-standalone:    
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"
      - effect: NoSchedule
        key: node.ocs.openshift.io/storage
        operator: Equal
        value: "true"

*** Content added to all 4 subs
  config:
    tolerations:
    - effect: NoSchedule
      key: xyz
      operator: Equal
      value: "true"


Moving the BZ to verified state

Comment 28 Shrivaibavi Raghaventhiran 2022-12-01 09:28:37 UTC

New BZ to track above mentioned issue https://bugzilla.redhat.com/show_bug.cgi?id=2149872

Comment 32 errata-xmlrpc 2022-12-07 11:19:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.11.4 Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8877

Comment 33 Red Hat Bugzilla 2023-12-08 04:30:52 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.