2005970 – [4.8.z clone] Not able to add toleration for MDS pods via StorageCluster yaml

Bug 2005970 - [4.8.z clone] Not able to add toleration for MDS pods via StorageCluster yaml

Summary: [4.8.z clone] Not able to add toleration for MDS pods via StorageCluster yaml

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	OCS 4.8.3
Assignee:	Subham Rai
QA Contact:	Shrivaibavi Raghaventhiran
Docs Contact:
URL:
Whiteboard:
Depends On:	2005937
Blocks:	1999158
TreeView+	depends on / blocked

Reported:	2021-09-20 15:12 UTC by Mudit Agarwal
Modified:	2021-10-18 12:16 UTC (History)
CC List:	10 users (show)
Fixed In Version:	quay.io/rhceph-dev/ocs-registry:4.8.3-11
Doc Type:	No Doc Update
Doc Text:
Clone Of:	2005937
Environment:
Last Closed:	2021-10-18 12:16:24 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 1350	None	Merged	add check for PodAntiAffinity not nil	2021-09-29 06:29:13 UTC
Github	red-hat-storage ocs-operator pull 1354	None	open	Bug 2005970: [release-4.8] add check for PodAntiAffinity not nil	2021-09-29 12:09:10 UTC
Red Hat Product Errata	RHBA-2021:3881	None	None	None	2021-10-18 12:16:44 UTC

Comment 6 Shrivaibavi Raghaventhiran 2021-10-11 09:26:36 UTC

Version Tested:
-----------------
OCS - ocs-operator.v4.8.3
OCP - 4.8.13

Build Used:
-------------
quay.io/rhceph-dev/ocs-registry:latest-stable-4.8.3

Steps Followed:
----------------
1. Tainted all masters and workers with non-ocs taint
2. Added the tolerations in subscription(for operators), configmap rook-ceph-operator-config and in storagecluster
3. Respinned all pods one by one
4. MDS, OSDs, pods stuck in Pending state with description below

Warning  FailedScheduling  42s   default-scheduler  0/6 nodes are available: 6 node(s) had taint {xyz: true}, that the pod didn't tolerate.

5. ocs-operator in CBLO state with description
Events:
  Type     Reason          Age                    From               Message
  ----     ------          ----                   ----               -------
  Normal   Scheduled       9m48s                  default-scheduler  Successfully assigned openshift-storage/ocs-operator-85db4fb4df-8w5m9 to compute-2
  Normal   AddedInterface  9m46s                  multus             Add eth0 [10.131.0.77/23] from openshift-sdn
  Normal   Pulled          9m45s                  kubelet            Successfully pulled image "quay.io/rhceph-dev/ocs4-ocs-rhel8-operator@sha256:4ba021f1c9e544a8798f08870fa210838c2a715f309ed7b9b525e73f4b47ce21" in 553.69436ms
  Normal   Pulled          9m17s                  kubelet            Successfully pulled image "quay.io/rhceph-dev/ocs4-ocs-rhel8-operator@sha256:4ba021f1c9e544a8798f08870fa210838c2a715f309ed7b9b525e73f4b47ce21" in 579.155894ms
  Normal   Pulled          8m37s                  kubelet            Successfully pulled image "quay.io/rhceph-dev/ocs4-ocs-rhel8-operator@sha256:4ba021f1c9e544a8798f08870fa210838c2a715f309ed7b9b525e73f4b47ce21" in 556.637204ms
  Warning  BackOff         7m57s (x5 over 8m49s)  kubelet            Back-off restarting failed container
  Normal   Pulled          7m45s                  kubelet            Successfully pulled image "quay.io/rhceph-dev/ocs4-ocs-rhel8-operator@sha256:4ba021f1c9e544a8798f08870fa210838c2a715f309ed7b9b525e73f4b47ce21" in 582.280178ms
  Normal   Created         7m44s (x4 over 9m45s)  kubelet            Created container ocs-operator
  Normal   Started         7m44s (x4 over 9m45s)  kubelet            Started container ocs-operator
  Warning  ProbeError      7m39s (x2 over 9m39s)  kubelet            Readiness probe error: Get "http://10.131.0.77:8081/readyz": dial tcp 10.131.0.77:8081: connect: connection refused
body:
  Warning  Unhealthy  7m39s (x2 over 9m39s)  kubelet  Readiness probe failed: Get "http://10.131.0.77:8081/readyz": dial tcp 10.131.0.77:8081: connect: connection refused
  Normal   Pulling    4m33s (x6 over 9m46s)  kubelet  Pulling image "quay.io/rhceph-dev/ocs4-ocs-rhel8-operator@sha256:4ba021f1c9e544a8798f08870fa210838c2a715f309ed7b9b525e73f4b47ce21"

Pods in Pending/CBLO state:
-----------------------------
 oc get pods -n openshift-storage -o wide | grep -v Running
NAME                                                              READY   STATUS             RESTARTS   AGE     IP            NODE        NOMINATED NODE   READINESS GATES
ocs-operator-85db4fb4df-8w5m9                                     0/1     CrashLoopBackOff   5          7m27s   10.131.0.77   compute-2   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6bc47b7f7p6tv   0/2     Pending            0          4m57s   <none>        <none>      <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-844b45bfcnvfv   0/2     Pending            0          4m52s   <none>        <none>      <none>           <none>
rook-ceph-osd-0-84f5bfb967-986jr                                  0/2     Pending            0          3m43s   <none>        <none>      <none>           <none>
rook-ceph-osd-1-8d6855ff4-n6hn8                                   0/2     Pending            0          3m27s   <none>        <none>      <none>           <none>
rook-ceph-osd-2-5b4456f87c-66mdq                                  0/2     Pending            0          3m4s    <none>        <none>      <none>           <none>
rook-ceph-tools-54f69f496f-phx2q                                  0/1     Pending            0          88s     <none>        <none>      <none>           <none>


Have a live cluster if anyone wants to debug.

ocs-mustgather being uploaded : http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1999158/

With the above observations moving the BZ to Assigned state

Comment 7 Mudit Agarwal 2021-10-11 10:02:06 UTC

Yes, live cluster is required to verify whether the fix is present or not in the current image. Please share.

Comment 8 Mudit Agarwal 2021-10-11 10:59:38 UTC

Checked the cluster, it is using a build which doesn't have any 4.8.3 fixes

[muagarwa@mudits-workstation ~]$ oc image info --filter-by-os=linux/amd64 registry-proxy.engineering.redhat.com/rh-osbs/ocs4-rook-ceph-rhel8-operator:v4.8.3-3|grep upstream-vcs-ref
               upstream-vcs-ref=ec9c00b94d74bf74ae2dd28027a35822bbb0c321

This is the last commit https://github.com/red-hat-storage/rook/commit/ec9c00b94d74bf74ae2dd28027a35822bbb0c321 which was for 4.8.2
The installed build doesn't have any new commits. Not sure if it is a build issue or the wrong build is used.

Comment 9 Shrivaibavi Raghaventhiran 2021-10-12 14:52:16 UTC

Tested version:
----------------
OCS - ocs-operator.v4.8.3
OCP - 4.8.13

$ oc image info --filter-by-os=linux/amd64 registry-proxy.engineering.redhat.com/rh-osbs/ocs4-rook-ceph-rhel8-operator:v4.8.3-11|grep upstream-vcs-ref
               upstream-vcs-ref=bae71a92621d74c76c9f085abdd72ef3720ac556

Test Steps:
------------
1. Tainted all nodes with non-ocs taint
2. Edited storagecluster, config map rook-ceph-operator-config and subscription
3. Respinned all the pods
4. Rebooted all nodes one by one
5. Added capacity via UI

Observations:
--------------
No issues seen, all tolerations were intact after pod respins and node reboot

Add-capacity was successful and new osds are up and running

With the above observations moving the BZ to verified state

Comment 13 errata-xmlrpc 2021-10-18 12:16:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OCS 4.8.3 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3881

Note You need to log in before you can comment on or make changes to this bug.