Bug 2005970 - [4.8.z clone] Not able to add toleration for MDS pods via StorageCluster yaml
Summary: [4.8.z clone] Not able to add toleration for MDS pods via StorageCluster yaml
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: OCS 4.8.3
Assignee: Subham Rai
QA Contact: Shrivaibavi Raghaventhiran
URL:
Whiteboard:
Depends On: 2005937
Blocks: 1999158
TreeView+ depends on / blocked
 
Reported: 2021-09-20 15:12 UTC by Mudit Agarwal
Modified: 2021-10-18 12:16 UTC (History)
10 users (show)

Fixed In Version: quay.io/rhceph-dev/ocs-registry:4.8.3-11
Doc Type: No Doc Update
Doc Text:
Clone Of: 2005937
Environment:
Last Closed: 2021-10-18 12:16:24 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-operator pull 1350 0 None Merged add check for PodAntiAffinity not nil 2021-09-29 06:29:13 UTC
Github red-hat-storage ocs-operator pull 1354 0 None open Bug 2005970: [release-4.8] add check for PodAntiAffinity not nil 2021-09-29 12:09:10 UTC
Red Hat Product Errata RHBA-2021:3881 0 None None None 2021-10-18 12:16:44 UTC

Comment 6 Shrivaibavi Raghaventhiran 2021-10-11 09:26:36 UTC
Version Tested:
-----------------
OCS - ocs-operator.v4.8.3
OCP - 4.8.13

Build Used:
-------------
quay.io/rhceph-dev/ocs-registry:latest-stable-4.8.3

Steps Followed:
----------------
1. Tainted all masters and workers with non-ocs taint
2. Added the tolerations in subscription(for operators), configmap rook-ceph-operator-config and in storagecluster
3. Respinned all pods one by one
4. MDS, OSDs, pods stuck in Pending state with description below

Warning  FailedScheduling  42s   default-scheduler  0/6 nodes are available: 6 node(s) had taint {xyz: true}, that the pod didn't tolerate.

5. ocs-operator in CBLO state with description
Events:
  Type     Reason          Age                    From               Message
  ----     ------          ----                   ----               -------
  Normal   Scheduled       9m48s                  default-scheduler  Successfully assigned openshift-storage/ocs-operator-85db4fb4df-8w5m9 to compute-2
  Normal   AddedInterface  9m46s                  multus             Add eth0 [10.131.0.77/23] from openshift-sdn
  Normal   Pulled          9m45s                  kubelet            Successfully pulled image "quay.io/rhceph-dev/ocs4-ocs-rhel8-operator@sha256:4ba021f1c9e544a8798f08870fa210838c2a715f309ed7b9b525e73f4b47ce21" in 553.69436ms
  Normal   Pulled          9m17s                  kubelet            Successfully pulled image "quay.io/rhceph-dev/ocs4-ocs-rhel8-operator@sha256:4ba021f1c9e544a8798f08870fa210838c2a715f309ed7b9b525e73f4b47ce21" in 579.155894ms
  Normal   Pulled          8m37s                  kubelet            Successfully pulled image "quay.io/rhceph-dev/ocs4-ocs-rhel8-operator@sha256:4ba021f1c9e544a8798f08870fa210838c2a715f309ed7b9b525e73f4b47ce21" in 556.637204ms
  Warning  BackOff         7m57s (x5 over 8m49s)  kubelet            Back-off restarting failed container
  Normal   Pulled          7m45s                  kubelet            Successfully pulled image "quay.io/rhceph-dev/ocs4-ocs-rhel8-operator@sha256:4ba021f1c9e544a8798f08870fa210838c2a715f309ed7b9b525e73f4b47ce21" in 582.280178ms
  Normal   Created         7m44s (x4 over 9m45s)  kubelet            Created container ocs-operator
  Normal   Started         7m44s (x4 over 9m45s)  kubelet            Started container ocs-operator
  Warning  ProbeError      7m39s (x2 over 9m39s)  kubelet            Readiness probe error: Get "http://10.131.0.77:8081/readyz": dial tcp 10.131.0.77:8081: connect: connection refused
body:
  Warning  Unhealthy  7m39s (x2 over 9m39s)  kubelet  Readiness probe failed: Get "http://10.131.0.77:8081/readyz": dial tcp 10.131.0.77:8081: connect: connection refused
  Normal   Pulling    4m33s (x6 over 9m46s)  kubelet  Pulling image "quay.io/rhceph-dev/ocs4-ocs-rhel8-operator@sha256:4ba021f1c9e544a8798f08870fa210838c2a715f309ed7b9b525e73f4b47ce21"

Pods in Pending/CBLO state:
-----------------------------
 oc get pods -n openshift-storage -o wide | grep -v Running
NAME                                                              READY   STATUS             RESTARTS   AGE     IP            NODE        NOMINATED NODE   READINESS GATES
ocs-operator-85db4fb4df-8w5m9                                     0/1     CrashLoopBackOff   5          7m27s   10.131.0.77   compute-2   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6bc47b7f7p6tv   0/2     Pending            0          4m57s   <none>        <none>      <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-844b45bfcnvfv   0/2     Pending            0          4m52s   <none>        <none>      <none>           <none>
rook-ceph-osd-0-84f5bfb967-986jr                                  0/2     Pending            0          3m43s   <none>        <none>      <none>           <none>
rook-ceph-osd-1-8d6855ff4-n6hn8                                   0/2     Pending            0          3m27s   <none>        <none>      <none>           <none>
rook-ceph-osd-2-5b4456f87c-66mdq                                  0/2     Pending            0          3m4s    <none>        <none>      <none>           <none>
rook-ceph-tools-54f69f496f-phx2q                                  0/1     Pending            0          88s     <none>        <none>      <none>           <none>


Have a live cluster if anyone wants to debug.

ocs-mustgather being uploaded : http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1999158/

With the above observations moving the BZ to Assigned state

Comment 7 Mudit Agarwal 2021-10-11 10:02:06 UTC
Yes, live cluster is required to verify whether the fix is present or not in the current image. Please share.

Comment 8 Mudit Agarwal 2021-10-11 10:59:38 UTC
Checked the cluster, it is using a build which doesn't have any 4.8.3 fixes

[muagarwa@mudits-workstation ~]$ oc image info --filter-by-os=linux/amd64 registry-proxy.engineering.redhat.com/rh-osbs/ocs4-rook-ceph-rhel8-operator:v4.8.3-3|grep upstream-vcs-ref
               upstream-vcs-ref=ec9c00b94d74bf74ae2dd28027a35822bbb0c321

This is the last commit https://github.com/red-hat-storage/rook/commit/ec9c00b94d74bf74ae2dd28027a35822bbb0c321 which was for 4.8.2
The installed build doesn't have any new commits. Not sure if it is a build issue or the wrong build is used.

Comment 9 Shrivaibavi Raghaventhiran 2021-10-12 14:52:16 UTC
Tested version:
----------------
OCS - ocs-operator.v4.8.3
OCP - 4.8.13

$ oc image info --filter-by-os=linux/amd64 registry-proxy.engineering.redhat.com/rh-osbs/ocs4-rook-ceph-rhel8-operator:v4.8.3-11|grep upstream-vcs-ref
               upstream-vcs-ref=bae71a92621d74c76c9f085abdd72ef3720ac556

Test Steps:
------------
1. Tainted all nodes with non-ocs taint
2. Edited storagecluster, config map rook-ceph-operator-config and subscription
3. Respinned all the pods
4. Rebooted all nodes one by one
5. Added capacity via UI

Observations:
--------------
No issues seen, all tolerations were intact after pod respins and node reboot

Add-capacity was successful and new osds are up and running

With the above observations moving the BZ to verified state

Comment 13 errata-xmlrpc 2021-10-18 12:16:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OCS 4.8.3 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3881


Note You need to log in before you can comment on or make changes to this bug.