Version Tested: ----------------- OCS - ocs-operator.v4.8.3 OCP - 4.8.13 Build Used: ------------- quay.io/rhceph-dev/ocs-registry:latest-stable-4.8.3 Steps Followed: ---------------- 1. Tainted all masters and workers with non-ocs taint 2. Added the tolerations in subscription(for operators), configmap rook-ceph-operator-config and in storagecluster 3. Respinned all pods one by one 4. MDS, OSDs, pods stuck in Pending state with description below Warning FailedScheduling 42s default-scheduler 0/6 nodes are available: 6 node(s) had taint {xyz: true}, that the pod didn't tolerate. 5. ocs-operator in CBLO state with description Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 9m48s default-scheduler Successfully assigned openshift-storage/ocs-operator-85db4fb4df-8w5m9 to compute-2 Normal AddedInterface 9m46s multus Add eth0 [10.131.0.77/23] from openshift-sdn Normal Pulled 9m45s kubelet Successfully pulled image "quay.io/rhceph-dev/ocs4-ocs-rhel8-operator@sha256:4ba021f1c9e544a8798f08870fa210838c2a715f309ed7b9b525e73f4b47ce21" in 553.69436ms Normal Pulled 9m17s kubelet Successfully pulled image "quay.io/rhceph-dev/ocs4-ocs-rhel8-operator@sha256:4ba021f1c9e544a8798f08870fa210838c2a715f309ed7b9b525e73f4b47ce21" in 579.155894ms Normal Pulled 8m37s kubelet Successfully pulled image "quay.io/rhceph-dev/ocs4-ocs-rhel8-operator@sha256:4ba021f1c9e544a8798f08870fa210838c2a715f309ed7b9b525e73f4b47ce21" in 556.637204ms Warning BackOff 7m57s (x5 over 8m49s) kubelet Back-off restarting failed container Normal Pulled 7m45s kubelet Successfully pulled image "quay.io/rhceph-dev/ocs4-ocs-rhel8-operator@sha256:4ba021f1c9e544a8798f08870fa210838c2a715f309ed7b9b525e73f4b47ce21" in 582.280178ms Normal Created 7m44s (x4 over 9m45s) kubelet Created container ocs-operator Normal Started 7m44s (x4 over 9m45s) kubelet Started container ocs-operator Warning ProbeError 7m39s (x2 over 9m39s) kubelet Readiness probe error: Get "http://10.131.0.77:8081/readyz": dial tcp 10.131.0.77:8081: connect: connection refused body: Warning Unhealthy 7m39s (x2 over 9m39s) kubelet Readiness probe failed: Get "http://10.131.0.77:8081/readyz": dial tcp 10.131.0.77:8081: connect: connection refused Normal Pulling 4m33s (x6 over 9m46s) kubelet Pulling image "quay.io/rhceph-dev/ocs4-ocs-rhel8-operator@sha256:4ba021f1c9e544a8798f08870fa210838c2a715f309ed7b9b525e73f4b47ce21" Pods in Pending/CBLO state: ----------------------------- oc get pods -n openshift-storage -o wide | grep -v Running NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ocs-operator-85db4fb4df-8w5m9 0/1 CrashLoopBackOff 5 7m27s 10.131.0.77 compute-2 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6bc47b7f7p6tv 0/2 Pending 0 4m57s <none> <none> <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-844b45bfcnvfv 0/2 Pending 0 4m52s <none> <none> <none> <none> rook-ceph-osd-0-84f5bfb967-986jr 0/2 Pending 0 3m43s <none> <none> <none> <none> rook-ceph-osd-1-8d6855ff4-n6hn8 0/2 Pending 0 3m27s <none> <none> <none> <none> rook-ceph-osd-2-5b4456f87c-66mdq 0/2 Pending 0 3m4s <none> <none> <none> <none> rook-ceph-tools-54f69f496f-phx2q 0/1 Pending 0 88s <none> <none> <none> <none> Have a live cluster if anyone wants to debug. ocs-mustgather being uploaded : http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1999158/ With the above observations moving the BZ to Assigned state
Yes, live cluster is required to verify whether the fix is present or not in the current image. Please share.
Checked the cluster, it is using a build which doesn't have any 4.8.3 fixes [muagarwa@mudits-workstation ~]$ oc image info --filter-by-os=linux/amd64 registry-proxy.engineering.redhat.com/rh-osbs/ocs4-rook-ceph-rhel8-operator:v4.8.3-3|grep upstream-vcs-ref upstream-vcs-ref=ec9c00b94d74bf74ae2dd28027a35822bbb0c321 This is the last commit https://github.com/red-hat-storage/rook/commit/ec9c00b94d74bf74ae2dd28027a35822bbb0c321 which was for 4.8.2 The installed build doesn't have any new commits. Not sure if it is a build issue or the wrong build is used.
Tested version: ---------------- OCS - ocs-operator.v4.8.3 OCP - 4.8.13 $ oc image info --filter-by-os=linux/amd64 registry-proxy.engineering.redhat.com/rh-osbs/ocs4-rook-ceph-rhel8-operator:v4.8.3-11|grep upstream-vcs-ref upstream-vcs-ref=bae71a92621d74c76c9f085abdd72ef3720ac556 Test Steps: ------------ 1. Tainted all nodes with non-ocs taint 2. Edited storagecluster, config map rook-ceph-operator-config and subscription 3. Respinned all the pods 4. Rebooted all nodes one by one 5. Added capacity via UI Observations: -------------- No issues seen, all tolerations were intact after pod respins and node reboot Add-capacity was successful and new osds are up and running With the above observations moving the BZ to verified state
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OCS 4.8.3 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3881