Description of problem (please be detailed as possible and provide log snippests): [Arbiter] When both the rgw pod hosting nodes is down rgw service is unavailable Version of all relevant components (if applicable): OCP version:- 4.7.0-0.nightly-2021-04-03-065651 OCS version:- ocs-operator.v4.7.0-339.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy OCP 4.7 over BM UPI 2. Deploy OCS with arbiter enable 3. Power off the nodes hosting rgw-pod Actual results: rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-654789blvkjc 0/2 Pending 0 121m <none> <none> <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-654789btg9sr 2/2 Terminating 0 23h 10.131.0.24 argo007.ceph.redhat.com <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-8445b44cgr7d 2/2 Terminating 0 6h49m 10.129.2.9 argo006.ceph.redhat.com <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-8445b44pqhbz 0/2 Pending 0 149m <none> <none> <none> <none> Expected results: rgw-pod should move to another node in the same zone Additional info: # snippet from describe of rgw-pod Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 121m default-scheduler 0/8 nodes are available: 1 node(s) didn't match Pod's node affinity, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't match pod anti-affinity rules, 2 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Warning FailedScheduling 121m default-scheduler 0/8 nodes are available: 1 node(s) didn't match Pod's node affinity, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't match pod anti-affinity rules, 2 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
@Pratik A couple questions: 1. The repro steps say you powered down one node, but the results look like you powered down two nodes. I'm still investigating the pod affinity, but just want to be clear on the repro. 2. If you start the nodes again, rgw pods start, right? This doesn't seem like a blocker unless the rgw pods don't start up again when the nodes come back.
I see now, the required anti-affinity to run the rgw pods in different zones is preventing the start of a new rgw pod in the same zone even while the failed rgw pod is terminating. We need to add the rgw pods to the list of pods to be force deleted, in addition to this PR: https://github.com/rook/rook/pull/7266 @Rohan could you take a look? Honestly I don't see why this would be a blocker for a tech preview feature, but it should be a small low-risk change anyway to get into 4.7.
@nberry MDS labels are already in the code to remove pod stuck in terminating state. So It should not cause a similar issue The upstream PR of RGW fix https://github.com/rook/rook/pull/7537
I'm trying to verify the problem with: - OCP: 4.7.0-0.nightly-2021-04-15-110345 - LSO 4.7.0-202104030128.p0 - OCS 4.7.0-353.ci latest-stable-47 (4.7.0-rc5) So I identified 2 nodes hosting rgw pods: ``` $ oc get pods -n openshift-storage -o wide | grep -i rgw rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-86df96847fnr 2/2 Running 0 22h 10.128.4.25 compute-5 <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-85b94c8wl8b6 2/2 Running 0 22h 10.128.2.13 compute-0 <none> <none> ``` And shutdown them via vsphere console (so that vsphere reports them as powered off) at 16:50 CEST. After some time (about 6 minutes), the affected rwg pods went into Terminating state: ``` $ oc get pods -n openshift-storage -o wide | grep -i rgw rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-86df96847fnr 2/2 Terminating 0 22h 10.128.4.25 compute-5 <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-86df9689xd5q 0/2 Pending 0 83s <none> <none> <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-85b94c8v94l5 0/2 Pending 0 93s <none> <none> <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-85b94c8wl8b6 2/2 Terminating 0 22h 10.128.2.13 compute-0 <none> <none> ``` and it stays this way. Looking into output of `oc describe pod/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-86df9689xd5q` I see: ``` Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 11m default-scheduler 0/9 nodes are available: 2 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match pod affinity/anti-affinity, 4 node(s) didn't match pod anti-affinity rules. Warning FailedScheduling 11m default-scheduler 0/9 nodes are available: 2 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match pod affinity/anti-affinity, 4 node(s) didn't match pod anti-affinity rules. ``` >>> ASSIGNED Additional info --------------- ``` $ oc get nodes -L topology.kubernetes.io/zone NAME STATUS ROLES AGE VERSION ZONE compute-0 NotReady worker 4d23h v1.20.0+c8905da data-a compute-1 Ready worker 4d23h v1.20.0+c8905da data-a compute-2 Ready worker 4d23h v1.20.0+c8905da data-a compute-3 Ready worker 4d23h v1.20.0+c8905da data-b compute-4 Ready worker 4d23h v1.20.0+c8905da data-b compute-5 NotReady worker 4d23h v1.20.0+c8905da data-b control-plane-0 Ready master 4d23h v1.20.0+c8905da data-a control-plane-1 Ready master 4d23h v1.20.0+c8905da data-b control-plane-2 Ready master 4d23h v1.20.0+c8905da foo-arbiter ```
I powered the nodes back again at 17:34 CEST, and then waited for all pods to be running again. But now I see that the rwg pods are running on different nodes compared to original state: ``` $ oc get pods -o wide -n openshift-storage | grep rgw rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-86df9689xd5q 2/2 Running 0 65m 10.131.0.40 compute-4 <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-85b94c8v94l5 2/2 Running 0 65m 10.131.3.30 compute-1 <none> <none> ``` Looking at ceph_rgw_metadata prometheus query, it seems that indeed, start of the powered down nodes coincides with start of new rgw pods (on different nodes though). I have no idea why is it happening.
Additional details for comment 13 (from oc describe pod/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-86df9689xd5q) : ``` Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 77m default-scheduler 0/9 nodes are available: 2 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match pod affinity/anti-affinity, 4 node(s) didn't match pod anti-affinity rules. Warning FailedScheduling 77m default-scheduler 0/9 nodes are available: 2 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match pod affinity/anti-affinity, 4 node(s) didn't match pod anti-affinity rules. Normal Scheduled 37m default-scheduler Successfully assigned openshift-storage/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-86df9689xd5q to compute-4 Normal AddedInterface 37m multus Add eth0 [10.131.0.40/23] Normal Pulled 37m kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:d2e99edf733960244256ad82e761acffe9f09e76749bb769469b4b929b25c509" already present on machine Normal Created 37m kubelet Created container chown-container-data-dir Normal Started 37m kubelet Started container chown-container-data-dir Normal Pulled 37m kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:d2e99edf733960244256ad82e761acffe9f09e76749bb769469b4b929b25c509" already present on machine Normal Created 37m kubelet Created container rgw Normal Started 37m kubelet Started container rgw Normal Pulled 37m kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:d2e99edf733960244256ad82e761acffe9f09e76749bb769469b4b929b25c509" already present on machine Normal Created 37m kubelet Created container log-collector Normal Started 37m kubelet Started container log-collector ``` Here we can see another evidence that the pod actually started after the powered off nodes were started again.
The operator log does not have any lines like the following for the rgw or any other pods: "force deleting pod <pod-name> that appears to be stuck terminating" It seems nodes 0 and 5 were down sufficiently long enough that we should have seen this message for all the pods on the node. If we don't see the terminating pods force deleted by the operator, the new rgw pods will stay stuck in the pending state and the old ones will stay in terminating. Then as soon as the nodes come back up, the pods will finally be scheduled. @Martin Did you see all the other pods on those nodes (besides rgw) also stuck while the nodes were down? If so, the conditions for the node being down must not have been detected by Rook. But if the rgw pods were the only ones stuck, then this BZ still needs to be fixed.
I would also recommend this *not* be a blocker for 4.7 and we move to 4.8. The odds of both nodes going down where the rgw pods are running is low.
I tested a build of https://github.com/openshift/rook/tree/release-4.7 on AWS dev cluster to check if the rgw pods stuck in terminating state is force deleted on the node failure and it works. Confirmed in the rook-ceph-operator logs: `2021-04-20 23:48:51.705697 I | op-k8sutil: pod "rook-ceph-rgw-my-store-a-6f8ff8698-92xvf" deletion succeeded` @Martin, Looks like the operator didn’t detect the node was down in your case.
(In reply to Travis Nielsen from comment #15) > The operator log does not have any lines like the following for the rgw or > any other pods: > > "force deleting pod <pod-name> that appears to be stuck terminating" > > It seems nodes 0 and 5 were down sufficiently long enough that we should > have seen this message for all the pods on the node. Agreed. > If we don't see the > terminating pods force deleted by the operator, the new rgw pods will stay > stuck in the pending state and the old ones will stay in terminating. Then > as soon as the nodes come back up, the pods will finally be scheduled. > > @Martin Did you see all the other pods on those nodes (besides rgw) also > stuck while the nodes were down? We can check must gather tarball referenced in comment 12 to answer that, since I fetched it at about 17:25-17:30, after reproducing the problem, but before starting the nodes again (shutdown happened at 16:50 CEST, startup at 17:34 CEST). Let's cd into the directory with k8s yaml dump of all ocs pods: ``` $ tar --force-local -xf mbukatov-0415a_20210415T151318.must-gather.2021-04-20T17\:10+02\:00.tar.gz $ cd ./mbukatov-0415a_20210415T151318.must-gather-ocs.2021-04-20T17:10+02:00/quay-io-rhceph-dev-ocs-must-gather-sha256-0c61617e2ae8d332167ab4bc86dfa9e7b74e002a8c344eee25700ea2069df6a0/namespaces/openshift-storage/pods $ ls compute-1-debug compute-2-debug compute-3-debug compute-4-debug csi-cephfsplugin-8jbp6 csi-cephfsplugin-979mx csi-cephfsplugin-lsd49 csi-cephfsplugin-n9c46 csi-cephfsplugin-provisioner-f975d886c-gd6cf csi-cephfsplugin-provisioner-f975d886c-vbzbn csi-cephfsplugin-s986p csi-cephfsplugin-vchr7 csi-rbdplugin-5brdd csi-rbdplugin-822t7 csi-rbdplugin-f6x48 csi-rbdplugin-fbbgq csi-rbdplugin-hwh42 csi-rbdplugin-ksskf csi-rbdplugin-provisioner-6bbf798bfb-5gg7v csi-rbdplugin-provisioner-6bbf798bfb-d2zlp csi-rbdplugin-provisioner-6bbf798bfb-dtvdh csi-rbdplugin-provisioner-6bbf798bfb-sp6jx must-gather-xv9cj-helper noobaa-core-0 noobaa-db-pg-0 noobaa-endpoint-56c55b887b-pzzks noobaa-operator-58569bcccf-wqsdk ocs-metrics-exporter-69845d9ffb-pnrns ocs-operator-6c4795f48d-47jw6 rook-ceph-crashcollector-compute-0-7b7464676b-2fr2k rook-ceph-crashcollector-compute-0-7b7464676b-v89kn rook-ceph-crashcollector-compute-1-7b56695857-k6fm8 rook-ceph-crashcollector-compute-2-57df775c68-qhdcf rook-ceph-crashcollector-compute-3-8898fd575-br5f5 rook-ceph-crashcollector-compute-4-689fbb5b78-lh4mt rook-ceph-crashcollector-compute-5-5d65d696f-t7bnh rook-ceph-crashcollector-compute-5-5d65d696f-xn92c rook-ceph-crashcollector-control-plane-2-644b688bc6-b9zfd rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-584cf59bh2scw rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-567b98d8ggpmt rook-ceph-mgr-a-7f9b5b666-v8k8h rook-ceph-mon-a-5685b776b6-qmtnx rook-ceph-mon-b-fc6bb97b5-gqbv5 rook-ceph-mon-b-fc6bb97b5-j4s2n rook-ceph-mon-c-756c8487d6-l7ppf rook-ceph-mon-d-67b5b987-6j442 rook-ceph-mon-d-67b5b987-f4tvc rook-ceph-mon-e-76bcc96ff8-69sc9 rook-ceph-operator-569468bdcc-nxhln rook-ceph-osd-0-844c4b7646-mpg5f rook-ceph-osd-1-7cbf8b8f49-9wvqp rook-ceph-osd-2-5568d584dc-rm2tx rook-ceph-osd-3-545686578c-p7bdx rook-ceph-osd-prepare-ocs-deviceset-ocsvolumeset-0-data-0qmwgmx rook-ceph-osd-prepare-ocs-deviceset-ocsvolumeset-1-data-06vbtnx rook-ceph-osd-prepare-ocs-deviceset-ocsvolumeset-2-data-09ks9gf rook-ceph-osd-prepare-ocs-deviceset-ocsvolumeset-3-data-0fsd62j rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-86df96847fnr rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-86df9689xd5q rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-85b94c8v94l5 rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-85b94c8wl8b6 rook-ceph-tools-69c5449589-wn84v ``` The following pods were allegedly running on one of the shutdown nodes (we can see the expected rgw pod there): ``` $ rg 'nodeName: compute-0' rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-85b94c8wl8b6/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-85b94c8wl8b6.yaml 639: nodeName: compute-0 csi-cephfsplugin-979mx/csi-cephfsplugin-979mx.yaml 475: nodeName: compute-0 rook-ceph-crashcollector-compute-0-7b7464676b-2fr2k/rook-ceph-crashcollector-compute-0-7b7464676b-2fr2k.yaml 492: nodeName: compute-0 rook-ceph-mon-d-67b5b987-f4tvc/rook-ceph-mon-d-67b5b987-f4tvc.yaml 878: nodeName: compute-0 csi-rbdplugin-hwh42/csi-rbdplugin-hwh42.yaml 476: nodeName: compute-0 noobaa-core-0/noobaa-core-0.yaml 487: nodeName: compute-0 csi-rbdplugin-provisioner-6bbf798bfb-dtvdh/csi-rbdplugin-provisioner-6bbf798bfb-dtvdh.yaml 547: nodeName: compute-0 ``` And when we check status of these pods, we see that all ocs pods, not just rgw, were reported as running even though the node was down: ``` $ rg 'nodeName: compute-0' | cut -d: -f1 | xargs grep ' phase: ' rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-85b94c8wl8b6/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-85b94c8wl8b6.yaml: phase: Running csi-cephfsplugin-979mx/csi-cephfsplugin-979mx.yaml: phase: Running rook-ceph-crashcollector-compute-0-7b7464676b-2fr2k/rook-ceph-crashcollector-compute-0-7b7464676b-2fr2k.yaml: phase: Running rook-ceph-mon-d-67b5b987-f4tvc/rook-ceph-mon-d-67b5b987-f4tvc.yaml: phase: Running csi-rbdplugin-hwh42/csi-rbdplugin-hwh42.yaml: phase: Running noobaa-core-0/noobaa-core-0.yaml: phase: Running csi-rbdplugin-provisioner-6bbf798bfb-dtvdh/csi-rbdplugin-provisioner-6bbf798bfb-dtvdh.yaml: phase: Running ``` > If so, the conditions for the node being > down must not have been detected by Rook. But if the rgw pods were the only > ones stuck, then this BZ still needs to be fixed. Looks like the 1st case applies.
(In reply to Rohan Gupta from comment #17) > @Martin, Looks like the operator didn’t detect the node was down in your > case. This is a good point, I'm investigating this.
(In reply to Martin Bukatovic from comment #19) > (In reply to Rohan Gupta from comment #17) > > @Martin, Looks like the operator didn’t detect the node was down in your > > case. > > This is a good point, I'm investigating this. I reported a separate BZ 1953100 for that.
During reporting BZ 1953100, I retried reproducer for this bug with: OCP 4.7.0-0.nightly-2021-04-21-093400 LSO 4.7.0-202104090228.p0 OCS 4.7.0-353.ci (4.7.0-rc5) And this time, I see no problem with rgw pod redeployment. Verification steps: - at 19:37 CEST shutdown compute-1 and compute-3 nodes - at 19:40 CEST nodes are declared as NotReady - at 19:43 CEST all ocs pods are still reported as running on affected nodes - at 44:44 CEST first pods from compute-3 are marked as Terminating, redeployment of these pods on different node is triggered - at 19:45 CEST all pods are running again, with exception of OSDs, which are pending as these pods can't be moved across nodes Before I powered off 2 nodes with rgw pods: ``` $ oc get pods -n openshift-storage -o wide | grep -i rgw rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-68fcc5fc6cnf 2/2 Running 0 63m 10.131.0.120 compute-1 rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-69f6b4cttvkg 2/2 Running 0 22h 10.129.2.238 compute-3 ``` And at 19:45 CEST, about 8 minutes after the node shutdown, rgw pods are running on another nodes again: ``` rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-68fcc5fxjrbm 2/2 Running 0 36s 10.128.4.29 compute-0 <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-69f6b4cm6v8r 2/2 Running 0 47s 10.130.2.20 compute-5 <none> <none> ```
I would like to understand connection between BZ 1953100 and this BZ better, to see whether we should mark BZ 1953100 as a blocker, or if it makes sense to verify this bug without fixing BZ 1953100. This is because I performed the verification 2 times, and in one case BZ 1953100 prevented the verification. If this scenario is likely to happen often, we can't move this BZ as verified based on current evidence and without further discussion.
What you describe sounds expected, see https://bugzilla.redhat.com/show_bug.cgi?id=1953100#c3 Moving back to on_qa to mark as verified if you agree.
If BZ 1953100 is evaluated as not valid, we need to go back and assume the problem to be part of this bug. The problem is that the failure I observed in comment 11 (allegedly because of misrepresented node status) is completely the same as described in this bug. With 50% verification success but without isolating the problem, we can't consider this bug as fixed. One way out would be to have this usecase automated, and have it executed like 10 times (small but >2 number agreed on in advance), checking whether the issue occurs or not.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days