1946592 – [Arbiter] When both the rgw pod hosting nodes are down, the rgw service is unavailable

Bug 1946592 - [Arbiter] When both the rgw pod hosting nodes are down, the rgw service is unavailable

Summary: [Arbiter] When both the rgw pod hosting nodes are down, the rgw service is un...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.7.0
Assignee:	Rohan Gupta
QA Contact:	Martin Bukatovic
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-06 13:16 UTC by Pratik Surve
Modified:	2023-09-15 01:04 UTC (History)
CC List:	9 users (show)
Fixed In Version:	4.7.0-344.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-19 09:21:26 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift rook pull 214	0	None	closed	Bug 1946592: ceph: added rgw app label to appLabels for deleting stuck rook pods	2021-04-08 12:43:02 UTC

Description Pratik Surve 2021-04-06 13:16:25 UTC

Description of problem (please be detailed as possible and provide log
snippests):
[Arbiter] When both the rgw pod hosting nodes is down rgw service is unavailable

Version of all relevant components (if applicable):

OCP version:- 4.7.0-0.nightly-2021-04-03-065651
OCS version:- ocs-operator.v4.7.0-339.ci

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy OCP 4.7 over BM UPI
2. Deploy OCS with arbiter enable 
3. Power off the nodes hosting rgw-pod


Actual results:
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-654789blvkjc   0/2     Pending       0          121m    <none>         <none>                    <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-654789btg9sr   2/2     Terminating   0          23h     10.131.0.24    argo007.ceph.redhat.com   <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-8445b44cgr7d   2/2     Terminating   0          6h49m   10.129.2.9     argo006.ceph.redhat.com   <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-8445b44pqhbz   0/2     Pending       0          149m    <none>         <none>                    <none>           <none>


Expected results:
rgw-pod should move to another node in the same zone


Additional info:
# snippet from describe of rgw-pod 

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  121m  default-scheduler  0/8 nodes are available: 1 node(s) didn't match Pod's node affinity, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't match pod anti-affinity rules, 2 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  121m  default-scheduler  0/8 nodes are available: 1 node(s) didn't match Pod's node affinity, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't match pod anti-affinity rules, 2 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

Comment 3 Travis Nielsen 2021-04-06 13:59:02 UTC

@Pratik A couple questions:
1. The repro steps say you powered down one node, but the results look like you powered down two nodes. I'm still investigating the pod affinity, but just want to be clear on the repro. 
2. If you start the nodes again, rgw pods start, right?

This doesn't seem like a blocker unless the rgw pods don't start up again when the nodes come back.

Comment 6 Travis Nielsen 2021-04-06 17:11:10 UTC

I see now, the required anti-affinity to run the rgw pods in different zones is preventing the start of a new rgw pod in the same zone even while the failed rgw pod is terminating. 

We need to add the rgw pods to the list of pods to be force deleted, in addition to this PR: https://github.com/rook/rook/pull/7266
@Rohan could you take a look? 

Honestly I don't see why this would be a blocker for a tech preview feature, but it should be a small low-risk change anyway to get into 4.7.

Comment 8 Rohan Gupta 2021-04-07 09:13:46 UTC

@nberry MDS labels are already in the code to remove pod stuck in terminating state. So It should not cause a similar issue
The upstream PR of RGW fix https://github.com/rook/rook/pull/7537

Comment 11 Martin Bukatovic 2021-04-20 15:09:50 UTC

I'm trying to verify the problem with:

- OCP: 4.7.0-0.nightly-2021-04-15-110345
- LSO 4.7.0-202104030128.p0
- OCS 4.7.0-353.ci latest-stable-47 (4.7.0-rc5)

So I identified 2 nodes hosting rgw pods:

```
$ oc get pods -n openshift-storage -o wide | grep -i rgw                        
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-86df96847fnr   2/2     Running     0          22h   10.128.4.25   compute-5           <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-85b94c8wl8b6   2/2     Running     0          22h   10.128.2.13   compute-0           <none>           <none>
```

And shutdown them via vsphere console (so that vsphere reports them as powered off) at 16:50 CEST.

After some time (about 6 minutes), the affected rwg pods went into Terminating state:

```
$ oc get pods -n openshift-storage -o wide | grep -i rgw                        
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-86df96847fnr   2/2     Terminating   0          22h   10.128.4.25   compute-5           <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-86df9689xd5q   0/2     Pending       0          83s   <none>        <none>              <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-85b94c8v94l5   0/2     Pending       0          93s   <none>        <none>              <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-85b94c8wl8b6   2/2     Terminating   0          22h   10.128.2.13   compute-0           <none>           <none>
```

and it stays this way. Looking into output of `oc describe pod/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-86df9689xd5q` I see:

```
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  11m   default-scheduler  0/9 nodes are available: 2 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match pod affinity/anti-affinity, 4 node(s) didn't match pod anti-affinity rules.
  Warning  FailedScheduling  11m   default-scheduler  0/9 nodes are available: 2 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match pod affinity/anti-affinity, 4 node(s) didn't match pod anti-affinity rules.
```

>>> ASSIGNED

Additional info
---------------

```
$ oc get nodes -L topology.kubernetes.io/zone 
NAME              STATUS     ROLES    AGE     VERSION           ZONE
compute-0         NotReady   worker   4d23h   v1.20.0+c8905da   data-a
compute-1         Ready      worker   4d23h   v1.20.0+c8905da   data-a
compute-2         Ready      worker   4d23h   v1.20.0+c8905da   data-a
compute-3         Ready      worker   4d23h   v1.20.0+c8905da   data-b
compute-4         Ready      worker   4d23h   v1.20.0+c8905da   data-b
compute-5         NotReady   worker   4d23h   v1.20.0+c8905da   data-b
control-plane-0   Ready      master   4d23h   v1.20.0+c8905da   data-a
control-plane-1   Ready      master   4d23h   v1.20.0+c8905da   data-b
control-plane-2   Ready      master   4d23h   v1.20.0+c8905da   foo-arbiter
```

Comment 13 Martin Bukatovic 2021-04-20 16:07:55 UTC

I powered the nodes back again at 17:34 CEST, and then waited for all pods to be running again.

But now I see that the rwg pods are running on different nodes compared to original state:

```
$ oc get pods -o wide -n openshift-storage | grep rgw
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-86df9689xd5q   2/2     Running     0          65m   10.131.0.40   compute-4         <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-85b94c8v94l5   2/2     Running     0          65m   10.131.3.30   compute-1         <none>           <none>
```

Looking at ceph_rgw_metadata prometheus query, it seems that indeed, start of the powered down nodes coincides with start of new rgw pods (on different nodes though).

I have no idea why is it happening.

Comment 14 Martin Bukatovic 2021-04-20 16:14:52 UTC

Additional details for comment 13 (from oc describe pod/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-86df9689xd5q) :

```
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  77m   default-scheduler  0/9 nodes are available: 2 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match pod affinity/anti-affinity, 4 node(s) didn't match pod anti-affinity rules.
  Warning  FailedScheduling  77m   default-scheduler  0/9 nodes are available: 2 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match pod affinity/anti-affinity, 4 node(s) didn't match pod anti-affinity rules.
  Normal   Scheduled         37m   default-scheduler  Successfully assigned openshift-storage/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-86df9689xd5q to compute-4
  Normal   AddedInterface    37m   multus             Add eth0 [10.131.0.40/23]
  Normal   Pulled            37m   kubelet            Container image "quay.io/rhceph-dev/rhceph@sha256:d2e99edf733960244256ad82e761acffe9f09e76749bb769469b4b929b25c509" already present on machine
  Normal   Created           37m   kubelet            Created container chown-container-data-dir
  Normal   Started           37m   kubelet            Started container chown-container-data-dir
  Normal   Pulled            37m   kubelet            Container image "quay.io/rhceph-dev/rhceph@sha256:d2e99edf733960244256ad82e761acffe9f09e76749bb769469b4b929b25c509" already present on machine
  Normal   Created           37m   kubelet            Created container rgw
  Normal   Started           37m   kubelet            Started container rgw
  Normal   Pulled            37m   kubelet            Container image "quay.io/rhceph-dev/rhceph@sha256:d2e99edf733960244256ad82e761acffe9f09e76749bb769469b4b929b25c509" already present on machine
  Normal   Created           37m   kubelet            Created container log-collector
  Normal   Started           37m   kubelet            Started container log-collector
```

Here we can see another evidence that the pod actually started after the powered off nodes were started again.

Comment 15 Travis Nielsen 2021-04-20 21:55:21 UTC

The operator log does not have any lines like the following for the rgw or any other pods:

"force deleting pod <pod-name> that appears to be stuck terminating"

It seems nodes 0 and 5 were down sufficiently long enough that we should have seen this message for all the pods on the node. If we don't see the terminating pods force deleted by the operator, the new rgw pods will stay stuck in the pending state and the old ones will stay in terminating. Then as soon as the nodes come back up, the pods will finally be scheduled. 

@Martin Did you see all the other pods on those nodes (besides rgw) also stuck while the nodes were down? If so, the conditions for the node being down must not have been detected by Rook. But if the rgw pods were the only ones stuck, then this BZ still needs to be fixed.

Comment 16 Travis Nielsen 2021-04-20 21:56:25 UTC

I would also recommend this *not* be a blocker for 4.7 and we move to 4.8. The odds of both nodes going down where the rgw pods are running is low.

Comment 17 Rohan Gupta 2021-04-21 00:02:57 UTC

I tested a build of https://github.com/openshift/rook/tree/release-4.7 on AWS dev cluster to check if the rgw pods stuck in terminating state is force deleted on the node failure and it works.

Confirmed in the rook-ceph-operator logs:
`2021-04-20 23:48:51.705697 I | op-k8sutil: pod "rook-ceph-rgw-my-store-a-6f8ff8698-92xvf" deletion succeeded`

@Martin, Looks like the operator didn’t detect the node was down in your case.

Comment 18 Martin Bukatovic 2021-04-23 15:05:11 UTC

(In reply to Travis Nielsen from comment #15)
> The operator log does not have any lines like the following for the rgw or
> any other pods:
> 
> "force deleting pod <pod-name> that appears to be stuck terminating"
> 
> It seems nodes 0 and 5 were down sufficiently long enough that we should
> have seen this message for all the pods on the node.

Agreed.

> If we don't see the
> terminating pods force deleted by the operator, the new rgw pods will stay
> stuck in the pending state and the old ones will stay in terminating. Then
> as soon as the nodes come back up, the pods will finally be scheduled. 
> 
> @Martin Did you see all the other pods on those nodes (besides rgw) also
> stuck while the nodes were down?

We can check must gather tarball referenced in comment 12 to answer that, since
I fetched it at about 17:25-17:30, after reproducing the problem, but before
starting the nodes again (shutdown happened at 16:50 CEST, startup at 
17:34 CEST).

Let's cd into the directory with k8s yaml dump of all ocs pods:

```
$ tar --force-local -xf mbukatov-0415a_20210415T151318.must-gather.2021-04-20T17\:10+02\:00.tar.gz
$ cd ./mbukatov-0415a_20210415T151318.must-gather-ocs.2021-04-20T17:10+02:00/quay-io-rhceph-dev-ocs-must-gather-sha256-0c61617e2ae8d332167ab4bc86dfa9e7b74e002a8c344eee25700ea2069df6a0/namespaces/openshift-storage/pods
$ ls
compute-1-debug
compute-2-debug
compute-3-debug
compute-4-debug
csi-cephfsplugin-8jbp6
csi-cephfsplugin-979mx
csi-cephfsplugin-lsd49
csi-cephfsplugin-n9c46
csi-cephfsplugin-provisioner-f975d886c-gd6cf
csi-cephfsplugin-provisioner-f975d886c-vbzbn
csi-cephfsplugin-s986p
csi-cephfsplugin-vchr7
csi-rbdplugin-5brdd
csi-rbdplugin-822t7
csi-rbdplugin-f6x48
csi-rbdplugin-fbbgq
csi-rbdplugin-hwh42
csi-rbdplugin-ksskf
csi-rbdplugin-provisioner-6bbf798bfb-5gg7v
csi-rbdplugin-provisioner-6bbf798bfb-d2zlp
csi-rbdplugin-provisioner-6bbf798bfb-dtvdh
csi-rbdplugin-provisioner-6bbf798bfb-sp6jx
must-gather-xv9cj-helper
noobaa-core-0
noobaa-db-pg-0
noobaa-endpoint-56c55b887b-pzzks
noobaa-operator-58569bcccf-wqsdk
ocs-metrics-exporter-69845d9ffb-pnrns
ocs-operator-6c4795f48d-47jw6
rook-ceph-crashcollector-compute-0-7b7464676b-2fr2k
rook-ceph-crashcollector-compute-0-7b7464676b-v89kn
rook-ceph-crashcollector-compute-1-7b56695857-k6fm8
rook-ceph-crashcollector-compute-2-57df775c68-qhdcf
rook-ceph-crashcollector-compute-3-8898fd575-br5f5
rook-ceph-crashcollector-compute-4-689fbb5b78-lh4mt
rook-ceph-crashcollector-compute-5-5d65d696f-t7bnh
rook-ceph-crashcollector-compute-5-5d65d696f-xn92c
rook-ceph-crashcollector-control-plane-2-644b688bc6-b9zfd
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-584cf59bh2scw
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-567b98d8ggpmt
rook-ceph-mgr-a-7f9b5b666-v8k8h
rook-ceph-mon-a-5685b776b6-qmtnx
rook-ceph-mon-b-fc6bb97b5-gqbv5
rook-ceph-mon-b-fc6bb97b5-j4s2n
rook-ceph-mon-c-756c8487d6-l7ppf
rook-ceph-mon-d-67b5b987-6j442
rook-ceph-mon-d-67b5b987-f4tvc
rook-ceph-mon-e-76bcc96ff8-69sc9
rook-ceph-operator-569468bdcc-nxhln
rook-ceph-osd-0-844c4b7646-mpg5f
rook-ceph-osd-1-7cbf8b8f49-9wvqp
rook-ceph-osd-2-5568d584dc-rm2tx
rook-ceph-osd-3-545686578c-p7bdx
rook-ceph-osd-prepare-ocs-deviceset-ocsvolumeset-0-data-0qmwgmx
rook-ceph-osd-prepare-ocs-deviceset-ocsvolumeset-1-data-06vbtnx
rook-ceph-osd-prepare-ocs-deviceset-ocsvolumeset-2-data-09ks9gf
rook-ceph-osd-prepare-ocs-deviceset-ocsvolumeset-3-data-0fsd62j
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-86df96847fnr
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-86df9689xd5q
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-85b94c8v94l5
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-85b94c8wl8b6
rook-ceph-tools-69c5449589-wn84v
```

The following pods were allegedly running on one of the shutdown
nodes (we can see the expected rgw pod there):

```
$ rg 'nodeName: compute-0'
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-85b94c8wl8b6/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-85b94c8wl8b6.yaml
639:  nodeName: compute-0

csi-cephfsplugin-979mx/csi-cephfsplugin-979mx.yaml
475:  nodeName: compute-0

rook-ceph-crashcollector-compute-0-7b7464676b-2fr2k/rook-ceph-crashcollector-compute-0-7b7464676b-2fr2k.yaml
492:  nodeName: compute-0

rook-ceph-mon-d-67b5b987-f4tvc/rook-ceph-mon-d-67b5b987-f4tvc.yaml
878:  nodeName: compute-0

csi-rbdplugin-hwh42/csi-rbdplugin-hwh42.yaml
476:  nodeName: compute-0

noobaa-core-0/noobaa-core-0.yaml
487:  nodeName: compute-0

csi-rbdplugin-provisioner-6bbf798bfb-dtvdh/csi-rbdplugin-provisioner-6bbf798bfb-dtvdh.yaml
547:  nodeName: compute-0
```

And when we check status of these pods, we see that all ocs pods, not just rgw,
were reported as running even though the node was down:

```
$ rg 'nodeName: compute-0' | cut -d: -f1 | xargs grep ' phase: '
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-85b94c8wl8b6/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-85b94c8wl8b6.yaml:  phase: Running
csi-cephfsplugin-979mx/csi-cephfsplugin-979mx.yaml:  phase: Running
rook-ceph-crashcollector-compute-0-7b7464676b-2fr2k/rook-ceph-crashcollector-compute-0-7b7464676b-2fr2k.yaml:  phase: Running
rook-ceph-mon-d-67b5b987-f4tvc/rook-ceph-mon-d-67b5b987-f4tvc.yaml:  phase: Running
csi-rbdplugin-hwh42/csi-rbdplugin-hwh42.yaml:  phase: Running
noobaa-core-0/noobaa-core-0.yaml:  phase: Running
csi-rbdplugin-provisioner-6bbf798bfb-dtvdh/csi-rbdplugin-provisioner-6bbf798bfb-dtvdh.yaml:  phase: Running
```

> If so, the conditions for the node being
> down must not have been detected by Rook. But if the rgw pods were the only
> ones stuck, then this BZ still needs to be fixed.

Looks like the 1st case applies.

Comment 19 Martin Bukatovic 2021-04-23 15:07:06 UTC

(In reply to Rohan Gupta from comment #17)
> @Martin, Looks like the operator didn’t detect the node was down in your
> case.

This is a good point, I'm investigating this.

Comment 20 Martin Bukatovic 2021-04-23 21:09:41 UTC

(In reply to Martin Bukatovic from comment #19)
> (In reply to Rohan Gupta from comment #17)
> > @Martin, Looks like the operator didn’t detect the node was down in your
> > case.
> 
> This is a good point, I'm investigating this.

I reported a separate BZ 1953100 for that.

Comment 21 Martin Bukatovic 2021-04-23 21:20:52 UTC

During reporting BZ 1953100, I retried reproducer for this bug with:

OCP 4.7.0-0.nightly-2021-04-21-093400
LSO 4.7.0-202104090228.p0
OCS 4.7.0-353.ci (4.7.0-rc5)

And this time, I see no problem with rgw pod redeployment.

Verification steps:

- at 19:37 CEST shutdown compute-1 and compute-3 nodes
- at 19:40 CEST nodes are declared as NotReady
- at 19:43 CEST all ocs pods are still reported as running on affected nodes
- at 44:44 CEST first pods from compute-3 are marked as Terminating, redeployment of these pods on different node is triggered
- at 19:45 CEST all pods are running again, with exception of OSDs, which are pending as these pods can't be moved across nodes

Before I powered off 2 nodes with rgw pods:

```
$ oc get pods -n openshift-storage -o wide | grep -i rgw                        
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-68fcc5fc6cnf   2/2     Running     0          63m     10.131.0.120   compute-1
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-69f6b4cttvkg   2/2     Running     0          22h     10.129.2.238   compute-3
```

And at 19:45 CEST, about 8 minutes after the node shutdown, rgw pods are running on another nodes again:

```
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-68fcc5fxjrbm   2/2     Running     0          36s     10.128.4.29    compute-0         <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-69f6b4cm6v8r   2/2     Running     0          47s     10.130.2.20    compute-5         <none>           <none>
```

Comment 22 Martin Bukatovic 2021-04-23 21:24:25 UTC

I would like to understand connection between BZ 1953100 and this BZ better, to see whether we should mark BZ 1953100 as a blocker, or if it makes sense to verify this bug without fixing BZ 1953100.

This is because I performed the verification 2 times, and in one case BZ 1953100 prevented the verification. If this scenario is likely to happen often, we can't move this BZ as verified based on current evidence and without further discussion.

Comment 23 Travis Nielsen 2021-04-23 22:14:53 UTC

What you describe sounds expected, see https://bugzilla.redhat.com/show_bug.cgi?id=1953100#c3

Moving back to on_qa to mark as verified if you agree.

Comment 24 Martin Bukatovic 2021-04-26 16:25:55 UTC

If BZ 1953100 is evaluated as not valid, we need to go back and assume the problem to be part of this bug.

The problem is that the failure I observed in comment 11 (allegedly because of misrepresented node status) is completely the same as described in this bug. With 50% verification success but without isolating the problem, we can't consider this bug as fixed.

One way out would be to have this usecase automated, and have it executed like 10 times (small but >2 number agreed on in advance), checking whether the issue occurs or not.

Comment 28 errata-xmlrpc 2021-05-19 09:21:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Comment 29 Red Hat Bugzilla 2023-09-15 01:04:42 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.