2223847 – [RDR] rbd-mirror pod goes to ContainerStatusUnknown status

Bug 2223847 - [RDR] rbd-mirror pod goes to ContainerStatusUnknown status

Summary: [RDR] rbd-mirror pod goes to ContainerStatusUnknown status

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Subham Rai
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-07-19 06:24 UTC by kmanohar
Modified:	2024-03-13 04:25 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-11-13 07:13:22 UTC
Embargoed:
Flags:	kmanohar: needinfo-

Attachments	(Terms of Use)

Description kmanohar 2023-07-19 06:24:54 UTC

Description of problem (please be detailed as possible and provide log
snippests):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Keep the workloads running for a longer time. In this case the cluster was up and running for the past one month.
2. In the Longevity cluster we observed that the rbd mirror pod goes to 'ContainerStatusUnknown' status. The other rbd-mirror pod status is fine.

odf-pods | grep mirror
rook-ceph-rbd-mirror-a-54bb9868f-hqqx7                            0/2     ContainerStatusUnknown   2                23d
rook-ceph-rbd-mirror-a-54bb9868f-scfdm                            2/2     Running                  0                4d7h


Actual results:

Expected results:
only one rbd mirror is expected to be present.


Additional info:
 1) Did not perform any DR operations
 2) OSD crash was observed at this time
 3) Submariner connection was also degraded at this time

Cluster info:

OCP - 4.13.0-0.nightly-2023-06-05-164816
ODF - 4.13.0-219.snaptrim
MCO - 4.13.0-219
ACM - 2.8
Submariner - submariner.v0.15.0


Must gather logs :-
c1 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/c1/

c2 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/c2/

hub - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/hub/


Live setup is available for debugging
c1 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25313/

c2 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25312/

hub - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25311/

Comment 3 Subham Rai 2023-07-20 12:15:56 UTC

The error is

```
Message:      The node was low on resource: ephemeral-storage. Threshold quantity: 19236596271, available: 18664628Ki. Container rbd-mirror was using 342044Ki, request is 0, has larger consumption of ephemeral-storage. Container log-collector was using 5940Ki, request is 0, has larger consumption of ephemeral-storage. 
```

Comment 4 kmanohar 2023-07-27 04:24:10 UTC

@srai There is one more rbd-mirror that got created and in running state. Could you please explain a bit more on that? And why the old pod stuck in the "ContainerStatusUnknown" state.

Comment 5 Subham Rai 2023-07-27 05:40:48 UTC

If you see the newer pod is in a different node 10.135.1.12 which doesn't have low resources 
```
pod/rook-ceph-osd-prepare-535693a906b30a53d2ba66acba7a8140-jdlk2      0/1     Completed                0                21d     10.135.1.12    compute-2   <none>           <none>
pod/rook-ceph-rbd-mirror-a-54bb9868f-hqqx7                            0/2     ContainerStatusUnknown   2                23d     10.133.2.158   compute-0   <none>           <none>
```

If you see the pod which is failing in a different node 10.133.2.158 has the error 
```
Message:      The node was low on resource: ephemeral-storage. Threshold quantity: 19236596271, available: 18664628Ki. Container rbd-mirror was using 342044Ki, request is 0, has larger consumption of ephemeral-storage. Container log-collector was using 5940Ki, request is 0, has larger consumption of ephemeral-storage. 
```

Comment 6 kmanohar 2023-07-27 08:23:11 UTC

Comment 7 kmanohar 2023-07-27 08:25:55 UTC

@srai Don't we expect the stale pod to get deleted after the creation of new one? What would be right behavior here?

Comment 9 Subham Rai 2023-08-02 09:28:14 UTC

Also, similar Kubernetes issue https://github.com/kubernetes/kubernetes/issues/104107  I don't this is something we can fix from the rook or ODF.

Comment 11 Subham Rai 2023-08-07 04:15:53 UTC

I looked deeply into this, and there are two ways we can fix the issues

1. Node is on low resources and Kubernetes is thinking the availability is very low to keep the rbd mirror pod since it has logs container too without `spec.containers[].resources.limits.ephemeral-storage` set and same for rbd-mirror container. So, we need space on the node or clean some room on the node.

2. we can possibly set `spec.containers[].resources.limits.ephemeral-storage` on the containers.

But I'll suggest 2nd is not the right fix IMO, since the node has the low resource that is the RCA, and we don't know what is the right value to put in `spec.containers[].resources.limits.ephemeral-storage` it could lead to other issues and could also restrict the log collections if we put very low value. Also, everything was working till now until the low resource on node.

I think we don't have anything to do here.

Comment 12 Subham Rai 2023-08-08 15:38:08 UTC

moving out 4.14, not a blocker.

Comment 13 Subham Rai 2023-08-30 04:51:05 UTC

Are we good to close this one?

Comment 14 Subham Rai 2023-09-11 10:29:28 UTC

closing this due to no activity. Please open if you have more repro and updates.

Comment 15 kmanohar 2023-10-10 13:32:12 UTC

@srai Considering the low resource on the node, the old rbd-mirror pod would have gone to container status unknown state but somehow a new rbd-mirror had come up and was running healthy. Is it because the node then had resource available to have the new rbd-mirror pod up and running, if so, the old pod is expected to be deleted.

Comment 16 Subham Rai 2023-10-11 11:33:22 UTC

kubelet was trying to delete the pod but somehow it was not able to find the pod ```The container could not be located when the pod was deleted.  The container used to be Running```

Please try to reproduce this and share the cluster for debugging.

Comment 18 Red Hat Bugzilla 2024-03-13 04:25:04 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.