Bug 2223847 - [RDR] rbd-mirror pod goes to ContainerStatusUnknown status [NEEDINFO]
Summary: [RDR] rbd-mirror pod goes to ContainerStatusUnknown status
Keywords:
Status: NEW
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.13
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Subham Rai
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-19 06:24 UTC by kmanohar
Modified: 2023-08-11 14:49 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
srai: needinfo? (kmanohar)


Attachments (Terms of Use)

Description kmanohar 2023-07-19 06:24:54 UTC
Description of problem (please be detailed as possible and provide log
snippests):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Keep the workloads running for a longer time. In this case the cluster was up and running for the past one month.
2. In the Longevity cluster we observed that the rbd mirror pod goes to 'ContainerStatusUnknown' status. The other rbd-mirror pod status is fine.

odf-pods | grep mirror
rook-ceph-rbd-mirror-a-54bb9868f-hqqx7                            0/2     ContainerStatusUnknown   2                23d
rook-ceph-rbd-mirror-a-54bb9868f-scfdm                            2/2     Running                  0                4d7h


Actual results:

Expected results:
only one rbd mirror is expected to be present.


Additional info:
 1) Did not perform any DR operations
 2) OSD crash was observed at this time
 3) Submariner connection was also degraded at this time

Cluster info:

OCP - 4.13.0-0.nightly-2023-06-05-164816
ODF - 4.13.0-219.snaptrim
MCO - 4.13.0-219
ACM - 2.8
Submariner - submariner.v0.15.0


Must gather logs :-
c1 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/c1/

c2 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/c2/

hub - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/hub/


Live setup is available for debugging
c1 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25313/

c2 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25312/

hub - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25311/

Comment 3 Subham Rai 2023-07-20 12:15:56 UTC
The error is

```
Message:      The node was low on resource: ephemeral-storage. Threshold quantity: 19236596271, available: 18664628Ki. Container rbd-mirror was using 342044Ki, request is 0, has larger consumption of ephemeral-storage. Container log-collector was using 5940Ki, request is 0, has larger consumption of ephemeral-storage. 
```

Comment 4 kmanohar 2023-07-27 04:24:10 UTC
@srai There is one more rbd-mirror that got created and in running state. Could you please explain a bit more on that? And why the old pod stuck in the "ContainerStatusUnknown" state.

Comment 5 Subham Rai 2023-07-27 05:40:48 UTC
If you see the newer pod is in a different node 10.135.1.12 which doesn't have low resources 
```
pod/rook-ceph-osd-prepare-535693a906b30a53d2ba66acba7a8140-jdlk2      0/1     Completed                0                21d     10.135.1.12    compute-2   <none>           <none>
pod/rook-ceph-rbd-mirror-a-54bb9868f-hqqx7                            0/2     ContainerStatusUnknown   2                23d     10.133.2.158   compute-0   <none>           <none>
```

If you see the pod which is failing in a different node 10.133.2.158 has the error 
```
Message:      The node was low on resource: ephemeral-storage. Threshold quantity: 19236596271, available: 18664628Ki. Container rbd-mirror was using 342044Ki, request is 0, has larger consumption of ephemeral-storage. Container log-collector was using 5940Ki, request is 0, has larger consumption of ephemeral-storage. 
```

Comment 6 kmanohar 2023-07-27 08:23:11 UTC
@

Comment 7 kmanohar 2023-07-27 08:25:55 UTC
@srai Don't we expect the stale pod to get deleted after the creation of new one? What would be right behavior here?

Comment 9 Subham Rai 2023-08-02 09:28:14 UTC
Also, similar Kubernetes issue https://github.com/kubernetes/kubernetes/issues/104107  I don't this is something we can fix from the rook or ODF.

Comment 11 Subham Rai 2023-08-07 04:15:53 UTC
I looked deeply into this, and there are two ways we can fix the issues

1. Node is on low resources and Kubernetes is thinking the availability is very low to keep the rbd mirror pod since it has logs container too without `spec.containers[].resources.limits.ephemeral-storage` set and same for rbd-mirror container. So, we need space on the node or clean some room on the node.

2. we can possibly set `spec.containers[].resources.limits.ephemeral-storage` on the containers.

But I'll suggest 2nd is not the right fix IMO, since the node has the low resource that is the RCA, and we don't know what is the right value to put in `spec.containers[].resources.limits.ephemeral-storage` it could lead to other issues and could also restrict the log collections if we put very low value. Also, everything was working till now until the low resource on node.

I think we don't have anything to do here.

Comment 12 Subham Rai 2023-08-08 15:38:08 UTC
moving out 4.14, not a blocker.


Note You need to log in before you can comment on or make changes to this bug.