Description of problem (please be detailed as possible and provide log snippests): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Keep the workloads running for a longer time. In this case the cluster was up and running for the past one month. 2. In the Longevity cluster we observed that the rbd mirror pod goes to 'ContainerStatusUnknown' status. The other rbd-mirror pod status is fine. odf-pods | grep mirror rook-ceph-rbd-mirror-a-54bb9868f-hqqx7 0/2 ContainerStatusUnknown 2 23d rook-ceph-rbd-mirror-a-54bb9868f-scfdm 2/2 Running 0 4d7h Actual results: Expected results: only one rbd mirror is expected to be present. Additional info: 1) Did not perform any DR operations 2) OSD crash was observed at this time 3) Submariner connection was also degraded at this time Cluster info: OCP - 4.13.0-0.nightly-2023-06-05-164816 ODF - 4.13.0-219.snaptrim MCO - 4.13.0-219 ACM - 2.8 Submariner - submariner.v0.15.0 Must gather logs :- c1 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/c1/ c2 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/c2/ hub - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/hub/ Live setup is available for debugging c1 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25313/ c2 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25312/ hub - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25311/
The error is ``` Message: The node was low on resource: ephemeral-storage. Threshold quantity: 19236596271, available: 18664628Ki. Container rbd-mirror was using 342044Ki, request is 0, has larger consumption of ephemeral-storage. Container log-collector was using 5940Ki, request is 0, has larger consumption of ephemeral-storage. ```
@srai There is one more rbd-mirror that got created and in running state. Could you please explain a bit more on that? And why the old pod stuck in the "ContainerStatusUnknown" state.
If you see the newer pod is in a different node 10.135.1.12 which doesn't have low resources ``` pod/rook-ceph-osd-prepare-535693a906b30a53d2ba66acba7a8140-jdlk2 0/1 Completed 0 21d 10.135.1.12 compute-2 <none> <none> pod/rook-ceph-rbd-mirror-a-54bb9868f-hqqx7 0/2 ContainerStatusUnknown 2 23d 10.133.2.158 compute-0 <none> <none> ``` If you see the pod which is failing in a different node 10.133.2.158 has the error ``` Message: The node was low on resource: ephemeral-storage. Threshold quantity: 19236596271, available: 18664628Ki. Container rbd-mirror was using 342044Ki, request is 0, has larger consumption of ephemeral-storage. Container log-collector was using 5940Ki, request is 0, has larger consumption of ephemeral-storage. ```
@
@srai Don't we expect the stale pod to get deleted after the creation of new one? What would be right behavior here?
Also, similar Kubernetes issue https://github.com/kubernetes/kubernetes/issues/104107 I don't this is something we can fix from the rook or ODF.
I looked deeply into this, and there are two ways we can fix the issues 1. Node is on low resources and Kubernetes is thinking the availability is very low to keep the rbd mirror pod since it has logs container too without `spec.containers[].resources.limits.ephemeral-storage` set and same for rbd-mirror container. So, we need space on the node or clean some room on the node. 2. we can possibly set `spec.containers[].resources.limits.ephemeral-storage` on the containers. But I'll suggest 2nd is not the right fix IMO, since the node has the low resource that is the RCA, and we don't know what is the right value to put in `spec.containers[].resources.limits.ephemeral-storage` it could lead to other issues and could also restrict the log collections if we put very low value. Also, everything was working till now until the low resource on node. I think we don't have anything to do here.
moving out 4.14, not a blocker.