Bug 2102440 - Ceph daemons are crashing while any app pod remained in container creating state [NEEDINFO]
Summary: Ceph daemons are crashing while any app pod remained in container creating state
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.11
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Kotresh HR
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-29 22:56 UTC by Amrita Mahapatra
Modified: 2023-08-09 16:37 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-10-13 09:35:48 UTC
Embargoed:
khiremat: needinfo? (ammahapa)


Attachments (Terms of Use)

Description Amrita Mahapatra 2022-06-29 22:56:47 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Ceph daemons are crashing while any app pod fails to move to Running and remains in container creating state. Although the pvc is in bound state, the pod failed with error logs,

Events:
  Type     Reason       Age                 From               Message
  ----     ------       ----                ----               -------
  Normal   Scheduled    5m5s                default-scheduler  Successfully assigned openshift-storage/pod-test-cephfs-271a7a8f35df44a299eab46f to ip-10-0-171-178.us-east-2.compute.internal by ip-10-0-182-125
  Warning  FailedMount  3m5s                kubelet            MountVolume.SetUp failed for volume "pvc-8b42c93f-a1d6-4b26-b828-d84070fd4736" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount  47s (x2 over 3m2s)  kubelet            Unable to attach or mount volumes: unmounted volumes=[mypvc], unattached volumes=[mypvc kube-api-access-lk5vj]: timed out waiting for the condition
  Warning  FailedMount  35s (x8 over 3m1s)  kubelet            MountVolume.SetUp failed for volume "pvc-8b42c93f-a1d6-4b26-b828-d84070fd4736" : rpc error: code = Internal desc = mount failed: exit status 32


Version of all relevant components (if applicable):
Validated with,
OCP version: 4.11.0-0.nightly-2022-06-28-160049
OCS version: 4.11.0-107
ceph version: 16.2.8-59.el8cp


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)? No


Is there any workaround available to the best of your knowledge? Yes
We can archive the crash details, then the ceph health check will pass.
Example:
[ammahapa@ammahapa ~]$ oc get pods | grep tool
rook-ceph-tools-9f8c8976f-zk8ps                                   1/1     Running    

[ammahapa@ammahapa ~]$ oc -n openshift-storage exec rook-ceph-tools-9f8c8976f-zk8ps -- ceph crash archive-all

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)? 3


Can this issue reproducible? Its happening intermittently.


Can this issue reproduce from the UI? No


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy an ODF 4.11 cluster
2. Enable nfs using patch command,
  [ammahapa@ammahapa ~]$ oc patch -n openshift-storage storageclusters.ocs.openshift.io ocs-storagecluster --patch '{"spec": {"nfs":{"enable": true}}}' --type merge

3. Check Cephnfs resource got created
  [ammahapa@ammahapa ~]$ oc get cephnfs
NAME                         AGE
ocs-storagecluster-cephnfs   10s

4. Check nfs-ganesha pod is up and running
[ammahapa@ammahapa ~]$ oc get pods | grep rook-ceph-nfs
rook-ceph-nfs-ocs-storagecluster-cephnfs-a-f7767ddc8-897nq        2/2     Running 

5. Enable rook_cis_nfs
oc  --namespace openshift-storage patch configmap rook-ceph-operator-config --type merge --patch '{"data":{"ROOK_CSI_ENABLE_NFS": "true"}}'

6. Create nfs pvcs with storageclass ocs-storagecluster-ceph-nfs

7. Create a pod with the nfs pvc mounted


Actual results:Sometimes the pod is not moving to Running state and remaining in ContainerCreating state although the pvc is in bound state.
Example error log,

Events:
  Type     Reason       Age                 From               Message
  ----     ------       ----                ----               -------
  Normal   Scheduled    5m5s                default-scheduler  Successfully assigned openshift-storage/pod-test-cephfs-271a7a8f35df44a299eab46f to ip-10-0-171-178.us-east-2.compute.internal by ip-10-0-182-125
  Warning  FailedMount  3m5s                kubelet            MountVolume.SetUp failed for volume "pvc-8b42c93f-a1d6-4b26-b828-d84070fd4736" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount  47s (x2 over 3m2s)  kubelet            Unable to attach or mount volumes: unmounted volumes=[mypvc], unattached volumes=[mypvc kube-api-access-lk5vj]: timed out waiting for the condition
  Warning  FailedMount  35s (x8 over 3m1s)  kubelet            MountVolume.SetUp failed for volume "pvc-8b42c93f-a1d6-4b26-b828-d84070fd4736" : rpc error: code = Internal desc = mount failed: exit status 32


Expected results: The app pod should move to Running and ceph should not crash.


Additional info:

Comment 3 Mudit Agarwal 2022-07-25 06:59:44 UTC
Not a 4.11 blocker


Note You need to log in before you can comment on or make changes to this bug.