Bug 2062652

Summary: ODF 4.9 deployment in AWS failed because MountVolume.SetUp failed for volume "pvc-..." : mount failed: exit status 32
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Daniel Horák <dahorak>
Component: rookAssignee: Travis Nielsen <tnielsen>
Status: CLOSED NOTABUG QA Contact: Elad <ebenahar>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.9CC: madam, mmuench, ocs-bugs, odf-bz-bot, shan
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-28 15:21:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Daniel Horák 2022-03-10 10:14:10 UTC
Description of problem (please be detailed as possible and provide log
snippests):
  In one of our ODF 4.9 deployments in AWS, one of mon pods failed to start
  because of problem with mounting volume.

  The cluster is configured as cluster behind proxy, but I think that the
  failure is nor related to that.

  All pods in openshift-storage namespace are running except this one:
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  rook-ceph-mon-a-5db478b498-ppvxg                   0/2     Init:0/2   0          25m     <none>        ip-10-0-64-109.us-east-2.compute.internal   <none>           <none>
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Following are the events from the failing pod:
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  42m                   default-scheduler  0/6 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  42m                   default-scheduler  0/6 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Normal   Scheduled         42m                   default-scheduler  Successfully assigned openshift-storage/rook-ceph-mon-a-5db478b498-ppvxg to ip-10-0-64-109.us-east-2.compute.internal
  Warning  FailedMount       35m (x2 over 38m)     kubelet            Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[rook-config-override rook-ceph-mons-keyring rook-ceph-log rook-ceph-crash ceph-daemon-data kube-api-access-v6pg2]: timed out waiting for the condition
  Warning  FailedMount       31m                   kubelet            Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[ceph-daemon-data kube-api-access-v6pg2 rook-config-override rook-ceph-mons-keyring rook-ceph-log rook-ceph-crash]: timed out waiting for the condition
  Warning  FailedMount       28m (x3 over 40m)     kubelet            Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[rook-ceph-crash ceph-daemon-data kube-api-access-v6pg2 rook-config-override rook-ceph-mons-keyring rook-ceph-log]: timed out waiting for the condition
  Warning  FailedMount       22m                   kubelet            Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[kube-api-access-v6pg2 rook-config-override rook-ceph-mons-keyring rook-ceph-log rook-ceph-crash ceph-daemon-data]: timed out waiting for the condition
  Warning  FailedMount       19m (x3 over 26m)     kubelet            Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[rook-ceph-log rook-ceph-crash ceph-daemon-data kube-api-access-v6pg2 rook-config-override rook-ceph-mons-keyring]: timed out waiting for the condition
  Warning  FailedMount       7m15s (x25 over 42m)  kubelet            MountVolume.SetUp failed for volume "pvc-b9da5fd9-62e2-4057-9432-743585026318" : mount failed: exit status 32
Mounting command: mount
Mounting arguments:  -o bind /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-east-2b/vol-02f13c2bb90670c71 /var/lib/kubelet/pods/a0a171a5-fd59-4d73-9856-6bfa4ee33fc7/volumes/kubernetes.io~aws-ebs/pvc-b9da5fd9-62e2-4057-9432-743585026318
Output: mount: /var/lib/kubelet/pods/a0a171a5-fd59-4d73-9856-6bfa4ee33fc7/volumes/kubernetes.io~aws-ebs/pvc-b9da5fd9-62e2-4057-9432-743585026318: special device /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-east-2b/vol-02f13c2bb90670c71 does not exist.
  Warning  FailedMount  108s (x2 over 4m5s)  kubelet  Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[rook-ceph-mons-keyring rook-ceph-log rook-ceph-crash ceph-daemon-data kube-api-access-v6pg2 rook-config-override]: timed out waiting for the condition
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  List of PVCs looks ok:
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  NAMESPACE           NAME              STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
  openshift-storage   rook-ceph-mon-a   Bound    pvc-b9da5fd9-62e2-4057-9432-743585026318   50Gi       RWO            gp2            25m
  openshift-storage   rook-ceph-mon-b   Bound    pvc-472a1a0a-e2e3-49b0-9342-f95641cf2966   50Gi       RWO            gp2            25m
  openshift-storage   rook-ceph-mon-c   Bound    pvc-ed46b84a-7b90-43bd-9205-02f882c0160f   50Gi       RWO            gp2            25m
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Version of all relevant components (if applicable):
  OCP version: 4.10.0-0.nightly-2022-03-09-043557
  odf-operator.v4.9.3
  ocs-operator.v4.9.3
  mcg-operator.v4.9.3


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
  I'm not aware of any, but the cluster was already destroyed, so I'm not sure
  if it might be possible to fix the issue.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
  3


Can this issue reproducible?
  We saw it only once, so it might be also some temporary AWS issue.


Can this issue reproduce from the UI?
  n/a


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy OCP cluster behind proxy.
2. Try to deploy ODF cluster there.


Actual results:
  ODF deployment failed, one of the rook-ceph-mon pods didn't started.


Expected results:
  ODF deployment pass, all pods in openshift-storage namespace are running.


Additional info:
  As mentioned above, wee see this issue only once, so it might be some
  temporary AWS problem.

Comment 3 Travis Nielsen 2022-03-10 18:09:20 UTC
The mons are not in quorum and not responding to any commands. There are also no mgr or OSD pods in the cluster, which means there has never been a successful reconcile to create all the Ceph resources.

The operator log shows that the mon reconcile is failing and any ceph status commands are failing [1]

2022-03-09T21:19:34.825052754Z 2022-03-09 21:19:34.825012 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to create cluster: failed to start ceph monitors: failed to start mon pods: failed to check mon quorum b: failed to wait for mon quorum: exceeded max retry count waiting for monitors to reach quorum
2022-03-09T21:19:34.825052754Z 2022-03-09 21:19:34.825040 I | op-k8sutil: Reporting Event openshift-storage:ocs-storagecluster-cephcluster Warning:ReconcileFailed:failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to create cluster: failed to start ceph monitors: failed to start mon pods: failed to check mon 
...
2022-03-09T21:22:16.678618789Z 2022-03-09 21:22:16.678579 E | clusterdisruption-controller: failed to check cluster health: failed to get status. . timed out: exit status 1
2

The mon-a pod shows its ebs volume does not exist [2]:

  Warning  FailedMount       7m15s (x25 over 42m)  kubelet            MountVolume.SetUp failed for volume "pvc-b9da5fd9-62e2-4057-9432-743585026318" : mount failed: exit status 32
Mounting command: mount
Mounting arguments:  -o bind /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-east-2b/vol-02f13c2bb90670c71 /var/lib/kubelet/pods/a0a171a5-fd59-4d73-9856-6bfa4ee33fc7/volumes/kubernetes.io~aws-ebs/pvc-b9da5fd9-62e2-4057-9432-743585026318
Output: mount: /var/lib/kubelet/pods/a0a171a5-fd59-4d73-9856-6bfa4ee33fc7/volumes/kubernetes.io~aws-ebs/pvc-b9da5fd9-62e2-4057-9432-743585026318: special device /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-east-2b/vol-02f13c2bb90670c71 does not exist.
  Warning  FailedMount  108s (x2 over 4m5s)  kubelet  Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[rook-ceph-mons-keyring rook-ceph-log rook-ceph-crash ceph-daemon-data kube-api-access-v6pg2 rook-config-override]: timed out waiting for the condition


The simplest resolution may be to scrap the install and reinstall since the mon volume had an issue provisioning the ebs volume. 

[1] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-010aup3c33-ua/j-010aup3c33-ua_20220309T194925/logs/failed_testcase_ocs_logs_1646855666/deployment_ocs_logs/ocs_must_gather/registry-redhat-io-odf4-ocs-must-gather-rhel8-sha256-820defc6534c620640768027af8ccaa7bdebe12d3868ce5a5b11f64a9f387e86/namespaces/openshift-storage/pods/rook-ceph-operator-67fbd96785-b696q/rook-ceph-operator/rook-ceph-operator/logs/current.log

[2] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-010aup3c33-ua/j-010aup3c33-ua_20220309T194925/logs/failed_testcase_ocs_logs_1646855666/test_deployment_ocs_logs/ocs_must_gather/registry-redhat-io-odf4-ocs-must-gather-rhel8-sha256-820defc6534c620640768027af8ccaa7bdebe12d3868ce5a5b11f64a9f387e86/namespaces/openshift-storage/oc_output/pods

Comment 4 Travis Nielsen 2022-03-21 16:33:23 UTC
Shall we close this or move to another component?

Comment 5 Daniel Horák 2022-03-22 10:15:05 UTC
I think we can close this bug, because we didn't faced it any more so it might be some temporary issue on AWS side.
If we will face this issue in the future, we can provide more info and reopen it.

Comment 6 Sébastien Han 2022-03-28 15:21:33 UTC
(In reply to Daniel Horák from comment #5)
> I think we can close this bug, because we didn't faced it any more so it
> might be some temporary issue on AWS side.
> If we will face this issue in the future, we can provide more info and
> reopen it.

Closing then, thanks!