Description of problem (please be detailed as possible and provide log snippests): In one of our ODF 4.9 deployments in AWS, one of mon pods failed to start because of problem with mounting volume. The cluster is configured as cluster behind proxy, but I think that the failure is nor related to that. All pods in openshift-storage namespace are running except this one: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ rook-ceph-mon-a-5db478b498-ppvxg 0/2 Init:0/2 0 25m <none> ip-10-0-64-109.us-east-2.compute.internal <none> <none> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Following are the events from the failing pod: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 42m default-scheduler 0/6 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Warning FailedScheduling 42m default-scheduler 0/6 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Normal Scheduled 42m default-scheduler Successfully assigned openshift-storage/rook-ceph-mon-a-5db478b498-ppvxg to ip-10-0-64-109.us-east-2.compute.internal Warning FailedMount 35m (x2 over 38m) kubelet Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[rook-config-override rook-ceph-mons-keyring rook-ceph-log rook-ceph-crash ceph-daemon-data kube-api-access-v6pg2]: timed out waiting for the condition Warning FailedMount 31m kubelet Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[ceph-daemon-data kube-api-access-v6pg2 rook-config-override rook-ceph-mons-keyring rook-ceph-log rook-ceph-crash]: timed out waiting for the condition Warning FailedMount 28m (x3 over 40m) kubelet Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[rook-ceph-crash ceph-daemon-data kube-api-access-v6pg2 rook-config-override rook-ceph-mons-keyring rook-ceph-log]: timed out waiting for the condition Warning FailedMount 22m kubelet Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[kube-api-access-v6pg2 rook-config-override rook-ceph-mons-keyring rook-ceph-log rook-ceph-crash ceph-daemon-data]: timed out waiting for the condition Warning FailedMount 19m (x3 over 26m) kubelet Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[rook-ceph-log rook-ceph-crash ceph-daemon-data kube-api-access-v6pg2 rook-config-override rook-ceph-mons-keyring]: timed out waiting for the condition Warning FailedMount 7m15s (x25 over 42m) kubelet MountVolume.SetUp failed for volume "pvc-b9da5fd9-62e2-4057-9432-743585026318" : mount failed: exit status 32 Mounting command: mount Mounting arguments: -o bind /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-east-2b/vol-02f13c2bb90670c71 /var/lib/kubelet/pods/a0a171a5-fd59-4d73-9856-6bfa4ee33fc7/volumes/kubernetes.io~aws-ebs/pvc-b9da5fd9-62e2-4057-9432-743585026318 Output: mount: /var/lib/kubelet/pods/a0a171a5-fd59-4d73-9856-6bfa4ee33fc7/volumes/kubernetes.io~aws-ebs/pvc-b9da5fd9-62e2-4057-9432-743585026318: special device /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-east-2b/vol-02f13c2bb90670c71 does not exist. Warning FailedMount 108s (x2 over 4m5s) kubelet Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[rook-ceph-mons-keyring rook-ceph-log rook-ceph-crash ceph-daemon-data kube-api-access-v6pg2 rook-config-override]: timed out waiting for the condition ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ List of PVCs looks ok: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE openshift-storage rook-ceph-mon-a Bound pvc-b9da5fd9-62e2-4057-9432-743585026318 50Gi RWO gp2 25m openshift-storage rook-ceph-mon-b Bound pvc-472a1a0a-e2e3-49b0-9342-f95641cf2966 50Gi RWO gp2 25m openshift-storage rook-ceph-mon-c Bound pvc-ed46b84a-7b90-43bd-9205-02f882c0160f 50Gi RWO gp2 25m ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Version of all relevant components (if applicable): OCP version: 4.10.0-0.nightly-2022-03-09-043557 odf-operator.v4.9.3 ocs-operator.v4.9.3 mcg-operator.v4.9.3 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? I'm not aware of any, but the cluster was already destroyed, so I'm not sure if it might be possible to fix the issue. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? We saw it only once, so it might be also some temporary AWS issue. Can this issue reproduce from the UI? n/a If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy OCP cluster behind proxy. 2. Try to deploy ODF cluster there. Actual results: ODF deployment failed, one of the rook-ceph-mon pods didn't started. Expected results: ODF deployment pass, all pods in openshift-storage namespace are running. Additional info: As mentioned above, wee see this issue only once, so it might be some temporary AWS problem.
The mons are not in quorum and not responding to any commands. There are also no mgr or OSD pods in the cluster, which means there has never been a successful reconcile to create all the Ceph resources. The operator log shows that the mon reconcile is failing and any ceph status commands are failing [1] 2022-03-09T21:19:34.825052754Z 2022-03-09 21:19:34.825012 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to create cluster: failed to start ceph monitors: failed to start mon pods: failed to check mon quorum b: failed to wait for mon quorum: exceeded max retry count waiting for monitors to reach quorum 2022-03-09T21:19:34.825052754Z 2022-03-09 21:19:34.825040 I | op-k8sutil: Reporting Event openshift-storage:ocs-storagecluster-cephcluster Warning:ReconcileFailed:failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to create cluster: failed to start ceph monitors: failed to start mon pods: failed to check mon ... 2022-03-09T21:22:16.678618789Z 2022-03-09 21:22:16.678579 E | clusterdisruption-controller: failed to check cluster health: failed to get status. . timed out: exit status 1 2 The mon-a pod shows its ebs volume does not exist [2]: Warning FailedMount 7m15s (x25 over 42m) kubelet MountVolume.SetUp failed for volume "pvc-b9da5fd9-62e2-4057-9432-743585026318" : mount failed: exit status 32 Mounting command: mount Mounting arguments: -o bind /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-east-2b/vol-02f13c2bb90670c71 /var/lib/kubelet/pods/a0a171a5-fd59-4d73-9856-6bfa4ee33fc7/volumes/kubernetes.io~aws-ebs/pvc-b9da5fd9-62e2-4057-9432-743585026318 Output: mount: /var/lib/kubelet/pods/a0a171a5-fd59-4d73-9856-6bfa4ee33fc7/volumes/kubernetes.io~aws-ebs/pvc-b9da5fd9-62e2-4057-9432-743585026318: special device /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-east-2b/vol-02f13c2bb90670c71 does not exist. Warning FailedMount 108s (x2 over 4m5s) kubelet Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[rook-ceph-mons-keyring rook-ceph-log rook-ceph-crash ceph-daemon-data kube-api-access-v6pg2 rook-config-override]: timed out waiting for the condition The simplest resolution may be to scrap the install and reinstall since the mon volume had an issue provisioning the ebs volume. [1] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-010aup3c33-ua/j-010aup3c33-ua_20220309T194925/logs/failed_testcase_ocs_logs_1646855666/deployment_ocs_logs/ocs_must_gather/registry-redhat-io-odf4-ocs-must-gather-rhel8-sha256-820defc6534c620640768027af8ccaa7bdebe12d3868ce5a5b11f64a9f387e86/namespaces/openshift-storage/pods/rook-ceph-operator-67fbd96785-b696q/rook-ceph-operator/rook-ceph-operator/logs/current.log [2] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-010aup3c33-ua/j-010aup3c33-ua_20220309T194925/logs/failed_testcase_ocs_logs_1646855666/test_deployment_ocs_logs/ocs_must_gather/registry-redhat-io-odf4-ocs-must-gather-rhel8-sha256-820defc6534c620640768027af8ccaa7bdebe12d3868ce5a5b11f64a9f387e86/namespaces/openshift-storage/oc_output/pods
Shall we close this or move to another component?
I think we can close this bug, because we didn't faced it any more so it might be some temporary issue on AWS side. If we will face this issue in the future, we can provide more info and reopen it.
(In reply to Daniel Horák from comment #5) > I think we can close this bug, because we didn't faced it any more so it > might be some temporary issue on AWS side. > If we will face this issue in the future, we can provide more info and > reopen it. Closing then, thanks!