Description of problem (please be detailed as possible and provide log snippests): Arbiter deployment( 3M + 6W ) failed with alertmanager-main-0 is in ContainerCreating state Version of all relevant components (if applicable): ocs-registry:4.9.0-228.ci openshift installer (4.9.0-0.nightly-2021-11-06-034743) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, not able install arbiter deployment Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? 3/3 Can this issue reproduce from the UI? Not tried If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. install arbiter deployment using ocs-ci 2. check alertmanager status 3. Actual results: $ oc -n openshift-monitoring get Pod NAME READY STATUS RESTARTS AGE alertmanager-main-0 0/5 ContainerCreating 0 54m alertmanager-main-1 0/5 ContainerCreating 0 54m alertmanager-main-2 0/5 ContainerCreating 0 54m Expected results: All the pods in openshift-monitoring should be in running status Additional info: > $ oc -n openshift-monitoring describe pod alertmanager-main-0 Name: alertmanager-main-0 Namespace: openshift-monitoring Priority: 2000000000 Priority Class Name: system-cluster-critical Node: compute-1/10.1.160.236 Start Time: Mon, 08 Nov 2021 13:57:03 +0530 Labels: alertmanager=main app=alertmanager app.kubernetes.io/component=alert-router app.kubernetes.io/instance=main app.kubernetes.io/managed-by=prometheus-operator app.kubernetes.io/name=alertmanager app.kubernetes.io/part-of=openshift-monitoring app.kubernetes.io/version=0.22.2 controller-revision-hash=alertmanager-main-7677898c78 statefulset.kubernetes.io/pod-name=alertmanager-main-0 Annotations: kubectl.kubernetes.io/default-container: alertmanager openshift.io/scc: nonroot Status: Pending Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 56m default-scheduler 0/9 nodes are available: 9 pod has unbound immediate PersistentVolumeClaims. Normal Scheduled 56m default-scheduler Successfully assigned openshift-monitoring/alertmanager-main-0 to compute-1 Normal SuccessfulAttachVolume 56m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-5f571d2e-44e2-4055-a352-bd01349ec438" Warning FailedMount 54m kubelet MountVolume.MountDevice failed for volume "pvc-5f571d2e-44e2-4055-a352-bd01349ec438" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 110) occurred while running rbd args: [--id csi-rbd-node -m 172.30.246.98:6789,172.30.118.99:6789,172.30.6.101:6789,172.30.92.74:6789,172.30.57.143:6789 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-a5f3512f-406d-11ec-93cb-0a580a810213 --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed rbd: map failed: (110) Connection timed out Warning FailedMount 52m kubelet MountVolume.MountDevice failed for volume "pvc-5f571d2e-44e2-4055-a352-bd01349ec438" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000002-a5f3512f-406d-11ec-93cb-0a580a810213 already exists Warning FailedMount 48m (x3 over 52m) kubelet MountVolume.MountDevice failed for volume "pvc-5f571d2e-44e2-4055-a352-bd01349ec438" : rpc error: code = DeadlineExceeded desc = context deadline exceeded Warning FailedMount 42m (x2 over 51m) kubelet Unable to attach or mount volumes: unmounted volumes=[my-alertmanager-claim], unattached volumes=[config-volume tls-assets my-alertmanager-claim secret-alertmanager-main-tls secret-alertmanager-main-proxy secret-alertmanager-kube-rbac-proxy alertmanager-trusted-ca-bundle kube-api-access-qt2lt]: timed out waiting for the condition Warning FailedMount 40m kubelet Unable to attach or mount volumes: unmounted volumes=[my-alertmanager-claim], unattached volumes=[secret-alertmanager-kube-rbac-proxy alertmanager-trusted-ca-bundle kube-api-access-qt2lt config-volume tls-assets my-alertmanager-claim secret-alertmanager-main-tls secret-alertmanager-main-proxy]: timed out waiting for the condition Warning FailedMount 33m (x4 over 49m) kubelet Unable to attach or mount volumes: unmounted volumes=[my-alertmanager-claim], unattached volumes=[secret-alertmanager-main-tls secret-alertmanager-main-proxy secret-alertmanager-kube-rbac-proxy alertmanager-trusted-ca-bundle kube-api-access-qt2lt config-volume tls-assets my-alertmanager-claim]: timed out waiting for the condition Warning FailedMount 28m (x2 over 47m) kubelet Unable to attach or mount volumes: unmounted volumes=[my-alertmanager-claim], unattached volumes=[kube-api-access-qt2lt config-volume tls-assets my-alertmanager-claim secret-alertmanager-main-tls secret-alertmanager-main-proxy secret-alertmanager-kube-rbac-proxy alertmanager-trusted-ca-bundle]: timed out waiting for the condition Warning FailedMount 25m kubelet MountVolume.MountDevice failed for volume "pvc-5f571d2e-44e2-4055-a352-bd01349ec438" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 110) occurred while running rbd args: [--id csi-rbd-node -m 172.30.118.99:6789,172.30.6.101:6789,172.30.92.74:6789,172.30.57.143:6789,172.30.246.98:6789 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-a5f3512f-406d-11ec-93cb-0a580a810213 --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed rbd: map failed: (110) Connection timed out Warning FailedMount 24m (x2 over 26m) kubelet Unable to attach or mount volumes: unmounted volumes=[my-alertmanager-claim], unattached volumes=[secret-alertmanager-main-proxy secret-alertmanager-kube-rbac-proxy alertmanager-trusted-ca-bundle kube-api-access-qt2lt config-volume tls-assets my-alertmanager-claim secret-alertmanager-main-tls]: timed out waiting for the condition Warning FailedMount 22m (x2 over 53m) kubelet Unable to attach or mount volumes: unmounted volumes=[my-alertmanager-claim], unattached volumes=[tls-assets my-alertmanager-claim secret-alertmanager-main-tls secret-alertmanager-main-proxy secret-alertmanager-kube-rbac-proxy alertmanager-trusted-ca-bundle kube-api-access-qt2lt config-volume]: timed out waiting for the condition Warning FailedMount 22m kubelet MountVolume.MountDevice failed for volume "pvc-5f571d2e-44e2-4055-a352-bd01349ec438" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 110) occurred while running rbd args: [--id csi-rbd-node -m 172.30.92.74:6789,172.30.57.143:6789,172.30.246.98:6789,172.30.118.99:6789,172.30.6.101:6789 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-a5f3512f-406d-11ec-93cb-0a580a810213 --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed rbd: map failed: (110) Connection timed out Warning FailedMount 19m (x7 over 38m) kubelet (combined from similar events): MountVolume.MountDevice failed for volume "pvc-5f571d2e-44e2-4055-a352-bd01349ec438" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 110) occurred while running rbd args: [--id csi-rbd-node -m 172.30.57.143:6789,172.30.246.98:6789,172.30.118.99:6789,172.30.6.101:6789,172.30.92.74:6789 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-a5f3512f-406d-11ec-93cb-0a580a810213 --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed rbd: map failed: (110) Connection timed out Warning FailedMount 9m51s (x7 over 46m) kubelet MountVolume.MountDevice failed for volume "pvc-5f571d2e-44e2-4055-a352-bd01349ec438" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 110) occurred while running rbd args: [--id csi-rbd-node -m 172.30.57.143:6789,172.30.246.98:6789,172.30.118.99:6789,172.30.6.101:6789,172.30.92.74:6789 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-a5f3512f-406d-11ec-93cb-0a580a810213 --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed rbd: map failed: (110) Connection timed out Warning FailedMount 3m58s (x4 over 13m) kubelet Unable to attach or mount volumes: unmounted volumes=[my-alertmanager-claim], unattached volumes=[my-alertmanager-claim secret-alertmanager-main-tls secret-alertmanager-main-proxy secret-alertmanager-kube-rbac-proxy alertmanager-trusted-ca-bundle kube-api-access-qt2lt config-volume tls-assets]: timed out waiting for the condition > pvc $ oc -n openshift-monitoring get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE my-alertmanager-claim-alertmanager-main-0 Bound pvc-5f571d2e-44e2-4055-a352-bd01349ec438 40Gi RWO ocs-storagecluster-ceph-rbd 57m my-alertmanager-claim-alertmanager-main-1 Bound pvc-95b4edec-47e9-469c-8645-01ccf769cfd7 40Gi RWO ocs-storagecluster-ceph-rbd 57m my-alertmanager-claim-alertmanager-main-2 Bound pvc-3803c53e-c3cc-429b-b3b8-6f16fd5f5e60 40Gi RWO ocs-storagecluster-ceph-rbd 57m my-prometheus-claim-prometheus-k8s-0 Bound pvc-0cb2311a-c412-46c6-afb8-a4355ea9574f 40Gi RWO ocs-storagecluster-ceph-rbd 57m my-prometheus-claim-prometheus-k8s-1 Bound pvc-c9f5aa2d-cd83-4c55-8c6a-d57dcc7e2cd8 40Gi RWO ocs-storagecluster-ceph-rbd 57m > job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/2241/console > must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-012vu2clva36-t4b/j-012vu2clva36-t4b_20211108T073634/logs/failed_testcase_ocs_logs_1636357274/test_deployment_ocs_logs/ocs_must_gather/
Vijay, can you please gather dmesg logs on the node where the mount is failing. Also, when was this test passed last. I don't remember any recent changes in this area.
Just connected to cluster to check the status of the pods in monitoring namespace : $ oc -n openshift-monitoring get pod NAME READY STATUS RESTARTS AGE alertmanager-main-0 0/5 ContainerCreating 0 11h alertmanager-main-1 0/5 ContainerCreating 0 11h alertmanager-main-2 0/5 ContainerCreating 0 11h cluster-monitoring-operator-75f48597b5-tdw5m 2/2 Running 0 11h grafana-756487f787-hwlqt 2/2 Running 0 11h kube-state-metrics-6bcc85759f-n2gm2 3/3 Running 0 11h node-exporter-7pw44 2/2 Running 2 12h node-exporter-hkdsp 2/2 Running 2 11h node-exporter-jkbsq 2/2 Running 2 11h node-exporter-kxvd9 2/2 Running 2 11h node-exporter-l5n96 2/2 Running 2 12h node-exporter-pfn5v 2/2 Running 2 12h node-exporter-pzb5w 2/2 Running 2 11h node-exporter-t97z7 2/2 Running 2 11h node-exporter-tgw7c 2/2 Running 2 11h openshift-state-metrics-769bdd45bc-4j6wx 3/3 Running 0 11h prometheus-adapter-578ff485dd-hkgrd 1/1 Running 0 11h prometheus-adapter-578ff485dd-zv2rg 1/1 Running 0 11h prometheus-k8s-0 0/7 Init:0/1 0 11h prometheus-k8s-1 0/7 Init:0/1 0 11h prometheus-operator-5c5f4d6d94-mb7bf 2/2 Running 0 11h telemeter-client-66fc8dd69f-xtx62 3/3 Running 0 11h thanos-querier-5b767cc45c-8x55h 5/5 Running 0 11h thanos-querier-5b767cc45c-rxscl 5/5 Running 0 11h pbalogh@pbalogh-mac arbiter-bug $ oc -n openshift-monitoring get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE my-alertmanager-claim-alertmanager-main-0 Bound pvc-5f571d2e-44e2-4055-a352-bd01349ec438 40Gi RWO ocs-storagecluster-ceph-rbd 11h my-alertmanager-claim-alertmanager-main-1 Bound pvc-95b4edec-47e9-469c-8645-01ccf769cfd7 40Gi RWO ocs-storagecluster-ceph-rbd 11h my-alertmanager-claim-alertmanager-main-2 Bound pvc-3803c53e-c3cc-429b-b3b8-6f16fd5f5e60 40Gi RWO ocs-storagecluster-ceph-rbd 11h my-prometheus-claim-prometheus-k8s-0 Bound pvc-0cb2311a-c412-46c6-afb8-a4355ea9574f 40Gi RWO ocs-storagecluster-ceph-rbd 11h my-prometheus-claim-prometheus-k8s-1 Bound pvc-c9f5aa2d-cd83-4c55-8c6a-d57dcc7e2cd8 40Gi RWO ocs-storagecluster-ceph-rbd 11h pbalogh@pbalogh-mac arbiter-bug $ oc -n openshift-monitoring get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE local-pv-1b0a6be3 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-1-data-1kxzq6 localblock 11h local-pv-28399b17 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-2-data-05bwtr localblock 11h local-pv-2e878e2b 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-2-data-26fps9 localblock 11h local-pv-6a965969 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-0-data-2b2fng localblock 11h local-pv-7d8fb2a3 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-1-data-26rw9g localblock 11h local-pv-91941fae 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-2-data-1542rl localblock 11h local-pv-949a2847 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-3-data-2g6rnn localblock 11h local-pv-b5299f85 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-0-data-1ftw7d localblock 11h local-pv-b75bfb86 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-0-data-0zlttw localblock 11h local-pv-be797c94 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-3-data-1kdf9d localblock 11h local-pv-d49b092a 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-3-data-09nncq localblock 11h local-pv-e112c5d2 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-1-data-0t7qrm localblock 11h pvc-0cb2311a-c412-46c6-afb8-a4355ea9574f 40Gi RWO Delete Bound openshift-monitoring/my-prometheus-claim-prometheus-k8s-0 ocs-storagecluster-ceph-rbd 11h pvc-3803c53e-c3cc-429b-b3b8-6f16fd5f5e60 40Gi RWO Delete Bound openshift-monitoring/my-alertmanager-claim-alertmanager-main-2 ocs-storagecluster-ceph-rbd 11h pvc-5f571d2e-44e2-4055-a352-bd01349ec438 40Gi RWO Delete Bound openshift-monitoring/my-alertmanager-claim-alertmanager-main-0 ocs-storagecluster-ceph-rbd 11h pvc-893c6731-3f7b-4819-9876-9fe78bf94f79 50Gi RWO Delete Bound openshift-storage/db-noobaa-db-pg-0 ocs-storagecluster-ceph-rbd 11h pvc-95b4edec-47e9-469c-8645-01ccf769cfd7 40Gi RWO Delete Bound openshift-monitoring/my-alertmanager-claim-alertmanager-main-1 ocs-storagecluster-ceph-rbd 11h pvc-c9f5aa2d-cd83-4c55-8c6a-d57dcc7e2cd8 40Gi RWO Delete Bound openshift-monitoring/my-prometheus-claim-prometheus-k8s-1 ocs-storagecluster-ceph-rbd 11h Events from oc -n openshift-monitoring describe pod prometheus-k8s-0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 79m (x16 over 9h) kubelet MountVolume.MountDevice failed for volume "pvc-0cb2311a-c412-46c6-afb8-a4355ea9574f" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 110) occurred while running rbd args: [--id csi-rbd-node -m 172.30.246.98:6789,172.30.118.99:6789,172.30.6.101:6789,172.30.92.74:6789,172.30.57.143:6789 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-a2807ae4-406d-11ec-93cb-0a580a810213 --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed rbd: map failed: (110) Connection timed out Warning FailedMount 55m (x24 over 11h) kubelet MountVolume.MountDevice failed for volume "pvc-0cb2311a-c412-46c6-afb8-a4355ea9574f" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 110) occurred while running rbd args: [--id csi-rbd-node -m 172.30.92.74:6789,172.30.57.143:6789,172.30.246.98:6789,172.30.118.99:6789,172.30.6.101:6789 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-a2807ae4-406d-11ec-93cb-0a580a810213 --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed rbd: map failed: (110) Connection timed out Warning FailedMount 46m (x23 over 11h) kubelet MountVolume.MountDevice failed for volume "pvc-0cb2311a-c412-46c6-afb8-a4355ea9574f" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 110) occurred while running rbd args: [--id csi-rbd-node -m 172.30.6.101:6789,172.30.92.74:6789,172.30.57.143:6789,172.30.246.98:6789,172.30.118.99:6789 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-a2807ae4-406d-11ec-93cb-0a580a810213 --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed rbd: map failed: (110) Connection timed out Warning FailedMount 15m (x27 over 11h) kubelet MountVolume.MountDevice failed for volume "pvc-0cb2311a-c412-46c6-afb8-a4355ea9574f" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 110) occurred while running rbd args: [--id csi-rbd-node -m 172.30.57.143:6789,172.30.246.98:6789,172.30.118.99:6789,172.30.6.101:6789,172.30.92.74:6789 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-a2807ae4-406d-11ec-93cb-0a580a810213 --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed rbd: map failed: (110) Connection timed out Warning FailedMount 6m15s (x66 over 10h) kubelet MountVolume.MountDevice failed for volume "pvc-0cb2311a-c412-46c6-afb8-a4355ea9574f" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 110) occurred while running rbd args: [--id csi-rbd-node -m 172.30.118.99:6789,172.30.6.101:6789,172.30.92.74:6789,172.30.57.143:6789,172.30.246.98:6789 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-a2807ae4-406d-11ec-93cb-0a580a810213 --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed rbd: map failed: (110) Connection timed out Warning FailedMount 43s (x355 over 11h) kubelet (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[my-prometheus-claim], unattached volumes=[prometheus-trusted-ca-bundle tls-assets config secret-metrics-client-certs secret-kube-rbac-proxy prometheus-k8s-rulefiles-0 secret-prometheus-k8s-htpasswd secret-prometheus-k8s-proxy configmap-kubelet-serving-ca-bundle configmap-serving-certs-ca-bundle secret-prometheus-k8s-thanos-sidecar-tls config-out my-prometheus-claim web-config secret-kube-etcd-client-certs metrics-client-ca secret-grpc-tls kube-api-access-2nl5n secret-prometheus-k8s-tls]: timed out waiting for the condition
BTW on 4.9 because of other blockers we had before we didn't have any successful deployment of such combination with Arbiter. https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-2az-rhcos-lso-vmdk-arbiter-3m-6w-tier4b/3/ This one was last on 4.8 where it worked: Started 3 mo 15 days ago As this combination we don't run on regular bases - mainly for first y-stream version like 4.8.0, 4.9.0 we don't have much results for this.
Ilya/Sunny, please take a look.
Hello, any update here? We are still blocking the resource and cluster because of this. If no one will reply here we are going to destroy the cluster by tomorrow. @muagarwa FYI
(In reply to Petr Balogh from comment #12) > Hello, > > any update here? We are still blocking the resource and cluster because of > this. If no one will reply here we are going to destroy the cluster by > tomorrow. > > @muagarwa FYI You can clean it up. The logging makes clear what's happened.
(In reply to Greg Farnum from comment #14) > (In reply to Petr Balogh from comment #12) > > Hello, > > > > any update here? We are still blocking the resource and cluster because of > > this. If no one will reply here we are going to destroy the cluster by > > tomorrow. > > > > @muagarwa FYI > > You can clean it up. The logging makes clear what's happened. Thanks. Destroyed the cluster
Running verification job here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-2az-rhcos-lso-vmdk-arbiter-3m-6w-tier4b/13
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/2335/console I see deployment passed on this job and now it's running tier4b suite. As this issue blocked deployment which passed, I am marking this one as verified.