Bug 2021068
Summary: | [Tracker for Ceph BZ #2022190] [arbiter]: alertmanager-main-0 is in ContainerCreating state | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Vijay Avuthu <vavuthu> | |
Component: | ceph | Assignee: | Greg Farnum <gfarnum> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Petr Balogh <pbalogh> | |
Severity: | urgent | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.9 | CC: | bniver, ebenahar, hnallurv, idryomov, jarrpa, madam, mbukatov, mmuench, mrajanna, muagarwa, ocs-bugs, odf-bz-bot, owasserm, pbalogh, rcyriac, rtalur, sostapov, sunkumar | |
Target Milestone: | --- | Keywords: | Automation, Regression, Tracking | |
Target Release: | ODF 4.9.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | v4.9.0-247.ci | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2022190 (view as bug list) | Environment: | ||
Last Closed: | 2022-01-07 17:46:31 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 2022190, 2025079 | |||
Bug Blocks: | 1974344, 1992247, 2029744 |
Description
Vijay Avuthu
2021-11-08 09:25:41 UTC
Vijay, can you please gather dmesg logs on the node where the mount is failing. Also, when was this test passed last. I don't remember any recent changes in this area. Just connected to cluster to check the status of the pods in monitoring namespace : $ oc -n openshift-monitoring get pod NAME READY STATUS RESTARTS AGE alertmanager-main-0 0/5 ContainerCreating 0 11h alertmanager-main-1 0/5 ContainerCreating 0 11h alertmanager-main-2 0/5 ContainerCreating 0 11h cluster-monitoring-operator-75f48597b5-tdw5m 2/2 Running 0 11h grafana-756487f787-hwlqt 2/2 Running 0 11h kube-state-metrics-6bcc85759f-n2gm2 3/3 Running 0 11h node-exporter-7pw44 2/2 Running 2 12h node-exporter-hkdsp 2/2 Running 2 11h node-exporter-jkbsq 2/2 Running 2 11h node-exporter-kxvd9 2/2 Running 2 11h node-exporter-l5n96 2/2 Running 2 12h node-exporter-pfn5v 2/2 Running 2 12h node-exporter-pzb5w 2/2 Running 2 11h node-exporter-t97z7 2/2 Running 2 11h node-exporter-tgw7c 2/2 Running 2 11h openshift-state-metrics-769bdd45bc-4j6wx 3/3 Running 0 11h prometheus-adapter-578ff485dd-hkgrd 1/1 Running 0 11h prometheus-adapter-578ff485dd-zv2rg 1/1 Running 0 11h prometheus-k8s-0 0/7 Init:0/1 0 11h prometheus-k8s-1 0/7 Init:0/1 0 11h prometheus-operator-5c5f4d6d94-mb7bf 2/2 Running 0 11h telemeter-client-66fc8dd69f-xtx62 3/3 Running 0 11h thanos-querier-5b767cc45c-8x55h 5/5 Running 0 11h thanos-querier-5b767cc45c-rxscl 5/5 Running 0 11h pbalogh@pbalogh-mac arbiter-bug $ oc -n openshift-monitoring get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE my-alertmanager-claim-alertmanager-main-0 Bound pvc-5f571d2e-44e2-4055-a352-bd01349ec438 40Gi RWO ocs-storagecluster-ceph-rbd 11h my-alertmanager-claim-alertmanager-main-1 Bound pvc-95b4edec-47e9-469c-8645-01ccf769cfd7 40Gi RWO ocs-storagecluster-ceph-rbd 11h my-alertmanager-claim-alertmanager-main-2 Bound pvc-3803c53e-c3cc-429b-b3b8-6f16fd5f5e60 40Gi RWO ocs-storagecluster-ceph-rbd 11h my-prometheus-claim-prometheus-k8s-0 Bound pvc-0cb2311a-c412-46c6-afb8-a4355ea9574f 40Gi RWO ocs-storagecluster-ceph-rbd 11h my-prometheus-claim-prometheus-k8s-1 Bound pvc-c9f5aa2d-cd83-4c55-8c6a-d57dcc7e2cd8 40Gi RWO ocs-storagecluster-ceph-rbd 11h pbalogh@pbalogh-mac arbiter-bug $ oc -n openshift-monitoring get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE local-pv-1b0a6be3 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-1-data-1kxzq6 localblock 11h local-pv-28399b17 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-2-data-05bwtr localblock 11h local-pv-2e878e2b 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-2-data-26fps9 localblock 11h local-pv-6a965969 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-0-data-2b2fng localblock 11h local-pv-7d8fb2a3 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-1-data-26rw9g localblock 11h local-pv-91941fae 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-2-data-1542rl localblock 11h local-pv-949a2847 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-3-data-2g6rnn localblock 11h local-pv-b5299f85 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-0-data-1ftw7d localblock 11h local-pv-b75bfb86 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-0-data-0zlttw localblock 11h local-pv-be797c94 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-3-data-1kdf9d localblock 11h local-pv-d49b092a 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-3-data-09nncq localblock 11h local-pv-e112c5d2 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-1-data-0t7qrm localblock 11h pvc-0cb2311a-c412-46c6-afb8-a4355ea9574f 40Gi RWO Delete Bound openshift-monitoring/my-prometheus-claim-prometheus-k8s-0 ocs-storagecluster-ceph-rbd 11h pvc-3803c53e-c3cc-429b-b3b8-6f16fd5f5e60 40Gi RWO Delete Bound openshift-monitoring/my-alertmanager-claim-alertmanager-main-2 ocs-storagecluster-ceph-rbd 11h pvc-5f571d2e-44e2-4055-a352-bd01349ec438 40Gi RWO Delete Bound openshift-monitoring/my-alertmanager-claim-alertmanager-main-0 ocs-storagecluster-ceph-rbd 11h pvc-893c6731-3f7b-4819-9876-9fe78bf94f79 50Gi RWO Delete Bound openshift-storage/db-noobaa-db-pg-0 ocs-storagecluster-ceph-rbd 11h pvc-95b4edec-47e9-469c-8645-01ccf769cfd7 40Gi RWO Delete Bound openshift-monitoring/my-alertmanager-claim-alertmanager-main-1 ocs-storagecluster-ceph-rbd 11h pvc-c9f5aa2d-cd83-4c55-8c6a-d57dcc7e2cd8 40Gi RWO Delete Bound openshift-monitoring/my-prometheus-claim-prometheus-k8s-1 ocs-storagecluster-ceph-rbd 11h Events from oc -n openshift-monitoring describe pod prometheus-k8s-0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 79m (x16 over 9h) kubelet MountVolume.MountDevice failed for volume "pvc-0cb2311a-c412-46c6-afb8-a4355ea9574f" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 110) occurred while running rbd args: [--id csi-rbd-node -m 172.30.246.98:6789,172.30.118.99:6789,172.30.6.101:6789,172.30.92.74:6789,172.30.57.143:6789 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-a2807ae4-406d-11ec-93cb-0a580a810213 --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed rbd: map failed: (110) Connection timed out Warning FailedMount 55m (x24 over 11h) kubelet MountVolume.MountDevice failed for volume "pvc-0cb2311a-c412-46c6-afb8-a4355ea9574f" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 110) occurred while running rbd args: [--id csi-rbd-node -m 172.30.92.74:6789,172.30.57.143:6789,172.30.246.98:6789,172.30.118.99:6789,172.30.6.101:6789 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-a2807ae4-406d-11ec-93cb-0a580a810213 --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed rbd: map failed: (110) Connection timed out Warning FailedMount 46m (x23 over 11h) kubelet MountVolume.MountDevice failed for volume "pvc-0cb2311a-c412-46c6-afb8-a4355ea9574f" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 110) occurred while running rbd args: [--id csi-rbd-node -m 172.30.6.101:6789,172.30.92.74:6789,172.30.57.143:6789,172.30.246.98:6789,172.30.118.99:6789 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-a2807ae4-406d-11ec-93cb-0a580a810213 --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed rbd: map failed: (110) Connection timed out Warning FailedMount 15m (x27 over 11h) kubelet MountVolume.MountDevice failed for volume "pvc-0cb2311a-c412-46c6-afb8-a4355ea9574f" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 110) occurred while running rbd args: [--id csi-rbd-node -m 172.30.57.143:6789,172.30.246.98:6789,172.30.118.99:6789,172.30.6.101:6789,172.30.92.74:6789 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-a2807ae4-406d-11ec-93cb-0a580a810213 --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed rbd: map failed: (110) Connection timed out Warning FailedMount 6m15s (x66 over 10h) kubelet MountVolume.MountDevice failed for volume "pvc-0cb2311a-c412-46c6-afb8-a4355ea9574f" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 110) occurred while running rbd args: [--id csi-rbd-node -m 172.30.118.99:6789,172.30.6.101:6789,172.30.92.74:6789,172.30.57.143:6789,172.30.246.98:6789 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-a2807ae4-406d-11ec-93cb-0a580a810213 --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed rbd: map failed: (110) Connection timed out Warning FailedMount 43s (x355 over 11h) kubelet (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[my-prometheus-claim], unattached volumes=[prometheus-trusted-ca-bundle tls-assets config secret-metrics-client-certs secret-kube-rbac-proxy prometheus-k8s-rulefiles-0 secret-prometheus-k8s-htpasswd secret-prometheus-k8s-proxy configmap-kubelet-serving-ca-bundle configmap-serving-certs-ca-bundle secret-prometheus-k8s-thanos-sidecar-tls config-out my-prometheus-claim web-config secret-kube-etcd-client-certs metrics-client-ca secret-grpc-tls kube-api-access-2nl5n secret-prometheus-k8s-tls]: timed out waiting for the condition BTW on 4.9 because of other blockers we had before we didn't have any successful deployment of such combination with Arbiter. https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-2az-rhcos-lso-vmdk-arbiter-3m-6w-tier4b/3/ This one was last on 4.8 where it worked: Started 3 mo 15 days ago As this combination we don't run on regular bases - mainly for first y-stream version like 4.8.0, 4.9.0 we don't have much results for this. Ilya/Sunny, please take a look. Hello, any update here? We are still blocking the resource and cluster because of this. If no one will reply here we are going to destroy the cluster by tomorrow. @muagarwa FYI (In reply to Petr Balogh from comment #12) > Hello, > > any update here? We are still blocking the resource and cluster because of > this. If no one will reply here we are going to destroy the cluster by > tomorrow. > > @muagarwa FYI You can clean it up. The logging makes clear what's happened. (In reply to Greg Farnum from comment #14) > (In reply to Petr Balogh from comment #12) > > Hello, > > > > any update here? We are still blocking the resource and cluster because of > > this. If no one will reply here we are going to destroy the cluster by > > tomorrow. > > > > @muagarwa FYI > > You can clean it up. The logging makes clear what's happened. Thanks. Destroyed the cluster Running verification job here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-2az-rhcos-lso-vmdk-arbiter-3m-6w-tier4b/13 https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/2335/console I see deployment passed on this job and now it's running tier4b suite. As this issue blocked deployment which passed, I am marking this one as verified. |