Bug 2182820
| Summary: | Vmware - rook-ceph-osd-1 pod experiences multiple CrashLoopBackOff,rook-ceph-mds-pas-testing-cephfs pod fails to gain ip and node | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Yuli Persky <ypersky> |
| Component: | rook | Assignee: | Travis Nielsen <tnielsen> |
| Status: | CLOSED NOTABUG | QA Contact: | Neha Berry <nberry> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.12 | CC: | jopinto, jpeimer, kramdoss, ocs-bugs, odf-bz-bot, rperiyas, srai |
| Target Milestone: | --- | Keywords: | Automation, Performance |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-05-30 15:20:31 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Yuli Persky
2023-03-29 18:51:11 UTC
Please note that must gather logs and ocs-operator logs are located at : rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/OCS/ocs-qe-bugs/bz-2182820/ Please note pods status: (yulienv38) [ypersky@ypersky ocs-ci]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES csi-addons-controller-manager-c9c696595-q7tg4 2/2 Running 0 7d6h 10.131.0.14 compute-2 <none> <none> csi-cephfsplugin-4d4kr 2/2 Running 0 7d6h 10.1.160.211 compute-1 <none> <none> csi-cephfsplugin-h2x5s 2/2 Running 0 7d6h 10.1.160.219 compute-0 <none> <none> csi-cephfsplugin-provisioner-7874d85cb6-4csm2 5/5 Running 0 3d3h 10.131.1.248 compute-2 <none> <none> csi-cephfsplugin-provisioner-7874d85cb6-w5tht 5/5 Running 0 7d6h 10.129.2.26 compute-1 <none> <none> csi-cephfsplugin-rgd7h 2/2 Running 0 7d6h 10.1.160.209 compute-2 <none> <none> csi-rbdplugin-8rbfk 3/3 Running 0 7d6h 10.1.160.211 compute-1 <none> <none> csi-rbdplugin-g8s6g 3/3 Running 0 7d6h 10.1.160.219 compute-0 <none> <none> csi-rbdplugin-klb9b 3/3 Running 0 7d6h 10.1.160.209 compute-2 <none> <none> csi-rbdplugin-provisioner-58d9d7fd6c-dmfrp 6/6 Running 0 7d6h 10.131.0.16 compute-2 <none> <none> csi-rbdplugin-provisioner-58d9d7fd6c-vc7mc 6/6 Running 0 7d6h 10.129.2.25 compute-1 <none> <none> noobaa-core-0 1/1 Running 0 3d3h 10.131.1.250 compute-2 <none> <none> noobaa-db-pg-0 1/1 Running 0 7d6h 10.131.0.28 compute-2 <none> <none> noobaa-endpoint-6f767c5799-kkms6 1/1 Running 0 3d3h 10.131.1.247 compute-2 <none> <none> noobaa-operator-5667fcb59c-rc924 1/1 Running 1 (24h ago) 3d3h 10.129.3.60 compute-1 <none> <none> ocs-metrics-exporter-696b8bdc7c-kjpqx 1/1 Running 0 7d6h 10.131.0.12 compute-2 <none> <none> ocs-operator-768bd84df7-hckqt 1/1 Running 0 7d6h 10.129.2.23 compute-1 <none> <none> odf-console-54978b6774-x9k2q 1/1 Running 0 7d6h 10.129.2.22 compute-1 <none> <none> odf-operator-controller-manager-8d6549758-dbpzt 2/2 Running 0 7d6h 10.129.2.21 compute-1 <none> <none> rook-ceph-crashcollector-compute-0-b64cb7fc9-lbnpb 1/1 Running 0 3d2h 10.128.3.117 compute-0 <none> <none> rook-ceph-crashcollector-compute-1-66c4d45546-nbgn4 1/1 Running 0 7d6h 10.129.2.32 compute-1 <none> <none> rook-ceph-crashcollector-compute-2-68c465997-45zgm 1/1 Running 0 7d6h 10.131.0.22 compute-2 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-79d8f4bb8sfbc 2/2 Running 0 7d6h 10.131.0.25 compute-2 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-79577447vlcfw 2/2 Running 0 3d3h 10.128.3.119 compute-0 <none> <none> rook-ceph-mds-pas-testing-cephfs-a-5cd65db484-xpbb4 2/2 Running 0 7d2h 10.129.2.71 compute-1 <none> <none> rook-ceph-mds-pas-testing-cephfs-b-694cb8cc4f-9sj9v 0/2 Pending 0 7d2h <none> <none> <none> <none> rook-ceph-mgr-a-7474cd5645-d48l7 2/2 Running 0 3d3h 10.131.1.243 compute-2 <none> <none> rook-ceph-mon-a-58fd5cf765-gm2kg 2/2 Running 0 7d6h 10.131.0.18 compute-2 <none> <none> rook-ceph-mon-b-66677dfc69-dvxrp 2/2 Running 0 7d6h 10.129.2.28 compute-1 <none> <none> rook-ceph-mon-c-747bf9df65-wkqwq 2/2 Running 0 3d3h 10.128.3.118 compute-0 <none> <none> rook-ceph-operator-fbb9bcddb-ztq4w 1/1 Running 0 7d6h 10.129.2.24 compute-1 <none> <none> rook-ceph-osd-0-7f4987fc57-vkjkk 2/2 Running 0 7d6h 10.131.0.21 compute-2 <none> <none> rook-ceph-osd-1-86895f9f75-tsw74 0/2 Init:CrashLoopBackOff 883 (112s ago) 3d3h 10.128.3.116 compute-0 <none> <none> rook-ceph-osd-2-58dbf9b77-kxnmm 2/2 Running 0 7d6h 10.129.2.31 compute-1 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-1-data-0btgcv-4m55s 0/1 Completed 0 7d6h 10.129.2.30 compute-1 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-2-data-0v9wcv-ds8bp 0/1 Completed 0 7d6h 10.131.0.20 compute-2 <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6f67fcb2gp2q 2/2 Running 0 7d6h 10.131.0.26 compute-2 <none> <none> rook-ceph-tools-56c5c9c865-qwdnd 1/1 Running 0 7d6h 10.131.0.27 compute-2 <none> <none> (yulienv38) [ypersky@ypersky ocs-ci]$ The setup ( the cluster) still exists , this is https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/22398/ cluster name : ypersky-12b kubeconfig: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-12b/ypersky-12b_20230322T113816/openshift-cluster-dir/auth/kubeconfig could you share must gather, seems like the cluster is not running now In the must-gather at [1], the OSD pod is not starting because the disk is no longer found [2]. 2023-03-29T15:23:57.218966410Z failed to read label for /var/lib/ceph/osd/ceph-1/block: (6) No such device or address 2023-03-29T15:23:57.219088647Z 2023-03-29T15:23:57.217+0000 7f4ce8d9b540 -1 bluestore(/var/lib/ceph/osd/ceph-1/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-1/block: (6) No such device or address I suspect the vmware disk just became not available, and now the cluster is simply in a degraded state. The cluster is still keeping the data safe and would allow reads/writes. In a production cluster, the OSD would need to be replaced to regain full data protection. 1. Do you see any other side effects besides the OSD being down? 2. Are you able to repro this? [1] http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2182820/ [2] http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2182820/logs-20230329-182520/ocs-must-gather-us/quay-io-ocs-dev-ocs-must-gather-sha256-3d57a983f6c2b53ef2dd4ee3a5a4df7bc86daf53abb703e9a7872d07c00ed3c7/namespaces/openshift-storage/pods/rook-ceph-osd-1-86895f9f75-tsw74/activate/activate/logs/current.log @ Subham Rai - for must gather logs please refer to comment#2 : Please note that must gather logs and ocs-operator logs are located at : rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/OCS/ocs-qe-bugs/bz-2182820/ Moving out of 4.13 while under investigation, it appears an environmental issue @Travis Nielsen , Per your questions : 1. Do you see any other side effects besides the OSD being down? - I described all the behaviour I was able to analyze. 2. Are you able to repro this? - I tried to reproduce it on vmware not lso cluster and did not see this behaviour. Therefor what I am going to do is - I'll recreate vmware lso 4.12 cluster on the same DC and will run the Perf suite again. I will update on the result - whether the issue was reproduced ot not. I'm keeping the "needinfo" on myself , to update the results of the reproduction try. Unfortunately I was not able to reproduce this issue during following runs. Therefore removed the Needinfo since it looks like I cannot contribute more. However, if I encounter this behaviour in the future - I'll update here. Please reopen if you can repro |