Created attachment 1954509 [details] full versions of all the components of the cluster Description of problem (please be detailed as possible and provide log snippests): During a run of the whole Performance tests suite, Vmware - rook-ceph-osd-1 pod experiences multiple CrashLoopBackOff,rook-ceph-mds-pas-testing-cephfs pod fails to gain ip and node. sh-4.4$ ceph status cluster: id: 74cb797c-bf88-4008-a3fb-7df929d6692a health: HEALTH_WARN insufficient standby MDS daemons available 1 osds down 1 host (1 osds) down 1 rack (1 osds) down Degraded data redundancy: 22238/66714 objects degraded (33.333%), 300 pgs degraded, 513 pgs undersized services: mon: 3 daemons, quorum a,b,c (age 3d) mgr: a(active, since 3d) mds: 2/2 daemons up, 1 hot standby osd: 3 osds: 2 up (since 3d), 3 in (since 7d) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 2/2 healthy pools: 14 pools, 513 pgs objects: 22.24k objects, 84 GiB usage: 171 GiB used, 3.8 TiB / 4 TiB avail pgs: 22238/66714 objects degraded (33.333%) 300 active+undersized+degraded 213 active+undersized io: client: 2.6 KiB/s rd, 174 KiB/s wr, 3 op/s rd, 3 op/s wr sh-4.4$ Tests that run before the Ceph health became not OK : tests.e2e.performance.csi_tests.test_bulk_pod_attachtime_performance ( Passed) tests.e2e.performance.csi_tests.test_pod_attachtime ( Passed) tests.e2e.performance.csi_tests.test_pod_reattachtime ( Passed) tests.e2e.performance.csi_tests.test_pvc_bulk_clone_performance ( Passed) tests.e2e.performance.csi_tests.test_pvc_bulk_creation_deletion_performance ( Passed) tests.e2e.performance.csi_tests.test_pvc_clone_performance ( Passed) tests.e2e.performance.csi_tests.test_pvc_creation_deletion_performance ( Passed) tests.e2e.performance.csi_tests.test_pvc_multi_clone_performance ( Passed) tests.e2e.performance.csi_tests.test_pvc_multi_snapshot_performance ( Failed - bug in the test) tests.e2e.performance.csi_tests.test_pvc_snapshot_performance ( Passed) The next test - FIO test failed to run the FIO clint All the next tests failed to run since failed on Ceph health check ( ceph not ok). Version of all relevant components (if applicable): Driver versions ================ OCP versions ============== clientVersion: buildDate: "2023-03-18T02:15:12Z" compiler: gc gitCommit: eed143055ede731029931ad204b19cd2f565ef1a gitTreeState: clean gitVersion: 4.13.0-202303180002.p0.geed1430.assembly.stream-eed1430 goVersion: go1.19.6 major: "" minor: "" platform: linux/amd64 kustomizeVersion: v4.5.7 openshiftVersion: 4.12.7 releaseClientVersion: 4.13.0-0.nightly-2023-03-19-052243 serverVersion: buildDate: "2023-02-21T11:03:12Z" compiler: gc gitCommit: b6d1f054747e9886f61dd85316deac3415e2726f gitTreeState: clean gitVersion: v1.25.4+18eadca goVersion: go1.19.4 major: "1" minor: "25" platform: linux/amd64 Cluster version: NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.7 True False 7d5h Cluster version is 4.12.7 OCS versions ============== NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.12.1 NooBaa Operator 4.12.1 mcg-operator.v4.12.0 Succeeded ocs-operator.v4.12.1 OpenShift Container Storage 4.12.1 ocs-operator.v4.12.0 Succeeded odf-csi-addons-operator.v4.12.1 CSI Addons 4.12.1 odf-csi-addons-operator.v4.12.0 Succeeded odf-operator.v4.12.1 OpenShift Data Foundation 4.12.1 odf-operator.v4.12.0 Succeeded ODF (OCS) build : full_version: 4.12.1-19 Rook versions =============== rook: v4.12.1-0.f4e99907f9b9f05a190303465f61d12d5d24cace go: go1.18.9 Ceph versions =============== ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable) RHCOS versions ================ NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME compute-0 Ready worker 7d5h v1.25.4+18eadca 10.1.160.219 10.1.160.219 Red Hat Enterprise Linux CoreOS 412.86.202303011010-0 ( Ootpa) 4.18.0-372.46.1.el8_6.x86_64 cri-o://1.25.2-10.rhaos4.12.git0a083f9.el8 compute-1 Ready worker 7d5h v1.25.4+18eadca 10.1.160.211 10.1.160.211 Red Hat Enterprise Linux CoreOS 412.86.202303011010-0 (Ootpa) 4.18.0-372.46.1.el8_6.x86_64 cri-o://1.25.2-10.rhaos4.12.git0a083f9.el8 compute-2 Ready worker 7d5h v1.25.4+18eadca 10.1.160.209 10.1.160.209 Red Hat Enterprise Linux CoreOS 412.86.202303011010-0 (Ootpa) 4.18.0-372.46.1.el8_6.x86_64 cri-o://1.25.2-10.rhaos4.12.git0a083f9.el8 control-plane-0 Ready control-plane,master 7d5h v1.25.4+18eadca 10.1.160.222 10.1.160.222 Red Hat Enterprise Linux CoreOS 412.86.202303011010-0 (Ootpa) 4.18.0-372.46.1.el8_6.x86_64 cri-o://1.25.2-10.rhaos4.12.git0a083f9.el8 control-plane-1 Ready control-plane,master 7d5h v1.25.4+18eadca 10.1.160.217 10.1.160.217 Red Hat Enterprise Linux CoreOS 412.86.202303011010-0 (Ootpa) 4.18.0-372.46.1.el8_6.x86_64 cri-o://1.25.2-10.rhaos4.12.git0a083f9.el8 control-plane-2 Ready control-plane,master 7d5h v1.25.4+18eadca 10.1.160.220 10.1.160.220 Red Hat Enterprise Linux CoreOS 412.86.202303011010-0 (Ootpa) 4.18.0-372.46.1.el8_6.x86_64 cri-o://1.25.2-10.rhaos4.12.git0a083f9.el8 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, since the cluster is not usable. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? I am not sure. Can this issue reproduce from the UI? Not relevant. If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy 4.12 VMware cluster 2. Run Performance suite ( marker performance) 3. Check Ceph health and pods status ( all pods should be up and running, ceph health should be OK). Actual results: Ceph health is not ok, run of the whole Performance tests suite, Vmware - rook-ceph-osd-1 pod experiences multiple CrashLoopBackOff,rook-ceph-mds-pas-testing-cephfs pod fails to gain ip and node. Expected results: All the pods should be up and running, ceph health should be ok. Additional info: Relevant Jenkins job : https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/22470/ Tests that run before the Ceph health became not OK : tests.e2e.performance.csi_tests.test_bulk_pod_attachtime_performance ( Passed) tests.e2e.performance.csi_tests.test_pod_attachtime ( Passed) tests.e2e.performance.csi_tests.test_pod_reattachtime ( Passed) tests.e2e.performance.csi_tests.test_pvc_bulk_clone_performance ( Passed) tests.e2e.performance.csi_tests.test_pvc_bulk_creation_deletion_performance ( Passed) tests.e2e.performance.csi_tests.test_pvc_clone_performance ( Passed) tests.e2e.performance.csi_tests.test_pvc_creation_deletion_performance ( Passed) tests.e2e.performance.csi_tests.test_pvc_multi_clone_performance ( Passed) tests.e2e.performance.csi_tests.test_pvc_multi_snapshot_performance ( Failed - bug in the test) tests.e2e.performance.csi_tests.test_pvc_snapshot_performance ( Passed) The next test - FIO test failed to run the FIO clint All the next tests failed to run since failed on Ceph health check ( ceph not ok). link to must gather logs will be posted in the next comment, also link to ocs-operator logs.
Please note that must gather logs and ocs-operator logs are located at : rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/OCS/ocs-qe-bugs/bz-2182820/
Please note pods status: (yulienv38) [ypersky@ypersky ocs-ci]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES csi-addons-controller-manager-c9c696595-q7tg4 2/2 Running 0 7d6h 10.131.0.14 compute-2 <none> <none> csi-cephfsplugin-4d4kr 2/2 Running 0 7d6h 10.1.160.211 compute-1 <none> <none> csi-cephfsplugin-h2x5s 2/2 Running 0 7d6h 10.1.160.219 compute-0 <none> <none> csi-cephfsplugin-provisioner-7874d85cb6-4csm2 5/5 Running 0 3d3h 10.131.1.248 compute-2 <none> <none> csi-cephfsplugin-provisioner-7874d85cb6-w5tht 5/5 Running 0 7d6h 10.129.2.26 compute-1 <none> <none> csi-cephfsplugin-rgd7h 2/2 Running 0 7d6h 10.1.160.209 compute-2 <none> <none> csi-rbdplugin-8rbfk 3/3 Running 0 7d6h 10.1.160.211 compute-1 <none> <none> csi-rbdplugin-g8s6g 3/3 Running 0 7d6h 10.1.160.219 compute-0 <none> <none> csi-rbdplugin-klb9b 3/3 Running 0 7d6h 10.1.160.209 compute-2 <none> <none> csi-rbdplugin-provisioner-58d9d7fd6c-dmfrp 6/6 Running 0 7d6h 10.131.0.16 compute-2 <none> <none> csi-rbdplugin-provisioner-58d9d7fd6c-vc7mc 6/6 Running 0 7d6h 10.129.2.25 compute-1 <none> <none> noobaa-core-0 1/1 Running 0 3d3h 10.131.1.250 compute-2 <none> <none> noobaa-db-pg-0 1/1 Running 0 7d6h 10.131.0.28 compute-2 <none> <none> noobaa-endpoint-6f767c5799-kkms6 1/1 Running 0 3d3h 10.131.1.247 compute-2 <none> <none> noobaa-operator-5667fcb59c-rc924 1/1 Running 1 (24h ago) 3d3h 10.129.3.60 compute-1 <none> <none> ocs-metrics-exporter-696b8bdc7c-kjpqx 1/1 Running 0 7d6h 10.131.0.12 compute-2 <none> <none> ocs-operator-768bd84df7-hckqt 1/1 Running 0 7d6h 10.129.2.23 compute-1 <none> <none> odf-console-54978b6774-x9k2q 1/1 Running 0 7d6h 10.129.2.22 compute-1 <none> <none> odf-operator-controller-manager-8d6549758-dbpzt 2/2 Running 0 7d6h 10.129.2.21 compute-1 <none> <none> rook-ceph-crashcollector-compute-0-b64cb7fc9-lbnpb 1/1 Running 0 3d2h 10.128.3.117 compute-0 <none> <none> rook-ceph-crashcollector-compute-1-66c4d45546-nbgn4 1/1 Running 0 7d6h 10.129.2.32 compute-1 <none> <none> rook-ceph-crashcollector-compute-2-68c465997-45zgm 1/1 Running 0 7d6h 10.131.0.22 compute-2 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-79d8f4bb8sfbc 2/2 Running 0 7d6h 10.131.0.25 compute-2 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-79577447vlcfw 2/2 Running 0 3d3h 10.128.3.119 compute-0 <none> <none> rook-ceph-mds-pas-testing-cephfs-a-5cd65db484-xpbb4 2/2 Running 0 7d2h 10.129.2.71 compute-1 <none> <none> rook-ceph-mds-pas-testing-cephfs-b-694cb8cc4f-9sj9v 0/2 Pending 0 7d2h <none> <none> <none> <none> rook-ceph-mgr-a-7474cd5645-d48l7 2/2 Running 0 3d3h 10.131.1.243 compute-2 <none> <none> rook-ceph-mon-a-58fd5cf765-gm2kg 2/2 Running 0 7d6h 10.131.0.18 compute-2 <none> <none> rook-ceph-mon-b-66677dfc69-dvxrp 2/2 Running 0 7d6h 10.129.2.28 compute-1 <none> <none> rook-ceph-mon-c-747bf9df65-wkqwq 2/2 Running 0 3d3h 10.128.3.118 compute-0 <none> <none> rook-ceph-operator-fbb9bcddb-ztq4w 1/1 Running 0 7d6h 10.129.2.24 compute-1 <none> <none> rook-ceph-osd-0-7f4987fc57-vkjkk 2/2 Running 0 7d6h 10.131.0.21 compute-2 <none> <none> rook-ceph-osd-1-86895f9f75-tsw74 0/2 Init:CrashLoopBackOff 883 (112s ago) 3d3h 10.128.3.116 compute-0 <none> <none> rook-ceph-osd-2-58dbf9b77-kxnmm 2/2 Running 0 7d6h 10.129.2.31 compute-1 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-1-data-0btgcv-4m55s 0/1 Completed 0 7d6h 10.129.2.30 compute-1 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-2-data-0v9wcv-ds8bp 0/1 Completed 0 7d6h 10.131.0.20 compute-2 <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6f67fcb2gp2q 2/2 Running 0 7d6h 10.131.0.26 compute-2 <none> <none> rook-ceph-tools-56c5c9c865-qwdnd 1/1 Running 0 7d6h 10.131.0.27 compute-2 <none> <none> (yulienv38) [ypersky@ypersky ocs-ci]$
The setup ( the cluster) still exists , this is https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/22398/ cluster name : ypersky-12b kubeconfig: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-12b/ypersky-12b_20230322T113816/openshift-cluster-dir/auth/kubeconfig
could you share must gather, seems like the cluster is not running now
In the must-gather at [1], the OSD pod is not starting because the disk is no longer found [2]. 2023-03-29T15:23:57.218966410Z failed to read label for /var/lib/ceph/osd/ceph-1/block: (6) No such device or address 2023-03-29T15:23:57.219088647Z 2023-03-29T15:23:57.217+0000 7f4ce8d9b540 -1 bluestore(/var/lib/ceph/osd/ceph-1/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-1/block: (6) No such device or address I suspect the vmware disk just became not available, and now the cluster is simply in a degraded state. The cluster is still keeping the data safe and would allow reads/writes. In a production cluster, the OSD would need to be replaced to regain full data protection. 1. Do you see any other side effects besides the OSD being down? 2. Are you able to repro this? [1] http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2182820/ [2] http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2182820/logs-20230329-182520/ocs-must-gather-us/quay-io-ocs-dev-ocs-must-gather-sha256-3d57a983f6c2b53ef2dd4ee3a5a4df7bc86daf53abb703e9a7872d07c00ed3c7/namespaces/openshift-storage/pods/rook-ceph-osd-1-86895f9f75-tsw74/activate/activate/logs/current.log
@ Subham Rai - for must gather logs please refer to comment#2 : Please note that must gather logs and ocs-operator logs are located at : rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/OCS/ocs-qe-bugs/bz-2182820/
Moving out of 4.13 while under investigation, it appears an environmental issue
@Travis Nielsen , Per your questions : 1. Do you see any other side effects besides the OSD being down? - I described all the behaviour I was able to analyze. 2. Are you able to repro this? - I tried to reproduce it on vmware not lso cluster and did not see this behaviour. Therefor what I am going to do is - I'll recreate vmware lso 4.12 cluster on the same DC and will run the Perf suite again. I will update on the result - whether the issue was reproduced ot not. I'm keeping the "needinfo" on myself , to update the results of the reproduction try.
Unfortunately I was not able to reproduce this issue during following runs. Therefore removed the Needinfo since it looks like I cannot contribute more. However, if I encounter this behaviour in the future - I'll update here.
Please reopen if you can repro