Bug 2277603
Summary: | Ceph health is going to Error state on ODF4.15.x on IBM Power | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Pooja Soni <posoni> |
Component: | ceph | Assignee: | Radoslaw Zarzynski <rzarzyns> |
ceph sub component: | RADOS | QA Contact: | Elad <ebenahar> |
Status: | NEW --- | Docs Contact: | |
Severity: | high | ||
Priority: | unspecified | CC: | aaaggarw, akupczyk, bhubbard, bniver, brgardne, jcaratza, kramdoss, muagarwa, nojha, odf-bz-bot, paarora, rzarzyns, sheggodu, sostapov, tnielsen |
Version: | 4.15 | Flags: | brgardne:
needinfo?
(bhubbard) bhubbard: needinfo? (akupczyk) brgardne: needinfo? (aaaggarw) |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | ppc64le | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | Type: | Bug | |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Pooja Soni
2024-04-28 14:21:02 UTC
Must gather logs - https://drive.google.com/file/d/14R_3VhDIXHHv9ZRyCUBJ3BqvWVmyXZ41/view?usp=sharing Hi, what is Tier1? What operations were performed on the cluster before it went to the current state? Tier1 is the test suite which is getting executed as part of zstream 4.15.2 and below test cases is part of the same test suite - test_selinux_relabel_for_existing_pvc[5] after running test_selinux_relabel_for_existing_pvc[5] test case we are seeing this issue. Link to test case - https://github.com/red-hat-storage/ocs-ci/blob/6de27377af27d626991b2b0b590f534a91a81400/tests/cross_functional/kcs/test_selinux_relabel_solution.py#L227 This test case is creating PVC, attach to the pod and creating multiple directories with files and applying selinux relabeling. This issue is seen in multiple cluster having ODF 4.15.2 installed. (In reply to Pooja Soni from comment #5) > Tier1 is the test suite which is getting executed as part of zstream 4.15.2 > and below test cases is part of the same test suite - > test_selinux_relabel_for_existing_pvc[5] > > after running test_selinux_relabel_for_existing_pvc[5] test case we are > seeing this issue. Link to test case - > https://github.com/red-hat-storage/ocs-ci/blob/ > 6de27377af27d626991b2b0b590f534a91a81400/tests/cross_functional/kcs/ > test_selinux_relabel_solution.py#L227 > > This test case is creating PVC, attach to the pod and creating multiple > directories with files and applying selinux relabeling. This issue is seen > in multiple cluster having ODF 4.15.2 installed. Thanks for the details. Do you have a live cluster that I can take a look at? Got a live cluster from Aaruni. Didn't see any issues at the Rook level. But the OSD.0 pod is crashing. Hi Radoslaw, Can you take a look at the OSD pod crashing? Thanks. we tried again on diff setup and it failed again with error: Ceph cluster health is not OK. Health: HEALTH_ERR 1/60838 objects unfound (0.002%); Reduced data availability: 37 pgs peering; Possible data damage: 1 pg recovery_unfound; Degraded data redundancy: 41007/182514 objects degraded (22.468%), 75 pgs degraded, 109 pgs undersized; 5 daemons have recently crashed; 1 slow ops, oldest one blocked for 120 sec, daemons [osd.1,osd.2] have slow ops. Must gather log for this setup - https://drive.google.com/file/d/1G_zg9vF8xI3c74hZBxmK4Q-Q-3VmZwtk/view?usp=sharing I got the same issue on running Tier1 on ODF 4.14.7. Ceph health went into error state after execution of test_selinux_relabel_for_existing_pvc[5] test case. I faced ceph health issue on 4.15.3 as well. sh-5.1$ ceph health detail HEALTH_ERR 1/652 objects unfound (0.153%); 5 scrub errors; Possible data damage: 1 pg recovery_unfound, 5 pgs inconsistent; Degraded data redundancy: 3/1956 objects degraded (0.153%), 1 pg degraded, 1 pg undersized; 6 daemons have recently crashed [WRN] OBJECT_UNFOUND: 1/652 objects unfound (0.153%) pg 12.1d has 1 unfound objects [ERR] OSD_SCRUB_ERRORS: 5 scrub errors [ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound, 5 pgs inconsistent pg 1.4 is active+clean+inconsistent, acting [1,2,0] pg 1.17 is active+clean+inconsistent, acting [1,2,0] pg 10.6 is active+clean+inconsistent, acting [2,1,0] pg 10.e is active+clean+inconsistent, acting [2,0,1] pg 10.f is active+clean+inconsistent, acting [1,2,0] pg 12.1d is active+recovery_unfound+undersized+degraded+remapped, acting [1,0], 1 unfound [WRN] PG_DEGRADED: Degraded data redundancy: 3/1956 objects degraded (0.153%), 1 pg degraded, 1 pg undersized pg 12.1d is stuck undersized for 4d, current state active+recovery_unfound+undersized+degraded+remapped, last acting [1,0] [WRN] RECENT_CRASH: 6 daemons have recently crashed osd.2 crashed on host rook-ceph-osd-2-868c7458ff-92dlg at 2024-05-23T13:05:44.743307Z osd.2 crashed on host rook-ceph-osd-2-868c7458ff-92dlg at 2024-05-23T13:07:40.116351Z osd.0 crashed on host rook-ceph-osd-0-7d7df95c84-qm6mm at 2024-05-23T13:08:25.743047Z osd.2 crashed on host rook-ceph-osd-2-868c7458ff-92dlg at 2024-05-23T13:11:15.966533Z osd.2 crashed on host rook-ceph-osd-2-868c7458ff-92dlg at 2024-05-23T13:12:09.088452Z osd.0 crashed on host rook-ceph-osd-0-7d7df95c84-qm6mm at 2024-05-23T13:09:36.226574Z sh-5.1$ must gather after excluding test_selinux_relabel_for_existing_pvc[5] - https://drive.google.com/file/d/1olFF_AOYda9aFdJbXkvdv6kIIR3k0H_2/view?usp=drive_link must gather with test_selinux_relabel_for_existing_pvc[5] - https://drive.google.com/file/d/1qwvRw5Gzp4Y58TIboJMLgQhqsrSy0UUW/view?usp=sharing > (...) significant percentage of objects would not read properly. We don't know how many objects have been read before the campaign gets aborted due to the testcase failure – we lack information on the denominator. Anyway, let's check. Further data points are useful even if the PR is unrelated. I created a branch with the 2 commits [1] reverted: https://gitlab.cee.redhat.com/ceph/ceph/-/commits/wip-bz-2277603-revert-gh53483 It bases on e9529323dd7ab3b0e8cdf84e17a1b58c2b42948c which was reported in https://bugzilla.redhat.com/show_bug.cgi?id=2277603#c8. Regards, Radek [1]: https://github.com/ceph/ceph/pull/53483 Copying the latest info from Aaruni. Source: https://bugzilla.redhat.com/show_bug.cgi?id=2280999#c42 > Could you retest this with `debug_osd = 20` and `debug_bluestore = 20` both? Please capture the m-g and coredumps again as well. setting `debug_osd = 20` and `debug_bluestore = 20` [root@rdr-odf414sel-bastion-0 ~]# oc -n openshift-storage rsh rook-ceph-tools-6cb655c7d-mfmt5 sh-5.1$ ceph -s cluster: id: 3f44f905-74ed-456e-8288-0cea8e7d0c93 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 10h) mgr: a(active, since 10h) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 10h), 3 in (since 10h) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 2.60k objects, 7.9 GiB usage: 24 GiB used, 1.4 TiB / 1.5 TiB avail pgs: 169 active+clean io: client: 853 B/s rd, 16 KiB/s wr, 1 op/s rd, 1 op/s wr sh-5.1$ ceph tell osd.* config get debug_osd osd.0: { "debug_osd": "1/5" } osd.1: { "debug_osd": "1/5" } osd.2: { "debug_osd": "1/5" } sh-5.1$ sh-5.1$ ceph tell osd.* config set debug_osd 20 osd.0: { "success": "" } osd.1: { "success": "" } osd.2: { "success": "" } sh-5.1$ sh-5.1$ ceph tell osd.* config get debug_osd osd.0: { "debug_osd": "20/20" } osd.1: { "debug_osd": "20/20" } osd.2: { "debug_osd": "20/20" } sh-5.1$ ceph tell osd.* config get debug_bluestore osd.0: { "debug_bluestore": "1/5" } osd.1: { "debug_bluestore": "1/5" } osd.2: { "debug_bluestore": "1/5" } sh-5.1$ sh-5.1$ ceph tell osd.* config set debug_bluestore 20 osd.0: { "success": "" } osd.1: { "success": "" } osd.2: { "success": "" } sh-5.1$ sh-5.1$ ceph tell osd.* config get debug_bluestore osd.0: { "debug_bluestore": "20/20" } osd.1: { "debug_bluestore": "20/20" } osd.2: { "debug_bluestore": "20/20" } sh-5.1$ exit exit Executed tier1 tests and ceph health went to warn state. ceph health status: [root@rdr-odf414sel-bastion-0 ~]# oc -n openshift-storage rsh rook-ceph-tools-6cb655c7d-mfmt5 sh-5.1$ ceph -s cluster: id: 3f44f905-74ed-456e-8288-0cea8e7d0c93 health: HEALTH_WARN 1 daemons have recently crashed services: mon: 3 daemons, quorum a,b,c (age 13h) mgr: a(active, since 13h) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 2h), 3 in (since 13h) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 13 pools, 201 pgs objects: 3.18k objects, 10 GiB usage: 31 GiB used, 1.4 TiB / 1.5 TiB avail pgs: 201 active+clean io: client: 1023 B/s rd, 418 KiB/s wr, 1 op/s rd, 1 op/s wr sh-5.1$ sh-5.1$ ceph crash ls ID ENTITY NEW 2024-06-05T07:00:33.318587Z_892175db-bd0a-48cc-a0e4-23da2970608a osd.2 * sh-5.1$ pods: [root@rdr-odf414sel-bastion-0 ~]# oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-5f8cf65876-257w4 2/2 Running 0 19m csi-cephfsplugin-c7wlj 2/2 Running 0 13h csi-cephfsplugin-k7fg8 2/2 Running 0 13h csi-cephfsplugin-provisioner-7b6b7c8bbf-4fh97 5/5 Running 0 13h csi-cephfsplugin-provisioner-7b6b7c8bbf-7stpt 5/5 Running 0 13h csi-cephfsplugin-rfxpz 2/2 Running 0 13h csi-nfsplugin-d4m25 2/2 Running 0 67m csi-nfsplugin-gx78q 2/2 Running 0 67m csi-nfsplugin-provisioner-6c874556f8-p7n7r 5/5 Running 0 67m csi-nfsplugin-provisioner-6c874556f8-zf24r 5/5 Running 0 67m csi-nfsplugin-xb969 2/2 Running 0 67m csi-rbdplugin-6gxnf 3/3 Running 0 13h csi-rbdplugin-csqwz 3/3 Running 0 13h csi-rbdplugin-mm785 3/3 Running 0 13h csi-rbdplugin-provisioner-6c64c96886-2smdw 6/6 Running 0 13h csi-rbdplugin-provisioner-6c64c96886-m5tz9 6/6 Running 0 13h noobaa-core-0 1/1 Running 0 66m noobaa-db-pg-0 1/1 Running 0 66m noobaa-endpoint-654d57b548-c4svt 1/1 Running 0 128m noobaa-endpoint-654d57b548-s8tj4 1/1 Running 0 67m noobaa-operator-7bdd5cb576-rbh2p 2/2 Running 0 13h ocs-metrics-exporter-67b9f9855d-f7rdv 1/1 Running 1 (67m ago) 13h ocs-operator-c8d7c579f-pvfln 1/1 Running 0 13h odf-console-7888dd6746-rl7n2 1/1 Running 0 13h odf-operator-controller-manager-7749cdb995-dw7mc 2/2 Running 0 13h rook-ceph-crashcollector-worker-0-554f85b66-gnh52 1/1 Running 0 13h rook-ceph-crashcollector-worker-1-646b58c45f-gvm59 1/1 Running 0 13h rook-ceph-crashcollector-worker-2-754dbbbfd6-b4fsn 1/1 Running 0 13h rook-ceph-exporter-worker-0-666bd75845-tnn57 1/1 Running 0 13h rook-ceph-exporter-worker-1-6f9d4c69c6-j6g88 1/1 Running 0 13h rook-ceph-exporter-worker-2-54fcd47ccc-cspg5 1/1 Running 0 13h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6487686cs8tlz 2/2 Running 0 13h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-f8b946dfqqrgm 2/2 Running 0 13h rook-ceph-mgr-a-744479898c-jnrnx 2/2 Running 0 13h rook-ceph-mon-a-574b64f99-xkvd7 2/2 Running 0 13h rook-ceph-mon-b-6dffc99fdd-pf7f9 2/2 Running 0 13h rook-ceph-mon-c-768c9bd57d-49m6f 2/2 Running 0 13h rook-ceph-nfs-ocs-storagecluster-cephnfs-a-749b7554b9-8zlpx 2/2 Running 0 67m rook-ceph-operator-857ccc7545-ns89c 1/1 Running 0 67m rook-ceph-osd-0-5548bc9796-cbzgl 2/2 Running 0 13h rook-ceph-osd-1-778c784bd5-jcd2t 2/2 Running 0 13h rook-ceph-osd-2-8c55ccff8-dtjrm 2/2 Running 1 (131m ago) 13h rook-ceph-osd-prepare-28a84de044d061da27bf0f4f43d99cab-qcqms 0/1 Completed 0 13h rook-ceph-osd-prepare-3535e54116e131773d2accb9756b7796-46r9l 0/1 Completed 0 13h rook-ceph-osd-prepare-92eebb804a903eedb1a420bdd15a845a-28rvv 0/1 Completed 0 13h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-857895bt8bq7 2/2 Running 0 13h rook-ceph-tools-6cb655c7d-mfmt5 1/1 Running 0 13h ux-backend-server-996cffddb-6j59g 2/2 Running 0 13h coredump : https://drive.google.com/file/d/1HQDVMFUPtMCAgUcEOj-tPBeINsH3u5IG/view?usp=sharing must-gather: https://drive.google.com/file/d/17Xkfp2cwYuSli5SDeLwb6BfWPBkilfCc/view?usp=sharing ----- @bhubbard I hope these test results show the issue with the logs you need. It looks like only osd.2 may have crashed in this test run. ----- @aaaggarw it looks like Justin was able to get an RHCS build that strips out the commit Brad suspected could be the problematic one. Are you able to run through the ODF test suite using that RHCS image, to determine if it is exacerbating the issue? BZ is turning some of the text into a hyperlink, but AFAICT, the image is just as the text appears here: registry-proxy.engineering.redhat.com/rh-osbs/rhceph:6-17.2.6.TEST.bz2277603 *** Bug 2280973 has been marked as a duplicate of this bug. *** Thanks Parth for pushing the image to public repo. I tried updating OCS-operator CSV with `quay.io/paarora/ceph:v700` [root@rdr-odf4152ver-bastion-0 ~]# oc edit csv ocs-operator.v4.15.2-rhodf -n openshift-storage clusterserviceversion.operators.coreos.com/ocs-operator.v4.15.2-rhodf edited [root@rdr-odf4152ver-bastion-0 ~]# oc get csv ocs-operator.v4.15.2-rhodf -n openshift-storage -o yaml |grep ceph:v700 value: quay.io/paarora/ceph:v700 pods:: [root@rdr-odf4152ver-bastion-0 ~]# oc get pods NAME READY STATUS RESTARTS AGE ceph-file-controller-detect-version-vkfqp 0/1 CrashLoopBackOff 3 (20s ago) 93s ceph-nfs-controller-detect-version-7znlr 0/1 CrashLoopBackOff 3 (22s ago) 93s ceph-object-controller-detect-version-mkpw2 0/1 CrashLoopBackOff 3 (19s ago) 96s csi-addons-controller-manager-76d57c4596-6mdqt 2/2 Running 0 2d10h csi-cephfsplugin-cb25t 2/2 Running 1 (3d2h ago) 3d2h csi-cephfsplugin-cshbv 2/2 Running 0 3d2h csi-cephfsplugin-provisioner-7545464b85-lvvmv 6/6 Running 0 3d2h csi-cephfsplugin-provisioner-7545464b85-rjnfz 6/6 Running 0 2d11h csi-cephfsplugin-vgwc5 2/2 Running 0 3d2h csi-nfsplugin-bdj7z 2/2 Running 0 2d17h csi-nfsplugin-jkdt7 2/2 Running 0 2d17h csi-nfsplugin-provisioner-68b6ff9c5-kzjrf 5/5 Running 0 2d17h csi-nfsplugin-provisioner-68b6ff9c5-m7plr 5/5 Running 0 2d11h csi-nfsplugin-r9xwp 2/2 Running 0 2d17h csi-rbdplugin-76zhn 3/3 Running 0 3d2h csi-rbdplugin-dt6j9 3/3 Running 0 3d2h csi-rbdplugin-provisioner-7d9c7cf9b8-jcfs4 6/6 Running 0 2d11h csi-rbdplugin-provisioner-7d9c7cf9b8-pmz2z 6/6 Running 0 3d2h csi-rbdplugin-tg99f 3/3 Running 0 3d2h noobaa-core-0 1/1 Running 0 2d11h noobaa-db-pg-0 1/1 Running 0 2d11h noobaa-endpoint-67cfb66697-njpln 1/1 Running 0 2d18h noobaa-operator-5f7f746647-gx24j 1/1 Running 0 2d11h ocs-metrics-exporter-5856bbf967-dddp8 1/1 Running 0 2d11h ocs-operator-67f46d8666-5sgzb 1/1 Running 0 2m7s odf-console-c69c7864b-znhgx 1/1 Running 0 3d2h odf-operator-controller-manager-65dbb56b4d-j2vt2 2/2 Running 1 (2d17h ago) 3d2h rook-ceph-crashcollector-worker-0-5fdb4ff8f7-25cg5 1/1 Running 0 2d10h rook-ceph-crashcollector-worker-1-7559dc47dd-mwgsg 1/1 Running 0 2d22h rook-ceph-crashcollector-worker-2-7d87957c55-94fz9 1/1 Running 0 2d22h rook-ceph-detect-version-97r5r 0/1 CrashLoopBackOff 3 (20s ago) 97s rook-ceph-exporter-worker-0-777dfb96cc-jhvpj 1/1 Running 0 2d10h rook-ceph-exporter-worker-1-789c4d9d9b-qkvzg 1/1 Running 0 2d22h rook-ceph-exporter-worker-2-77bf6f5747-fchzk 1/1 Running 0 2d22h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-86488d667qzk2 2/2 Running 8 (2d10h ago) 2d22h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-64b8575fwdzv4 2/2 Running 8 (2d10h ago) 2d11h rook-ceph-mgr-a-6f7d9b9494-4dxdv 3/3 Running 0 2d22h rook-ceph-mgr-b-7566b65bff-m5bcw 3/3 Running 0 2d22h rook-ceph-mon-a-5857cb9688-6qcl5 2/2 Running 0 2d22h rook-ceph-mon-b-6f67b5555d-hwjwp 2/2 Running 0 2d10h rook-ceph-mon-c-79f49ff657-gbwq6 2/2 Running 0 2d22h rook-ceph-nfs-ocs-storagecluster-cephnfs-a-68dccf77dc-2jn92 2/2 Running 0 2d17h rook-ceph-operator-7bb4b69c4-586x7 1/1 Running 0 2d17h rook-ceph-osd-0-7ff4959fb5-9p276 2/2 Running 0 2d22h rook-ceph-osd-1-6cc85dd567-qwjzv 2/2 Running 0 2d22h rook-ceph-osd-2-557fb8b4cf-tcd2k 2/2 Running 0 2d11h rook-ceph-osd-prepare-15dcf7c38e6e2cb4e402ac852df2376a-8fqjk 0/1 Completed 0 3d2h rook-ceph-osd-prepare-f9b54076c6b8e10416ddb285c1f7f548-4fsk4 0/1 Completed 0 3d2h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-656d598jhkx7 2/2 Running 0 2d22h rook-ceph-tools-65d7788bc6-crjnb 1/1 Running 0 3d2h ux-backend-server-775c4c4956-zjpxm 2/2 Running 0 2d11h [root@rdr-odf4152ver-bastion-0 ~]# oc describe pod rook-ceph-detect-version-97r5r Name: rook-ceph-detect-version-97r5r Namespace: openshift-storage Priority: 0 Service Account: rook-ceph-cmd-reporter Node: worker-0/10.20.180.190 Start Time: Thu, 06 Jun 2024 05:28:05 -0400 Labels: app=rook-ceph-detect-version batch.kubernetes.io/controller-uid=67299e35-f26a-4c46-ad45-98edf53f6883 batch.kubernetes.io/job-name=rook-ceph-detect-version controller-uid=67299e35-f26a-4c46-ad45-98edf53f6883 job-name=rook-ceph-detect-version rook-version=v4.15.2-0.62c88d463a1f068556dd5e48b736ea5354cd79a0 Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["10.129.3.65/23"],"mac_address":"0a:58:0a:81:03:41","gateway_ips":["10.129.2.1"],"routes":[{"dest":"10.128.0.0... k8s.v1.cni.cncf.io/network-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "10.129.3.65" ], "mac": "0a:58:0a:81:03:41", "default": true, "dns": {} }] openshift.io/scc: restricted-v2 seccomp.security.alpha.kubernetes.io/pod: runtime/default Status: Running SeccompProfile: RuntimeDefault IP: 10.129.3.65 IPs: IP: 10.129.3.65 Controlled By: Job/rook-ceph-detect-version Init Containers: init-copy-binaries: Container ID: cri-o://ac2a2b1c2acdfedb72898edf3451a5c2e085a47f16726aa5567c2d3b329a46d4 Image: registry.redhat.io/odf4/rook-ceph-rhel9-operator@sha256:400ccfcf6f63db49823e6e71908b9d5b96486a0c0c9a083bb283f9fe29efa01d Image ID: registry.redhat.io/odf4/rook-ceph-rhel9-operator@sha256:400ccfcf6f63db49823e6e71908b9d5b96486a0c0c9a083bb283f9fe29efa01d Port: <none> Host Port: <none> Command: cp Args: --archive --force --verbose /usr/local/bin/rook /rook/copied-binaries State: Terminated Reason: Completed Exit Code: 0 Started: Thu, 06 Jun 2024 05:28:06 -0400 Finished: Thu, 06 Jun 2024 05:28:06 -0400 Ready: True Restart Count: 0 Environment: <none> Mounts: /rook/copied-binaries from rook-copied-binaries (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m5m64 (ro) Containers: cmd-reporter: Container ID: cri-o://1c6ae897d3b8d347ef21e0e1341d1a9579e8e199b3df60cad178f37117aae260 Image: quay.io/paarora/ceph:v700 Image ID: quay.io/paarora/ceph@sha256:2895e3af6615ecc0631e6c73dc5e03ac7a4319357c35021925699cf6227f312d Port: <none> Host Port: <none> Command: /rook/copied-binaries/rook Args: cmd-reporter --command {"cmd":["ceph"],"args":["--version"]} --config-map-name rook-ceph-detect-version --namespace openshift-storage State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Thu, 06 Jun 2024 05:29:22 -0400 Finished: Thu, 06 Jun 2024 05:29:22 -0400 Ready: False Restart Count: 3 Environment: <none> Mounts: /rook/copied-binaries from rook-copied-binaries (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m5m64 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: rook-copied-binaries: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> kube-api-access-m5m64: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s node.ocs.openshift.io/storage=true:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 112s default-scheduler Successfully assigned openshift-storage/rook-ceph-detect-version-97r5r to worker-0 Normal AddedInterface 112s multus Add eth0 [10.129.3.65/23] from ovn-kubernetes Normal Pulled 112s kubelet Container image "registry.redhat.io/odf4/rook-ceph-rhel9-operator@sha256:400ccfcf6f63db49823e6e71908b9d5b96486a0c0c9a083bb283f9fe29efa01d" already present on machine Normal Created 112s kubelet Created container init-copy-binaries Normal Started 112s kubelet Started container init-copy-binaries Normal Pulling 111s kubelet Pulling image "quay.io/paarora/ceph:v700" Normal Pulled 81s kubelet Successfully pulled image "quay.io/paarora/ceph:v700" in 29.391s (29.391s including waiting) Normal Created 36s (x4 over 81s) kubelet Created container cmd-reporter Normal Started 36s (x4 over 81s) kubelet Started container cmd-reporter Normal Pulled 36s (x3 over 80s) kubelet Container image "quay.io/paarora/ceph:v700" already present on machine Warning BackOff 11s (x7 over 79s) kubelet Back-off restarting failed container cmd-reporter in pod rook-ceph-detect-version-97r5r_openshift-storage(d454922f-1f35-4bdb-bf8e-6b65ef18fc7a) error: [root@rdr-odf4152ver-bastion-0 ~]# oc logs pod/rook-ceph-detect-version-97r5r Defaulted container "cmd-reporter" out of: cmd-reporter, init-copy-binaries (init) exec /rook/copied-binaries/rook: no such file or directory image is updated in ocs-operator pod: [root@rdr-odf4152ver-bastion-0 ~]# oc describe pod ocs-operator-67f46d8666-5sgzb |grep CEPH_IMAGE ROOK_CEPH_IMAGE: registry.redhat.io/odf4/rook-ceph-rhel9-operator@sha256:400ccfcf6f63db49823e6e71908b9d5b96486a0c0c9a083bb283f9fe29efa01d CEPH_IMAGE: quay.io/paarora/ceph:v700 [root@rdr-odf4152ver-bastion-0 ~]# and ocs-operator pod is running. Also, ceph health is in OK state [root@rdr-odf4152ver-bastion-0 ~]# oc get cephcluster NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID ocs-storagecluster-cephcluster /var/lib/rook 3 3d3h Ready Cluster created successfully HEALTH_OK b7029531-6b77-4478-8935-d806637a7bb4 [root@rdr-odf4152ver-bastion-0 ~]# oc rsh rook-ceph-tools-65d7788bc6-crjnb sh-5.1$ sh-5.1$ ceph -s cluster: id: b7029531-6b77-4478-8935-d806637a7bb4 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 2d) mgr: b(active, since 2d), standbys: a mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 2d), 3 in (since 3d) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 13 pools, 201 pgs objects: 16.77k objects, 61 GiB usage: 147 GiB used, 1.3 TiB / 1.5 TiB avail pgs: 201 active+clean io: client: 1.5 KiB/s rd, 418 KiB/s wr, 2 op/s rd, 2 op/s wr Should I execute tier tests on the setup? Yes. Please run the test suite(s) that have been causing OSDs to crash to determine if the reverted commit alleviates the issue. Hope following error will not cause any issue while running test suite. [root@rdr-odf4152ver-bastion-0 ~]# oc get pods |grep -v "Completed\|Running" NAME READY STATUS RESTARTS AGE ceph-file-controller-detect-version-5kwvf 0/1 CrashLoopBackOff 2 (18s ago) 37s ceph-nfs-controller-detect-version-qtn7p 0/1 CrashLoopBackOff 2 (20s ago) 36s ceph-object-controller-detect-version-7mj6h 0/1 CrashLoopBackOff 2 (21s ago) 40s rook-ceph-detect-version-2kfrd 0/1 CrashLoopBackOff 2 (23s ago) 39s [root@rdr-odf4152ver-bastion-0 ~]# oc logs pod/ceph-file-controller-detect-version-5kwvf Defaulted container "cmd-reporter" out of: cmd-reporter, init-copy-binaries (init) exec /rook/copied-binaries/rook: no such file or directory [root@rdr-odf4152ver-bastion-0 ~]# oc logs pod/ceph-object-controller-detect-version-7mj6h Defaulted container "cmd-reporter" out of: cmd-reporter, init-copy-binaries (init) exec /rook/copied-binaries/rook: no such file or directory [root@rdr-odf4152ver-bastion-0 ~]# oc logs pod/rook-ceph-detect-version-2kfrd Defaulted container "cmd-reporter" out of: cmd-reporter, init-copy-binaries (init) exec /rook/copied-binaries/rook: no such file or directory Executing tier test on the setup. Will upload the results once done. Tier tests are completed and they ran as expected. Ceph health is also in OK state, but it's status shows Progressing due to cmd-reporter. [root@rdr-odf4152ver-bastion-0 scripts]# oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-76d57c4596-rmd2m 2/2 Running 0 47h csi-cephfsplugin-cb25t 2/2 Running 1 (6d22h ago) 6d22h csi-cephfsplugin-cshbv 2/2 Running 0 6d22h csi-cephfsplugin-provisioner-7545464b85-lvvmv 6/6 Running 0 6d22h csi-cephfsplugin-provisioner-7545464b85-rjnfz 6/6 Running 0 6d6h csi-cephfsplugin-vgwc5 2/2 Running 0 6d22h csi-rbdplugin-76zhn 3/3 Running 0 6d22h csi-rbdplugin-dt6j9 3/3 Running 0 6d22h csi-rbdplugin-provisioner-7d9c7cf9b8-jcfs4 6/6 Running 0 6d6h csi-rbdplugin-provisioner-7d9c7cf9b8-pmz2z 6/6 Running 0 6d22h csi-rbdplugin-tg99f 3/3 Running 0 6d22h noobaa-core-0 1/1 Running 0 2d1h noobaa-db-pg-0 1/1 Running 0 2d1h noobaa-endpoint-67cfb66697-v2q9b 1/1 Running 0 2d1h noobaa-operator-5f7f746647-rnqmn 1/1 Running 0 2d1h ocs-metrics-exporter-5856bbf967-dddp8 1/1 Running 0 6d6h ocs-operator-67f46d8666-4mslb 1/1 Running 0 2d1h odf-console-c69c7864b-znhgx 1/1 Running 0 6d22h odf-operator-controller-manager-65dbb56b4d-j2vt2 2/2 Running 1 (6d12h ago) 6d22h rook-ceph-crashcollector-worker-0-5fdb4ff8f7-22k6t 1/1 Running 0 2d1h rook-ceph-crashcollector-worker-1-7559dc47dd-mwgsg 1/1 Running 0 6d18h rook-ceph-crashcollector-worker-2-7d87957c55-94fz9 1/1 Running 0 6d18h rook-ceph-exporter-worker-0-777dfb96cc-5jt7j 1/1 Running 0 2d1h rook-ceph-exporter-worker-1-789c4d9d9b-qkvzg 1/1 Running 0 6d18h rook-ceph-exporter-worker-2-77bf6f5747-fchzk 1/1 Running 0 6d18h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-86488d667qzk2 2/2 Running 8 (6d6h ago) 6d18h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-64b8575fwdzv4 2/2 Running 8 (6d6h ago) 6d6h rook-ceph-mgr-a-6f7d9b9494-4dxdv 3/3 Running 0 6d18h rook-ceph-mgr-b-7566b65bff-m5bcw 3/3 Running 0 6d18h rook-ceph-mon-a-5857cb9688-6qcl5 2/2 Running 0 6d18h rook-ceph-mon-b-6f67b5555d-lj2pg 2/2 Running 0 2d1h rook-ceph-mon-c-79f49ff657-gbwq6 2/2 Running 0 6d18h rook-ceph-nfs-ocs-storagecluster-cephnfs-a-68dccf77dc-2jn92 2/2 Running 0 6d12h rook-ceph-operator-7bb4b69c4-sbntz 1/1 Running 0 2d1h rook-ceph-osd-0-7ff4959fb5-9p276 2/2 Running 0 6d18h rook-ceph-osd-1-6cc85dd567-qwjzv 2/2 Running 0 6d18h rook-ceph-osd-2-557fb8b4cf-f6snc 2/2 Running 0 2d1h rook-ceph-osd-prepare-15dcf7c38e6e2cb4e402ac852df2376a-8fqjk 0/1 Completed 0 6d22h rook-ceph-osd-prepare-f9b54076c6b8e10416ddb285c1f7f548-4fsk4 0/1 Completed 0 6d22h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-656d598jhkx7 2/2 Running 0 6d18h rook-ceph-tools-65d7788bc6-crjnb 1/1 Running 0 6d22h ux-backend-server-775c4c4956-zjpxm 2/2 Running 0 6d6h [root@rdr-odf4152ver-bastion-0 scripts]# oc get cephcluster -n openshift-storage NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID ocs-storagecluster-cephcluster /var/lib/rook 3 6d22h Progressing Detecting Ceph version HEALTH_OK b7029531-6b77-4478-8935-d806637a7bb4 [root@rdr-odf4152ver-bastion-0 scripts]# oc describe cephcluster -n openshift-storage Name: ocs-storagecluster-cephcluster Namespace: openshift-storage Labels: app=ocs-storagecluster Annotations: <none> API Version: ceph.rook.io/v1 Kind: CephCluster Metadata: Creation Timestamp: 2024-06-03T06:31:58Z Finalizers: cephcluster.ceph.rook.io Generation: 4 Owner References: API Version: ocs.openshift.io/v1 Block Owner Deletion: true Controller: true Kind: StorageCluster Name: ocs-storagecluster UID: 297077f4-c0be-4df5-94c0-fb657002cd61 Resource Version: 5964439 UID: e418bbb8-45ef-46a6-b6c5-7e0301f06a8e Spec: Ceph Version: Image: quay.io/paarora/ceph:v700 Cleanup Policy: Sanitize Disks: Continue Upgrade After Checks Even If Not Healthy: true Crash Collector: Csi: Cephfs: Kernel Mount Options: ms_mode=prefer-crc Read Affinity: Enabled: true Dashboard: Data Dir Host Path: /var/lib/rook Disruption Management: Machine Disruption Budget Namespace: openshift-machine-api Manage Pod Budgets: true External: Health Check: Daemon Health: Mon: Osd: Status: Labels: Exporter: rook.io/managedBy: ocs-storagecluster Mgr: Odf - Resource - Profile: Mon: Odf - Resource - Profile: Monitoring: rook.io/managedBy: ocs-storagecluster Osd: Odf - Resource - Profile: Log Collector: Enabled: true Max Log Size: 500Mi Periodicity: daily Mgr: Count: 2 Modules: Enabled: true Name: pg_autoscaler Enabled: true Name: balancer Mon: Count: 3 Monitoring: Enabled: true Network: Connections: requireMsgr2: true Multi Cluster Service: Placement: All: Node Affinity: Required During Scheduling Ignored During Execution: Node Selector Terms: Match Expressions: Key: cluster.ocs.openshift.io/openshift-storage Operator: Exists Tolerations: Effect: NoSchedule Key: node.ocs.openshift.io/storage Operator: Equal Value: true Arbiter: Tolerations: Effect: NoSchedule Key: node-role.kubernetes.io/master Operator: Exists Mon: Node Affinity: Required During Scheduling Ignored During Execution: Node Selector Terms: Match Expressions: Key: cluster.ocs.openshift.io/openshift-storage Operator: Exists Pod Anti Affinity: Required During Scheduling Ignored During Execution: Label Selector: Match Expressions: Key: app Operator: In Values: rook-ceph-mon Topology Key: kubernetes.io/hostname Priority Class Names: Mgr: system-node-critical Mon: system-node-critical Osd: system-node-critical Resources: Mgr: Limits: Cpu: 2 Memory: 3Gi Requests: Cpu: 1 Memory: 1536Mi Mon: Limits: Cpu: 1 Memory: 2Gi Requests: Cpu: 1 Memory: 2Gi Security: Key Rotation: Enabled: false Kms: Storage: Flapping Restart Interval Hours: 24 Storage Class Device Sets: Count: 3 Name: ocs-deviceset-localblock-0 Placement: Node Affinity: Required During Scheduling Ignored During Execution: Node Selector Terms: Match Expressions: Key: cluster.ocs.openshift.io/openshift-storage Operator: Exists Tolerations: Effect: NoSchedule Key: node.ocs.openshift.io/storage Operator: Equal Value: true Topology Spread Constraints: Label Selector: Match Expressions: Key: app Operator: In Values: rook-ceph-osd Max Skew: 1 Topology Key: kubernetes.io/hostname When Unsatisfiable: ScheduleAnyway Prepare Placement: Node Affinity: Required During Scheduling Ignored During Execution: Node Selector Terms: Match Expressions: Key: cluster.ocs.openshift.io/openshift-storage Operator: Exists Tolerations: Effect: NoSchedule Key: node.ocs.openshift.io/storage Operator: Equal Value: true Topology Spread Constraints: Label Selector: Match Expressions: Key: app Operator: In Values: rook-ceph-osd rook-ceph-osd-prepare Max Skew: 1 Topology Key: kubernetes.io/hostname When Unsatisfiable: ScheduleAnyway Resources: Limits: Cpu: 2 Memory: 5Gi Requests: Cpu: 2 Memory: 5Gi Tune Fast Device Class: true Volume Claim Templates: Metadata: Annotations: Crush Device Class: ssd Spec: Access Modes: ReadWriteOnce Resources: Requests: Storage: 100Gi Storage Class Name: localblock Volume Mode: Block Status: Store: Status: Ceph: Capacity: Bytes Available: 1453224570880 Bytes Total: 1610612736000 Bytes Used: 157388165120 Last Updated: 2024-06-06T09:51:29Z Fsid: b7029531-6b77-4478-8935-d806637a7bb4 Health: HEALTH_OK Last Changed: 2024-06-03T22:39:11Z Last Checked: 2024-06-06T09:51:29Z Previous Health: HEALTH_WARN Versions: Mds: ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable): 2 Mgr: ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable): 2 Mon: ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable): 3 Osd: ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable): 3 Overall: ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable): 11 Rgw: ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable): 1 Conditions: Last Heartbeat Time: 2024-06-06T09:51:30Z Last Transition Time: 2024-06-03T06:33:46Z Message: Cluster created successfully Reason: ClusterCreated Status: True Type: Ready Last Heartbeat Time: 2024-06-10T04:47:39Z Last Transition Time: 2024-06-10T04:47:39Z Message: Detecting Ceph version Reason: ClusterProgressing Status: True Type: Progressing Message: Detecting Ceph version Observed Generation: 2 Phase: Progressing State: Creating Storage: Device Classes: Name: ssd Osd: Store Type: Bluestore: 3 Version: Image: registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:9dbd051cfcdb334aad33a536cc115ae1954edaea5f8cb5943ad615f1b41b0226 Version: 17.2.6-196 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ReconcileFailed 29m (x102 over 2d1h) rook-ceph-cluster-controller failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed the ceph version check: failed to complete ceph version job: failed to run CmdReporter rook-ceph-detect-version successfully. failed waiting for results ConfigMap rook-ceph-detect-version. timed out waiting for results ConfigMap |