Bug 2277603

Summary: Ceph health is going to Error state on ODF4.15.x on IBM Power
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Pooja Soni <posoni>
Component: cephAssignee: Radoslaw Zarzynski <rzarzyns>
ceph sub component: RADOS QA Contact: Elad <ebenahar>
Status: NEW --- Docs Contact:
Severity: high    
Priority: unspecified CC: aaaggarw, akupczyk, bhubbard, bniver, brgardne, jcaratza, kramdoss, muagarwa, nojha, odf-bz-bot, paarora, rzarzyns, sheggodu, sostapov, tnielsen
Version: 4.15Flags: brgardne: needinfo? (bhubbard)
bhubbard: needinfo? (akupczyk)
brgardne: needinfo? (aaaggarw)
Target Milestone: ---   
Target Release: ---   
Hardware: ppc64le   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pooja Soni 2024-04-28 14:21:02 UTC
Description of problem (please be detailed as possible and provide log
snippests):
On running Tier1 on ODF 4.15.2 ceph health is going in error state.
sh-5.1$ ceph health
HEALTH_ERR 1/654 objects unfound (0.153%); 7 scrub errors; Possible data damage: 1 pg recovery_unfound, 4 pgs inconsistent; Degraded data redundancy: 3/1962 objects degraded (0.153%), 1 pg degraded; 3 slow ops, oldest one blocked for 265101 sec, daemons [osd.1,osd.2] have slow ops.
sh-5.1$

Version of all relevant components (if applicable):
ODF version - 4.15.2
OCP version - 4.15.2

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
We can skip the test case and continue with other test case execution.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Create ODF 4.15.2 and execute the tier1 test suite. 
2. During execution of test_selinux_relabel_for_existing_pvc[5] test case ceph health is going into error state.


Actual results:


Expected results:


Additional info:

Comment 4 Santosh Pillai 2024-04-29 04:18:38 UTC
Hi, 
what is Tier1?
What operations were performed on the cluster before it went to the current state?

Comment 5 Pooja Soni 2024-04-29 06:28:44 UTC
Tier1 is the test suite which is getting executed as part of zstream 4.15.2 and below test cases is part of the same test suite -
test_selinux_relabel_for_existing_pvc[5]

after running test_selinux_relabel_for_existing_pvc[5] test case we are seeing this issue. Link to test case - https://github.com/red-hat-storage/ocs-ci/blob/6de27377af27d626991b2b0b590f534a91a81400/tests/cross_functional/kcs/test_selinux_relabel_solution.py#L227

This test case is creating PVC, attach to the pod and creating multiple directories with files and applying selinux relabeling. This issue is seen in multiple cluster having ODF 4.15.2 installed.

Comment 6 Santosh Pillai 2024-04-29 09:30:46 UTC
(In reply to Pooja Soni from comment #5)
> Tier1 is the test suite which is getting executed as part of zstream 4.15.2
> and below test cases is part of the same test suite -
> test_selinux_relabel_for_existing_pvc[5]
> 
> after running test_selinux_relabel_for_existing_pvc[5] test case we are
> seeing this issue. Link to test case -
> https://github.com/red-hat-storage/ocs-ci/blob/
> 6de27377af27d626991b2b0b590f534a91a81400/tests/cross_functional/kcs/
> test_selinux_relabel_solution.py#L227
> 
> This test case is creating PVC, attach to the pod and creating multiple
> directories with files and applying selinux relabeling. This issue is seen
> in multiple cluster having ODF 4.15.2 installed.

Thanks for the details. 
Do you have a live cluster that I can take a look at?

Comment 7 Santosh Pillai 2024-04-29 11:33:40 UTC
Got a live cluster from Aaruni.  Didn't see any issues at the Rook level. But the OSD.0 pod is crashing. 

Hi Radoslaw, 
Can you take a look at the OSD pod crashing?

Thanks.

Comment 11 Pooja Soni 2024-04-29 15:21:10 UTC
we tried again on diff setup and it failed again with error:

Ceph cluster health is not OK. Health: HEALTH_ERR 1/60838 objects unfound (0.002%); Reduced data availability: 37 pgs peering; Possible data damage: 1 pg recovery_unfound; Degraded data redundancy: 41007/182514 objects degraded (22.468%), 75 pgs degraded, 109 pgs undersized; 5 daemons have recently crashed; 1 slow ops, oldest one blocked for 120 sec, daemons [osd.1,osd.2] have slow ops.

Must gather log for this setup - https://drive.google.com/file/d/1G_zg9vF8xI3c74hZBxmK4Q-Q-3VmZwtk/view?usp=sharing

Comment 14 Pooja Soni 2024-05-15 10:38:31 UTC
I got the same issue on running Tier1 on ODF 4.14.7. Ceph health went into error state after execution of test_selinux_relabel_for_existing_pvc[5] test case.

Comment 18 Pooja Soni 2024-05-28 11:57:02 UTC
I faced ceph health issue on 4.15.3 as well.
sh-5.1$ ceph health detail
HEALTH_ERR 1/652 objects unfound (0.153%); 5 scrub errors; Possible data damage: 1 pg recovery_unfound, 5 pgs inconsistent; Degraded data redundancy: 3/1956 objects degraded (0.153%), 1 pg degraded, 1 pg undersized; 6 daemons have recently crashed
[WRN] OBJECT_UNFOUND: 1/652 objects unfound (0.153%)
    pg 12.1d has 1 unfound objects
[ERR] OSD_SCRUB_ERRORS: 5 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound, 5 pgs inconsistent
    pg 1.4 is active+clean+inconsistent, acting [1,2,0]
    pg 1.17 is active+clean+inconsistent, acting [1,2,0]
    pg 10.6 is active+clean+inconsistent, acting [2,1,0]
    pg 10.e is active+clean+inconsistent, acting [2,0,1]
    pg 10.f is active+clean+inconsistent, acting [1,2,0]
    pg 12.1d is active+recovery_unfound+undersized+degraded+remapped, acting [1,0], 1 unfound
[WRN] PG_DEGRADED: Degraded data redundancy: 3/1956 objects degraded (0.153%), 1 pg degraded, 1 pg undersized
    pg 12.1d is stuck undersized for 4d, current state active+recovery_unfound+undersized+degraded+remapped, last acting [1,0]
[WRN] RECENT_CRASH: 6 daemons have recently crashed
    osd.2 crashed on host rook-ceph-osd-2-868c7458ff-92dlg at 2024-05-23T13:05:44.743307Z
    osd.2 crashed on host rook-ceph-osd-2-868c7458ff-92dlg at 2024-05-23T13:07:40.116351Z
    osd.0 crashed on host rook-ceph-osd-0-7d7df95c84-qm6mm at 2024-05-23T13:08:25.743047Z
    osd.2 crashed on host rook-ceph-osd-2-868c7458ff-92dlg at 2024-05-23T13:11:15.966533Z
    osd.2 crashed on host rook-ceph-osd-2-868c7458ff-92dlg at 2024-05-23T13:12:09.088452Z
    osd.0 crashed on host rook-ceph-osd-0-7d7df95c84-qm6mm at 2024-05-23T13:09:36.226574Z
sh-5.1$

Comment 19 Pooja Soni 2024-05-29 04:51:33 UTC
must gather after excluding test_selinux_relabel_for_existing_pvc[5] - https://drive.google.com/file/d/1olFF_AOYda9aFdJbXkvdv6kIIR3k0H_2/view?usp=drive_link

must gather with test_selinux_relabel_for_existing_pvc[5] - 
https://drive.google.com/file/d/1qwvRw5Gzp4Y58TIboJMLgQhqsrSy0UUW/view?usp=sharing

Comment 25 Radoslaw Zarzynski 2024-06-03 17:42:59 UTC
> (...) significant percentage of objects would not read properly.

We don't know how many objects have been read before the campaign
gets aborted due to the testcase failure – we lack information
on the denominator.

Anyway, let's check. Further data points are useful even if the PR
is unrelated.

I created a branch with the 2 commits [1] reverted:

https://gitlab.cee.redhat.com/ceph/ceph/-/commits/wip-bz-2277603-revert-gh53483

It bases on e9529323dd7ab3b0e8cdf84e17a1b58c2b42948c which was
reported in https://bugzilla.redhat.com/show_bug.cgi?id=2277603#c8.

Regards,
Radek

[1]: https://github.com/ceph/ceph/pull/53483

Comment 30 Blaine Gardner 2024-06-05 16:57:04 UTC
Copying the latest info from Aaruni. Source: https://bugzilla.redhat.com/show_bug.cgi?id=2280999#c42



> Could you retest this with `debug_osd = 20` and `debug_bluestore = 20` both? Please capture the m-g and coredumps again as well.

setting `debug_osd = 20` and `debug_bluestore = 20`
 
[root@rdr-odf414sel-bastion-0 ~]# oc -n openshift-storage rsh rook-ceph-tools-6cb655c7d-mfmt5
sh-5.1$ ceph -s
  cluster:
    id:     3f44f905-74ed-456e-8288-0cea8e7d0c93
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 10h)
    mgr: a(active, since 10h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 10h), 3 in (since 10h)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 2.60k objects, 7.9 GiB
    usage:   24 GiB used, 1.4 TiB / 1.5 TiB avail
    pgs:     169 active+clean
 
  io:
    client:   853 B/s rd, 16 KiB/s wr, 1 op/s rd, 1 op/s wr
 
sh-5.1$ ceph tell osd.* config get debug_osd
osd.0: {
    "debug_osd": "1/5"
}
osd.1: {
    "debug_osd": "1/5"
}
osd.2: {
    "debug_osd": "1/5"
}
sh-5.1$ 
sh-5.1$ ceph tell osd.* config set debug_osd 20
osd.0: {
    "success": ""
}
osd.1: {
    "success": ""
}
osd.2: {
    "success": ""
}
sh-5.1$ 
sh-5.1$ ceph tell osd.* config get debug_osd
osd.0: {
    "debug_osd": "20/20"
}
osd.1: {
    "debug_osd": "20/20"
}
osd.2: {
    "debug_osd": "20/20"
}

sh-5.1$ ceph tell osd.* config get debug_bluestore
osd.0: {
    "debug_bluestore": "1/5"
}
osd.1: {
    "debug_bluestore": "1/5"
}
osd.2: {
    "debug_bluestore": "1/5"
}
sh-5.1$ 
sh-5.1$ ceph tell osd.* config set debug_bluestore 20
osd.0: {
    "success": ""
}
osd.1: {
    "success": ""
}
osd.2: {
    "success": ""
}
sh-5.1$ 
sh-5.1$ ceph tell osd.* config get debug_bluestore
osd.0: {
    "debug_bluestore": "20/20"
}
osd.1: {
    "debug_bluestore": "20/20"
}
osd.2: {
    "debug_bluestore": "20/20"
}
sh-5.1$ exit
exit

Executed tier1 tests and ceph health went to warn state. 


ceph health status:

[root@rdr-odf414sel-bastion-0 ~]# oc -n openshift-storage rsh rook-ceph-tools-6cb655c7d-mfmt5
sh-5.1$ ceph -s
  cluster:
    id:     3f44f905-74ed-456e-8288-0cea8e7d0c93
    health: HEALTH_WARN
            1 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum a,b,c (age 13h)
    mgr: a(active, since 13h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 2h), 3 in (since 13h)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   13 pools, 201 pgs
    objects: 3.18k objects, 10 GiB
    usage:   31 GiB used, 1.4 TiB / 1.5 TiB avail
    pgs:     201 active+clean
 
  io:
    client:   1023 B/s rd, 418 KiB/s wr, 1 op/s rd, 1 op/s wr
 
sh-5.1$ 
sh-5.1$ ceph crash ls
ID                                                                ENTITY  NEW  
2024-06-05T07:00:33.318587Z_892175db-bd0a-48cc-a0e4-23da2970608a  osd.2    *   
sh-5.1$ 

pods:

[root@rdr-odf414sel-bastion-0 ~]# oc get pods -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS       AGE
csi-addons-controller-manager-5f8cf65876-257w4                    2/2     Running     0              19m
csi-cephfsplugin-c7wlj                                            2/2     Running     0              13h
csi-cephfsplugin-k7fg8                                            2/2     Running     0              13h
csi-cephfsplugin-provisioner-7b6b7c8bbf-4fh97                     5/5     Running     0              13h
csi-cephfsplugin-provisioner-7b6b7c8bbf-7stpt                     5/5     Running     0              13h
csi-cephfsplugin-rfxpz                                            2/2     Running     0              13h
csi-nfsplugin-d4m25                                               2/2     Running     0              67m
csi-nfsplugin-gx78q                                               2/2     Running     0              67m
csi-nfsplugin-provisioner-6c874556f8-p7n7r                        5/5     Running     0              67m
csi-nfsplugin-provisioner-6c874556f8-zf24r                        5/5     Running     0              67m
csi-nfsplugin-xb969                                               2/2     Running     0              67m
csi-rbdplugin-6gxnf                                               3/3     Running     0              13h
csi-rbdplugin-csqwz                                               3/3     Running     0              13h
csi-rbdplugin-mm785                                               3/3     Running     0              13h
csi-rbdplugin-provisioner-6c64c96886-2smdw                        6/6     Running     0              13h
csi-rbdplugin-provisioner-6c64c96886-m5tz9                        6/6     Running     0              13h
noobaa-core-0                                                     1/1     Running     0              66m
noobaa-db-pg-0                                                    1/1     Running     0              66m
noobaa-endpoint-654d57b548-c4svt                                  1/1     Running     0              128m
noobaa-endpoint-654d57b548-s8tj4                                  1/1     Running     0              67m
noobaa-operator-7bdd5cb576-rbh2p                                  2/2     Running     0              13h
ocs-metrics-exporter-67b9f9855d-f7rdv                             1/1     Running     1 (67m ago)    13h
ocs-operator-c8d7c579f-pvfln                                      1/1     Running     0              13h
odf-console-7888dd6746-rl7n2                                      1/1     Running     0              13h
odf-operator-controller-manager-7749cdb995-dw7mc                  2/2     Running     0              13h
rook-ceph-crashcollector-worker-0-554f85b66-gnh52                 1/1     Running     0              13h
rook-ceph-crashcollector-worker-1-646b58c45f-gvm59                1/1     Running     0              13h
rook-ceph-crashcollector-worker-2-754dbbbfd6-b4fsn                1/1     Running     0              13h
rook-ceph-exporter-worker-0-666bd75845-tnn57                      1/1     Running     0              13h
rook-ceph-exporter-worker-1-6f9d4c69c6-j6g88                      1/1     Running     0              13h
rook-ceph-exporter-worker-2-54fcd47ccc-cspg5                      1/1     Running     0              13h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6487686cs8tlz   2/2     Running     0              13h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-f8b946dfqqrgm   2/2     Running     0              13h
rook-ceph-mgr-a-744479898c-jnrnx                                  2/2     Running     0              13h
rook-ceph-mon-a-574b64f99-xkvd7                                   2/2     Running     0              13h
rook-ceph-mon-b-6dffc99fdd-pf7f9                                  2/2     Running     0              13h
rook-ceph-mon-c-768c9bd57d-49m6f                                  2/2     Running     0              13h
rook-ceph-nfs-ocs-storagecluster-cephnfs-a-749b7554b9-8zlpx       2/2     Running     0              67m
rook-ceph-operator-857ccc7545-ns89c                               1/1     Running     0              67m
rook-ceph-osd-0-5548bc9796-cbzgl                                  2/2     Running     0              13h
rook-ceph-osd-1-778c784bd5-jcd2t                                  2/2     Running     0              13h
rook-ceph-osd-2-8c55ccff8-dtjrm                                   2/2     Running     1 (131m ago)   13h
rook-ceph-osd-prepare-28a84de044d061da27bf0f4f43d99cab-qcqms      0/1     Completed   0              13h
rook-ceph-osd-prepare-3535e54116e131773d2accb9756b7796-46r9l      0/1     Completed   0              13h
rook-ceph-osd-prepare-92eebb804a903eedb1a420bdd15a845a-28rvv      0/1     Completed   0              13h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-857895bt8bq7   2/2     Running     0              13h
rook-ceph-tools-6cb655c7d-mfmt5                                   1/1     Running     0              13h
ux-backend-server-996cffddb-6j59g                                 2/2     Running     0              13h

coredump : https://drive.google.com/file/d/1HQDVMFUPtMCAgUcEOj-tPBeINsH3u5IG/view?usp=sharing

must-gather: https://drive.google.com/file/d/17Xkfp2cwYuSli5SDeLwb6BfWPBkilfCc/view?usp=sharing


-----

@bhubbard I hope these test results show the issue with the logs you need. It looks like only osd.2 may have crashed in this test run.

-----

@aaaggarw it looks like Justin was able to get an RHCS build that strips out the commit Brad suspected could be the problematic one. Are you able to run through the ODF test suite using that RHCS image, to determine if it is exacerbating the issue?

BZ is turning some of the text into a hyperlink, but AFAICT, the image is just as the text appears here: registry-proxy.engineering.redhat.com/rh-osbs/rhceph:6-17.2.6.TEST.bz2277603

Comment 31 Brad Hubbard 2024-06-06 00:43:51 UTC
*** Bug 2280973 has been marked as a duplicate of this bug. ***

Comment 34 Aaruni Aggarwal 2024-06-06 09:36:31 UTC
Thanks Parth for pushing the image to public repo. 

I tried updating OCS-operator CSV with `quay.io/paarora/ceph:v700` 


[root@rdr-odf4152ver-bastion-0 ~]# oc edit csv ocs-operator.v4.15.2-rhodf -n openshift-storage 
clusterserviceversion.operators.coreos.com/ocs-operator.v4.15.2-rhodf edited

[root@rdr-odf4152ver-bastion-0 ~]# oc get csv ocs-operator.v4.15.2-rhodf -n openshift-storage -o yaml |grep ceph:v700
                  value: quay.io/paarora/ceph:v700

pods:: 

[root@rdr-odf4152ver-bastion-0 ~]# oc get pods
NAME                                                              READY   STATUS             RESTARTS        AGE
ceph-file-controller-detect-version-vkfqp                         0/1     CrashLoopBackOff   3 (20s ago)     93s
ceph-nfs-controller-detect-version-7znlr                          0/1     CrashLoopBackOff   3 (22s ago)     93s
ceph-object-controller-detect-version-mkpw2                       0/1     CrashLoopBackOff   3 (19s ago)     96s
csi-addons-controller-manager-76d57c4596-6mdqt                    2/2     Running            0               2d10h
csi-cephfsplugin-cb25t                                            2/2     Running            1 (3d2h ago)    3d2h
csi-cephfsplugin-cshbv                                            2/2     Running            0               3d2h
csi-cephfsplugin-provisioner-7545464b85-lvvmv                     6/6     Running            0               3d2h
csi-cephfsplugin-provisioner-7545464b85-rjnfz                     6/6     Running            0               2d11h
csi-cephfsplugin-vgwc5                                            2/2     Running            0               3d2h
csi-nfsplugin-bdj7z                                               2/2     Running            0               2d17h
csi-nfsplugin-jkdt7                                               2/2     Running            0               2d17h
csi-nfsplugin-provisioner-68b6ff9c5-kzjrf                         5/5     Running            0               2d17h
csi-nfsplugin-provisioner-68b6ff9c5-m7plr                         5/5     Running            0               2d11h
csi-nfsplugin-r9xwp                                               2/2     Running            0               2d17h
csi-rbdplugin-76zhn                                               3/3     Running            0               3d2h
csi-rbdplugin-dt6j9                                               3/3     Running            0               3d2h
csi-rbdplugin-provisioner-7d9c7cf9b8-jcfs4                        6/6     Running            0               2d11h
csi-rbdplugin-provisioner-7d9c7cf9b8-pmz2z                        6/6     Running            0               3d2h
csi-rbdplugin-tg99f                                               3/3     Running            0               3d2h
noobaa-core-0                                                     1/1     Running            0               2d11h
noobaa-db-pg-0                                                    1/1     Running            0               2d11h
noobaa-endpoint-67cfb66697-njpln                                  1/1     Running            0               2d18h
noobaa-operator-5f7f746647-gx24j                                  1/1     Running            0               2d11h
ocs-metrics-exporter-5856bbf967-dddp8                             1/1     Running            0               2d11h
ocs-operator-67f46d8666-5sgzb                                     1/1     Running            0               2m7s
odf-console-c69c7864b-znhgx                                       1/1     Running            0               3d2h
odf-operator-controller-manager-65dbb56b4d-j2vt2                  2/2     Running            1 (2d17h ago)   3d2h
rook-ceph-crashcollector-worker-0-5fdb4ff8f7-25cg5                1/1     Running            0               2d10h
rook-ceph-crashcollector-worker-1-7559dc47dd-mwgsg                1/1     Running            0               2d22h
rook-ceph-crashcollector-worker-2-7d87957c55-94fz9                1/1     Running            0               2d22h
rook-ceph-detect-version-97r5r                                    0/1     CrashLoopBackOff   3 (20s ago)     97s
rook-ceph-exporter-worker-0-777dfb96cc-jhvpj                      1/1     Running            0               2d10h
rook-ceph-exporter-worker-1-789c4d9d9b-qkvzg                      1/1     Running            0               2d22h
rook-ceph-exporter-worker-2-77bf6f5747-fchzk                      1/1     Running            0               2d22h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-86488d667qzk2   2/2     Running            8 (2d10h ago)   2d22h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-64b8575fwdzv4   2/2     Running            8 (2d10h ago)   2d11h
rook-ceph-mgr-a-6f7d9b9494-4dxdv                                  3/3     Running            0               2d22h
rook-ceph-mgr-b-7566b65bff-m5bcw                                  3/3     Running            0               2d22h
rook-ceph-mon-a-5857cb9688-6qcl5                                  2/2     Running            0               2d22h
rook-ceph-mon-b-6f67b5555d-hwjwp                                  2/2     Running            0               2d10h
rook-ceph-mon-c-79f49ff657-gbwq6                                  2/2     Running            0               2d22h
rook-ceph-nfs-ocs-storagecluster-cephnfs-a-68dccf77dc-2jn92       2/2     Running            0               2d17h
rook-ceph-operator-7bb4b69c4-586x7                                1/1     Running            0               2d17h
rook-ceph-osd-0-7ff4959fb5-9p276                                  2/2     Running            0               2d22h
rook-ceph-osd-1-6cc85dd567-qwjzv                                  2/2     Running            0               2d22h
rook-ceph-osd-2-557fb8b4cf-tcd2k                                  2/2     Running            0               2d11h
rook-ceph-osd-prepare-15dcf7c38e6e2cb4e402ac852df2376a-8fqjk      0/1     Completed          0               3d2h
rook-ceph-osd-prepare-f9b54076c6b8e10416ddb285c1f7f548-4fsk4      0/1     Completed          0               3d2h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-656d598jhkx7   2/2     Running            0               2d22h
rook-ceph-tools-65d7788bc6-crjnb                                  1/1     Running            0               3d2h
ux-backend-server-775c4c4956-zjpxm                                2/2     Running            0               2d11h

[root@rdr-odf4152ver-bastion-0 ~]# oc describe pod rook-ceph-detect-version-97r5r
Name:             rook-ceph-detect-version-97r5r
Namespace:        openshift-storage
Priority:         0
Service Account:  rook-ceph-cmd-reporter
Node:             worker-0/10.20.180.190
Start Time:       Thu, 06 Jun 2024 05:28:05 -0400
Labels:           app=rook-ceph-detect-version
                  batch.kubernetes.io/controller-uid=67299e35-f26a-4c46-ad45-98edf53f6883
                  batch.kubernetes.io/job-name=rook-ceph-detect-version
                  controller-uid=67299e35-f26a-4c46-ad45-98edf53f6883
                  job-name=rook-ceph-detect-version
                  rook-version=v4.15.2-0.62c88d463a1f068556dd5e48b736ea5354cd79a0
Annotations:      k8s.ovn.org/pod-networks:
                    {"default":{"ip_addresses":["10.129.3.65/23"],"mac_address":"0a:58:0a:81:03:41","gateway_ips":["10.129.2.1"],"routes":[{"dest":"10.128.0.0...
                  k8s.v1.cni.cncf.io/network-status:
                    [{
                        "name": "ovn-kubernetes",
                        "interface": "eth0",
                        "ips": [
                            "10.129.3.65"
                        ],
                        "mac": "0a:58:0a:81:03:41",
                        "default": true,
                        "dns": {}
                    }]
                  openshift.io/scc: restricted-v2
                  seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:           Running
SeccompProfile:   RuntimeDefault
IP:               10.129.3.65
IPs:
  IP:           10.129.3.65
Controlled By:  Job/rook-ceph-detect-version
Init Containers:
  init-copy-binaries:
    Container ID:  cri-o://ac2a2b1c2acdfedb72898edf3451a5c2e085a47f16726aa5567c2d3b329a46d4
    Image:         registry.redhat.io/odf4/rook-ceph-rhel9-operator@sha256:400ccfcf6f63db49823e6e71908b9d5b96486a0c0c9a083bb283f9fe29efa01d
    Image ID:      registry.redhat.io/odf4/rook-ceph-rhel9-operator@sha256:400ccfcf6f63db49823e6e71908b9d5b96486a0c0c9a083bb283f9fe29efa01d
    Port:          <none>
    Host Port:     <none>
    Command:
      cp
    Args:
      --archive
      --force
      --verbose
      /usr/local/bin/rook
      /rook/copied-binaries
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 06 Jun 2024 05:28:06 -0400
      Finished:     Thu, 06 Jun 2024 05:28:06 -0400
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /rook/copied-binaries from rook-copied-binaries (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m5m64 (ro)
Containers:
  cmd-reporter:
    Container ID:  cri-o://1c6ae897d3b8d347ef21e0e1341d1a9579e8e199b3df60cad178f37117aae260
    Image:         quay.io/paarora/ceph:v700
    Image ID:      quay.io/paarora/ceph@sha256:2895e3af6615ecc0631e6c73dc5e03ac7a4319357c35021925699cf6227f312d
    Port:          <none>
    Host Port:     <none>
    Command:
      /rook/copied-binaries/rook
    Args:
      cmd-reporter
      --command
      {"cmd":["ceph"],"args":["--version"]}
      --config-map-name
      rook-ceph-detect-version
      --namespace
      openshift-storage
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 06 Jun 2024 05:29:22 -0400
      Finished:     Thu, 06 Jun 2024 05:29:22 -0400
    Ready:          False
    Restart Count:  3
    Environment:    <none>
    Mounts:
      /rook/copied-binaries from rook-copied-binaries (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m5m64 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  rook-copied-binaries:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-m5m64:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             node.ocs.openshift.io/storage=true:NoSchedule
Events:
  Type     Reason          Age                From               Message
  ----     ------          ----               ----               -------
  Normal   Scheduled       112s               default-scheduler  Successfully assigned openshift-storage/rook-ceph-detect-version-97r5r to worker-0
  Normal   AddedInterface  112s               multus             Add eth0 [10.129.3.65/23] from ovn-kubernetes
  Normal   Pulled          112s               kubelet            Container image "registry.redhat.io/odf4/rook-ceph-rhel9-operator@sha256:400ccfcf6f63db49823e6e71908b9d5b96486a0c0c9a083bb283f9fe29efa01d" already present on machine
  Normal   Created         112s               kubelet            Created container init-copy-binaries
  Normal   Started         112s               kubelet            Started container init-copy-binaries
  Normal   Pulling         111s               kubelet            Pulling image "quay.io/paarora/ceph:v700"
  Normal   Pulled          81s                kubelet            Successfully pulled image "quay.io/paarora/ceph:v700" in 29.391s (29.391s including waiting)
  Normal   Created         36s (x4 over 81s)  kubelet            Created container cmd-reporter
  Normal   Started         36s (x4 over 81s)  kubelet            Started container cmd-reporter
  Normal   Pulled          36s (x3 over 80s)  kubelet            Container image "quay.io/paarora/ceph:v700" already present on machine
  Warning  BackOff         11s (x7 over 79s)  kubelet            Back-off restarting failed container cmd-reporter in pod rook-ceph-detect-version-97r5r_openshift-storage(d454922f-1f35-4bdb-bf8e-6b65ef18fc7a)

error: 

[root@rdr-odf4152ver-bastion-0 ~]# oc logs pod/rook-ceph-detect-version-97r5r 
Defaulted container "cmd-reporter" out of: cmd-reporter, init-copy-binaries (init)
exec /rook/copied-binaries/rook: no such file or directory

Comment 35 Aaruni Aggarwal 2024-06-06 09:56:56 UTC
image is updated in ocs-operator pod:

[root@rdr-odf4152ver-bastion-0 ~]# oc describe pod ocs-operator-67f46d8666-5sgzb |grep CEPH_IMAGE
      ROOK_CEPH_IMAGE:                    registry.redhat.io/odf4/rook-ceph-rhel9-operator@sha256:400ccfcf6f63db49823e6e71908b9d5b96486a0c0c9a083bb283f9fe29efa01d
      CEPH_IMAGE:                         quay.io/paarora/ceph:v700
[root@rdr-odf4152ver-bastion-0 ~]# 

and ocs-operator pod is running. Also, ceph health is in OK state 

[root@rdr-odf4152ver-bastion-0 ~]# oc get cephcluster
NAME                             DATADIRHOSTPATH   MONCOUNT   AGE    PHASE   MESSAGE                        HEALTH      EXTERNAL   FSID
ocs-storagecluster-cephcluster   /var/lib/rook     3          3d3h   Ready   Cluster created successfully   HEALTH_OK              b7029531-6b77-4478-8935-d806637a7bb4

[root@rdr-odf4152ver-bastion-0 ~]# oc rsh rook-ceph-tools-65d7788bc6-crjnb
sh-5.1$ 
sh-5.1$ ceph -s
  cluster:
    id:     b7029531-6b77-4478-8935-d806637a7bb4
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 2d)
    mgr: b(active, since 2d), standbys: a
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 2d), 3 in (since 3d)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   13 pools, 201 pgs
    objects: 16.77k objects, 61 GiB
    usage:   147 GiB used, 1.3 TiB / 1.5 TiB avail
    pgs:     201 active+clean
 
  io:
    client:   1.5 KiB/s rd, 418 KiB/s wr, 2 op/s rd, 2 op/s wr
 
Should I execute tier tests on the setup?

Comment 36 Blaine Gardner 2024-06-06 21:08:20 UTC
Yes. Please run the test suite(s) that have been causing OSDs to crash to determine if the reverted commit alleviates the issue.

Comment 37 Aaruni Aggarwal 2024-06-07 02:50:32 UTC
Hope following error will not cause any issue while running test suite. 

[root@rdr-odf4152ver-bastion-0 ~]# oc get pods |grep -v "Completed\|Running"
NAME                                                              READY   STATUS             RESTARTS        AGE
ceph-file-controller-detect-version-5kwvf                         0/1     CrashLoopBackOff   2 (18s ago)     37s
ceph-nfs-controller-detect-version-qtn7p                          0/1     CrashLoopBackOff   2 (20s ago)     36s
ceph-object-controller-detect-version-7mj6h                       0/1     CrashLoopBackOff   2 (21s ago)     40s
rook-ceph-detect-version-2kfrd                                    0/1     CrashLoopBackOff   2 (23s ago)     39s

[root@rdr-odf4152ver-bastion-0 ~]# oc logs pod/ceph-file-controller-detect-version-5kwvf  
Defaulted container "cmd-reporter" out of: cmd-reporter, init-copy-binaries (init)
exec /rook/copied-binaries/rook: no such file or directory

[root@rdr-odf4152ver-bastion-0 ~]# oc logs pod/ceph-object-controller-detect-version-7mj6h 
Defaulted container "cmd-reporter" out of: cmd-reporter, init-copy-binaries (init)
exec /rook/copied-binaries/rook: no such file or directory
 
[root@rdr-odf4152ver-bastion-0 ~]# oc logs pod/rook-ceph-detect-version-2kfrd
Defaulted container "cmd-reporter" out of: cmd-reporter, init-copy-binaries (init)
exec /rook/copied-binaries/rook: no such file or directory


Executing tier test on the setup. Will upload the results once done.

Comment 40 Aaruni Aggarwal 2024-06-10 05:53:34 UTC
Tier tests are completed and they ran as expected. Ceph health is also in OK state, but it's status shows Progressing due to cmd-reporter. 

[root@rdr-odf4152ver-bastion-0 scripts]# oc get pods -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS        AGE
csi-addons-controller-manager-76d57c4596-rmd2m                    2/2     Running     0               47h
csi-cephfsplugin-cb25t                                            2/2     Running     1 (6d22h ago)   6d22h
csi-cephfsplugin-cshbv                                            2/2     Running     0               6d22h
csi-cephfsplugin-provisioner-7545464b85-lvvmv                     6/6     Running     0               6d22h
csi-cephfsplugin-provisioner-7545464b85-rjnfz                     6/6     Running     0               6d6h
csi-cephfsplugin-vgwc5                                            2/2     Running     0               6d22h
csi-rbdplugin-76zhn                                               3/3     Running     0               6d22h
csi-rbdplugin-dt6j9                                               3/3     Running     0               6d22h
csi-rbdplugin-provisioner-7d9c7cf9b8-jcfs4                        6/6     Running     0               6d6h
csi-rbdplugin-provisioner-7d9c7cf9b8-pmz2z                        6/6     Running     0               6d22h
csi-rbdplugin-tg99f                                               3/3     Running     0               6d22h
noobaa-core-0                                                     1/1     Running     0               2d1h
noobaa-db-pg-0                                                    1/1     Running     0               2d1h
noobaa-endpoint-67cfb66697-v2q9b                                  1/1     Running     0               2d1h
noobaa-operator-5f7f746647-rnqmn                                  1/1     Running     0               2d1h
ocs-metrics-exporter-5856bbf967-dddp8                             1/1     Running     0               6d6h
ocs-operator-67f46d8666-4mslb                                     1/1     Running     0               2d1h
odf-console-c69c7864b-znhgx                                       1/1     Running     0               6d22h
odf-operator-controller-manager-65dbb56b4d-j2vt2                  2/2     Running     1 (6d12h ago)   6d22h
rook-ceph-crashcollector-worker-0-5fdb4ff8f7-22k6t                1/1     Running     0               2d1h
rook-ceph-crashcollector-worker-1-7559dc47dd-mwgsg                1/1     Running     0               6d18h
rook-ceph-crashcollector-worker-2-7d87957c55-94fz9                1/1     Running     0               6d18h
rook-ceph-exporter-worker-0-777dfb96cc-5jt7j                      1/1     Running     0               2d1h
rook-ceph-exporter-worker-1-789c4d9d9b-qkvzg                      1/1     Running     0               6d18h
rook-ceph-exporter-worker-2-77bf6f5747-fchzk                      1/1     Running     0               6d18h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-86488d667qzk2   2/2     Running     8 (6d6h ago)    6d18h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-64b8575fwdzv4   2/2     Running     8 (6d6h ago)    6d6h
rook-ceph-mgr-a-6f7d9b9494-4dxdv                                  3/3     Running     0               6d18h
rook-ceph-mgr-b-7566b65bff-m5bcw                                  3/3     Running     0               6d18h
rook-ceph-mon-a-5857cb9688-6qcl5                                  2/2     Running     0               6d18h
rook-ceph-mon-b-6f67b5555d-lj2pg                                  2/2     Running     0               2d1h
rook-ceph-mon-c-79f49ff657-gbwq6                                  2/2     Running     0               6d18h
rook-ceph-nfs-ocs-storagecluster-cephnfs-a-68dccf77dc-2jn92       2/2     Running     0               6d12h
rook-ceph-operator-7bb4b69c4-sbntz                                1/1     Running     0               2d1h
rook-ceph-osd-0-7ff4959fb5-9p276                                  2/2     Running     0               6d18h
rook-ceph-osd-1-6cc85dd567-qwjzv                                  2/2     Running     0               6d18h
rook-ceph-osd-2-557fb8b4cf-f6snc                                  2/2     Running     0               2d1h
rook-ceph-osd-prepare-15dcf7c38e6e2cb4e402ac852df2376a-8fqjk      0/1     Completed   0               6d22h
rook-ceph-osd-prepare-f9b54076c6b8e10416ddb285c1f7f548-4fsk4      0/1     Completed   0               6d22h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-656d598jhkx7   2/2     Running     0               6d18h
rook-ceph-tools-65d7788bc6-crjnb                                  1/1     Running     0               6d22h
ux-backend-server-775c4c4956-zjpxm                                2/2     Running     0               6d6h


[root@rdr-odf4152ver-bastion-0 scripts]# oc get cephcluster -n openshift-storage
NAME                             DATADIRHOSTPATH   MONCOUNT   AGE     PHASE         MESSAGE                  HEALTH      EXTERNAL   FSID
ocs-storagecluster-cephcluster   /var/lib/rook     3          6d22h   Progressing   Detecting Ceph version   HEALTH_OK              b7029531-6b77-4478-8935-d806637a7bb4

 
[root@rdr-odf4152ver-bastion-0 scripts]# oc describe cephcluster -n openshift-storage
Name:         ocs-storagecluster-cephcluster
Namespace:    openshift-storage
Labels:       app=ocs-storagecluster
Annotations:  <none>
API Version:  ceph.rook.io/v1
Kind:         CephCluster
Metadata:
  Creation Timestamp:  2024-06-03T06:31:58Z
  Finalizers:
    cephcluster.ceph.rook.io
  Generation:  4
  Owner References:
    API Version:           ocs.openshift.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  StorageCluster
    Name:                  ocs-storagecluster
    UID:                   297077f4-c0be-4df5-94c0-fb657002cd61
  Resource Version:        5964439
  UID:                     e418bbb8-45ef-46a6-b6c5-7e0301f06a8e
Spec:
  Ceph Version:
    Image:  quay.io/paarora/ceph:v700
  Cleanup Policy:
    Sanitize Disks:
  Continue Upgrade After Checks Even If Not Healthy:  true
  Crash Collector:
  Csi:
    Cephfs:
      Kernel Mount Options:  ms_mode=prefer-crc
    Read Affinity:
      Enabled:  true
  Dashboard:
  Data Dir Host Path:  /var/lib/rook
  Disruption Management:
    Machine Disruption Budget Namespace:  openshift-machine-api
    Manage Pod Budgets:                   true
  External:
  Health Check:
    Daemon Health:
      Mon:
      Osd:
      Status:
  Labels:
    Exporter:
      rook.io/managedBy:  ocs-storagecluster
    Mgr:
      Odf - Resource - Profile:  
    Mon:
      Odf - Resource - Profile:  
    Monitoring:
      rook.io/managedBy:  ocs-storagecluster
    Osd:
      Odf - Resource - Profile:  
  Log Collector:
    Enabled:       true
    Max Log Size:  500Mi
    Periodicity:   daily
  Mgr:
    Count:  2
    Modules:
      Enabled:  true
      Name:     pg_autoscaler
      Enabled:  true
      Name:     balancer
  Mon:
    Count:  3
  Monitoring:
    Enabled:  true
  Network:
    Connections:
      requireMsgr2:  true
    Multi Cluster Service:
  Placement:
    All:
      Node Affinity:
        Required During Scheduling Ignored During Execution:
          Node Selector Terms:
            Match Expressions:
              Key:       cluster.ocs.openshift.io/openshift-storage
              Operator:  Exists
      Tolerations:
        Effect:    NoSchedule
        Key:       node.ocs.openshift.io/storage
        Operator:  Equal
        Value:     true
    Arbiter:
      Tolerations:
        Effect:    NoSchedule
        Key:       node-role.kubernetes.io/master
        Operator:  Exists
    Mon:
      Node Affinity:
        Required During Scheduling Ignored During Execution:
          Node Selector Terms:
            Match Expressions:
              Key:       cluster.ocs.openshift.io/openshift-storage
              Operator:  Exists
      Pod Anti Affinity:
        Required During Scheduling Ignored During Execution:
          Label Selector:
            Match Expressions:
              Key:       app
              Operator:  In
              Values:
                rook-ceph-mon
          Topology Key:  kubernetes.io/hostname
  Priority Class Names:
    Mgr:  system-node-critical
    Mon:  system-node-critical
    Osd:  system-node-critical
  Resources:
    Mgr:
      Limits:
        Cpu:     2
        Memory:  3Gi
      Requests:
        Cpu:     1
        Memory:  1536Mi
    Mon:
      Limits:
        Cpu:     1
        Memory:  2Gi
      Requests:
        Cpu:     1
        Memory:  2Gi
  Security:
    Key Rotation:
      Enabled:  false
    Kms:
  Storage:
    Flapping Restart Interval Hours:  24
    Storage Class Device Sets:
      Count:  3
      Name:   ocs-deviceset-localblock-0
      Placement:
        Node Affinity:
          Required During Scheduling Ignored During Execution:
            Node Selector Terms:
              Match Expressions:
                Key:       cluster.ocs.openshift.io/openshift-storage
                Operator:  Exists
        Tolerations:
          Effect:    NoSchedule
          Key:       node.ocs.openshift.io/storage
          Operator:  Equal
          Value:     true
        Topology Spread Constraints:
          Label Selector:
            Match Expressions:
              Key:       app
              Operator:  In
              Values:
                rook-ceph-osd
          Max Skew:            1
          Topology Key:        kubernetes.io/hostname
          When Unsatisfiable:  ScheduleAnyway
      Prepare Placement:
        Node Affinity:
          Required During Scheduling Ignored During Execution:
            Node Selector Terms:
              Match Expressions:
                Key:       cluster.ocs.openshift.io/openshift-storage
                Operator:  Exists
        Tolerations:
          Effect:    NoSchedule
          Key:       node.ocs.openshift.io/storage
          Operator:  Equal
          Value:     true
        Topology Spread Constraints:
          Label Selector:
            Match Expressions:
              Key:       app
              Operator:  In
              Values:
                rook-ceph-osd
                rook-ceph-osd-prepare
          Max Skew:            1
          Topology Key:        kubernetes.io/hostname
          When Unsatisfiable:  ScheduleAnyway
      Resources:
        Limits:
          Cpu:     2
          Memory:  5Gi
        Requests:
          Cpu:                 2
          Memory:              5Gi
      Tune Fast Device Class:  true
      Volume Claim Templates:
        Metadata:
          Annotations:
            Crush Device Class:  ssd
        Spec:
          Access Modes:
            ReadWriteOnce
          Resources:
            Requests:
              Storage:         100Gi
          Storage Class Name:  localblock
          Volume Mode:         Block
        Status:
    Store:
Status:
  Ceph:
    Capacity:
      Bytes Available:  1453224570880
      Bytes Total:      1610612736000
      Bytes Used:       157388165120
      Last Updated:     2024-06-06T09:51:29Z
    Fsid:               b7029531-6b77-4478-8935-d806637a7bb4
    Health:             HEALTH_OK
    Last Changed:       2024-06-03T22:39:11Z
    Last Checked:       2024-06-06T09:51:29Z
    Previous Health:    HEALTH_WARN
    Versions:
      Mds:
        ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable):  2
      Mgr:
        ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable):  2
      Mon:
        ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable):  3
      Osd:
        ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable):  3
      Overall:
        ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable):  11
      Rgw:
        ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable):  1
  Conditions:
    Last Heartbeat Time:   2024-06-06T09:51:30Z
    Last Transition Time:  2024-06-03T06:33:46Z
    Message:               Cluster created successfully
    Reason:                ClusterCreated
    Status:                True
    Type:                  Ready
    Last Heartbeat Time:   2024-06-10T04:47:39Z
    Last Transition Time:  2024-06-10T04:47:39Z
    Message:               Detecting Ceph version
    Reason:                ClusterProgressing
    Status:                True
    Type:                  Progressing
  Message:                 Detecting Ceph version
  Observed Generation:     2
  Phase:                   Progressing
  State:                   Creating
  Storage:
    Device Classes:
      Name:  ssd
    Osd:
      Store Type:
        Bluestore:  3
  Version:
    Image:    registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:9dbd051cfcdb334aad33a536cc115ae1954edaea5f8cb5943ad615f1b41b0226
    Version:  17.2.6-196
Events:
  Type     Reason           Age                   From                          Message
  ----     ------           ----                  ----                          -------
  Warning  ReconcileFailed  29m (x102 over 2d1h)  rook-ceph-cluster-controller  failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed the ceph version check: failed to complete ceph version job: failed to run CmdReporter rook-ceph-detect-version successfully. failed waiting for results ConfigMap rook-ceph-detect-version. timed out waiting for results ConfigMap