Bug 2280973

Summary:	Ceph health is going to Error state on ODF4.14.7 on IBM Power
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Pooja Soni <posoni>
Component:	rook	Assignee:	Parth Arora <paarora>
Status:	CLOSED DUPLICATE	QA Contact:	Neha Berry <nberry>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.14	CC:	aaaggarw, bhubbard, brgardne, dkhandel, lithomas, odf-bz-bot, paarora, radoslaw.zak, rzarzyns, tnielsen
Target Milestone:	---
Target Release:	---
Hardware:	ppc64le
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-06-06 00:43:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pooja Soni 2024-05-17 13:42:31 UTC

Description of problem (please be detailed as possible and provide log
snippests):Ceph health is going to error state after running test_selinux_relabel_for_existing_pvc[5] test case 
sh-5.1$ ceph health
HEALTH_ERR 2 MDSs report slow metadata IOs; 1 MDSs report slow requests; Module 'devicehealth' has failed: unknown operation; 4/64333 objects unfound (0.006%); 2 osds down; 2 hosts (2 osds) down; Reduced data availability: 201 pgs inactive; Possible data damage: 4 pgs recovery_unfound; Degraded data redundancy: 128670/192999 objects degraded (66.669%), 117 pgs degraded, 201 pgs undersized; 627 daemons have recently crashed; 1 mgr modules have recently crashed
sh-5.1$

Version of all relevant components (if applicable):
ODF version - 4.14.7
OCP version - 4.14.25

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Create ODF 4.14.7 and execute the tier1 test suite. 
2. During execution of test_selinux_relabel_for_existing_pvc[5] test case ceph health is going into error state.


Actual results:


Expected results:


Additional info:

Comment 6 Pooja Soni 2024-05-21 10:12:09 UTC

must gather logs - https://drive.google.com/file/d/1kwGQbGLmwZ6BMAo5Fdl5dNlWyTr2DDd4/view?usp=drive_link

Comment 11 Sunil Kumar Acharya 2024-05-21 13:28:29 UTC

As we have completed the development freeze for ODF-4.16 moving this non-blocker bz out of the release. If this is a blocker, feel free to propose it as a blocker with justification note.

Comment 13 Aaruni Aggarwal 2024-05-21 14:53:18 UTC

Infact, this issue is seen in ODF4.15.2  and ODF4.14.7
and we created BZ for ODF4.15.2 as well : https://bugzilla.redhat.com/show_bug.cgi?id=2277603

Comment 16 Aaruni Aggarwal 2024-05-22 09:08:24 UTC

This issue is seen in ODF4.14.7 and ODF4.15.2 and we have tried in multiple clusters and it is consistent.

Comment 17 Aaruni Aggarwal 2024-05-22 09:26:25 UTC

We haven't seen this issue on ODF4.14.6 and earlier. 
For ODF4.15,  this issue is not seen on ODF4.15.1 and earlier.
Also, on ODF4.16.0 no issue related to same.

Comment 22 Deepshikha khandelwal 2024-05-23 08:45:44 UTC

A custom 4.14.7 build `bz-2280973` with rhceph version 6.1z4 is available for testing

Comment 25 Pooja Soni 2024-05-23 12:00:04 UTC

> lemme know once you have the build for `4.14.7 with rhceph version 6.1z4` and a build for `4.15.{2,3}` with version 6.1z4 on IBM P systems

we got the build for 4.14.7 with rhceph version 6.1z4 and we are testing it but we are still waiting for 4.15.2 and 4.15.3 with rhceph version 6.1z4.

Comment 27 Aaruni Aggarwal 2024-05-24 06:59:33 UTC

we tested ODF4.14.7 with new build having rhceph version 6.1z4, and we did not face any issue while running test cases, ceph health is also OK and all pods are running.

Comment 30 Pooja Soni 2024-05-29 13:22:37 UTC

I reran tier1 on ODF 4.14.7 after setting debug log level to 20. Ceph health went into error state.

sh-5.1$ ceph health detail
HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; 1/25670 objects unfound (0.004%); Possible data damage: 1 pg recovery_unfound; Degraded data redundancy: 3/77010 objects degraded (0.004%), 1 pg degraded; 2 slow ops, oldest one blocked for 12486 sec, osd.0 has slow ops
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 12484 secs
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): 24 slow requests are blocked > 30 secs
[WRN] OBJECT_UNFOUND: 1/25670 objects unfound (0.004%)
    pg 10.6 has 1 unfound objects
[ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound
    pg 10.6 is active+recovery_unfound+degraded, acting [0,2,1], 1 unfound
[WRN] PG_DEGRADED: Degraded data redundancy: 3/77010 objects degraded (0.004%), 1 pg degraded
    pg 10.6 is active+recovery_unfound+degraded, acting [0,2,1], 1 unfound
[WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 12486 sec, osd.0 has slow ops
sh-5.1$

ODF details - 
[root@4147-63b9-bastion-0 scripts]# oc get pods -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-addons-controller-manager-79cb669559-2fhnl                    2/2     Running     0          7m45s
csi-cephfsplugin-42j82                                            2/2     Running     0          3h10m
csi-cephfsplugin-5vskv                                            2/2     Running     0          3h10m
csi-cephfsplugin-gn9m6                                            2/2     Running     0          3h10m
csi-cephfsplugin-provisioner-5d7bf56669-2pnnh                     5/5     Running     0          3h10m
csi-cephfsplugin-provisioner-5d7bf56669-tl6dt                     5/5     Running     0          3h10m
csi-nfsplugin-dszl7                                               2/2     Running     0          78m
csi-nfsplugin-jc29j                                               2/2     Running     0          78m
csi-nfsplugin-provisioner-6c874556f8-89ctt                        5/5     Running     0          78m
csi-nfsplugin-provisioner-6c874556f8-cmtgq                        5/5     Running     0          78m
csi-nfsplugin-vcpvw                                               2/2     Running     0          78m
csi-rbdplugin-2589h                                               3/3     Running     0          3h10m
csi-rbdplugin-75plr                                               3/3     Running     0          3h10m
csi-rbdplugin-mdzj4                                               3/3     Running     0          3h10m
csi-rbdplugin-provisioner-58b4d778f4-cdbc8                        6/6     Running     0          3h10m
csi-rbdplugin-provisioner-58b4d778f4-kqcfd                        6/6     Running     0          3h10m
noobaa-core-0                                                     1/1     Running     0          3h7m
noobaa-db-pg-0                                                    1/1     Running     0          3h7m
noobaa-endpoint-7458dd6f4d-sc9tz                                  1/1     Running     0          3h5m
noobaa-operator-5c8c964858-72kgn                                  2/2     Running     0          3h11m
ocs-metrics-exporter-7ffffb7c9d-4tpzl                             1/1     Running     0          3h11m
ocs-operator-5bff7bdf4c-cs9wf                                     1/1     Running     0          3h11m
odf-console-6f7998946b-bwg56                                      1/1     Running     0          3h11m
odf-operator-controller-manager-5568dd9487-26hl2                  2/2     Running     0          3h11m
rook-ceph-crashcollector-worker-0-648f4b8788-vwrdv                1/1     Running     0          3h7m
rook-ceph-crashcollector-worker-1-646b58c45f-4qlzf                1/1     Running     0          3h7m
rook-ceph-crashcollector-worker-2-69f5449956-xk6c2                1/1     Running     0          3h8m
rook-ceph-exporter-worker-0-6b555f4675-cr6qt                      1/1     Running     0          3h7m
rook-ceph-exporter-worker-1-6f9d4c69c6-x8fm5                      1/1     Running     0          3h7m
rook-ceph-exporter-worker-2-c45956ff5-tgr6x                       1/1     Running     0          3h8m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-696bd4668jzvp   2/2     Running     0          3h7m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6d5ddfd8cdfrt   2/2     Running     0          3h7m
rook-ceph-mgr-a-54bfbc8c97-r9p8h                                  2/2     Running     0          3h8m
rook-ceph-mon-a-8596bdf8f5-9xnwm                                  2/2     Running     0          3h9m
rook-ceph-mon-b-78dccc67ff-kxg56                                  2/2     Running     0          3h8m
rook-ceph-mon-c-f685957cc-zm9fq                                   2/2     Running     0          3h8m
rook-ceph-operator-5fd7f59d9-56pgt                                1/1     Running     0          78m
rook-ceph-osd-0-6cd77b9f49-4dnbx                                  2/2     Running     0          3h8m
rook-ceph-osd-1-54898769b4-76n67                                  2/2     Running     0          3h8m
rook-ceph-osd-2-6c9477fd7c-hphh2                                  2/2     Running     0          3h8m
rook-ceph-osd-prepare-21204917fbfd08a6e447ee561bd006f2-nk9lq      0/1     Completed   0          3h8m
rook-ceph-osd-prepare-386c0b59249f47cb659730da72766997-4q8k6      0/1     Completed   0          3h8m
rook-ceph-osd-prepare-874523f6970b8f4e8d48f12f057a3742-9xq6w      0/1     Completed   0          3h8m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-66d6ccbq8nhq   2/2     Running     0          3h7m
rook-ceph-tools-6cb655c7d-rj8ff                                   1/1     Running     0          3h1m
ux-backend-server-8fd45d994-kf8gm                                 2/2     Running     0          3h11m
[root@4147-63b9-bastion-0 scripts]# oc get cephcluster -n openshift-storage
NAME                             DATADIRHOSTPATH   MONCOUNT   AGE     PHASE   MESSAGE                        HEALTH       EXTERNAL   FSID
ocs-storagecluster-cephcluster   /var/lib/rook     3          3h10m   Ready   Cluster created successfully   HEALTH_ERR
3abc68f2-f6c7-481e-b640-e2fa6e5a4e8b
[root@4147-63b9-bastion-0 scripts]# oc get storagecluster -n openshift-storage
NAME                 AGE     PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   3h11m   Progressing              2024-05-29T09:34:03Z   4.14.7
[root@4147-63b9-bastion-0 scripts]# oc get csv -A
NAMESPACE                              NAME                                          DISPLAY                       VERSION
    REPLACES                                PHASE
openshift-local-storage                local-storage-operator.v4.14.0-202404030309   Local Storage                 4.14.0-202404030309                                           Succeeded
openshift-operator-lifecycle-manager   packageserver                                 Package Server                0.0.1-snapshot                                                Succeeded
openshift-storage                      mcg-operator.v4.14.7-rhodf                    NooBaa Operator               4.14.7-rhodf          mcg-operator.v4.14.6-rhodf              Succeeded
openshift-storage                      ocs-operator.v4.14.7-rhodf                    OpenShift Container Storage   4.14.7-rhodf          ocs-operator.v4.14.6-rhodf              Succeeded
openshift-storage                      odf-csi-addons-operator.v4.14.7-rhodf         CSI Addons                    4.14.7-rhodf          odf-csi-addons-operator.v4.14.6-rhodf   Succeeded
openshift-storage                      odf-operator.v4.14.7-rhodf                    OpenShift Data Foundation     4.14.7-rhodf          odf-operator.v4.14.6-rhodf              Succeeded
[root@4147-63b9-bastion-0 scripts]#



Here is the must gather logs for the same - https://drive.google.com/file/d/16ZBrkvzU8pnNk0wLTmGY2hxd72dQct_c/view?usp=drive_link

Comment 32 Aaruni Aggarwal 2024-05-30 11:54:40 UTC

In Pooja's setup, coredumps are missing. In her must-gather, coredump directory is empty. May be because in her setup daemon crash is not happening. 

[root@4147-63b9-bastion-0 odf-debug20-selinux]# cd quay-io-rhceph-dev-ocs-must-gather-sha256-41894d86060275bc9094bb4819f9b38cd2ca8beec15a58f6c34ccc12d9deb588/ceph/
[root@4147-63b9-bastion-0 ceph]# ls
ceph_daemon_log_worker-0  event-filter.html  journal_worker-2  kernel_worker-2       must_gather_commands_json_output
ceph_daemon_log_worker-1  journal_worker-0   kernel_worker-0   logs                  namespaces
ceph_daemon_log_worker-2  journal_worker-1   kernel_worker-1   must_gather_commands  timestamp
[root@4147-63b9-bastion-0 ceph]#
4:52
[root@4147-63b9-bastion-0 ~]# oc debug node/worker-0
Starting pod/worker-0-debug-wn578 ...
To use host binaries, run `chroot /host`
Pod IP: 9.114.99.13
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-5.1# ls /var/lib/systemd/coredump/
sh-5.1# exit
exit
sh-4.4# exit
exit

Removing debug pod ...
[root@4147-63b9-bastion-0 ~]# oc debug node/worker-1
Starting pod/worker-1-debug-tqhgh ...
To use host binaries, run `chroot /host`
Pod IP: 9.114.99.6
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-5.1# ls /var/lib/systemd/coredump/
sh-5.1# exit
exit
sh-4.4# exit
exit

Removing debug pod ...
[root@4147-63b9-bastion-0 ~]# oc debug node/worker-2
Starting pod/worker-2-debug-tf6dt ...
To use host binaries, run `chroot /host`
Pod IP: 9.114.99.11
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-5.1# ls /var/lib/systemd/coredump/
sh-5.1# exit
exit
sh-4.4# exit
exit

Removing debug pod ...

Comment 34 Brad Hubbard 2024-06-06 00:43:51 UTC


*** This bug has been marked as a duplicate of bug 2277603 ***