Description of problem (please be detailed as possible and provide log snippests):Ceph health is going to error state after running test_selinux_relabel_for_existing_pvc[5] test case sh-5.1$ ceph health HEALTH_ERR 2 MDSs report slow metadata IOs; 1 MDSs report slow requests; Module 'devicehealth' has failed: unknown operation; 4/64333 objects unfound (0.006%); 2 osds down; 2 hosts (2 osds) down; Reduced data availability: 201 pgs inactive; Possible data damage: 4 pgs recovery_unfound; Degraded data redundancy: 128670/192999 objects degraded (66.669%), 117 pgs degraded, 201 pgs undersized; 627 daemons have recently crashed; 1 mgr modules have recently crashed sh-5.1$ Version of all relevant components (if applicable): ODF version - 4.14.7 OCP version - 4.14.25 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Create ODF 4.14.7 and execute the tier1 test suite. 2. During execution of test_selinux_relabel_for_existing_pvc[5] test case ceph health is going into error state. Actual results: Expected results: Additional info:
must gather logs - https://drive.google.com/file/d/1kwGQbGLmwZ6BMAo5Fdl5dNlWyTr2DDd4/view?usp=drive_link
As we have completed the development freeze for ODF-4.16 moving this non-blocker bz out of the release. If this is a blocker, feel free to propose it as a blocker with justification note.
Infact, this issue is seen in ODF4.15.2 and ODF4.14.7 and we created BZ for ODF4.15.2 as well : https://bugzilla.redhat.com/show_bug.cgi?id=2277603
This issue is seen in ODF4.14.7 and ODF4.15.2 and we have tried in multiple clusters and it is consistent.
We haven't seen this issue on ODF4.14.6 and earlier. For ODF4.15, this issue is not seen on ODF4.15.1 and earlier. Also, on ODF4.16.0 no issue related to same.
A custom 4.14.7 build `bz-2280973` with rhceph version 6.1z4 is available for testing
> lemme know once you have the build for `4.14.7 with rhceph version 6.1z4` and a build for `4.15.{2,3}` with version 6.1z4 on IBM P systems we got the build for 4.14.7 with rhceph version 6.1z4 and we are testing it but we are still waiting for 4.15.2 and 4.15.3 with rhceph version 6.1z4.
we tested ODF4.14.7 with new build having rhceph version 6.1z4, and we did not face any issue while running test cases, ceph health is also OK and all pods are running.
I reran tier1 on ODF 4.14.7 after setting debug log level to 20. Ceph health went into error state. sh-5.1$ ceph health detail HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; 1/25670 objects unfound (0.004%); Possible data damage: 1 pg recovery_unfound; Degraded data redundancy: 3/77010 objects degraded (0.004%), 1 pg degraded; 2 slow ops, oldest one blocked for 12486 sec, osd.0 has slow ops [WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs mds.ocs-storagecluster-cephfilesystem-b(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 12484 secs [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests mds.ocs-storagecluster-cephfilesystem-b(mds.0): 24 slow requests are blocked > 30 secs [WRN] OBJECT_UNFOUND: 1/25670 objects unfound (0.004%) pg 10.6 has 1 unfound objects [ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound pg 10.6 is active+recovery_unfound+degraded, acting [0,2,1], 1 unfound [WRN] PG_DEGRADED: Degraded data redundancy: 3/77010 objects degraded (0.004%), 1 pg degraded pg 10.6 is active+recovery_unfound+degraded, acting [0,2,1], 1 unfound [WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 12486 sec, osd.0 has slow ops sh-5.1$ ODF details - [root@4147-63b9-bastion-0 scripts]# oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-79cb669559-2fhnl 2/2 Running 0 7m45s csi-cephfsplugin-42j82 2/2 Running 0 3h10m csi-cephfsplugin-5vskv 2/2 Running 0 3h10m csi-cephfsplugin-gn9m6 2/2 Running 0 3h10m csi-cephfsplugin-provisioner-5d7bf56669-2pnnh 5/5 Running 0 3h10m csi-cephfsplugin-provisioner-5d7bf56669-tl6dt 5/5 Running 0 3h10m csi-nfsplugin-dszl7 2/2 Running 0 78m csi-nfsplugin-jc29j 2/2 Running 0 78m csi-nfsplugin-provisioner-6c874556f8-89ctt 5/5 Running 0 78m csi-nfsplugin-provisioner-6c874556f8-cmtgq 5/5 Running 0 78m csi-nfsplugin-vcpvw 2/2 Running 0 78m csi-rbdplugin-2589h 3/3 Running 0 3h10m csi-rbdplugin-75plr 3/3 Running 0 3h10m csi-rbdplugin-mdzj4 3/3 Running 0 3h10m csi-rbdplugin-provisioner-58b4d778f4-cdbc8 6/6 Running 0 3h10m csi-rbdplugin-provisioner-58b4d778f4-kqcfd 6/6 Running 0 3h10m noobaa-core-0 1/1 Running 0 3h7m noobaa-db-pg-0 1/1 Running 0 3h7m noobaa-endpoint-7458dd6f4d-sc9tz 1/1 Running 0 3h5m noobaa-operator-5c8c964858-72kgn 2/2 Running 0 3h11m ocs-metrics-exporter-7ffffb7c9d-4tpzl 1/1 Running 0 3h11m ocs-operator-5bff7bdf4c-cs9wf 1/1 Running 0 3h11m odf-console-6f7998946b-bwg56 1/1 Running 0 3h11m odf-operator-controller-manager-5568dd9487-26hl2 2/2 Running 0 3h11m rook-ceph-crashcollector-worker-0-648f4b8788-vwrdv 1/1 Running 0 3h7m rook-ceph-crashcollector-worker-1-646b58c45f-4qlzf 1/1 Running 0 3h7m rook-ceph-crashcollector-worker-2-69f5449956-xk6c2 1/1 Running 0 3h8m rook-ceph-exporter-worker-0-6b555f4675-cr6qt 1/1 Running 0 3h7m rook-ceph-exporter-worker-1-6f9d4c69c6-x8fm5 1/1 Running 0 3h7m rook-ceph-exporter-worker-2-c45956ff5-tgr6x 1/1 Running 0 3h8m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-696bd4668jzvp 2/2 Running 0 3h7m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6d5ddfd8cdfrt 2/2 Running 0 3h7m rook-ceph-mgr-a-54bfbc8c97-r9p8h 2/2 Running 0 3h8m rook-ceph-mon-a-8596bdf8f5-9xnwm 2/2 Running 0 3h9m rook-ceph-mon-b-78dccc67ff-kxg56 2/2 Running 0 3h8m rook-ceph-mon-c-f685957cc-zm9fq 2/2 Running 0 3h8m rook-ceph-operator-5fd7f59d9-56pgt 1/1 Running 0 78m rook-ceph-osd-0-6cd77b9f49-4dnbx 2/2 Running 0 3h8m rook-ceph-osd-1-54898769b4-76n67 2/2 Running 0 3h8m rook-ceph-osd-2-6c9477fd7c-hphh2 2/2 Running 0 3h8m rook-ceph-osd-prepare-21204917fbfd08a6e447ee561bd006f2-nk9lq 0/1 Completed 0 3h8m rook-ceph-osd-prepare-386c0b59249f47cb659730da72766997-4q8k6 0/1 Completed 0 3h8m rook-ceph-osd-prepare-874523f6970b8f4e8d48f12f057a3742-9xq6w 0/1 Completed 0 3h8m rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-66d6ccbq8nhq 2/2 Running 0 3h7m rook-ceph-tools-6cb655c7d-rj8ff 1/1 Running 0 3h1m ux-backend-server-8fd45d994-kf8gm 2/2 Running 0 3h11m [root@4147-63b9-bastion-0 scripts]# oc get cephcluster -n openshift-storage NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID ocs-storagecluster-cephcluster /var/lib/rook 3 3h10m Ready Cluster created successfully HEALTH_ERR 3abc68f2-f6c7-481e-b640-e2fa6e5a4e8b [root@4147-63b9-bastion-0 scripts]# oc get storagecluster -n openshift-storage NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 3h11m Progressing 2024-05-29T09:34:03Z 4.14.7 [root@4147-63b9-bastion-0 scripts]# oc get csv -A NAMESPACE NAME DISPLAY VERSION REPLACES PHASE openshift-local-storage local-storage-operator.v4.14.0-202404030309 Local Storage 4.14.0-202404030309 Succeeded openshift-operator-lifecycle-manager packageserver Package Server 0.0.1-snapshot Succeeded openshift-storage mcg-operator.v4.14.7-rhodf NooBaa Operator 4.14.7-rhodf mcg-operator.v4.14.6-rhodf Succeeded openshift-storage ocs-operator.v4.14.7-rhodf OpenShift Container Storage 4.14.7-rhodf ocs-operator.v4.14.6-rhodf Succeeded openshift-storage odf-csi-addons-operator.v4.14.7-rhodf CSI Addons 4.14.7-rhodf odf-csi-addons-operator.v4.14.6-rhodf Succeeded openshift-storage odf-operator.v4.14.7-rhodf OpenShift Data Foundation 4.14.7-rhodf odf-operator.v4.14.6-rhodf Succeeded [root@4147-63b9-bastion-0 scripts]# Here is the must gather logs for the same - https://drive.google.com/file/d/16ZBrkvzU8pnNk0wLTmGY2hxd72dQct_c/view?usp=drive_link
In Pooja's setup, coredumps are missing. In her must-gather, coredump directory is empty. May be because in her setup daemon crash is not happening. [root@4147-63b9-bastion-0 odf-debug20-selinux]# cd quay-io-rhceph-dev-ocs-must-gather-sha256-41894d86060275bc9094bb4819f9b38cd2ca8beec15a58f6c34ccc12d9deb588/ceph/ [root@4147-63b9-bastion-0 ceph]# ls ceph_daemon_log_worker-0 event-filter.html journal_worker-2 kernel_worker-2 must_gather_commands_json_output ceph_daemon_log_worker-1 journal_worker-0 kernel_worker-0 logs namespaces ceph_daemon_log_worker-2 journal_worker-1 kernel_worker-1 must_gather_commands timestamp [root@4147-63b9-bastion-0 ceph]# 4:52 [root@4147-63b9-bastion-0 ~]# oc debug node/worker-0 Starting pod/worker-0-debug-wn578 ... To use host binaries, run `chroot /host` Pod IP: 9.114.99.13 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-5.1# ls /var/lib/systemd/coredump/ sh-5.1# exit exit sh-4.4# exit exit Removing debug pod ... [root@4147-63b9-bastion-0 ~]# oc debug node/worker-1 Starting pod/worker-1-debug-tqhgh ... To use host binaries, run `chroot /host` Pod IP: 9.114.99.6 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-5.1# ls /var/lib/systemd/coredump/ sh-5.1# exit exit sh-4.4# exit exit Removing debug pod ... [root@4147-63b9-bastion-0 ~]# oc debug node/worker-2 Starting pod/worker-2-debug-tf6dt ... To use host binaries, run `chroot /host` Pod IP: 9.114.99.11 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-5.1# ls /var/lib/systemd/coredump/ sh-5.1# exit exit sh-4.4# exit exit Removing debug pod ...
*** This bug has been marked as a duplicate of bug 2277603 ***