Bug 2280973
| Summary: | Ceph health is going to Error state on ODF4.14.7 on IBM Power | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Pooja Soni <posoni> |
| Component: | rook | Assignee: | Parth Arora <paarora> |
| Status: | CLOSED DUPLICATE | QA Contact: | Neha Berry <nberry> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.14 | CC: | aaaggarw, bhubbard, brgardne, dkhandel, lithomas, odf-bz-bot, paarora, radoslaw.zak, rzarzyns, tnielsen |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | ppc64le | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2024-06-06 00:43:51 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Pooja Soni
2024-05-17 13:42:31 UTC
must gather logs - https://drive.google.com/file/d/1kwGQbGLmwZ6BMAo5Fdl5dNlWyTr2DDd4/view?usp=drive_link As we have completed the development freeze for ODF-4.16 moving this non-blocker bz out of the release. If this is a blocker, feel free to propose it as a blocker with justification note. Infact, this issue is seen in ODF4.15.2 and ODF4.14.7 and we created BZ for ODF4.15.2 as well : https://bugzilla.redhat.com/show_bug.cgi?id=2277603 This issue is seen in ODF4.14.7 and ODF4.15.2 and we have tried in multiple clusters and it is consistent. We haven't seen this issue on ODF4.14.6 and earlier. For ODF4.15, this issue is not seen on ODF4.15.1 and earlier. Also, on ODF4.16.0 no issue related to same. A custom 4.14.7 build `bz-2280973` with rhceph version 6.1z4 is available for testing > lemme know once you have the build for `4.14.7 with rhceph version 6.1z4` and a build for `4.15.{2,3}` with version 6.1z4 on IBM P systems
we got the build for 4.14.7 with rhceph version 6.1z4 and we are testing it but we are still waiting for 4.15.2 and 4.15.3 with rhceph version 6.1z4.
we tested ODF4.14.7 with new build having rhceph version 6.1z4, and we did not face any issue while running test cases, ceph health is also OK and all pods are running. I reran tier1 on ODF 4.14.7 after setting debug log level to 20. Ceph health went into error state.
sh-5.1$ ceph health detail
HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; 1/25670 objects unfound (0.004%); Possible data damage: 1 pg recovery_unfound; Degraded data redundancy: 3/77010 objects degraded (0.004%), 1 pg degraded; 2 slow ops, oldest one blocked for 12486 sec, osd.0 has slow ops
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
mds.ocs-storagecluster-cephfilesystem-b(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 12484 secs
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
mds.ocs-storagecluster-cephfilesystem-b(mds.0): 24 slow requests are blocked > 30 secs
[WRN] OBJECT_UNFOUND: 1/25670 objects unfound (0.004%)
pg 10.6 has 1 unfound objects
[ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound
pg 10.6 is active+recovery_unfound+degraded, acting [0,2,1], 1 unfound
[WRN] PG_DEGRADED: Degraded data redundancy: 3/77010 objects degraded (0.004%), 1 pg degraded
pg 10.6 is active+recovery_unfound+degraded, acting [0,2,1], 1 unfound
[WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 12486 sec, osd.0 has slow ops
sh-5.1$
ODF details -
[root@4147-63b9-bastion-0 scripts]# oc get pods -n openshift-storage
NAME READY STATUS RESTARTS AGE
csi-addons-controller-manager-79cb669559-2fhnl 2/2 Running 0 7m45s
csi-cephfsplugin-42j82 2/2 Running 0 3h10m
csi-cephfsplugin-5vskv 2/2 Running 0 3h10m
csi-cephfsplugin-gn9m6 2/2 Running 0 3h10m
csi-cephfsplugin-provisioner-5d7bf56669-2pnnh 5/5 Running 0 3h10m
csi-cephfsplugin-provisioner-5d7bf56669-tl6dt 5/5 Running 0 3h10m
csi-nfsplugin-dszl7 2/2 Running 0 78m
csi-nfsplugin-jc29j 2/2 Running 0 78m
csi-nfsplugin-provisioner-6c874556f8-89ctt 5/5 Running 0 78m
csi-nfsplugin-provisioner-6c874556f8-cmtgq 5/5 Running 0 78m
csi-nfsplugin-vcpvw 2/2 Running 0 78m
csi-rbdplugin-2589h 3/3 Running 0 3h10m
csi-rbdplugin-75plr 3/3 Running 0 3h10m
csi-rbdplugin-mdzj4 3/3 Running 0 3h10m
csi-rbdplugin-provisioner-58b4d778f4-cdbc8 6/6 Running 0 3h10m
csi-rbdplugin-provisioner-58b4d778f4-kqcfd 6/6 Running 0 3h10m
noobaa-core-0 1/1 Running 0 3h7m
noobaa-db-pg-0 1/1 Running 0 3h7m
noobaa-endpoint-7458dd6f4d-sc9tz 1/1 Running 0 3h5m
noobaa-operator-5c8c964858-72kgn 2/2 Running 0 3h11m
ocs-metrics-exporter-7ffffb7c9d-4tpzl 1/1 Running 0 3h11m
ocs-operator-5bff7bdf4c-cs9wf 1/1 Running 0 3h11m
odf-console-6f7998946b-bwg56 1/1 Running 0 3h11m
odf-operator-controller-manager-5568dd9487-26hl2 2/2 Running 0 3h11m
rook-ceph-crashcollector-worker-0-648f4b8788-vwrdv 1/1 Running 0 3h7m
rook-ceph-crashcollector-worker-1-646b58c45f-4qlzf 1/1 Running 0 3h7m
rook-ceph-crashcollector-worker-2-69f5449956-xk6c2 1/1 Running 0 3h8m
rook-ceph-exporter-worker-0-6b555f4675-cr6qt 1/1 Running 0 3h7m
rook-ceph-exporter-worker-1-6f9d4c69c6-x8fm5 1/1 Running 0 3h7m
rook-ceph-exporter-worker-2-c45956ff5-tgr6x 1/1 Running 0 3h8m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-696bd4668jzvp 2/2 Running 0 3h7m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6d5ddfd8cdfrt 2/2 Running 0 3h7m
rook-ceph-mgr-a-54bfbc8c97-r9p8h 2/2 Running 0 3h8m
rook-ceph-mon-a-8596bdf8f5-9xnwm 2/2 Running 0 3h9m
rook-ceph-mon-b-78dccc67ff-kxg56 2/2 Running 0 3h8m
rook-ceph-mon-c-f685957cc-zm9fq 2/2 Running 0 3h8m
rook-ceph-operator-5fd7f59d9-56pgt 1/1 Running 0 78m
rook-ceph-osd-0-6cd77b9f49-4dnbx 2/2 Running 0 3h8m
rook-ceph-osd-1-54898769b4-76n67 2/2 Running 0 3h8m
rook-ceph-osd-2-6c9477fd7c-hphh2 2/2 Running 0 3h8m
rook-ceph-osd-prepare-21204917fbfd08a6e447ee561bd006f2-nk9lq 0/1 Completed 0 3h8m
rook-ceph-osd-prepare-386c0b59249f47cb659730da72766997-4q8k6 0/1 Completed 0 3h8m
rook-ceph-osd-prepare-874523f6970b8f4e8d48f12f057a3742-9xq6w 0/1 Completed 0 3h8m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-66d6ccbq8nhq 2/2 Running 0 3h7m
rook-ceph-tools-6cb655c7d-rj8ff 1/1 Running 0 3h1m
ux-backend-server-8fd45d994-kf8gm 2/2 Running 0 3h11m
[root@4147-63b9-bastion-0 scripts]# oc get cephcluster -n openshift-storage
NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID
ocs-storagecluster-cephcluster /var/lib/rook 3 3h10m Ready Cluster created successfully HEALTH_ERR
3abc68f2-f6c7-481e-b640-e2fa6e5a4e8b
[root@4147-63b9-bastion-0 scripts]# oc get storagecluster -n openshift-storage
NAME AGE PHASE EXTERNAL CREATED AT VERSION
ocs-storagecluster 3h11m Progressing 2024-05-29T09:34:03Z 4.14.7
[root@4147-63b9-bastion-0 scripts]# oc get csv -A
NAMESPACE NAME DISPLAY VERSION
REPLACES PHASE
openshift-local-storage local-storage-operator.v4.14.0-202404030309 Local Storage 4.14.0-202404030309 Succeeded
openshift-operator-lifecycle-manager packageserver Package Server 0.0.1-snapshot Succeeded
openshift-storage mcg-operator.v4.14.7-rhodf NooBaa Operator 4.14.7-rhodf mcg-operator.v4.14.6-rhodf Succeeded
openshift-storage ocs-operator.v4.14.7-rhodf OpenShift Container Storage 4.14.7-rhodf ocs-operator.v4.14.6-rhodf Succeeded
openshift-storage odf-csi-addons-operator.v4.14.7-rhodf CSI Addons 4.14.7-rhodf odf-csi-addons-operator.v4.14.6-rhodf Succeeded
openshift-storage odf-operator.v4.14.7-rhodf OpenShift Data Foundation 4.14.7-rhodf odf-operator.v4.14.6-rhodf Succeeded
[root@4147-63b9-bastion-0 scripts]#
Here is the must gather logs for the same - https://drive.google.com/file/d/16ZBrkvzU8pnNk0wLTmGY2hxd72dQct_c/view?usp=drive_link
In Pooja's setup, coredumps are missing. In her must-gather, coredump directory is empty. May be because in her setup daemon crash is not happening. [root@4147-63b9-bastion-0 odf-debug20-selinux]# cd quay-io-rhceph-dev-ocs-must-gather-sha256-41894d86060275bc9094bb4819f9b38cd2ca8beec15a58f6c34ccc12d9deb588/ceph/ [root@4147-63b9-bastion-0 ceph]# ls ceph_daemon_log_worker-0 event-filter.html journal_worker-2 kernel_worker-2 must_gather_commands_json_output ceph_daemon_log_worker-1 journal_worker-0 kernel_worker-0 logs namespaces ceph_daemon_log_worker-2 journal_worker-1 kernel_worker-1 must_gather_commands timestamp [root@4147-63b9-bastion-0 ceph]# 4:52 [root@4147-63b9-bastion-0 ~]# oc debug node/worker-0 Starting pod/worker-0-debug-wn578 ... To use host binaries, run `chroot /host` Pod IP: 9.114.99.13 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-5.1# ls /var/lib/systemd/coredump/ sh-5.1# exit exit sh-4.4# exit exit Removing debug pod ... [root@4147-63b9-bastion-0 ~]# oc debug node/worker-1 Starting pod/worker-1-debug-tqhgh ... To use host binaries, run `chroot /host` Pod IP: 9.114.99.6 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-5.1# ls /var/lib/systemd/coredump/ sh-5.1# exit exit sh-4.4# exit exit Removing debug pod ... [root@4147-63b9-bastion-0 ~]# oc debug node/worker-2 Starting pod/worker-2-debug-tf6dt ... To use host binaries, run `chroot /host` Pod IP: 9.114.99.11 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-5.1# ls /var/lib/systemd/coredump/ sh-5.1# exit exit sh-4.4# exit exit Removing debug pod ... *** This bug has been marked as a duplicate of bug 2277603 *** |