Bug 2225176
Summary: | After upgrade of ODF OSD pods are in Init:CrashLoopBackOff | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Pratik Surve <prsurve> |
Component: | rook | Assignee: | Santosh Pillai <sapillai> |
Status: | CLOSED ERRATA | QA Contact: | Vijay Avuthu <vavuthu> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.14 | CC: | akupczyk, kdreyer, kramdoss, muagarwa, nojha, odf-bz-bot, sapillai, tserlin, vavuthu |
Target Milestone: | --- | ||
Target Release: | ODF 4.14.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | 4.14.0-90 | Doc Type: | No Doc Update |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-11-08 18:52:55 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 2226657 | ||
Bug Blocks: |
Description
Pratik Surve
2023-07-24 13:22:18 UTC
expand-bluefs init container is failing as the `ceph-bluestore-tool` is not able to read meta data from the provided path - /var/lib/ceph/osd/ceph-0 Still looking into it. Tried checking the OSD prepare logs because the `type` is created during `mkfs` in `ceph-volume prepare`. Not able to get the prepare pod logs due to following error: ``` oc logs rook-ceph-osd-prepare-ocs-deviceset-2-data-0bcbkd-fwcgj Defaulted container "provision" out of: provision, copy-bins (init), blkdevmapper (init) unable to retrieve container logs for cri-o://919ffbf854bf3e31dfa26a83d8d65dbfd66cf37e44f0f6de05c14b8943528b1a% ``` Have asked Pratik to retry this scenario again so that we can ensure this is not an environment issue. This is not an env issue because we are hitting it in every ci tests for nightly builds. There is a cluster if you want to look https://jenkins.ceph.redhat.com/job/ocs-ci/2356/ Just a guess - It might be related to the new bluestore changes as those were merged in this ceph build but by default the changes should be disabled OSD works are removing the `expand-bluefs` init container. ``` sh-5.1$ ceph status cluster: id: 2e285ed3-d727-4445-aed4-d8fa245c92d9 health: HEALTH_WARN 2 MDSs report slow metadata IOs 2 osds down 2 hosts (2 osds) down 2 zones (2 osds) down Reduced data availability: 113 pgs inactive, 113 pgs stale Degraded data redundancy: 1862/2793 objects degraded (66.667%), 81 pgs degraded, 113 pgs undersized services: mon: 3 daemons, quorum a,b,c (age 19h) mgr: a(active, since 19h) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 1 up (since 2m), 3 in (since 24h) data: volumes: 1/1 healthy pools: 4 pools, 113 pgs objects: 931 objects, 2.8 GiB usage: 7.8 GiB used, 292 GiB / 300 GiB avail pgs: 100.000% pgs not active 1862/2793 objects degraded (66.667%) 81 stale+undersized+degraded+peered 32 stale+undersized+peered sh-5.1$ exit ``` Tried running the ceph-bluestore-tool inside the osd container. Getting the same error. Although the type is `bluestore` ``` sh-5.1# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-0 --log-level 30 inferring bluefs devices from bluestore path expected bluestore, but type is sh-5.1# cat /var/lib/ceph/osd/ceph-0/type bluestore sh-5.1# ``` So looks like `ceph-bluestore-tool` is not reading the metadata correctly. (In reply to Santosh Pillai from comment #6) > OSD works are removing the `expand-bluefs` init container. > > ``` > sh-5.1$ ceph status > cluster: > id: 2e285ed3-d727-4445-aed4-d8fa245c92d9 > health: HEALTH_WARN > 2 MDSs report slow metadata IOs > 2 osds down > 2 hosts (2 osds) down > 2 zones (2 osds) down > Reduced data availability: 113 pgs inactive, 113 pgs stale > Degraded data redundancy: 1862/2793 objects degraded (66.667%), > 81 pgs degraded, 113 pgs undersized > > services: > mon: 3 daemons, quorum a,b,c (age 19h) > mgr: a(active, since 19h) > mds: 1/1 daemons up, 1 hot standby > osd: 3 osds: 1 up (since 2m), 3 in (since 24h) > > data: > volumes: 1/1 healthy > pools: 4 pools, 113 pgs > objects: 931 objects, 2.8 GiB > usage: 7.8 GiB used, 292 GiB / 300 GiB avail > pgs: 100.000% pgs not active > 1862/2793 objects degraded (66.667%) > 81 stale+undersized+degraded+peered > 32 stale+undersized+peered > > sh-5.1$ exit > ``` > > > Tried running the ceph-bluestore-tool inside the osd container. Getting the > same error. Although the type is `bluestore` > ``` sh-5.1# ceph-bluestore-tool bluefs-bdev-expand --path > /var/lib/ceph/osd/ceph-0 --log-level 30 > inferring bluefs devices from bluestore path > expected bluestore, but type is > > sh-5.1# cat /var/lib/ceph/osd/ceph-0/type > bluestore > sh-5.1# > ``` > > So looks like `ceph-bluestore-tool` is not reading the metadata correctly. Hi Adam, has there been any change in way the meta data is read by the COT, that could have caused this issue? Adam found the rook cause in Ceph Pls provide QE ack verified upgrade with vSphere platform: Upgraded cluster from ODF 4.13.1 to 4.14.0-93 $ oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.14.0-93.stable NooBaa Operator 4.14.0-93.stable mcg-operator.v4.13.1-rhodf Succeeded ocs-operator.v4.14.0-93.stable OpenShift Container Storage 4.14.0-93.stable ocs-operator.v4.13.1-rhodf Succeeded odf-csi-addons-operator.v4.14.0-93.stable CSI Addons 4.14.0-93.stable odf-csi-addons-operator.v4.13.1-rhodf Succeeded odf-operator.v4.14.0-93.stable OpenShift Data Foundation 4.14.0-93.stable odf-operator.v4.13.1-rhodf Succeeded > All pods are up and running $ oc get pods NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-5c8fd7b449-tfjw5 2/2 Running 0 94m csi-cephfsplugin-h2zzs 2/2 Running 0 105m csi-cephfsplugin-hd2nr 2/2 Running 0 105m csi-cephfsplugin-kw9nh 2/2 Running 0 105m csi-cephfsplugin-provisioner-689c768444-drrxm 5/5 Running 0 105m csi-cephfsplugin-provisioner-689c768444-s7pj9 5/5 Running 0 105m csi-rbdplugin-krx8s 3/3 Running 0 105m csi-rbdplugin-nwwkq 3/3 Running 0 105m csi-rbdplugin-provisioner-6bb5f9f996-tjmtw 6/6 Running 0 105m csi-rbdplugin-provisioner-6bb5f9f996-xr6mm 6/6 Running 0 105m csi-rbdplugin-qlcqf 3/3 Running 0 105m noobaa-core-0 1/1 Running 0 105m noobaa-db-pg-0 1/1 Running 0 128m noobaa-endpoint-74fd8699d5-4svkl 1/1 Running 0 105m noobaa-operator-6bd6985d8-9kjxs 2/2 Running 0 107m ocs-metrics-exporter-756f64cdbc-jb68v 1/1 Running 0 107m ocs-operator-c8f5b6b46-wm4rz 1/1 Running 1 (106m ago) 107m odf-console-544f747cdf-cqxh6 1/1 Running 3 (108m ago) 109m odf-operator-controller-manager-dc4d55f78-vzx85 2/2 Running 0 109m rook-ceph-crashcollector-compute-0-66998b7976-5ktgt 1/1 Running 0 105m rook-ceph-crashcollector-compute-1-7666849b5-vlb28 1/1 Running 0 103m rook-ceph-crashcollector-compute-2-799cf594c8-q89jk 1/1 Running 0 105m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7b95676cp66w7 2/2 Running 0 104m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-77f67664b2qcd 2/2 Running 0 104m rook-ceph-mgr-a-6d9dc4d5bc-jlwx8 2/2 Running 0 103m rook-ceph-mon-a-767f9b8575-dr9vh 2/2 Running 0 105m rook-ceph-mon-b-b689b6665-sj96f 2/2 Running 0 103m rook-ceph-mon-c-7774d87c75-vrffn 2/2 Running 0 105m rook-ceph-operator-57887b7c4-7wljh 1/1 Running 0 106m rook-ceph-osd-0-88955f6d8-gtfkq 2/2 Running 0 102m rook-ceph-osd-1-9cfc6f76d-6kh68 2/2 Running 0 102m rook-ceph-osd-2-6f54db8869-bmsds 2/2 Running 0 102m rook-ceph-osd-prepare-ocs-deviceset-0-data-06mnxk-zltw2 0/1 Completed 0 129m rook-ceph-osd-prepare-ocs-deviceset-1-data-04rx7r-pvfmb 0/1 Completed 0 129m rook-ceph-osd-prepare-ocs-deviceset-2-data-06dw7p-qxqdc 0/1 Completed 0 129m rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-574c554574bq 2/2 Running 0 104m rook-ceph-tools-779ff74f75-7k875 1/1 Running 0 106m upgrade job: https://url.corp.redhat.com/20f84c2 logs: https://url.corp.redhat.com/1db1d4a Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6832 *** Bug 2226662 has been marked as a duplicate of this bug. *** The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |