Description of problem (please be detailed as possible and provide log snippests): After upgrade of ODF OSD pods are in Init:CrashLoopBackOff Version of all relevant components (if applicable): OCP version:- 4.14.0-0.nightly-2023-07-20-215234 ODF version:- 4.14.0-77 platform:- IBM CLOUD IPI Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Deploy ODF with 4.14.0-67 2. Upgrade to latest stable ie 4.14.0-77 3. check pod list Actual results: pods -l app=rook-ceph-osd NAME READY STATUS RESTARTS AGE rook-ceph-osd-0-5cb48c4cf8-s6vvn 0/2 Init:CrashLoopBackOff 8 (3m54s ago) 19m rook-ceph-osd-1-65c5c75dc5-l6tss 0/2 Init:CrashLoopBackOff 8 (3m25s ago) 19m rook-ceph-osd-2-5fb876b874-f8qpf 0/2 Init:CrashLoopBackOff 8 (48s ago) 16m c logs rook-ceph-osd-0-5cb48c4cf8-s6vvn --all-containers + PVC_SOURCE=/ocs-deviceset-1-data-0fd5p9 + PVC_DEST=/var/lib/ceph/osd/ceph-0/block + CP_ARGS=(--archive --dereference --verbose) + '[' -b /var/lib/ceph/osd/ceph-0/block ']' ++ stat --format %t%T /ocs-deviceset-1-data-0fd5p9 + PVC_SOURCE_MAJ_MIN=fc40 ++ stat --format %t%T /var/lib/ceph/osd/ceph-0/block PVC /var/lib/ceph/osd/ceph-0/block already exists and has the same major and minor as /ocs-deviceset-1-data-0fd5p9: fc40 + PVC_DEST_MAJ_MIN=fc40 + [[ fc40 == \f\c\4\0 ]] + echo 'PVC /var/lib/ceph/osd/ceph-0/block already exists and has the same major and minor as /ocs-deviceset-1-data-0fd5p9: fc40' + exit 0 inferring bluefs devices from bluestore path expected bluestore, but type is Error from server (BadRequest): container "chown-container-data-dir" in pod "rook-ceph-osd-0-5cb48c4cf8-s6vvn" is waiting to start: PodInitializing $ oc logs rook-ceph-osd-1-65c5c75dc5-l6tss --all-containers inferring bluefs devices from bluestore path expected bluestore, but type is Error from server (BadRequest): container "chown-container-data-dir" in pod "rook-ceph-osd-1-65c5c75dc5-l6tss" is waiting to start: PodInitializing Expected results: Additional info:
expand-bluefs init container is failing as the `ceph-bluestore-tool` is not able to read meta data from the provided path - /var/lib/ceph/osd/ceph-0 Still looking into it.
Tried checking the OSD prepare logs because the `type` is created during `mkfs` in `ceph-volume prepare`. Not able to get the prepare pod logs due to following error: ``` oc logs rook-ceph-osd-prepare-ocs-deviceset-2-data-0bcbkd-fwcgj Defaulted container "provision" out of: provision, copy-bins (init), blkdevmapper (init) unable to retrieve container logs for cri-o://919ffbf854bf3e31dfa26a83d8d65dbfd66cf37e44f0f6de05c14b8943528b1a% ``` Have asked Pratik to retry this scenario again so that we can ensure this is not an environment issue.
This is not an env issue because we are hitting it in every ci tests for nightly builds. There is a cluster if you want to look https://jenkins.ceph.redhat.com/job/ocs-ci/2356/ Just a guess - It might be related to the new bluestore changes as those were merged in this ceph build but by default the changes should be disabled
OSD works are removing the `expand-bluefs` init container. ``` sh-5.1$ ceph status cluster: id: 2e285ed3-d727-4445-aed4-d8fa245c92d9 health: HEALTH_WARN 2 MDSs report slow metadata IOs 2 osds down 2 hosts (2 osds) down 2 zones (2 osds) down Reduced data availability: 113 pgs inactive, 113 pgs stale Degraded data redundancy: 1862/2793 objects degraded (66.667%), 81 pgs degraded, 113 pgs undersized services: mon: 3 daemons, quorum a,b,c (age 19h) mgr: a(active, since 19h) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 1 up (since 2m), 3 in (since 24h) data: volumes: 1/1 healthy pools: 4 pools, 113 pgs objects: 931 objects, 2.8 GiB usage: 7.8 GiB used, 292 GiB / 300 GiB avail pgs: 100.000% pgs not active 1862/2793 objects degraded (66.667%) 81 stale+undersized+degraded+peered 32 stale+undersized+peered sh-5.1$ exit ``` Tried running the ceph-bluestore-tool inside the osd container. Getting the same error. Although the type is `bluestore` ``` sh-5.1# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-0 --log-level 30 inferring bluefs devices from bluestore path expected bluestore, but type is sh-5.1# cat /var/lib/ceph/osd/ceph-0/type bluestore sh-5.1# ``` So looks like `ceph-bluestore-tool` is not reading the metadata correctly.
(In reply to Santosh Pillai from comment #6) > OSD works are removing the `expand-bluefs` init container. > > ``` > sh-5.1$ ceph status > cluster: > id: 2e285ed3-d727-4445-aed4-d8fa245c92d9 > health: HEALTH_WARN > 2 MDSs report slow metadata IOs > 2 osds down > 2 hosts (2 osds) down > 2 zones (2 osds) down > Reduced data availability: 113 pgs inactive, 113 pgs stale > Degraded data redundancy: 1862/2793 objects degraded (66.667%), > 81 pgs degraded, 113 pgs undersized > > services: > mon: 3 daemons, quorum a,b,c (age 19h) > mgr: a(active, since 19h) > mds: 1/1 daemons up, 1 hot standby > osd: 3 osds: 1 up (since 2m), 3 in (since 24h) > > data: > volumes: 1/1 healthy > pools: 4 pools, 113 pgs > objects: 931 objects, 2.8 GiB > usage: 7.8 GiB used, 292 GiB / 300 GiB avail > pgs: 100.000% pgs not active > 1862/2793 objects degraded (66.667%) > 81 stale+undersized+degraded+peered > 32 stale+undersized+peered > > sh-5.1$ exit > ``` > > > Tried running the ceph-bluestore-tool inside the osd container. Getting the > same error. Although the type is `bluestore` > ``` sh-5.1# ceph-bluestore-tool bluefs-bdev-expand --path > /var/lib/ceph/osd/ceph-0 --log-level 30 > inferring bluefs devices from bluestore path > expected bluestore, but type is > > sh-5.1# cat /var/lib/ceph/osd/ceph-0/type > bluestore > sh-5.1# > ``` > > So looks like `ceph-bluestore-tool` is not reading the metadata correctly. Hi Adam, has there been any change in way the meta data is read by the COT, that could have caused this issue?
Adam found the rook cause in Ceph
Pls provide QE ack
verified upgrade with vSphere platform: Upgraded cluster from ODF 4.13.1 to 4.14.0-93 $ oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.14.0-93.stable NooBaa Operator 4.14.0-93.stable mcg-operator.v4.13.1-rhodf Succeeded ocs-operator.v4.14.0-93.stable OpenShift Container Storage 4.14.0-93.stable ocs-operator.v4.13.1-rhodf Succeeded odf-csi-addons-operator.v4.14.0-93.stable CSI Addons 4.14.0-93.stable odf-csi-addons-operator.v4.13.1-rhodf Succeeded odf-operator.v4.14.0-93.stable OpenShift Data Foundation 4.14.0-93.stable odf-operator.v4.13.1-rhodf Succeeded > All pods are up and running $ oc get pods NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-5c8fd7b449-tfjw5 2/2 Running 0 94m csi-cephfsplugin-h2zzs 2/2 Running 0 105m csi-cephfsplugin-hd2nr 2/2 Running 0 105m csi-cephfsplugin-kw9nh 2/2 Running 0 105m csi-cephfsplugin-provisioner-689c768444-drrxm 5/5 Running 0 105m csi-cephfsplugin-provisioner-689c768444-s7pj9 5/5 Running 0 105m csi-rbdplugin-krx8s 3/3 Running 0 105m csi-rbdplugin-nwwkq 3/3 Running 0 105m csi-rbdplugin-provisioner-6bb5f9f996-tjmtw 6/6 Running 0 105m csi-rbdplugin-provisioner-6bb5f9f996-xr6mm 6/6 Running 0 105m csi-rbdplugin-qlcqf 3/3 Running 0 105m noobaa-core-0 1/1 Running 0 105m noobaa-db-pg-0 1/1 Running 0 128m noobaa-endpoint-74fd8699d5-4svkl 1/1 Running 0 105m noobaa-operator-6bd6985d8-9kjxs 2/2 Running 0 107m ocs-metrics-exporter-756f64cdbc-jb68v 1/1 Running 0 107m ocs-operator-c8f5b6b46-wm4rz 1/1 Running 1 (106m ago) 107m odf-console-544f747cdf-cqxh6 1/1 Running 3 (108m ago) 109m odf-operator-controller-manager-dc4d55f78-vzx85 2/2 Running 0 109m rook-ceph-crashcollector-compute-0-66998b7976-5ktgt 1/1 Running 0 105m rook-ceph-crashcollector-compute-1-7666849b5-vlb28 1/1 Running 0 103m rook-ceph-crashcollector-compute-2-799cf594c8-q89jk 1/1 Running 0 105m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7b95676cp66w7 2/2 Running 0 104m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-77f67664b2qcd 2/2 Running 0 104m rook-ceph-mgr-a-6d9dc4d5bc-jlwx8 2/2 Running 0 103m rook-ceph-mon-a-767f9b8575-dr9vh 2/2 Running 0 105m rook-ceph-mon-b-b689b6665-sj96f 2/2 Running 0 103m rook-ceph-mon-c-7774d87c75-vrffn 2/2 Running 0 105m rook-ceph-operator-57887b7c4-7wljh 1/1 Running 0 106m rook-ceph-osd-0-88955f6d8-gtfkq 2/2 Running 0 102m rook-ceph-osd-1-9cfc6f76d-6kh68 2/2 Running 0 102m rook-ceph-osd-2-6f54db8869-bmsds 2/2 Running 0 102m rook-ceph-osd-prepare-ocs-deviceset-0-data-06mnxk-zltw2 0/1 Completed 0 129m rook-ceph-osd-prepare-ocs-deviceset-1-data-04rx7r-pvfmb 0/1 Completed 0 129m rook-ceph-osd-prepare-ocs-deviceset-2-data-06dw7p-qxqdc 0/1 Completed 0 129m rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-574c554574bq 2/2 Running 0 104m rook-ceph-tools-779ff74f75-7k875 1/1 Running 0 106m upgrade job: https://url.corp.redhat.com/20f84c2 logs: https://url.corp.redhat.com/1db1d4a
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6832
*** Bug 2226662 has been marked as a duplicate of this bug. ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days