2225176 – After upgrade of ODF OSD pods are in Init:CrashLoopBackOff

Bug 2225176 - After upgrade of ODF OSD pods are in Init:CrashLoopBackOff

Summary: After upgrade of ODF OSD pods are in Init:CrashLoopBackOff

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.14.0
Assignee:	Santosh Pillai
QA Contact:	Vijay Avuthu
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2226662 (view as bug list)
Depends On:	2226657
Blocks:
TreeView+	depends on / blocked

Reported:	2023-07-24 13:22 UTC by Pratik Surve
Modified:	2024-04-14 04:25 UTC (History)
CC List:	9 users (show)
Fixed In Version:	4.14.0-90
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-11-08 18:52:55 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2023:6832	0	None	None	None	2023-11-08 18:53:47 UTC

Description Pratik Surve 2023-07-24 13:22:18 UTC

Description of problem (please be detailed as possible and provide log
snippests):

After upgrade of ODF OSD pods are in Init:CrashLoopBackOff 


Version of all relevant components (if applicable):

OCP version:- 4.14.0-0.nightly-2023-07-20-215234
ODF version:- 4.14.0-77
platform:- IBM CLOUD IPI



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy ODF with 4.14.0-67
2. Upgrade to latest stable ie 4.14.0-77 
3. check pod list


Actual results:

pods -l app=rook-ceph-osd
NAME                               READY   STATUS                  RESTARTS        AGE
rook-ceph-osd-0-5cb48c4cf8-s6vvn   0/2     Init:CrashLoopBackOff   8 (3m54s ago)   19m
rook-ceph-osd-1-65c5c75dc5-l6tss   0/2     Init:CrashLoopBackOff   8 (3m25s ago)   19m
rook-ceph-osd-2-5fb876b874-f8qpf   0/2     Init:CrashLoopBackOff   8 (48s ago)     16m

c logs rook-ceph-osd-0-5cb48c4cf8-s6vvn --all-containers        
+ PVC_SOURCE=/ocs-deviceset-1-data-0fd5p9
+ PVC_DEST=/var/lib/ceph/osd/ceph-0/block
+ CP_ARGS=(--archive --dereference --verbose)
+ '[' -b /var/lib/ceph/osd/ceph-0/block ']'
++ stat --format %t%T /ocs-deviceset-1-data-0fd5p9
+ PVC_SOURCE_MAJ_MIN=fc40
++ stat --format %t%T /var/lib/ceph/osd/ceph-0/block
PVC /var/lib/ceph/osd/ceph-0/block already exists and has the same major and minor as /ocs-deviceset-1-data-0fd5p9: fc40
+ PVC_DEST_MAJ_MIN=fc40
+ [[ fc40 == \f\c\4\0 ]]
+ echo 'PVC /var/lib/ceph/osd/ceph-0/block already exists and has the same major and minor as /ocs-deviceset-1-data-0fd5p9: fc40'
+ exit 0
inferring bluefs devices from bluestore path
expected bluestore, but type is 
Error from server (BadRequest): container "chown-container-data-dir" in pod "rook-ceph-osd-0-5cb48c4cf8-s6vvn" is waiting to start: PodInitializing


$ oc logs rook-ceph-osd-1-65c5c75dc5-l6tss --all-containers                                
inferring bluefs devices from bluestore path
expected bluestore, but type is 
Error from server (BadRequest): container "chown-container-data-dir" in pod "rook-ceph-osd-1-65c5c75dc5-l6tss" is waiting to start: PodInitializing


Expected results:


Additional info:

Comment 3 Santosh Pillai 2023-07-25 07:21:01 UTC

expand-bluefs init container is failing as the `ceph-bluestore-tool` is not able to read meta data from the provided path - /var/lib/ceph/osd/ceph-0
Still looking into it.

Comment 4 Santosh Pillai 2023-07-25 08:13:44 UTC

Tried checking the OSD prepare logs because the `type` is created during `mkfs` in `ceph-volume prepare`. Not able to get the prepare pod logs due to following error:

```
oc logs rook-ceph-osd-prepare-ocs-deviceset-2-data-0bcbkd-fwcgj
Defaulted container "provision" out of: provision, copy-bins (init), blkdevmapper (init)
unable to retrieve container logs for cri-o://919ffbf854bf3e31dfa26a83d8d65dbfd66cf37e44f0f6de05c14b8943528b1a%
```

Have asked Pratik to retry this scenario again so that we can ensure this is not an environment issue.

Comment 5 Mudit Agarwal 2023-07-25 08:16:58 UTC

This is not an env issue because we are hitting it in every ci tests for nightly builds.
There is a cluster if you want to look https://jenkins.ceph.redhat.com/job/ocs-ci/2356/

Just a guess - It might be related to the new bluestore changes as those were merged in this ceph build but by default the changes should be disabled

Comment 6 Santosh Pillai 2023-07-25 09:06:58 UTC

OSD works are removing the `expand-bluefs` init container. 

``` 
sh-5.1$ ceph status 
  cluster:
    id:     2e285ed3-d727-4445-aed4-d8fa245c92d9
    health: HEALTH_WARN
            2 MDSs report slow metadata IOs
            2 osds down
            2 hosts (2 osds) down
            2 zones (2 osds) down
            Reduced data availability: 113 pgs inactive, 113 pgs stale
            Degraded data redundancy: 1862/2793 objects degraded (66.667%), 81 pgs degraded, 113 pgs undersized
 
  services:
    mon: 3 daemons, quorum a,b,c (age 19h)
    mgr: a(active, since 19h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 1 up (since 2m), 3 in (since 24h)
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 113 pgs
    objects: 931 objects, 2.8 GiB
    usage:   7.8 GiB used, 292 GiB / 300 GiB avail
    pgs:     100.000% pgs not active
             1862/2793 objects degraded (66.667%)
             81 stale+undersized+degraded+peered
             32 stale+undersized+peered
 
sh-5.1$ exit
```


Tried running the ceph-bluestore-tool inside the osd container. Getting the same error. Although the type is `bluestore`
``` sh-5.1# ceph-bluestore-tool bluefs-bdev-expand --path  /var/lib/ceph/osd/ceph-0 --log-level 30
inferring bluefs devices from bluestore path
expected bluestore, but type is 
        
sh-5.1# cat /var/lib/ceph/osd/ceph-0/type 
bluestore
sh-5.1# 
```

So looks like `ceph-bluestore-tool` is not reading the metadata correctly.

Comment 7 Santosh Pillai 2023-07-25 09:17:18 UTC

(In reply to Santosh Pillai from comment #6)
> OSD works are removing the `expand-bluefs` init container. 
> 
> ``` 
> sh-5.1$ ceph status 
>   cluster:
>     id:     2e285ed3-d727-4445-aed4-d8fa245c92d9
>     health: HEALTH_WARN
>             2 MDSs report slow metadata IOs
>             2 osds down
>             2 hosts (2 osds) down
>             2 zones (2 osds) down
>             Reduced data availability: 113 pgs inactive, 113 pgs stale
>             Degraded data redundancy: 1862/2793 objects degraded (66.667%),
> 81 pgs degraded, 113 pgs undersized
>  
>   services:
>     mon: 3 daemons, quorum a,b,c (age 19h)
>     mgr: a(active, since 19h)
>     mds: 1/1 daemons up, 1 hot standby
>     osd: 3 osds: 1 up (since 2m), 3 in (since 24h)
>  
>   data:
>     volumes: 1/1 healthy
>     pools:   4 pools, 113 pgs
>     objects: 931 objects, 2.8 GiB
>     usage:   7.8 GiB used, 292 GiB / 300 GiB avail
>     pgs:     100.000% pgs not active
>              1862/2793 objects degraded (66.667%)
>              81 stale+undersized+degraded+peered
>              32 stale+undersized+peered
>  
> sh-5.1$ exit
> ```
> 
> 
> Tried running the ceph-bluestore-tool inside the osd container. Getting the
> same error. Although the type is `bluestore`
> ``` sh-5.1# ceph-bluestore-tool bluefs-bdev-expand --path 
> /var/lib/ceph/osd/ceph-0 --log-level 30
> inferring bluefs devices from bluestore path
> expected bluestore, but type is 
>         
> sh-5.1# cat /var/lib/ceph/osd/ceph-0/type 
> bluestore
> sh-5.1# 
> ```
> 
> So looks like `ceph-bluestore-tool` is not reading the metadata correctly.


Hi Adam, has there been any change in way the meta data is read by the COT, that could have caused this issue?

Comment 8 Santosh Pillai 2023-07-25 15:02:19 UTC

Adam found the rook cause in Ceph

Comment 14 Mudit Agarwal 2023-07-28 05:59:11 UTC

Pls provide QE ack

Comment 17 Vijay Avuthu 2023-08-03 07:24:59 UTC

verified upgrade with vSphere platform:

Upgraded cluster from ODF 4.13.1 to 4.14.0-93

$ oc get csv
NAME                                        DISPLAY                       VERSION            REPLACES                                PHASE
mcg-operator.v4.14.0-93.stable              NooBaa Operator               4.14.0-93.stable   mcg-operator.v4.13.1-rhodf              Succeeded
ocs-operator.v4.14.0-93.stable              OpenShift Container Storage   4.14.0-93.stable   ocs-operator.v4.13.1-rhodf              Succeeded
odf-csi-addons-operator.v4.14.0-93.stable   CSI Addons                    4.14.0-93.stable   odf-csi-addons-operator.v4.13.1-rhodf   Succeeded
odf-operator.v4.14.0-93.stable              OpenShift Data Foundation     4.14.0-93.stable   odf-operator.v4.13.1-rhodf              Succeeded


> All pods are up and running

$ oc get pods
NAME                                                              READY   STATUS      RESTARTS       AGE
csi-addons-controller-manager-5c8fd7b449-tfjw5                    2/2     Running     0              94m
csi-cephfsplugin-h2zzs                                            2/2     Running     0              105m
csi-cephfsplugin-hd2nr                                            2/2     Running     0              105m
csi-cephfsplugin-kw9nh                                            2/2     Running     0              105m
csi-cephfsplugin-provisioner-689c768444-drrxm                     5/5     Running     0              105m
csi-cephfsplugin-provisioner-689c768444-s7pj9                     5/5     Running     0              105m
csi-rbdplugin-krx8s                                               3/3     Running     0              105m
csi-rbdplugin-nwwkq                                               3/3     Running     0              105m
csi-rbdplugin-provisioner-6bb5f9f996-tjmtw                        6/6     Running     0              105m
csi-rbdplugin-provisioner-6bb5f9f996-xr6mm                        6/6     Running     0              105m
csi-rbdplugin-qlcqf                                               3/3     Running     0              105m
noobaa-core-0                                                     1/1     Running     0              105m
noobaa-db-pg-0                                                    1/1     Running     0              128m
noobaa-endpoint-74fd8699d5-4svkl                                  1/1     Running     0              105m
noobaa-operator-6bd6985d8-9kjxs                                   2/2     Running     0              107m
ocs-metrics-exporter-756f64cdbc-jb68v                             1/1     Running     0              107m
ocs-operator-c8f5b6b46-wm4rz                                      1/1     Running     1 (106m ago)   107m
odf-console-544f747cdf-cqxh6                                      1/1     Running     3 (108m ago)   109m
odf-operator-controller-manager-dc4d55f78-vzx85                   2/2     Running     0              109m
rook-ceph-crashcollector-compute-0-66998b7976-5ktgt               1/1     Running     0              105m
rook-ceph-crashcollector-compute-1-7666849b5-vlb28                1/1     Running     0              103m
rook-ceph-crashcollector-compute-2-799cf594c8-q89jk               1/1     Running     0              105m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7b95676cp66w7   2/2     Running     0              104m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-77f67664b2qcd   2/2     Running     0              104m
rook-ceph-mgr-a-6d9dc4d5bc-jlwx8                                  2/2     Running     0              103m
rook-ceph-mon-a-767f9b8575-dr9vh                                  2/2     Running     0              105m
rook-ceph-mon-b-b689b6665-sj96f                                   2/2     Running     0              103m
rook-ceph-mon-c-7774d87c75-vrffn                                  2/2     Running     0              105m
rook-ceph-operator-57887b7c4-7wljh                                1/1     Running     0              106m
rook-ceph-osd-0-88955f6d8-gtfkq                                   2/2     Running     0              102m
rook-ceph-osd-1-9cfc6f76d-6kh68                                   2/2     Running     0              102m
rook-ceph-osd-2-6f54db8869-bmsds                                  2/2     Running     0              102m
rook-ceph-osd-prepare-ocs-deviceset-0-data-06mnxk-zltw2           0/1     Completed   0              129m
rook-ceph-osd-prepare-ocs-deviceset-1-data-04rx7r-pvfmb           0/1     Completed   0              129m
rook-ceph-osd-prepare-ocs-deviceset-2-data-06dw7p-qxqdc           0/1     Completed   0              129m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-574c554574bq   2/2     Running     0              104m
rook-ceph-tools-779ff74f75-7k875                                  1/1     Running     0              106m


upgrade job: https://url.corp.redhat.com/20f84c2
logs: https://url.corp.redhat.com/1db1d4a

Comment 23 errata-xmlrpc 2023-11-08 18:52:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832

Comment 24 Radoslaw Zarzynski 2023-12-15 16:01:29 UTC

*** Bug 2226662 has been marked as a duplicate of this bug. ***

Comment 25 Red Hat Bugzilla 2024-04-14 04:25:41 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.