Bug 2305874 - [GSS][ODF 4.13 backport] Legacy LVM-based OSDs are in crashloop state
Summary: [GSS][ODF 4.13 backport] Legacy LVM-based OSDs are in crashloop state
Keywords:
Status: NEW
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.13
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Travis Nielsen
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-08-19 18:42 UTC by Paul Gozart
Modified: 2024-09-06 11:31 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OCSBZM-9070 0 None None None 2024-09-06 11:31:53 UTC

Description Paul Gozart 2024-08-19 18:42:17 UTC
This bug was initially created as a copy of Bug #2276533

I am copying this bug because: 

My TAM customer ran into the issue described in KCS 7041554 when upgrading their dev cluster from 4.12.40 to 4.13.45.  They require a backport to 4.13 so they can upgrade their prod clusters.


Description of problem (please be detailed as possible and provide log:

Customer upgraded from 4.12.40 to 4.13.45 we have noticed that all OSDs are in crahs loop with the expand-bluefs container showing errors about devices that can not be found.

Version of all relevant components (if applicable):
ODF 4.13

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, all osds are down

Is there any workaround available to the best of your knowledge?
Yes.  https://access.redhat.com/solutions/7041554

Can this issue reproducible?
yes, at customer environment

Comment 4 Travis Nielsen 2024-08-19 19:37:33 UTC
Actually, removing from 4.13.z proposal while still investigating...

The original issue that this was cloned from does not apply to 4.13. 
That code path was that legacy lvm-based OSDs were failing to expand.
It required that this method: c.getExpandPVCInitContainer()
be called for the lvm-based OSDs. However, in this code snippet [1],
line 556 already only applies to raw-based OSDs:
  initContainers = append(initContainers, c.getExpandPVCInitContainer(osdProps, osdID))

This code path does not apply to lvm-based OSDs since it's already in the "else" block
for "raw" OSDs. [2]

So these must not be legacy lvm-based OSDs as in the original issue. There must be
some other issue causing these raw-based OSDs to fail during the resize call. 
If this is the case, these OSDs will likely continue to fail each time the 
cluster is upgraded and the OSDs are reconciled to add the resize container back.


[1] https://github.com/red-hat-storage/rook/blob/release-4.13/pkg/operator/ceph/cluster/osd/spec.go#L529-L557
[2] https://github.com/red-hat-storage/rook/commit/b489d7ae47a628497be8695ffb70606d246a578c

Comment 5 Travis Nielsen 2024-08-19 19:39:50 UTC
If we can't find the root cause, to avoid future issues with these OSDs, they may need to be serially wiped and replaced.


Note You need to log in before you can comment on or make changes to this bug.