Bug 2305874

Summary: [GSS][ODF 4.13 backport] Legacy LVM-based OSDs are in crashloop state
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Paul Gozart <pgozart>
Component: rookAssignee: Travis Nielsen <tnielsen>
Status: NEW --- QA Contact: Neha Berry <nberry>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.13CC: odf-bz-bot, sheggodu, tnielsen
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Paul Gozart 2024-08-19 18:42:17 UTC
This bug was initially created as a copy of Bug #2276533

I am copying this bug because: 

My TAM customer ran into the issue described in KCS 7041554 when upgrading their dev cluster from 4.12.40 to 4.13.45.  They require a backport to 4.13 so they can upgrade their prod clusters.


Description of problem (please be detailed as possible and provide log:

Customer upgraded from 4.12.40 to 4.13.45 we have noticed that all OSDs are in crahs loop with the expand-bluefs container showing errors about devices that can not be found.

Version of all relevant components (if applicable):
ODF 4.13

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, all osds are down

Is there any workaround available to the best of your knowledge?
Yes.  https://access.redhat.com/solutions/7041554

Can this issue reproducible?
yes, at customer environment

Comment 4 Travis Nielsen 2024-08-19 19:37:33 UTC
Actually, removing from 4.13.z proposal while still investigating...

The original issue that this was cloned from does not apply to 4.13. 
That code path was that legacy lvm-based OSDs were failing to expand.
It required that this method: c.getExpandPVCInitContainer()
be called for the lvm-based OSDs. However, in this code snippet [1],
line 556 already only applies to raw-based OSDs:
  initContainers = append(initContainers, c.getExpandPVCInitContainer(osdProps, osdID))

This code path does not apply to lvm-based OSDs since it's already in the "else" block
for "raw" OSDs. [2]

So these must not be legacy lvm-based OSDs as in the original issue. There must be
some other issue causing these raw-based OSDs to fail during the resize call. 
If this is the case, these OSDs will likely continue to fail each time the 
cluster is upgraded and the OSDs are reconciled to add the resize container back.


[1] https://github.com/red-hat-storage/rook/blob/release-4.13/pkg/operator/ceph/cluster/osd/spec.go#L529-L557
[2] https://github.com/red-hat-storage/rook/commit/b489d7ae47a628497be8695ffb70606d246a578c

Comment 5 Travis Nielsen 2024-08-19 19:39:50 UTC
If we can't find the root cause, to avoid future issues with these OSDs, they may need to be serially wiped and replaced.