This bug was initially created as a copy of Bug #2276533 I am copying this bug because: My TAM customer ran into the issue described in KCS 7041554 when upgrading their dev cluster from 4.12.40 to 4.13.45. They require a backport to 4.13 so they can upgrade their prod clusters. Description of problem (please be detailed as possible and provide log: Customer upgraded from 4.12.40 to 4.13.45 we have noticed that all OSDs are in crahs loop with the expand-bluefs container showing errors about devices that can not be found. Version of all relevant components (if applicable): ODF 4.13 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, all osds are down Is there any workaround available to the best of your knowledge? Yes. https://access.redhat.com/solutions/7041554 Can this issue reproducible? yes, at customer environment
Actually, removing from 4.13.z proposal while still investigating... The original issue that this was cloned from does not apply to 4.13. That code path was that legacy lvm-based OSDs were failing to expand. It required that this method: c.getExpandPVCInitContainer() be called for the lvm-based OSDs. However, in this code snippet [1], line 556 already only applies to raw-based OSDs: initContainers = append(initContainers, c.getExpandPVCInitContainer(osdProps, osdID)) This code path does not apply to lvm-based OSDs since it's already in the "else" block for "raw" OSDs. [2] So these must not be legacy lvm-based OSDs as in the original issue. There must be some other issue causing these raw-based OSDs to fail during the resize call. If this is the case, these OSDs will likely continue to fail each time the cluster is upgraded and the OSDs are reconciled to add the resize container back. [1] https://github.com/red-hat-storage/rook/blob/release-4.13/pkg/operator/ceph/cluster/osd/spec.go#L529-L557 [2] https://github.com/red-hat-storage/rook/commit/b489d7ae47a628497be8695ffb70606d246a578c
If we can't find the root cause, to avoid future issues with these OSDs, they may need to be serially wiped and replaced.