Description of problem (please be detailed as possible and provide log snippests): After one of the storage nodes rebooted, the rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a is stuck in CrashLoopBackOff. ceph health detail reports HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs; Reduced data availability: 106 pgs inactive, 46 pgs peering; 258 slow ops, oldest one blocked for 141331 sec, daemons [osd.2,mon.a] have slow ops. Version of all relevant components (if applicable): ODF 4.14.0-139.stable. OCP 4.14.0-rc.4 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? This is a lab doing preGA testing Is there any workaround available to the best of your knowledge? The problem sounds very similar to this:https://access.redhat.com/solutions/6972994. The workaround posted is to delete the OSD pod, however we want to identify the root cause so that no manual intervention is required. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? unsure Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: I will add must-gather and ceph logs in the comments
Parth is on PTO this week, I will provide update from Engineering side
Since this happens also post fresh deployment, according to comment #9, setting the bug severity to high and proposing as a blocker for 4.15.0
(In reply to Elad from comment #16) > Since this happens also post fresh deployment, according to comment #9, > setting the bug severity to high and proposing as a blocker for 4.15.0 Elad, this happens intermittently and workaround is to just restart the rook operator. The fix is merged upstream - https://github.com/rook/rook/pull/12817
(In reply to Santosh Pillai from comment #17) > (In reply to Elad from comment #16) > > Since this happens also post fresh deployment, according to comment #9, > > setting the bug severity to high and proposing as a blocker for 4.15.0 > > Elad, this happens intermittently and workaround is to just restart the rook > operator. The fix is merged upstream - > https://github.com/rook/rook/pull/12817 Spoke too soon. Above PR is not the fix..
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383