Bug 2245004

Summary: rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a does not come up after node reboot
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Logan McNaughton <lmcnaugh>
Component: rookAssignee: Parth Arora <paarora>
Status: CLOSED ERRATA QA Contact: Vishakha Kathole <vkathole>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.14CC: akandath, ebenahar, jthottan, kjosy, muagarwa, nberry, odf-bz-bot, paarora, sapillai, srai, tdesala, thottanjiffin, tnielsen
Target Milestone: ---   
Target Release: ODF 4.15.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.15.0-123 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2254475 (view as bug list) Environment:
Last Closed: 2024-03-19 15:27:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2254475    

Description Logan McNaughton 2023-10-19 08:48:23 UTC
Description of problem (please be detailed as possible and provide log
snippests):

After one of the storage nodes rebooted, the rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a is stuck in CrashLoopBackOff.

ceph health detail reports HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs; Reduced data availability: 106 pgs inactive, 46 pgs peering; 258 slow ops, oldest one blocked for 141331 sec, daemons [osd.2,mon.a] have slow ops.

Version of all relevant components (if applicable):

ODF 4.14.0-139.stable. OCP 4.14.0-rc.4

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

This is a lab doing preGA testing

Is there any workaround available to the best of your knowledge?

The problem sounds very similar to this:https://access.redhat.com/solutions/6972994. The workaround posted is to delete the OSD pod, however we want to identify the root cause so that no manual intervention is required.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?

unsure


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

I will add must-gather and ceph logs in the comments

Comment 13 Jiffin 2023-10-30 10:00:53 UTC
Parth is on PTO this week, I will provide update from Engineering side

Comment 16 Elad 2023-12-13 12:48:50 UTC
Since this happens also post fresh deployment, according to comment #9, setting the bug severity to high and proposing as a blocker for 4.15.0

Comment 17 Santosh Pillai 2023-12-14 03:53:51 UTC
(In reply to Elad from comment #16)
> Since this happens also post fresh deployment, according to comment #9,
> setting the bug severity to high and proposing as a blocker for 4.15.0

Elad, this happens intermittently and workaround is to just restart the rook operator. The fix is merged upstream - https://github.com/rook/rook/pull/12817

Comment 20 Santosh Pillai 2023-12-18 11:41:58 UTC
(In reply to Santosh Pillai from comment #17)
> (In reply to Elad from comment #16)
> > Since this happens also post fresh deployment, according to comment #9,
> > setting the bug severity to high and proposing as a blocker for 4.15.0
> 
> Elad, this happens intermittently and workaround is to just restart the rook
> operator. The fix is merged upstream -
> https://github.com/rook/rook/pull/12817


Spoke too soon. Above PR is not the fix..

Comment 27 errata-xmlrpc 2024-03-19 15:27:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383