Bug 2245004

Summary:	rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a does not come up after node reboot
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Logan McNaughton <lmcnaugh>
Component:	rook	Assignee:	Parth Arora <paarora>
Status:	CLOSED ERRATA	QA Contact:	Vishakha Kathole <vkathole>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.14	CC:	akandath, ebenahar, jthottan, kjosy, muagarwa, nberry, odf-bz-bot, paarora, sapillai, srai, tdesala, thottanjiffin, tnielsen
Target Milestone:	---
Target Release:	ODF 4.15.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.15.0-123	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	2254475 (view as bug list)		Environment:
Last Closed:	2024-03-19 15:27:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2254475

Description Logan McNaughton 2023-10-19 08:48:23 UTC

Description of problem (please be detailed as possible and provide log
snippests):

After one of the storage nodes rebooted, the rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a is stuck in CrashLoopBackOff.

ceph health detail reports HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs; Reduced data availability: 106 pgs inactive, 46 pgs peering; 258 slow ops, oldest one blocked for 141331 sec, daemons [osd.2,mon.a] have slow ops.

Version of all relevant components (if applicable):

ODF 4.14.0-139.stable. OCP 4.14.0-rc.4

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

This is a lab doing preGA testing

Is there any workaround available to the best of your knowledge?

The problem sounds very similar to this:https://access.redhat.com/solutions/6972994. The workaround posted is to delete the OSD pod, however we want to identify the root cause so that no manual intervention is required.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?

unsure


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

I will add must-gather and ceph logs in the comments

Comment 13 Jiffin 2023-10-30 10:00:53 UTC

Parth is on PTO this week, I will provide update from Engineering side

Comment 16 Elad 2023-12-13 12:48:50 UTC

Since this happens also post fresh deployment, according to comment #9, setting the bug severity to high and proposing as a blocker for 4.15.0

Comment 17 Santosh Pillai 2023-12-14 03:53:51 UTC

(In reply to Elad from comment #16)
> Since this happens also post fresh deployment, according to comment #9,
> setting the bug severity to high and proposing as a blocker for 4.15.0

Elad, this happens intermittently and workaround is to just restart the rook operator. The fix is merged upstream - https://github.com/rook/rook/pull/12817

Comment 20 Santosh Pillai 2023-12-18 11:41:58 UTC

(In reply to Santosh Pillai from comment #17)
> (In reply to Elad from comment #16)
> > Since this happens also post fresh deployment, according to comment #9,
> > setting the bug severity to high and proposing as a blocker for 4.15.0
> 
> Elad, this happens intermittently and workaround is to just restart the rook
> operator. The fix is merged upstream -
> https://github.com/rook/rook/pull/12817


Spoke too soon. Above PR is not the fix..

Comment 27 errata-xmlrpc 2024-03-19 15:27:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383