2245004 – rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a does not come up after node reboot

Bug 2245004 - rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a does not come up after node reboot

Summary: rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a does not come up after nod...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.15.0
Assignee:	Parth Arora
QA Contact:	Vishakha Kathole
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2254475
TreeView+	depends on / blocked

Reported:	2023-10-19 08:48 UTC by Logan McNaughton
Modified:	2024-03-19 15:28 UTC (History)
CC List:	13 users (show)
Fixed In Version:	4.15.0-123
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2254475 (view as bug list)
Environment:
Last Closed:	2024-03-19 15:27:53 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	rook rook pull 12817	None	Merged	object: improve the error handling for multisite objs	2024-01-03 07:53:10 UTC
Red Hat Knowledge Base (Solution)	7049176	None	None	None	2023-12-14 09:56:18 UTC
Red Hat Product Errata	RHSA-2024:1383	None	None	None	2024-03-19 15:28:02 UTC

Description Logan McNaughton 2023-10-19 08:48:23 UTC

Description of problem (please be detailed as possible and provide log
snippests):

After one of the storage nodes rebooted, the rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a is stuck in CrashLoopBackOff.

ceph health detail reports HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs; Reduced data availability: 106 pgs inactive, 46 pgs peering; 258 slow ops, oldest one blocked for 141331 sec, daemons [osd.2,mon.a] have slow ops.

Version of all relevant components (if applicable):

ODF 4.14.0-139.stable. OCP 4.14.0-rc.4

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

This is a lab doing preGA testing

Is there any workaround available to the best of your knowledge?

The problem sounds very similar to this:https://access.redhat.com/solutions/6972994. The workaround posted is to delete the OSD pod, however we want to identify the root cause so that no manual intervention is required.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?

unsure


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

I will add must-gather and ceph logs in the comments

Comment 13 Jiffin 2023-10-30 10:00:53 UTC

Parth is on PTO this week, I will provide update from Engineering side

Comment 16 Elad 2023-12-13 12:48:50 UTC

Since this happens also post fresh deployment, according to comment #9, setting the bug severity to high and proposing as a blocker for 4.15.0

Comment 17 Santosh Pillai 2023-12-14 03:53:51 UTC

(In reply to Elad from comment #16)
> Since this happens also post fresh deployment, according to comment #9,
> setting the bug severity to high and proposing as a blocker for 4.15.0

Elad, this happens intermittently and workaround is to just restart the rook operator. The fix is merged upstream - https://github.com/rook/rook/pull/12817

Comment 20 Santosh Pillai 2023-12-18 11:41:58 UTC

(In reply to Santosh Pillai from comment #17)
> (In reply to Elad from comment #16)
> > Since this happens also post fresh deployment, according to comment #9,
> > setting the bug severity to high and proposing as a blocker for 4.15.0
> 
> Elad, this happens intermittently and workaround is to just restart the rook
> operator. The fix is merged upstream -
> https://github.com/rook/rook/pull/12817


Spoke too soon. Above PR is not the fix..

Comment 27 errata-xmlrpc 2024-03-19 15:27:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Note You need to log in before you can comment on or make changes to this bug.