Bug 2245004 - rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a does not come up after node reboot
Summary: rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a does not come up after nod...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.15.0
Assignee: Parth Arora
QA Contact: Vishakha Kathole
URL:
Whiteboard:
Depends On:
Blocks: 2254475
TreeView+ depends on / blocked
 
Reported: 2023-10-19 08:48 UTC by Logan McNaughton
Modified: 2024-03-19 15:28 UTC (History)
13 users (show)

Fixed In Version: 4.15.0-123
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2254475 (view as bug list)
Environment:
Last Closed: 2024-03-19 15:27:53 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github rook rook pull 12817 0 None Merged object: improve the error handling for multisite objs 2024-01-03 07:53:10 UTC
Red Hat Knowledge Base (Solution) 7049176 0 None None None 2023-12-14 09:56:18 UTC
Red Hat Product Errata RHSA-2024:1383 0 None None None 2024-03-19 15:28:02 UTC

Description Logan McNaughton 2023-10-19 08:48:23 UTC
Description of problem (please be detailed as possible and provide log
snippests):

After one of the storage nodes rebooted, the rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a is stuck in CrashLoopBackOff.

ceph health detail reports HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs; Reduced data availability: 106 pgs inactive, 46 pgs peering; 258 slow ops, oldest one blocked for 141331 sec, daemons [osd.2,mon.a] have slow ops.

Version of all relevant components (if applicable):

ODF 4.14.0-139.stable. OCP 4.14.0-rc.4

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

This is a lab doing preGA testing

Is there any workaround available to the best of your knowledge?

The problem sounds very similar to this:https://access.redhat.com/solutions/6972994. The workaround posted is to delete the OSD pod, however we want to identify the root cause so that no manual intervention is required.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?

unsure


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

I will add must-gather and ceph logs in the comments

Comment 13 Jiffin 2023-10-30 10:00:53 UTC
Parth is on PTO this week, I will provide update from Engineering side

Comment 16 Elad 2023-12-13 12:48:50 UTC
Since this happens also post fresh deployment, according to comment #9, setting the bug severity to high and proposing as a blocker for 4.15.0

Comment 17 Santosh Pillai 2023-12-14 03:53:51 UTC
(In reply to Elad from comment #16)
> Since this happens also post fresh deployment, according to comment #9,
> setting the bug severity to high and proposing as a blocker for 4.15.0

Elad, this happens intermittently and workaround is to just restart the rook operator. The fix is merged upstream - https://github.com/rook/rook/pull/12817

Comment 20 Santosh Pillai 2023-12-18 11:41:58 UTC
(In reply to Santosh Pillai from comment #17)
> (In reply to Elad from comment #16)
> > Since this happens also post fresh deployment, according to comment #9,
> > setting the bug severity to high and proposing as a blocker for 4.15.0
> 
> Elad, this happens intermittently and workaround is to just restart the rook
> operator. The fix is merged upstream -
> https://github.com/rook/rook/pull/12817


Spoke too soon. Above PR is not the fix..

Comment 27 errata-xmlrpc 2024-03-19 15:27:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383


Note You need to log in before you can comment on or make changes to this bug.