Bug 2276824

Summary:	[4.15.z clone] Expose the upgrade setting for a longer timeout waiting for healthy OSDs before continuing
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Nikhil Ladha <nladha>
Component:	ocs-operator	Assignee:	Nikhil Ladha <nladha>
Status:	CLOSED ERRATA	QA Contact:	Petr Balogh <pbalogh>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.11	CC:	asriram, bkunal, etamir, kramdoss, nberry, odf-bz-bot, pbalogh, tnielsen
Target Milestone:	---	Flags:	kramdoss: needinfo+ kramdoss: needinfo+
Target Release:	ODF 4.15.3
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.15.3-9	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	2276694	Environment:
Last Closed:	2024-06-11 16:41:47 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2276694
Bug Blocks:

Description Nikhil Ladha 2024-04-24 06:16:02 UTC

+++ This bug was initially created as a clone of Bug #2276694 +++

Description of problem (please be detailed as possible and provide log
snippests):

During upgrades, ODF currently waits up to 10 minutes while upgrading OSDs to verify they are healthy before continuining. If the PGs are all healthy within 10 minutes, the upgrade continues without any issue. If the PGs are still unhealthy after 10 minutes of upgrading an OSD, Rook will continue with the upgrade of the next OSD. If multiple OSDs end up being down at the same time this can cause data availability issues temporarily.

Rook has an option to increase the 10 minute timeout. We need to expose this to give the customer flexibility over this sensitivity. The setting is waitTimeoutForHealthyOSDInMinutes.

For more discussion see https://issues.redhat.com/browse/RHSTOR-5734.


Version of all relevant components (if applicable):

All ODF versions have this 10 min timeout currently.


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

No

Is there any workaround available to the best of your knowledge?

No, except by editing the CephCluster directly and disabling the cluster reconcile (a very heavy-handed approach with other side effects).


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

NA

If this is a regression, please provide more details to justify this:

NA

Steps to Reproduce:
1. Install ODF
2. Upgrade ODF while PGs are unhealthy
3. See IO pause while multiple OSDs may be down.


Actual results:

IO pause during upgrade if OSDs are not healthy as expected.


Expected results:

Full data availability during upgrades.


Additional info:
N.A

Comment 3 krishnaram Karthick 2024-05-02 11:33:11 UTC

Moving the bug to 4.15.4 as we have reached the limit on bugs intake for 4.15.3

Comment 17 Petr Balogh 2024-06-04 15:31:13 UTC

Based on validation I've done here https://bugzilla.redhat.com/show_bug.cgi?id=2276694#c25

Upgrade from 4.15 to 4.16 both latest builds with latest fix, I am marking this as verified.

But still there is an issue that the value is not propagated for fresh deployment when defined in first created storageCluster CR.

But it's not a blocker if user will patch the value on existing cluster.

Comment 22 errata-xmlrpc 2024-06-11 16:41:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.15.3 Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:3806