Bug 1833153

Summary:	add a variable for sleep time of rook operator between checks of downed OSD+Node.
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	svolkov
Component:	rook	Assignee:	Sébastien Han <shan>
Status:	CLOSED ERRATA	QA Contact:	Shrivaibavi Raghaventhiran <sraghave>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.4	CC:	hnallurv, madam, muagarwa, nberry, ocs-bugs, ratamir, shan, tnielsen
Target Milestone:	---
Target Release:	OCS 4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:	If this bug requires documentation, please select an appropriate Doc Type value.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-12-17 06:22:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1830015
Bug Blocks:

Description svolkov 2020-05-07 22:34:14 UTC

Description of problem (please be detailed as possible and provide log
snippests):

adding to https://bugzilla.redhat.com/show_bug.cgi?id=1830015

it would be best if the loop time for when Rook checks for down OSD + node down will be a variable in the storage cluster CR.

This way if we have any special cases where we need longer or shorter periods of time to check for these events we or a customer, will be able to control this.

Version of all relevant components (if applicable):
OCS4.2, 4.3 4.4

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
same steps as https://bugzilla.redhat.com/show_bug.cgi?id=1830015

Actual results:


Expected results:


Additional info:

Comment 2 Travis Nielsen 2020-05-08 23:58:50 UTC

Moving to 4.5. The timeout of 5 minutes will do for 4.4.

Comment 3 Travis Nielsen 2020-05-28 21:48:57 UTC

With https://github.com/rook/rook/pull/5556 I'm proposing setting the time check interval to 60s. This will follow the pattern of checking mon health in about the same interval (45s).

@Sagy, do you still see a need to make this configurable?

Comment 4 Travis Nielsen 2020-06-01 19:35:01 UTC

The change to query osd status every 60s was merged downstream for 4.5 with https://github.com/openshift/rook/pull/65.

Comment 5 svolkov 2020-06-22 04:38:53 UTC

(In reply to Travis Nielsen from comment #3)
> With https://github.com/rook/rook/pull/5556 I'm proposing setting the time
> check interval to 60s. This will follow the pattern of checking mon health
> in about the same interval (45s).
> 
> @Sagy, do you still see a need to make this configurable?

I would make this configurable. it will also help in QE testing and in POCs.
not to mention it actually gives the customer an ability to control the failure.

Comment 6 Elad 2020-06-25 09:47:43 UTC

Hi Travis,

Which variable will be used for this purpose?

Comment 9 Travis Nielsen 2020-06-25 21:16:53 UTC

Moving back to assigned to add the variable instead of simply leave it at the constant of 60s.

Comment 10 Travis Nielsen 2020-06-30 14:28:29 UTC

Moving to 4.6 since it's not blocking.

Comment 13 Sébastien Han 2020-07-22 09:35:23 UTC

Done in https://github.com/rook/rook/pull/5789 and resynced with https://github.com/openshift/rook/pull/85

Comment 14 Neha Berry 2020-09-24 13:20:53 UTC

Hi Sagy,

Since this was a special ask for POC as well, would you like to confirm in latest 4.6 if the fix is what you had asked for ?

Comment 16 svolkov 2020-10-01 04:11:52 UTC

Neha,

This was not a request for a POC, it something I'm sure many customers will use, but I will test this and reply.

Comment 17 Shrivaibavi Raghaventhiran 2020-11-03 09:11:46 UTC

@svolkov Any updates on the BZ, Did you get a chance to test this ?

Comment 19 Shrivaibavi Raghaventhiran 2020-12-08 12:45:11 UTC

Tested versions:
---------------
OCS - 4.6.0-rc5
OCP - 4.6

Did not get the proper steps to verify this BZ, 
I followed the steps to reproduce mentioned in BZ https://bugzilla.redhat.com/show_bug.cgi?id=1830015
and also we did not hit any issue during automation runs on tier4.


Based on the above explanation moving this BZ to "SANITY VERIFIED".

Comment 20 Travis Nielsen 2020-12-08 15:47:54 UTC

In PR [1] the interval was made configurable in the CephCluster CR to check the OSD health with this default:

healthCheck:
  daemonHealth:
    osd:
      disabled: false
      interval: 60s

See the documentation [2].

However, this setting is not exposed for OCS yet. I'd suggest a new BZ for that. 

[1] https://github.com/rook/rook/pull/5789
[2] https://rook.github.io/docs/rook/v1.5/ceph-cluster-crd.html#health-settings

Comment 22 errata-xmlrpc 2020-12-17 06:22:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605