Bug 1833153 - add a variable for sleep time of rook operator between checks of downed OSD+Node.
Summary: add a variable for sleep time of rook operator between checks of downed OSD+N...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: OCS 4.6.0
Assignee: Sébastien Han
QA Contact: Shrivaibavi Raghaventhiran
URL:
Whiteboard:
Depends On: 1830015
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-07 22:34 UTC by svolkov
Modified: 2020-12-17 06:22 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
If this bug requires documentation, please select an appropriate Doc Type value.
Clone Of:
Environment:
Last Closed: 2020-12-17 06:22:30 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift rook pull 85 0 None closed resync 4.6 with rook master 2020-12-11 08:08:11 UTC
Github rook rook pull 5789 0 None closed ceph: configurable status checks and livenessprobe 2020-12-11 08:08:12 UTC
Red Hat Product Errata RHSA-2020:5605 0 None None None 2020-12-17 06:22:47 UTC

Description svolkov 2020-05-07 22:34:14 UTC
Description of problem (please be detailed as possible and provide log
snippests):

adding to https://bugzilla.redhat.com/show_bug.cgi?id=1830015

it would be best if the loop time for when Rook checks for down OSD + node down will be a variable in the storage cluster CR.

This way if we have any special cases where we need longer or shorter periods of time to check for these events we or a customer, will be able to control this.

Version of all relevant components (if applicable):
OCS4.2, 4.3 4.4

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
same steps as https://bugzilla.redhat.com/show_bug.cgi?id=1830015

Actual results:


Expected results:


Additional info:

Comment 2 Travis Nielsen 2020-05-08 23:58:50 UTC
Moving to 4.5. The timeout of 5 minutes will do for 4.4.

Comment 3 Travis Nielsen 2020-05-28 21:48:57 UTC
With https://github.com/rook/rook/pull/5556 I'm proposing setting the time check interval to 60s. This will follow the pattern of checking mon health in about the same interval (45s).

@Sagy, do you still see a need to make this configurable?

Comment 4 Travis Nielsen 2020-06-01 19:35:01 UTC
The change to query osd status every 60s was merged downstream for 4.5 with https://github.com/openshift/rook/pull/65.

Comment 5 svolkov 2020-06-22 04:38:53 UTC
(In reply to Travis Nielsen from comment #3)
> With https://github.com/rook/rook/pull/5556 I'm proposing setting the time
> check interval to 60s. This will follow the pattern of checking mon health
> in about the same interval (45s).
> 
> @Sagy, do you still see a need to make this configurable?

I would make this configurable. it will also help in QE testing and in POCs.
not to mention it actually gives the customer an ability to control the failure.

Comment 6 Elad 2020-06-25 09:47:43 UTC
Hi Travis,

Which variable will be used for this purpose?

Comment 9 Travis Nielsen 2020-06-25 21:16:53 UTC
Moving back to assigned to add the variable instead of simply leave it at the constant of 60s.

Comment 10 Travis Nielsen 2020-06-30 14:28:29 UTC
Moving to 4.6 since it's not blocking.

Comment 13 Sébastien Han 2020-07-22 09:35:23 UTC
Done in https://github.com/rook/rook/pull/5789 and resynced with https://github.com/openshift/rook/pull/85

Comment 14 Neha Berry 2020-09-24 13:20:53 UTC
Hi Sagy,

Since this was a special ask for POC as well, would you like to confirm in latest 4.6 if the fix is what you had asked for ?

Comment 16 svolkov 2020-10-01 04:11:52 UTC
Neha,

This was not a request for a POC, it something I'm sure many customers will use, but I will test this and reply.

Comment 17 Shrivaibavi Raghaventhiran 2020-11-03 09:11:46 UTC
@svolkov Any updates on the BZ, Did you get a chance to test this ?

Comment 19 Shrivaibavi Raghaventhiran 2020-12-08 12:45:11 UTC
Tested versions:
---------------
OCS - 4.6.0-rc5
OCP - 4.6

Did not get the proper steps to verify this BZ, 
I followed the steps to reproduce mentioned in BZ https://bugzilla.redhat.com/show_bug.cgi?id=1830015
and also we did not hit any issue during automation runs on tier4.


Based on the above explanation moving this BZ to "SANITY VERIFIED".

Comment 20 Travis Nielsen 2020-12-08 15:47:54 UTC
In PR [1] the interval was made configurable in the CephCluster CR to check the OSD health with this default:

healthCheck:
  daemonHealth:
    osd:
      disabled: false
      interval: 60s

See the documentation [2].

However, this setting is not exposed for OCS yet. I'd suggest a new BZ for that. 

[1] https://github.com/rook/rook/pull/5789
[2] https://rook.github.io/docs/rook/v1.5/ceph-cluster-crd.html#health-settings

Comment 22 errata-xmlrpc 2020-12-17 06:22:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605


Note You need to log in before you can comment on or make changes to this bug.