Description of problem (please be detailed as possible and provide log snippests): adding to https://bugzilla.redhat.com/show_bug.cgi?id=1830015 it would be best if the loop time for when Rook checks for down OSD + node down will be a variable in the storage cluster CR. This way if we have any special cases where we need longer or shorter periods of time to check for these events we or a customer, will be able to control this. Version of all relevant components (if applicable): OCS4.2, 4.3 4.4 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? yes Can this issue reproduce from the UI? yes If this is a regression, please provide more details to justify this: Steps to Reproduce: same steps as https://bugzilla.redhat.com/show_bug.cgi?id=1830015 Actual results: Expected results: Additional info:
Moving to 4.5. The timeout of 5 minutes will do for 4.4.
With https://github.com/rook/rook/pull/5556 I'm proposing setting the time check interval to 60s. This will follow the pattern of checking mon health in about the same interval (45s). @Sagy, do you still see a need to make this configurable?
The change to query osd status every 60s was merged downstream for 4.5 with https://github.com/openshift/rook/pull/65.
(In reply to Travis Nielsen from comment #3) > With https://github.com/rook/rook/pull/5556 I'm proposing setting the time > check interval to 60s. This will follow the pattern of checking mon health > in about the same interval (45s). > > @Sagy, do you still see a need to make this configurable? I would make this configurable. it will also help in QE testing and in POCs. not to mention it actually gives the customer an ability to control the failure.
Hi Travis, Which variable will be used for this purpose?
Moving back to assigned to add the variable instead of simply leave it at the constant of 60s.
Moving to 4.6 since it's not blocking.
Done in https://github.com/rook/rook/pull/5789 and resynced with https://github.com/openshift/rook/pull/85
Hi Sagy, Since this was a special ask for POC as well, would you like to confirm in latest 4.6 if the fix is what you had asked for ?
Neha, This was not a request for a POC, it something I'm sure many customers will use, but I will test this and reply.
@svolkov Any updates on the BZ, Did you get a chance to test this ?
Tested versions: --------------- OCS - 4.6.0-rc5 OCP - 4.6 Did not get the proper steps to verify this BZ, I followed the steps to reproduce mentioned in BZ https://bugzilla.redhat.com/show_bug.cgi?id=1830015 and also we did not hit any issue during automation runs on tier4. Based on the above explanation moving this BZ to "SANITY VERIFIED".
In PR [1] the interval was made configurable in the CephCluster CR to check the OSD health with this default: healthCheck: daemonHealth: osd: disabled: false interval: 60s See the documentation [2]. However, this setting is not exposed for OCS yet. I'd suggest a new BZ for that. [1] https://github.com/rook/rook/pull/5789 [2] https://rook.github.io/docs/rook/v1.5/ceph-cluster-crd.html#health-settings
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605