Description of problem (please be detailed as possible and provide log snippets): [DR] When mirroring is enabled rbd mirror daemon restart config should be enabled automatically Version of all relevant components (if applicable): OCP version:- 4.10.0-0.nightly-2022-05-26-102501 ODF version:- 4.10.3-4 CEPH version:- ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Currently, the user has to set this value ceph config set global rbd_mirror_die_after_seconds 3600 via the toolbox, and that is not a good way for the customer as they don't have toolbox access Expected results: When we have enabled mirroring for RDR we have to set this value ceph config set global rbd_mirror_die_after_seconds 3600 without any user action requirement Additional info:
Does this variable needs to be set in the rook ceph configmap which we maintain in ocs-operator? If yes, then why can't we ask customer to update the configmap manually instead of using toolbox?
(In reply to Mudit Agarwal from comment #2) > Does this variable needs to be set in the rook ceph configmap which we > maintain in ocs-operator? > If yes, then why can't we ask customer to update the configmap manually > instead of using toolbox? We can document the same, i.e to edit the config map and set the value accordingly. Are we editing the rook configmap [1] directly or is there a section in the RH docs for ocs-operator? [1] https://github.com/rook/rook/blob/master/Documentation/Storage-Configuration/Advanced/ceph-configuration.md#custom-cephconf-settings
Not a 4.11 blocker
>> Are we editing the rook configmap [1] directly or is there a section in the RH docs for ocs-operator? AFAIK, we are editing it directly. Some customers and support folks have done that in the past. But I guess that is not a recommended process and we don't document it anywhere. Eran or Jose can keep me honest here.
@vkolli @aclewett can you provide your input?
This is IMHO important for the RDR TP. Checked with Ilya that this workaround is still needed with Ceph 5.1z2.
@prsurve Can you please share any more details you have on it & can you also provide me the steps to reproduce it.
@mparida automatic RBD mirror daemon restart is meant to be a temporary workaround for some RBD mirroring related bugs. No steps to reproduce are needed. The task is to set the 'ceph config set global rbd_mirror_die_after_seconds 3600' on the ODF side when mirroring for RDR is enabled. No user interaction should be needed. The effect will be that Ceph automatically restarts the RBD mirror pod then to solve some of the known issues.
We discussed this with PM, QE and Eng and want to add debugging enablement to this as well, just for rbd_mirror. Namely for all rbd-mirror daemons set: debug_ms 1 debug_rbd 20 debug_rbd_mirror 30
Josh, are these parameters only for DR clusters? If these are enabled on a non-DR cluster then what will be the impact? Right now, there is no way by which ocs-operator can distinguish between DR and non-DR cluster which applying these changes. Moving it back to 4.11 and tagging it as a blocker.
(In reply to Ilya Dryomov from comment #13) > (In reply to Mudit Agarwal from comment #11) > > Josh, are these parameters only for DR clusters? > > If these are enabled on a non-DR cluster then what will be the impact? > > It should be fine to apply my version on non-DR clusters as well (subject to > QE verification, of course). > > There would be no client.rbd-mirror.a and client.rbd-mirror-peer entities on > non-DR clusters, so debug_ms, debug_rbd and debug_rbd_mirror settings would > be effectively canceled out. > > mgr/rbd_support/log_level would take effect but it shouldn't be too verbose. > Do pay attention to ceph-mgr log file size when verifying this change just > in case though. Rook rotates all daemon logs so log size should not be an issue. As Ilya says, we don't need to make a distinction between DR/non-DR here, since these only affect components used during DR, and the rbd_support mgr debugging has no impact on performance.
Thanks for clarifying these doubts. Though I have one more concern, if you see the existing parameters in the list https://github.com/red-hat-storage/ocs-operator/blob/f6ef1482dc4b3a3fd6e5d624ee7c05dacb598f8d/controllers/storagecluster/cephconfig.go#L30 we have always added simple straightforward parameters while the config parameters mentioned in the comments above look complex. Not sure if those can be added seamlessly, Malay needs to investigate. Will take some more time than expected.
To be clear, are all these settings from Comment 12 expected to be in production? Or this is just for QE environments? If it's just for QE, then those settings should just be applied in the toolbox during testing. For example, setting such a high log level sounds like just for a QE environment - Anything added to the configmap [1] is permanent, and OCS operator will enforce it during reconcile. - Logging to file should already be enabled for the daemon, so no need to set the log_file [1] https://github.com/red-hat-storage/ocs-operator/blob/f6ef1482dc4b3a3fd6e5d624ee7c05dacb598f8d/controllers/storagecluster/cephconfig.go#L30
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6156