2093266 – [RDR] When mirroring is enabled rbd mirror daemon restart config should be enabled automatically

Bug 2093266 - [RDR] When mirroring is enabled rbd mirror daemon restart config should be enabled automatically

Summary: [RDR] When mirroring is enabled rbd mirror daemon restart config should be en...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.11.0
Assignee:	Malay Kumar parida
QA Contact:	Pratik Surve
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2103643
TreeView+	depends on / blocked

Reported:	2022-06-03 11:56 UTC by Pratik Surve
Modified:	2023-08-09 17:00 UTC (History)
CC List:	18 users (show)
Fixed In Version:	4.11.0-110
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2103643 (view as bug list)
Environment:
Last Closed:	2022-08-24 13:54:14 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 1740	None	Merged	Enable rbd-mirror daemon restart config automatically	2022-07-04 13:15:59 UTC
Github	red-hat-storage ocs-operator pull 1745	None	Merged	Bug 2093266: [release-4.11] Enable rbd-mirror daemon restart config automatically	2022-07-04 13:15:59 UTC
Red Hat Product Errata	RHSA-2022:6156	None	None	None	2022-08-24 13:55:05 UTC

Description Pratik Surve 2022-06-03 11:56:23 UTC

Description of problem (please be detailed as possible and provide log
snippets):

[DR] When mirroring is enabled rbd mirror daemon restart config should be enabled automatically

Version of all relevant components (if applicable):

OCP version:- 4.10.0-0.nightly-2022-05-26-102501
ODF version:- 4.10.3-4
CEPH version:- ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:

Currently, the user has to set this value ceph config set global rbd_mirror_die_after_seconds 3600 via the toolbox, and that is not a good way for the customer as they don't have toolbox access

Expected results:

When we have enabled mirroring for RDR we have to set this value ceph config set global rbd_mirror_die_after_seconds 3600 without any user action requirement


Additional info:

Comment 2 Mudit Agarwal 2022-06-08 15:03:52 UTC

Does this variable needs to be set in the rook ceph configmap which we maintain in ocs-operator?
If yes, then why can't we ask customer to update the configmap manually instead of using toolbox?

Comment 3 Shyamsundar 2022-06-08 17:09:41 UTC

(In reply to Mudit Agarwal from comment #2)
> Does this variable needs to be set in the rook ceph configmap which we
> maintain in ocs-operator?
> If yes, then why can't we ask customer to update the configmap manually
> instead of using toolbox?

We can document the same, i.e to edit the config map and set the value accordingly.

Are we editing the rook configmap [1] directly or is there a section in the RH docs for ocs-operator?

[1] https://github.com/rook/rook/blob/master/Documentation/Storage-Configuration/Advanced/ceph-configuration.md#custom-cephconf-settings

Comment 4 Mudit Agarwal 2022-06-21 04:28:12 UTC

Not a 4.11 blocker

Comment 5 Mudit Agarwal 2022-06-22 15:57:04 UTC

>> Are we editing the rook configmap [1] directly or is there a section in the RH docs for ocs-operator?
AFAIK, we are editing it directly. Some customers and support folks have done that in the past.
But I guess that is not a recommended process and we don't document it anywhere.

Eran or Jose can keep me honest here.

Comment 6 Eran Tamir 2022-06-24 04:29:14 UTC

@vkolli @aclewett can you provide your input?

Comment 7 Karolin Seeger 2022-06-24 10:23:58 UTC

This is IMHO important for the RDR TP. Checked with Ilya that this workaround is still needed with Ceph 5.1z2.

Comment 8 Malay Kumar parida 2022-06-29 05:24:42 UTC

@prsurve Can you please share any more details you have on it & can you also provide me the steps to reproduce it.

Comment 9 Karolin Seeger 2022-06-29 07:52:15 UTC

@mparida automatic RBD mirror daemon restart is meant to be a temporary workaround for some RBD mirroring related bugs. No steps to reproduce are needed.
The task is to set the 'ceph config set global rbd_mirror_die_after_seconds 3600' on the ODF side when mirroring for RDR is enabled. No user interaction should be needed.
The effect will be that Ceph automatically restarts the RBD mirror pod then to solve some of the known issues.

Comment 10 Josh Durgin 2022-06-29 15:00:16 UTC

We discussed this with PM, QE and Eng and want to add debugging enablement to this as well, just for rbd_mirror.

Namely for all rbd-mirror daemons set:

debug_ms 1
debug_rbd 20
debug_rbd_mirror 30

Comment 11 Mudit Agarwal 2022-06-29 15:25:19 UTC

Josh, are these parameters only for DR clusters? 
If these are enabled on a non-DR cluster then what will be the impact?
Right now, there is no way by which ocs-operator can distinguish between DR and non-DR cluster which applying these changes.

Moving it back to 4.11 and tagging it as a blocker.

Comment 14 Josh Durgin 2022-06-29 16:10:04 UTC

(In reply to Ilya Dryomov from comment #13)
> (In reply to Mudit Agarwal from comment #11)
> > Josh, are these parameters only for DR clusters? 
> > If these are enabled on a non-DR cluster then what will be the impact?
> 
> It should be fine to apply my version on non-DR clusters as well (subject to
> QE verification, of course).
> 
> There would be no client.rbd-mirror.a and client.rbd-mirror-peer entities on
> non-DR clusters, so debug_ms, debug_rbd and debug_rbd_mirror settings would
> be effectively canceled out.
> 
> mgr/rbd_support/log_level would take effect but it shouldn't be too verbose.
> Do pay attention to ceph-mgr log file size when verifying this change just
> in case though.

Rook rotates all daemon logs so log size should not be an issue.

As Ilya says, we don't need to make a distinction between DR/non-DR here, since these only affect components used during DR, and the rbd_support mgr debugging has no impact on performance.

Comment 15 Mudit Agarwal 2022-06-29 17:00:01 UTC

Thanks for clarifying these doubts.

Though I have one more concern, if you see the existing parameters in the list https://github.com/red-hat-storage/ocs-operator/blob/f6ef1482dc4b3a3fd6e5d624ee7c05dacb598f8d/controllers/storagecluster/cephconfig.go#L30
we have always added simple straightforward parameters while the config parameters mentioned in the comments above look complex.

Not sure if those can be added seamlessly, Malay needs to investigate. Will take some more time than expected.

Comment 16 Travis Nielsen 2022-06-29 18:33:41 UTC

To be clear, are all these settings from Comment 12 expected to be in production? Or this is just for QE environments? If it's just for QE, then those settings should just be applied in the toolbox during testing. For example, setting such a high log level sounds like just for a QE environment
- Anything added to the configmap [1] is permanent, and OCS operator will enforce it during reconcile.
- Logging to file should already be enabled for the daemon, so no need to set the log_file

[1] https://github.com/red-hat-storage/ocs-operator/blob/f6ef1482dc4b3a3fd6e5d624ee7c05dacb598f8d/controllers/storagecluster/cephconfig.go#L30

Comment 32 errata-xmlrpc 2022-08-24 13:54:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6156

Note You need to log in before you can comment on or make changes to this bug.