Bug 2127186

Summary:	[RDR] Enable RBD mirroring debugging by default
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Karolin Seeger <kseeger>
Component:	ocs-operator	Assignee:	Malay Kumar parida <mparida>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Pratik Surve <prsurve>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.11	CC:	amagrawa, idryomov, jdurgin, kramdoss, mmuench, mparida, muagarwa, ocs-bugs, odf-bz-bot, prsurve, sostapov, tnielsen
Target Milestone:	---	Keywords:	Reopened
Target Release:	ODF 4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.12.0-114	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-02-08 14:06:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2142763
Bug Blocks:

Description Karolin Seeger 2022-09-15 15:16:36 UTC

Description of problem (please be detailed as possible and provide log
snippests):

When rbd mirroring is enabled, rbd debugging should be enabled by default as long Regional DR is in the Tech Preview status. That's supposed to make debugging easier when issues are hit.

Version of all relevant components (if applicable):

ODF 4.11

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

-

Is there any workaround available to the best of your knowledge?

-

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?

not applicable


Can this issue reproduce from the UI?

not applicable


If this is a regression, please provide more details to justify this:

no

Steps to Reproduce:

not applicable


Actual results:

no rbd debugging enabled

Expected results:

rbd debugging enabled


Additional info:

Comment 3 Karolin Seeger 2022-09-15 15:20:35 UTC

Details tracked in https://bugzilla.redhat.com/show_bug.cgi?id=2093266.

<---- snip ---->
Expanding on Josh's comment, an equivalent of the following is needed:

$ ceph config set client.rbd-mirror.a debug_ms 1
$ ceph config set client.rbd-mirror.a debug_rbd 20
$ ceph config set client.rbd-mirror.a debug_rbd_mirror 30
$ ceph config set client.rbd-mirror.a log_file /var/log/ceph/\$cluster-\$name.log

$ ceph config set client.rbd-mirror-peer debug_ms 1
$ ceph config set client.rbd-mirror-peer debug_rbd 20
$ ceph config set client.rbd-mirror-peer debug_rbd_mirror 30
$ ceph config set client.rbd-mirror-peer log_file /var/log/ceph/\$cluster-\$name.log

$ ceph config set mgr mgr/rbd_support/log_level debug

(This is basically what Pratik, Sidhant and others have been doing for while in QE testing with minor tweaks.)

<---- snap ---->

Comment 4 Karolin Seeger 2022-09-15 15:21:45 UTC

Re-assigning to Malay as discussed with Mudit.

Comment 5 Malay Kumar parida 2022-09-16 10:14:38 UTC

From the discussion between @tnielsen & @idryomov on https://bugzilla.redhat.com/show_bug.cgi?id=2093266.

the last one, mgr/rbd_support/log_level, can't be set this way(by adding to configmap in ocs-operator), -- "ceph config set" (or lower level "ceph config-key set") is the only way to set ceph-mgr module configuration settings as they are stored on the monitors.
Rook doesn't currently use assimilate-conf, instead of that it uses /etc/ceph/ceph.conf.

As Travis said this can be done on rook downstream https://github.com/red-hat-storage/rook/blob/1ae867049b49079b76696e68ee9b8f30216528bd/pkg/operator/ceph/cluster/cluster.go#L497.

For the rest of the settings, I have a PR up on the ocs-operator now which is linked above.

Comment 7 Travis Nielsen 2022-09-16 13:50:40 UTC

Another approach to consider is that a job template could be created similar to the OSD removal job template [1].

The template could run the ceph config set for all the commands needed, or use assimilate-conf, or any ceph implementation that is needed for the settings.

The template would not be run on every cluster automatically, but we would need the customer to run the job template on their cluster whenever they are testing DR.

When they are done testing DR, the settings could be reverted by running the template again with an option to indicate the same.

The strong advantage of this approach is that the customer can be aware of the increased logging and choose to enable or disable it at any time, and the job template could remain even after GA. The template could also be expanded to have additional options if something else is needed for DR as well. It would be very flexible and useful, instead of the ceph.conf overrides.

[1] https://github.com/red-hat-storage/ocs-operator/blob/f06c3e5c27ed309e76fbb85a416e9ecf1fa6dd6b/controllers/storagecluster/job_templates.go#L68

Comment 8 Malay Kumar parida 2022-09-27 07:04:39 UTC

@idryomov , What do you say? As I saw from your comment on the pr, you are not in favor of the configmap change, So what should we do now? Should consider what Travis has suggested or we do nothing and keep things as it is?

Comment 10 Travis Nielsen 2022-09-27 14:16:32 UTC

The OCS operator could technically run the template while the feature is in beta to automatically enable the logging. Perhaps it could be run once, and we add to the mirroring docs an instruction on how to disable it if needed. 

Although if there are any manual steps for configuring mirroring, I'd still vote to have this also be one of those steps unless we expect a high percentage of the beta users to ask to troubleshoot the feature.

Comment 11 Mudit Agarwal 2022-09-29 03:41:10 UTC

I am not in favor of enabling this via code, the initial agreement was to enable this only for TP but now we are in 4.12 which might be the GA version for RDR.
So, I am not sure this should be done or not.

Comment 12 Mudit Agarwal 2022-10-19 03:08:33 UTC

I don't think we have an agreeable approach here to fix this issue. Moreover, the plan was to do this for RDR TP which is already there in place.
I am closing this as WONT FIX and we should try to find a way to enable this manually rather than enabling it in the code.
Please reopen if someone thinks otherwise.

Comment 14 Mudit Agarwal 2022-10-19 12:53:55 UTC

Hi Ilya, 

I absolutely have no problem with fixing this but for something this important I expect all the stakeholders to agree on a fix but 
there wasn't any movement (nor on the BZ neither the PR) for almost a month now and we are 2 weeks away from 4.12 dev freeze.
Can we close the loop on the fix quickly so that it can be fixed before 4.12 dev freeze?

Comment 16 Travis Nielsen 2022-10-21 20:55:43 UTC

To implement this as a job that could be run on the ocs operator, see this example here:
https://github.com/travisn/rook/blob/mirror-logging/deploy/examples/mirror-logging.yaml

This job could be defined and run by the ocs operator, or it could be wrapped by the ocs operator with a template and we leave it up to the user to run it.

Comment 17 Malay Kumar parida 2022-10-26 04:52:35 UTC

Reopening the Bug & taking a look on priority basis for 4.12

Comment 18 Malay Kumar parida 2022-11-01 08:47:09 UTC

Hello Travis & Ilya,

Thanks Travis for providing the yaml of the job needed.
I have raised a new PR now, Now we create a job when the Spec.Mirroring.Enabled field on the storagecluster CR is true.

The job runs the required commands on ceph config.

Here is the the output of ceph config dump in the toolbox after the job is completed.

sh-4.4$ ceph config dump
WHO                                      MASK  LEVEL     OPTION                                 VALUE                              RO
global                                         basic     log_to_file                            true                                 
global                                         advanced  mon_allow_pool_delete                  true                                 
global                                         advanced  mon_allow_pool_size_one                true                                 
global                                         advanced  mon_cluster_log_file                                                        
global                                         advanced  mon_pg_warn_min_per_osd                0                                    
mon                                            advanced  auth_allow_insecure_global_id_reclaim  false                                
mgr                                            advanced  mgr/balancer/mode                      upmap                                
mgr                                            advanced  mgr/prometheus/rbd_stats_pools         ocs-storagecluster-cephblockpool   * 
mgr                                            advanced  mgr/rbd_support/log_level              debug                                
osd.0                                          basic     osd_mclock_max_capacity_iops_ssd       16840.108369                         
osd.1                                          basic     osd_mclock_max_capacity_iops_ssd       17091.355128                         
osd.2                                          basic     osd_mclock_max_capacity_iops_ssd       17071.832952                         
mds.ocs-storagecluster-cephfilesystem-a        basic     mds_cache_memory_limit                 4294967296                           
mds.ocs-storagecluster-cephfilesystem-a        basic     mds_join_fs                            ocs-storagecluster-cephfilesystem    
mds.ocs-storagecluster-cephfilesystem-b        basic     mds_cache_memory_limit                 4294967296                           
mds.ocs-storagecluster-cephfilesystem-b        basic     mds_join_fs                            ocs-storagecluster-cephfilesystem    
client.rbd-mirror-peer                         advanced  debug_ms                               1/1                                  
client.rbd-mirror-peer                         advanced  debug_rbd                              20/20                                
client.rbd-mirror-peer                         advanced  debug_rbd_mirror                       30/30                                
client.rbd-mirror-peer                         basic     log_file                               /var/log/ceph/$cluster-$name.log   * 
client.rbd-mirror.a                            advanced  debug_ms                               1/1                                  
client.rbd-mirror.a                            advanced  debug_rbd                              20/20                                
client.rbd-mirror.a                            advanced  debug_rbd_mirror                       30/30                                
client.rbd-mirror.a                            basic     log_file                               /var/log/ceph/$cluster-$name.log   * 


You can check all the required fields are set on this.
I will request both of you to take a look at the PR here https://github.com/red-hat-storage/ocs-operator/pull/1875.

Thanks
@tnielsen
@idryomov

Comment 20 Travis Nielsen 2022-11-01 13:54:17 UTC

The output looks good to me, thanks Malay.

Comment 26 Mudit Agarwal 2022-11-15 13:10:40 UTC

Will move this to ON_QA once https://bugzilla.redhat.com/show_bug.cgi?id=2142763 is ON_QA