Bug 2127186
| Summary: | [RDR] Enable RBD mirroring debugging by default | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Karolin Seeger <kseeger> |
| Component: | ocs-operator | Assignee: | Malay Kumar parida <mparida> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Pratik Surve <prsurve> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.11 | CC: | amagrawa, idryomov, jdurgin, kramdoss, mmuench, mparida, muagarwa, ocs-bugs, odf-bz-bot, prsurve, sostapov, tnielsen |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | ODF 4.12.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.12.0-114 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-02-08 14:06:28 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 2142763 | ||
| Bug Blocks: | |||
|
Description
Karolin Seeger
2022-09-15 15:16:36 UTC
Details tracked in https://bugzilla.redhat.com/show_bug.cgi?id=2093266. <---- snip ----> Expanding on Josh's comment, an equivalent of the following is needed: $ ceph config set client.rbd-mirror.a debug_ms 1 $ ceph config set client.rbd-mirror.a debug_rbd 20 $ ceph config set client.rbd-mirror.a debug_rbd_mirror 30 $ ceph config set client.rbd-mirror.a log_file /var/log/ceph/\$cluster-\$name.log $ ceph config set client.rbd-mirror-peer debug_ms 1 $ ceph config set client.rbd-mirror-peer debug_rbd 20 $ ceph config set client.rbd-mirror-peer debug_rbd_mirror 30 $ ceph config set client.rbd-mirror-peer log_file /var/log/ceph/\$cluster-\$name.log $ ceph config set mgr mgr/rbd_support/log_level debug (This is basically what Pratik, Sidhant and others have been doing for while in QE testing with minor tweaks.) <---- snap ----> Re-assigning to Malay as discussed with Mudit. From the discussion between @tnielsen & @idryomov on https://bugzilla.redhat.com/show_bug.cgi?id=2093266. the last one, mgr/rbd_support/log_level, can't be set this way(by adding to configmap in ocs-operator), -- "ceph config set" (or lower level "ceph config-key set") is the only way to set ceph-mgr module configuration settings as they are stored on the monitors. Rook doesn't currently use assimilate-conf, instead of that it uses /etc/ceph/ceph.conf. As Travis said this can be done on rook downstream https://github.com/red-hat-storage/rook/blob/1ae867049b49079b76696e68ee9b8f30216528bd/pkg/operator/ceph/cluster/cluster.go#L497. For the rest of the settings, I have a PR up on the ocs-operator now which is linked above. Another approach to consider is that a job template could be created similar to the OSD removal job template [1]. The template could run the ceph config set for all the commands needed, or use assimilate-conf, or any ceph implementation that is needed for the settings. The template would not be run on every cluster automatically, but we would need the customer to run the job template on their cluster whenever they are testing DR. When they are done testing DR, the settings could be reverted by running the template again with an option to indicate the same. The strong advantage of this approach is that the customer can be aware of the increased logging and choose to enable or disable it at any time, and the job template could remain even after GA. The template could also be expanded to have additional options if something else is needed for DR as well. It would be very flexible and useful, instead of the ceph.conf overrides. [1] https://github.com/red-hat-storage/ocs-operator/blob/f06c3e5c27ed309e76fbb85a416e9ecf1fa6dd6b/controllers/storagecluster/job_templates.go#L68 @idryomov , What do you say? As I saw from your comment on the pr, you are not in favor of the configmap change, So what should we do now? Should consider what Travis has suggested or we do nothing and keep things as it is? The OCS operator could technically run the template while the feature is in beta to automatically enable the logging. Perhaps it could be run once, and we add to the mirroring docs an instruction on how to disable it if needed. Although if there are any manual steps for configuring mirroring, I'd still vote to have this also be one of those steps unless we expect a high percentage of the beta users to ask to troubleshoot the feature. I am not in favor of enabling this via code, the initial agreement was to enable this only for TP but now we are in 4.12 which might be the GA version for RDR. So, I am not sure this should be done or not. I don't think we have an agreeable approach here to fix this issue. Moreover, the plan was to do this for RDR TP which is already there in place. I am closing this as WONT FIX and we should try to find a way to enable this manually rather than enabling it in the code. Please reopen if someone thinks otherwise. Hi Ilya, I absolutely have no problem with fixing this but for something this important I expect all the stakeholders to agree on a fix but there wasn't any movement (nor on the BZ neither the PR) for almost a month now and we are 2 weeks away from 4.12 dev freeze. Can we close the loop on the fix quickly so that it can be fixed before 4.12 dev freeze? To implement this as a job that could be run on the ocs operator, see this example here: https://github.com/travisn/rook/blob/mirror-logging/deploy/examples/mirror-logging.yaml This job could be defined and run by the ocs operator, or it could be wrapped by the ocs operator with a template and we leave it up to the user to run it. Reopening the Bug & taking a look on priority basis for 4.12 Hello Travis & Ilya, Thanks Travis for providing the yaml of the job needed. I have raised a new PR now, Now we create a job when the Spec.Mirroring.Enabled field on the storagecluster CR is true. The job runs the required commands on ceph config. Here is the the output of ceph config dump in the toolbox after the job is completed. sh-4.4$ ceph config dump WHO MASK LEVEL OPTION VALUE RO global basic log_to_file true global advanced mon_allow_pool_delete true global advanced mon_allow_pool_size_one true global advanced mon_cluster_log_file global advanced mon_pg_warn_min_per_osd 0 mon advanced auth_allow_insecure_global_id_reclaim false mgr advanced mgr/balancer/mode upmap mgr advanced mgr/prometheus/rbd_stats_pools ocs-storagecluster-cephblockpool * mgr advanced mgr/rbd_support/log_level debug osd.0 basic osd_mclock_max_capacity_iops_ssd 16840.108369 osd.1 basic osd_mclock_max_capacity_iops_ssd 17091.355128 osd.2 basic osd_mclock_max_capacity_iops_ssd 17071.832952 mds.ocs-storagecluster-cephfilesystem-a basic mds_cache_memory_limit 4294967296 mds.ocs-storagecluster-cephfilesystem-a basic mds_join_fs ocs-storagecluster-cephfilesystem mds.ocs-storagecluster-cephfilesystem-b basic mds_cache_memory_limit 4294967296 mds.ocs-storagecluster-cephfilesystem-b basic mds_join_fs ocs-storagecluster-cephfilesystem client.rbd-mirror-peer advanced debug_ms 1/1 client.rbd-mirror-peer advanced debug_rbd 20/20 client.rbd-mirror-peer advanced debug_rbd_mirror 30/30 client.rbd-mirror-peer basic log_file /var/log/ceph/$cluster-$name.log * client.rbd-mirror.a advanced debug_ms 1/1 client.rbd-mirror.a advanced debug_rbd 20/20 client.rbd-mirror.a advanced debug_rbd_mirror 30/30 client.rbd-mirror.a basic log_file /var/log/ceph/$cluster-$name.log * You can check all the required fields are set on this. I will request both of you to take a look at the PR here https://github.com/red-hat-storage/ocs-operator/pull/1875. Thanks @tnielsen @idryomov The output looks good to me, thanks Malay. Will move this to ON_QA once https://bugzilla.redhat.com/show_bug.cgi?id=2142763 is ON_QA |