Bug 2146546

Summary: [RFE] Refactor RBD mirror metrics to use new labeled performance counter
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Juan Miguel Olmo <jolmomar>
Component: RBD-MirrorAssignee: Divyansh Kamboj <dkamboj>
Status: CLOSED ERRATA QA Contact: Sunil Angadi <sangadi>
Severity: high Docs Contact: Akash Raj <akraj>
Priority: high    
Version: 6.0CC: akraj, ceph-eng-bugs, cephqe-warriors, dkamboj, idryomov, kdreyer, ocs-bugs, rmandyam, sangadi
Target Milestone: ---Keywords: FutureFeature
Target Release: 6.1Flags: sangadi: needinfo+
sangadi: needinfo+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
.Renaming and adding performance counters is supported With this release, you can rename and add performance counters in `rbd-mirror` daemon to improve clarity and provide detailed information about snapshot synchronization between source and destination clusters. The renamed existing snapshot and journal-based performance counters added new performance counters and switched to using labels instead of image specification in counter names.
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-15 09:16:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2025815, 2145593, 2146544, 2192813    

Description Juan Miguel Olmo 2022-11-23 17:52:35 UTC
As monitoring user of RBD mirror daemon I need to have perf counters where context attributes will not be part of the performance counter name. This will allow to transorm these performance counter to prometheus metrics using the usual format.

Example:

rbd_mirror_snapshot_image_<image_spec>snapshots*

to:

ceph_rbd_mirror_snapshot_image_snapshot_*(pool: px, image: imgx)

Comment 1 RHEL Program Management 2022-11-23 17:52:47 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 13 Divyansh Kamboj 2023-04-18 10:25:30 UTC
can you try and run the admin socket command in the directory, instead of providing the path? looks like your path is longer than the system defined limit.
also make sure that you're running the command on the secondary cluster to see the values.

Comment 15 Divyansh Kamboj 2023-04-18 11:04:09 UTC
i think the socket file path length is an os limit, i don't think we need to file a BZ for that

Comment 18 Divyansh Kamboj 2023-04-26 09:13:02 UTC
> Is this overall pool (Total number of pools) mirror usage for snapshot based?

These values are afaik agnostic of pools, as this is the global data of all the snapshots that daemon has handled. `perf dump` will not show any labels

> If we get these metrics using pool wise, wouldn't it be more helpful?

Right now there's only 2 levels in which snapshot based mirroring reports perf counters, one is on the global daemon level, another is per image based(with the labels). If we want pool based data, we can use the per image counter to compile it.

> why these metrics are not available from primary?

These are not available on the primary, as the snapshots on the primary get handled by the rbd-mirror daemon on the secondary cluster. Thus we are only able to generate counters on the secondary side. This data also gets written on mirror image status, so we can observe this data on the primary cluster using that command.

> Labels field is empty here, what value we should expect here and which scenario?

The counter dump command you've posted is for rbd_mirror_snapshot, which is counters for all the snapshots being processed by that daemon (global). The labels are on the per image based counter `rbd_mirror_snapshot_image`. And will look something like this, 

```
    "rbd_mirror_snapshot_image": {
        "labels": {
            "image": "image1",
            "namespace": "",
            "pool": "data"
        },
        "counters": {
            "snapshots": 106,
            "sync_time": {
                "avgcount": 106,
                "sum": 8.157710400,
                "avgtime": 0.076959532
            },
            "sync_bytes": 524288000,
            "remote_timestamp": 1682500200.777647685,
            "local_timestamp": 1682500200.777647685,
            "last_sync_time": 0.003713542,
            "last_sync_bytes": 0
        }
    }```

can you check and see if you have that counter working for you? you can confirm the values using mirror image status command to verify if the values are correct for the images.

> both are having same values, looks like an issue for me.

I'm not sure I see the same values in the output you've posted, can you point out where the values are same, I might've missed it.

Comment 22 Ilya Dryomov 2023-05-16 16:06:36 UTC
*** Bug 2145593 has been marked as a duplicate of this bug. ***

Comment 24 errata-xmlrpc 2023-06-15 09:16:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3623

Comment 25 Red Hat Bugzilla 2023-10-20 04:25:04 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days