Bug 1993511 - ceph.rules: many-to-many matching not allowed: matching labels must be unique on one side
Summary: ceph.rules: many-to-many matching not allowed: matching labels must be uniqu...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph-monitoring
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: arun kumar mohan
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-13 16:42 UTC by German Parente
Modified: 2023-08-09 16:37 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-13 14:51:36 UTC
Embargoed:


Attachments (Terms of Use)

Description German Parente 2021-08-13 16:42:23 UTC
Description of problem:

typical issue where the rule should be modified. I am reporting this incident to monitoring component but I guess that it's more for CEPH ( ocs ).

In this case the rule is using as part of the tuple to calculate the metric, the device. So, we have this warning and the following tuples:

level=warn ts=2021-08-11T07:12:49.450Z caller=manager.go:598 component="rule manager" group=ceph.rules msg="Evaluating rule failed" rule="record: cluster:ceph_disk_latency:join_ceph_node_disk_irate1m\nexpr: avg(max by(instance) (label_replace(label_replace(ceph_disk_occupation{job=\"rook-ceph-mgr\"}, \"instance\", \"$1\", \"exported_instance\", \"(.*)\"), \"device\", \"$1\", \"device\", \"/dev/(.*)\") * on(instance, device) group_right() (irate(node_disk_read_time_seconds_total[1m]) + irate(node_disk_write_time_seconds_total[1m]) / (clamp_min(irate(node_disk_reads_completed_total[1m]), 1) + irate(node_disk_writes_completed_total[1m])))))\n" err="found duplicate series for the match group {device=\"sdc\", instance=\"coll-ocs-westeurope-3\"} on the left hand-side of the operation: [{__name__=\"ceph_disk_occupation\", ceph_daemon=\"osd.4\", container=\"mgr\", device=\"sdc\", endpoint=\"http-metrics\", exported_instance=\"coll-ocs-westeurope-3\", instance=\"coll-ocs-westeurope-3\", job=\"rook-ceph-mgr\", namespace=\"openshift-storage\", pod=\"rook-ceph-mgr-a-77cd89576b-9d4nl\", service=\"rook-ceph-mgr\"}, {__name__=\"ceph_disk_occupation\", ceph_daemon=\"osd.1\", container=\"mgr\", device=\"sdc\", endpoint=\"http-metrics\", exported_instance=\"coll-ocs-westeurope-3\", instance=\"coll-ocs-westeurope-3\", job=\"rook-ceph-mgr\", namespace=\"openshift-storage\", pod=\"rook-ceph-mgr-a-77cd89576b-9d4nl\", service=\"rook-ceph-mgr\"}];many-to-many matching not allowed: matching labels must be unique on one side"


{__name__=\"ceph_disk_occupation\", ceph_daemon=\"osd.4\", container=\"mgr\", device=\"sdc\", endpoint=\"http-metrics\", exported_instance=\"coll-ocs-westeurope-3\", instance=\"coll-ocs-westeurope-3\", job=\"rook-ceph-mgr\", namespace=\"openshift-storage\", pod=\"rook-ceph-mgr-a-77cd89576b-9d4nl\", service=\"rook-ceph-mgr\"}
{__name__=\"ceph_disk_occupation\", ceph_daemon=\"osd.1\", container=\"mgr\", device=\"sdc\", endpoint=\"http-metrics\", exported_instance=\"coll-ocs-westeurope-3\", instance=\"coll-ocs-westeurope-3\", job=\"rook-ceph-mgr\", namespace=\"openshift-storage\", pod=\"rook-ceph-mgr-a-77cd89576b-9d4nl\", service=\"rook-ceph-mgr\"}

As both pods showing same values, particularly the device is the same, the rule is failing to evaluate.


Version-Release number of selected component (if applicable):  4.7

Comment 1 Jayapriya Pai 2021-08-16 06:47:53 UTC
This rule appears to be from ocs-operator and would need a ceph cluster to reproduce. Updating component to Storage/Ceph

Comment 2 Jan Safranek 2021-08-16 13:09:16 UTC
Moving to Openshift Container Storage. I guessed ceph-monitoring component.

Comment 11 arun kumar mohan 2021-10-13 07:18:46 UTC
Hi,
I was trying to repro the issue, but I was unable to do so in a new cluster setup.
In German's result, we could see same device ("sdc") with two different OSDs (osd.1 & osd.4) on the same host, which is not possible in a common scenario.
German, I would like to know, how we can repro the issue?

@gparente

Comment 13 arun kumar mohan 2021-10-13 08:39:59 UTC
Thanks for the quick reply German.
In that case, can you check with customer on what steps they take to get to this issue? and is it reproducible always?
Without repro-ing the issue, I'm unable to move forward =(


Note You need to log in before you can comment on or make changes to this bug.