Bug 1993511

Summary: ceph.rules: many-to-many matching not allowed: matching labels must be unique on one side
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: German Parente <gparente>
Component: ceph-monitoringAssignee: arun kumar mohan <amohan>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Elad <ebenahar>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.7CC: amohan, amuller, anpicker, aos-bugs, erooth, hnallurv, janantha, jfajersk, jsafrane, muagarwa, nthomas, ocs-bugs, odf-bz-bot
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-13 14:51:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description German Parente 2021-08-13 16:42:23 UTC
Description of problem:

typical issue where the rule should be modified. I am reporting this incident to monitoring component but I guess that it's more for CEPH ( ocs ).

In this case the rule is using as part of the tuple to calculate the metric, the device. So, we have this warning and the following tuples:

level=warn ts=2021-08-11T07:12:49.450Z caller=manager.go:598 component="rule manager" group=ceph.rules msg="Evaluating rule failed" rule="record: cluster:ceph_disk_latency:join_ceph_node_disk_irate1m\nexpr: avg(max by(instance) (label_replace(label_replace(ceph_disk_occupation{job=\"rook-ceph-mgr\"}, \"instance\", \"$1\", \"exported_instance\", \"(.*)\"), \"device\", \"$1\", \"device\", \"/dev/(.*)\") * on(instance, device) group_right() (irate(node_disk_read_time_seconds_total[1m]) + irate(node_disk_write_time_seconds_total[1m]) / (clamp_min(irate(node_disk_reads_completed_total[1m]), 1) + irate(node_disk_writes_completed_total[1m])))))\n" err="found duplicate series for the match group {device=\"sdc\", instance=\"coll-ocs-westeurope-3\"} on the left hand-side of the operation: [{__name__=\"ceph_disk_occupation\", ceph_daemon=\"osd.4\", container=\"mgr\", device=\"sdc\", endpoint=\"http-metrics\", exported_instance=\"coll-ocs-westeurope-3\", instance=\"coll-ocs-westeurope-3\", job=\"rook-ceph-mgr\", namespace=\"openshift-storage\", pod=\"rook-ceph-mgr-a-77cd89576b-9d4nl\", service=\"rook-ceph-mgr\"}, {__name__=\"ceph_disk_occupation\", ceph_daemon=\"osd.1\", container=\"mgr\", device=\"sdc\", endpoint=\"http-metrics\", exported_instance=\"coll-ocs-westeurope-3\", instance=\"coll-ocs-westeurope-3\", job=\"rook-ceph-mgr\", namespace=\"openshift-storage\", pod=\"rook-ceph-mgr-a-77cd89576b-9d4nl\", service=\"rook-ceph-mgr\"}];many-to-many matching not allowed: matching labels must be unique on one side"


{__name__=\"ceph_disk_occupation\", ceph_daemon=\"osd.4\", container=\"mgr\", device=\"sdc\", endpoint=\"http-metrics\", exported_instance=\"coll-ocs-westeurope-3\", instance=\"coll-ocs-westeurope-3\", job=\"rook-ceph-mgr\", namespace=\"openshift-storage\", pod=\"rook-ceph-mgr-a-77cd89576b-9d4nl\", service=\"rook-ceph-mgr\"}
{__name__=\"ceph_disk_occupation\", ceph_daemon=\"osd.1\", container=\"mgr\", device=\"sdc\", endpoint=\"http-metrics\", exported_instance=\"coll-ocs-westeurope-3\", instance=\"coll-ocs-westeurope-3\", job=\"rook-ceph-mgr\", namespace=\"openshift-storage\", pod=\"rook-ceph-mgr-a-77cd89576b-9d4nl\", service=\"rook-ceph-mgr\"}

As both pods showing same values, particularly the device is the same, the rule is failing to evaluate.


Version-Release number of selected component (if applicable):  4.7

Comment 1 Jayapriya Pai 2021-08-16 06:47:53 UTC
This rule appears to be from ocs-operator and would need a ceph cluster to reproduce. Updating component to Storage/Ceph

Comment 2 Jan Safranek 2021-08-16 13:09:16 UTC
Moving to Openshift Container Storage. I guessed ceph-monitoring component.

Comment 11 arun kumar mohan 2021-10-13 07:18:46 UTC
Hi,
I was trying to repro the issue, but I was unable to do so in a new cluster setup.
In German's result, we could see same device ("sdc") with two different OSDs (osd.1 & osd.4) on the same host, which is not possible in a common scenario.
German, I would like to know, how we can repro the issue?

@gparente

Comment 13 arun kumar mohan 2021-10-13 08:39:59 UTC
Thanks for the quick reply German.
In that case, can you check with customer on what steps they take to get to this issue? and is it reproducible always?
Without repro-ing the issue, I'm unable to move forward =(