Description of problem:
typical issue where the rule should be modified. I am reporting this incident to monitoring component but I guess that it's more for CEPH ( ocs ).
In this case the rule is using as part of the tuple to calculate the metric, the device. So, we have this warning and the following tuples:
level=warn ts=2021-08-11T07:12:49.450Z caller=manager.go:598 component="rule manager" group=ceph.rules msg="Evaluating rule failed" rule="record: cluster:ceph_disk_latency:join_ceph_node_disk_irate1m\nexpr: avg(max by(instance) (label_replace(label_replace(ceph_disk_occupation{job=\"rook-ceph-mgr\"}, \"instance\", \"$1\", \"exported_instance\", \"(.*)\"), \"device\", \"$1\", \"device\", \"/dev/(.*)\") * on(instance, device) group_right() (irate(node_disk_read_time_seconds_total[1m]) + irate(node_disk_write_time_seconds_total[1m]) / (clamp_min(irate(node_disk_reads_completed_total[1m]), 1) + irate(node_disk_writes_completed_total[1m])))))\n" err="found duplicate series for the match group {device=\"sdc\", instance=\"coll-ocs-westeurope-3\"} on the left hand-side of the operation: [{__name__=\"ceph_disk_occupation\", ceph_daemon=\"osd.4\", container=\"mgr\", device=\"sdc\", endpoint=\"http-metrics\", exported_instance=\"coll-ocs-westeurope-3\", instance=\"coll-ocs-westeurope-3\", job=\"rook-ceph-mgr\", namespace=\"openshift-storage\", pod=\"rook-ceph-mgr-a-77cd89576b-9d4nl\", service=\"rook-ceph-mgr\"}, {__name__=\"ceph_disk_occupation\", ceph_daemon=\"osd.1\", container=\"mgr\", device=\"sdc\", endpoint=\"http-metrics\", exported_instance=\"coll-ocs-westeurope-3\", instance=\"coll-ocs-westeurope-3\", job=\"rook-ceph-mgr\", namespace=\"openshift-storage\", pod=\"rook-ceph-mgr-a-77cd89576b-9d4nl\", service=\"rook-ceph-mgr\"}];many-to-many matching not allowed: matching labels must be unique on one side"
{__name__=\"ceph_disk_occupation\", ceph_daemon=\"osd.4\", container=\"mgr\", device=\"sdc\", endpoint=\"http-metrics\", exported_instance=\"coll-ocs-westeurope-3\", instance=\"coll-ocs-westeurope-3\", job=\"rook-ceph-mgr\", namespace=\"openshift-storage\", pod=\"rook-ceph-mgr-a-77cd89576b-9d4nl\", service=\"rook-ceph-mgr\"}
{__name__=\"ceph_disk_occupation\", ceph_daemon=\"osd.1\", container=\"mgr\", device=\"sdc\", endpoint=\"http-metrics\", exported_instance=\"coll-ocs-westeurope-3\", instance=\"coll-ocs-westeurope-3\", job=\"rook-ceph-mgr\", namespace=\"openshift-storage\", pod=\"rook-ceph-mgr-a-77cd89576b-9d4nl\", service=\"rook-ceph-mgr\"}
As both pods showing same values, particularly the device is the same, the rule is failing to evaluate.
Version-Release number of selected component (if applicable): 4.7
Moving to Openshift Container Storage. I guessed ceph-monitoring component.
Comment 11arun kumar mohan
2021-10-13 07:18:46 UTC
Hi,
I was trying to repro the issue, but I was unable to do so in a new cluster setup.
In German's result, we could see same device ("sdc") with two different OSDs (osd.1 & osd.4) on the same host, which is not possible in a common scenario.
German, I would like to know, how we can repro the issue?
@gparente
Comment 13arun kumar mohan
2021-10-13 08:39:59 UTC
Thanks for the quick reply German.
In that case, can you check with customer on what steps they take to get to this issue? and is it reproducible always?
Without repro-ing the issue, I'm unable to move forward =(
Description of problem: typical issue where the rule should be modified. I am reporting this incident to monitoring component but I guess that it's more for CEPH ( ocs ). In this case the rule is using as part of the tuple to calculate the metric, the device. So, we have this warning and the following tuples: level=warn ts=2021-08-11T07:12:49.450Z caller=manager.go:598 component="rule manager" group=ceph.rules msg="Evaluating rule failed" rule="record: cluster:ceph_disk_latency:join_ceph_node_disk_irate1m\nexpr: avg(max by(instance) (label_replace(label_replace(ceph_disk_occupation{job=\"rook-ceph-mgr\"}, \"instance\", \"$1\", \"exported_instance\", \"(.*)\"), \"device\", \"$1\", \"device\", \"/dev/(.*)\") * on(instance, device) group_right() (irate(node_disk_read_time_seconds_total[1m]) + irate(node_disk_write_time_seconds_total[1m]) / (clamp_min(irate(node_disk_reads_completed_total[1m]), 1) + irate(node_disk_writes_completed_total[1m])))))\n" err="found duplicate series for the match group {device=\"sdc\", instance=\"coll-ocs-westeurope-3\"} on the left hand-side of the operation: [{__name__=\"ceph_disk_occupation\", ceph_daemon=\"osd.4\", container=\"mgr\", device=\"sdc\", endpoint=\"http-metrics\", exported_instance=\"coll-ocs-westeurope-3\", instance=\"coll-ocs-westeurope-3\", job=\"rook-ceph-mgr\", namespace=\"openshift-storage\", pod=\"rook-ceph-mgr-a-77cd89576b-9d4nl\", service=\"rook-ceph-mgr\"}, {__name__=\"ceph_disk_occupation\", ceph_daemon=\"osd.1\", container=\"mgr\", device=\"sdc\", endpoint=\"http-metrics\", exported_instance=\"coll-ocs-westeurope-3\", instance=\"coll-ocs-westeurope-3\", job=\"rook-ceph-mgr\", namespace=\"openshift-storage\", pod=\"rook-ceph-mgr-a-77cd89576b-9d4nl\", service=\"rook-ceph-mgr\"}];many-to-many matching not allowed: matching labels must be unique on one side" {__name__=\"ceph_disk_occupation\", ceph_daemon=\"osd.4\", container=\"mgr\", device=\"sdc\", endpoint=\"http-metrics\", exported_instance=\"coll-ocs-westeurope-3\", instance=\"coll-ocs-westeurope-3\", job=\"rook-ceph-mgr\", namespace=\"openshift-storage\", pod=\"rook-ceph-mgr-a-77cd89576b-9d4nl\", service=\"rook-ceph-mgr\"} {__name__=\"ceph_disk_occupation\", ceph_daemon=\"osd.1\", container=\"mgr\", device=\"sdc\", endpoint=\"http-metrics\", exported_instance=\"coll-ocs-westeurope-3\", instance=\"coll-ocs-westeurope-3\", job=\"rook-ceph-mgr\", namespace=\"openshift-storage\", pod=\"rook-ceph-mgr-a-77cd89576b-9d4nl\", service=\"rook-ceph-mgr\"} As both pods showing same values, particularly the device is the same, the rule is failing to evaluate. Version-Release number of selected component (if applicable): 4.7