Bug 1806640

Summary: PodDisruptionBudgetAtLimit and PodDisruptionBudgetLimit alerts may trigger evaluation errors
Product: OpenShift Container Platform Reporter: Simon Pasquier <spasquie>
Component: kube-controller-managerAssignee: Maciej Szulik <maszulik>
Status: CLOSED ERRATA QA Contact: zhou ying <yinzhou>
Severity: low Docs Contact:
Priority: medium    
Version: 4.3.zCC: aos-bugs, lcosic, mfojtik, nmoraiti
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:20:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Simon Pasquier 2020-02-24 16:42:24 UTC
Description of problem:
PodDisruptionBudgetAtLimit and PodDisruptionBudgetLimit may trigger evaluation errors.

Version-Release number of selected component (if applicable):
4.3.0

How reproducible:
Always

Steps to Reproduce:
1. Scale down CVO deployment from 1 to 0.
2. Scale down CMO deployment from 1 to 0 (openshift-monitoring namespace).
3. Scale kube-state-metrics deployment from 1 to 2 (openshift-monitoring namespace).
4. Open the Prometheus UI (link in the Monitoring openshift console).
5. Click the Status > Rules link and look for "PodDisruptionBudgetAtLimit" and "PodDisruptionBudgetLimit".

Actual results:
The alert evaluations fail with the following errors:

found duplicate series for the match group {namespace="openshift-machine-config-operator", poddisruptionbudget="etcd-quorum-guard", service="kube-state-metrics"} on the right hand-side of the operation: [{__name__="kube_poddisruptionbudget_status_desired_healthy", endpoint="https-main", instance="10.131.0.3:8443", job="kube-state-metrics", namespace="openshift-machine-config-operator", pod="kube-state-metrics-777f6bf798-kq7tj", poddisruptionbudget="etcd-quorum-guard", service="kube-state-metrics"}, {__name__="kube_poddisruptionbudget_status_desired_healthy", endpoint="https-main", instance="10.129.2.11:8443", job="kube-state-metrics", namespace="openshift-machine-config-operator", pod="kube-state-metrics-777f6bf798-bzmnt", poddisruptionbudget="etcd-quorum-guard", service="kube-state-metrics"}];many-to-many matching not allowed: matching labels must be unique on one side

found duplicate series for the match group {namespace="openshift-machine-config-operator", poddisruptionbudget="etcd-quorum-guard", service="kube-state-metrics"} on the right hand-side of the operation: [{__name__="kube_poddisruptionbudget_status_desired_healthy", endpoint="https-main", instance="10.131.0.3:8443", job="kube-state-metrics", namespace="openshift-machine-config-operator", pod="kube-state-metrics-777f6bf798-kq7tj", poddisruptionbudget="etcd-quorum-guard", service="kube-state-metrics"}, {__name__="kube_poddisruptionbudget_status_desired_healthy", endpoint="https-main", instance="10.129.2.11:8443", job="kube-state-metrics", namespace="openshift-machine-config-operator", pod="kube-state-metrics-777f6bf798-bzmnt", poddisruptionbudget="etcd-quorum-guard", service="kube-state-metrics"}];many-to-many matching not allowed: matching labels must be unique on one side

Expected results:
No alert evaluation error.

Additional info:
The 'on (namespace, poddisruptionbudget, service)' stanza could be omitted. For instance, PodDisruptionBudgetLimit can be rewritten:

kube_poddisruptionbudget_status_expected_pods < kube_poddisruptionbudget_status_desired_healthy

If you really want to drop the kube-state-metrics labels, the expression can be wrapped by the max aggregator. For instance:

max by(namespace, poddisruptionbudget, service) (kube_poddisruptionbudget_status_expected_pods < kube_poddisruptionbudget_status_desired_healthy)

Comment 1 Simon Pasquier 2020-03-10 13:48:23 UTC
*** Bug 1810947 has been marked as a duplicate of this bug. ***

Comment 2 Michal Fojtik 2020-05-12 10:32:23 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

As such, we're marking this bug as "LifecycleStale" and decreasing severity from "medium" to "low".

If you have further information on the current state of the bug, please update it, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 3 Simon Pasquier 2020-05-12 12:08:58 UTC
The issue still exists. Since the fix is trivial and described in the ticket, I've sent a PR.

Comment 4 Michal Fojtik 2020-05-14 11:51:18 UTC
(In reply to Simon Pasquier from comment #3)
> The issue still exists. Since the fix is trivial and described in the
> ticket, I've sent a PR.

Thanks! Moving this back to backlog.

Comment 7 zhou ying 2020-05-18 13:47:41 UTC
Can't reproduce the issue now with payload: 4.5.0-0.nightly-2020-05-18-012833

[root@dhcp-140-138 ~]# oc get deployment
NAME                          READY   UP-TO-DATE   AVAILABLE   AGE
cluster-monitoring-operator   0/0     0            0           7h55m
grafana                       1/1     1            1           7h34m
kube-state-metrics            2/2     2            2           7h44m


[root@dhcp-140-138 ~]# oc get deployment -n openshift-cluster-version
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
cluster-version-operator   0/0     0            0           8h


alert: PodDisruptionBudgetAtLimit
expr: max
  by(namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_expected_pods
  == kube_poddisruptionbudget_status_desired_healthy)
for: 15m
labels:
  severity: warning
annotations:
  message: The pod disruption budget is preventing further disruption to pods because
    it is at the minimum allowed level.
OK		19.057s ago	345.8us
alert: PodDisruptionBudgetLimit
expr: max
  by(namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_expected_pods
  < kube_poddisruptionbudget_status_desired_healthy)
for: 15m
labels:
  severity: critical
annotations:
  message: The pod disruption budget is below the minimum number allowed pods.

Comment 9 errata-xmlrpc 2020-07-13 17:20:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 10 Jan Chaloupka 2021-03-22 08:53:57 UTC
*** Bug 1940392 has been marked as a duplicate of this bug. ***