Bug 1614509 - KubeQuotaExceeded alert in default prometheus rules extremely noisy
Summary: KubeQuotaExceeded alert in default prometheus rules extremely noisy
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.11.0
Assignee: Frederic Branczyk
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-09 18:40 UTC by Justin Pierce
Modified: 2018-12-21 15:23 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-12-21 15:23:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Justin Pierce 2018-08-09 18:40:57 UTC
Description of problem:
In the publicly accessible multi-tenant starter environments, users are intentionally restricted to 1 PVC, 2 CPUs, etc.

The current KubeQuotaExceeded alert: '100
  * kube_resourcequota{job="kube-state-metrics",type="used"} / ignoring(instance,
  job, type) kube_resourcequota{job="kube-state-metrics",type="hard"}
  > 90'

will invariably alert if the user consumes their PVC. This results in potentially thousands of warnings like the following:
alertname="KubeQuotaExceeded" endpoint="https-main" namespace="jmp-test15" pod="kube-state-metrics-b44488686-p54bf" resource="persistentvolumeclaims" resourcequota="object-counts" service="kube-state-metrics" severity="warning"

CPU seems to be another culprit since in a project limited to 2 CPUs, a user will frequently explicitly allocate the CPUs among different pods.

Version-Release number of selected component (if applicable):
v3.10

How reproducible:
100%

Steps to Reproduce:
1. Setup an object limit of 1 for PVCs and use that PVC in a project

Actual results:
The alert is impractical for low integer values.

Expected results:
This 90% alert should only be a threshold for directly measured / large float values. Restrict to actual resourcequota="object-counts" with hard limits > 10 ?

Additional info:
I realize there are way to quiet this alert in the starter environment, but since this alert is general purpose, I wanted to suggest that it was not generally applicable in its current form.

Comment 1 Frederic Branczyk 2018-08-17 13:20:27 UTC
We should only make this alert apply to OpenShift components by default, in those cases we do want to know when we are approaching 100% quota (if any are set in the first place).

We need to evaluate whether we can still accomplish this for 3.11, I'm putting this into 3.11 for now, but it has low priority, compared to other issues as Alertmanager routes can be chosen to avoid the noise (not great, but a solution for the time being).

Comment 2 Frederic Branczyk 2018-09-06 08:27:14 UTC
This has been fixed in this PR: https://github.com/openshift/cluster-monitoring-operator/pull/88

Comment 3 Frederic Branczyk 2018-09-07 09:01:32 UTC
The PR has been merged. Please verify.

Comment 5 Junqi Zhao 2018-09-20 05:47:13 UTC
Tested with ose-cluster-monitoring-operator:v3.11.7, no KubeQuotaExceeded alert for user project in oso starter env, the alert only applied to openshift.*|kube.*|default|logging project

Comment 6 Luke Meyer 2018-12-21 15:23:06 UTC
Closing bugs that were verified and targeted for GA but for some reason were not picked up by errata. This bug fix should be present in current 3.11 release content.


Note You need to log in before you can comment on or make changes to this bug.