Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1614509

Summary: KubeQuotaExceeded alert in default prometheus rules extremely noisy
Product: OpenShift Container Platform Reporter: Justin Pierce <jupierce>
Component: MonitoringAssignee: Frederic Branczyk <fbranczy>
Status: CLOSED CURRENTRELEASE QA Contact: Junqi Zhao <juzhao>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.10.0   
Target Milestone: ---   
Target Release: 3.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-21 15:23:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Justin Pierce 2018-08-09 18:40:57 UTC
Description of problem:
In the publicly accessible multi-tenant starter environments, users are intentionally restricted to 1 PVC, 2 CPUs, etc.

The current KubeQuotaExceeded alert: '100
  * kube_resourcequota{job="kube-state-metrics",type="used"} / ignoring(instance,
  job, type) kube_resourcequota{job="kube-state-metrics",type="hard"}
  > 90'

will invariably alert if the user consumes their PVC. This results in potentially thousands of warnings like the following:
alertname="KubeQuotaExceeded" endpoint="https-main" namespace="jmp-test15" pod="kube-state-metrics-b44488686-p54bf" resource="persistentvolumeclaims" resourcequota="object-counts" service="kube-state-metrics" severity="warning"

CPU seems to be another culprit since in a project limited to 2 CPUs, a user will frequently explicitly allocate the CPUs among different pods.

Version-Release number of selected component (if applicable):
v3.10

How reproducible:
100%

Steps to Reproduce:
1. Setup an object limit of 1 for PVCs and use that PVC in a project

Actual results:
The alert is impractical for low integer values.

Expected results:
This 90% alert should only be a threshold for directly measured / large float values. Restrict to actual resourcequota="object-counts" with hard limits > 10 ?

Additional info:
I realize there are way to quiet this alert in the starter environment, but since this alert is general purpose, I wanted to suggest that it was not generally applicable in its current form.

Comment 1 Frederic Branczyk 2018-08-17 13:20:27 UTC
We should only make this alert apply to OpenShift components by default, in those cases we do want to know when we are approaching 100% quota (if any are set in the first place).

We need to evaluate whether we can still accomplish this for 3.11, I'm putting this into 3.11 for now, but it has low priority, compared to other issues as Alertmanager routes can be chosen to avoid the noise (not great, but a solution for the time being).

Comment 2 Frederic Branczyk 2018-09-06 08:27:14 UTC
This has been fixed in this PR: https://github.com/openshift/cluster-monitoring-operator/pull/88

Comment 3 Frederic Branczyk 2018-09-07 09:01:32 UTC
The PR has been merged. Please verify.

Comment 5 Junqi Zhao 2018-09-20 05:47:13 UTC
Tested with ose-cluster-monitoring-operator:v3.11.7, no KubeQuotaExceeded alert for user project in oso starter env, the alert only applied to openshift.*|kube.*|default|logging project

Comment 6 Luke Meyer 2018-12-21 15:23:06 UTC
Closing bugs that were verified and targeted for GA but for some reason were not picked up by errata. This bug fix should be present in current 3.11 release content.