1850717 – CPUThrottlingHigh and other alerts lack namespace restrictions

Bug 1850717 - CPUThrottlingHigh and other alerts lack namespace restrictions

Summary: CPUThrottlingHigh and other alerts lack namespace restrictions

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Lili Cosic
QA Contact:	hongyan li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1851873
TreeView+	depends on / blocked

Reported:	2020-06-24 19:31 UTC by W. Trevor King
Modified:	2024-03-25 16:05 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1851873 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:09:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 825	0	None	closed	Bug 1850717: Add namespace selector to CPUThrottlingHigh	2021-02-02 02:25:58 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:10:26 UTC

Internal Links: 1843346

Description W. Trevor King 2020-06-24 19:31:51 UTC

Since it landed (in master/4.1) CPUThrottlingHigh has lacked the namespace=~" filters that its sibling alerts have [1].  It should probably grow something like:

  namespace=~"(openshift-.*|kube-.*|default|logging)",

which seems popular.  That would avoid pestering the cluster admin when a pod in a user namespace was pegging it's CPU limit.  I'm assuming there are separate alerts that cluster admins would receive if they forgot to set quotas on their users and a user either set a high limit or forgot to limit CPU and ended up leaving a given compute pool starved for CPU.  Spot-checking master:

  $ git --no-pager log --oneline -1
  d8c6e775 (HEAD -> master, origin/release-4.7, origin/release-4.6, origin/master, origin/HEAD) Merge pull request #820 from simonpasquier/fix-alertmanagerconfiginconsistent-alert
  $ yaml2json <assets/prometheus-k8s/rules.yaml | jq -r '.spec.groups[].rules[] | select(.alert != null and (.expr | contains("namespace=~") | not)) | .alert + " " + .expr'
  ClusterMonitoringOperatorReconciliationErrors rate(cluster_monitoring_operator_reconcile_errors_total[15m]) * 100 / rate(cluster_monitoring_operator_reconcile_attempts_total[15m]) > 10
  AlertmanagerReceiversNotConfigured cluster:alertmanager_routing_enabled:max == 0
  MultipleContainersOOMKilled sum(max by(namespace, container, pod) (increase(kube_pod_container_status_restarts_total[12m])) and max by(namespace, container, pod) (kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}) == 1) > 5
  KubeStateMetricsListErrors (sum(rate(kube_state_metrics_list_total{job="kube-state-metrics",result="error"}[5m]))
    /
  sum(rate(kube_state_metrics_list_total{job="kube-state-metrics"}[5m])))
  > 0.01
  
  KubeStateMetricsWatchErrors (sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics",result="error"}[5m]))
    /
  sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics"}[5m])))
  > 0.01
  ...many more...

so there are many more of these that seem to be relying on "nobody outside of the core components will name a job kube-state-metrics" and similarly brittle expressions.  It's probably worth auditing the existing alerts and adding namespaces to all of them which we possibly can.

This will make life better for cluster admins.  It's not clear to me where it leaves other users.  Are they now on the hook to define their own alerts if they want to hear about CPUThrottlingHigh, etc.?  Or am I just missing a way to define a generic alert and then distinguish between "admins will want to hear about this" and "only users need to hear about this; admins can ignore"?

[1]: https://github.com/openshift/cluster-monitoring-operator/commit/5b77682a1baaa29519c53a80835c889b8cb15be5#diff-1dca87d186c04a487d72e52ab0b4dde5R824-R826

Comment 3 Maciej Szulik 2020-06-25 15:21:04 UTC

From a discussion I've had with Lili on slack. Particularly for PodDisruptionBudgetAtLimit
we can't introduce that namespace limitation. The alert was added as a response to an upgrade 
problem described in https://bugzilla.redhat.com/show_bug.cgi?id=1762888 where a faulty PDB
was preventing the cluster from evicting pods and thus blocking the upgrade. That's why 
PodDisruptionBudgetAtLimit is a safety measure for the entire cluster and admin should be notified
of any mis-configured PDB since they will cause problems during upgrades.

Comment 4 W. Trevor King 2020-06-25 16:29:42 UTC

About the 4.1 version: I set that because that's when the open CPUThrottlingHigh landed.  Deciding how far any fixes get backported will depend on the bug's severity and the maintenance phase of the z stream in question.  E.g. 4.1 is end-of-life, so no need to backport to there (even though the bug exists in 4.1).  4.2 and 4.3 are both currently in the maintenance phase [1], where we are only committed to backporting "Urgent and Selected High Priority" fixes, so at the current "medium" severity (conflating severity with priority?), we would not need to backport this bug to those releases either.

[1]: https://access.redhat.com/support/policy/updates/openshift#dates

Comment 9 hongyan li 2020-06-29 06:12:15 UTC

Blocked by  Bug 1851675 - open pkg/graphql/schema.graphql: no such file or directory, can't install ENV.

Comment 12 errata-xmlrpc 2020-10-27 16:09:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.