Bug 1850717

Summary: CPUThrottlingHigh and other alerts lack namespace restrictions
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: MonitoringAssignee: Lili Cosic <lcosic>
Status: CLOSED ERRATA QA Contact: hongyan li <hongyli>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.1.zCC: aabhishe, alegrand, anpicker, erooth, kakkoyun, lcosic, maszulik, mloibl, pkrupa, surbania
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1851873 (view as bug list) Environment:
Last Closed: 2020-10-27 16:09:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1851873    

Description W. Trevor King 2020-06-24 19:31:51 UTC
Since it landed (in master/4.1) CPUThrottlingHigh has lacked the namespace=~" filters that its sibling alerts have [1].  It should probably grow something like:

  namespace=~"(openshift-.*|kube-.*|default|logging)",

which seems popular.  That would avoid pestering the cluster admin when a pod in a user namespace was pegging it's CPU limit.  I'm assuming there are separate alerts that cluster admins would receive if they forgot to set quotas on their users and a user either set a high limit or forgot to limit CPU and ended up leaving a given compute pool starved for CPU.  Spot-checking master:

  $ git --no-pager log --oneline -1
  d8c6e775 (HEAD -> master, origin/release-4.7, origin/release-4.6, origin/master, origin/HEAD) Merge pull request #820 from simonpasquier/fix-alertmanagerconfiginconsistent-alert
  $ yaml2json <assets/prometheus-k8s/rules.yaml | jq -r '.spec.groups[].rules[] | select(.alert != null and (.expr | contains("namespace=~") | not)) | .alert + " " + .expr'
  ClusterMonitoringOperatorReconciliationErrors rate(cluster_monitoring_operator_reconcile_errors_total[15m]) * 100 / rate(cluster_monitoring_operator_reconcile_attempts_total[15m]) > 10
  AlertmanagerReceiversNotConfigured cluster:alertmanager_routing_enabled:max == 0
  MultipleContainersOOMKilled sum(max by(namespace, container, pod) (increase(kube_pod_container_status_restarts_total[12m])) and max by(namespace, container, pod) (kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}) == 1) > 5
  KubeStateMetricsListErrors (sum(rate(kube_state_metrics_list_total{job="kube-state-metrics",result="error"}[5m]))
    /
  sum(rate(kube_state_metrics_list_total{job="kube-state-metrics"}[5m])))
  > 0.01
  
  KubeStateMetricsWatchErrors (sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics",result="error"}[5m]))
    /
  sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics"}[5m])))
  > 0.01
  ...many more...

so there are many more of these that seem to be relying on "nobody outside of the core components will name a job kube-state-metrics" and similarly brittle expressions.  It's probably worth auditing the existing alerts and adding namespaces to all of them which we possibly can.

This will make life better for cluster admins.  It's not clear to me where it leaves other users.  Are they now on the hook to define their own alerts if they want to hear about CPUThrottlingHigh, etc.?  Or am I just missing a way to define a generic alert and then distinguish between "admins will want to hear about this" and "only users need to hear about this; admins can ignore"?

[1]: https://github.com/openshift/cluster-monitoring-operator/commit/5b77682a1baaa29519c53a80835c889b8cb15be5#diff-1dca87d186c04a487d72e52ab0b4dde5R824-R826

Comment 3 Maciej Szulik 2020-06-25 15:21:04 UTC
From a discussion I've had with Lili on slack. Particularly for PodDisruptionBudgetAtLimit
we can't introduce that namespace limitation. The alert was added as a response to an upgrade 
problem described in https://bugzilla.redhat.com/show_bug.cgi?id=1762888 where a faulty PDB
was preventing the cluster from evicting pods and thus blocking the upgrade. That's why 
PodDisruptionBudgetAtLimit is a safety measure for the entire cluster and admin should be notified
of any mis-configured PDB since they will cause problems during upgrades.

Comment 4 W. Trevor King 2020-06-25 16:29:42 UTC
About the 4.1 version: I set that because that's when the open CPUThrottlingHigh landed.  Deciding how far any fixes get backported will depend on the bug's severity and the maintenance phase of the z stream in question.  E.g. 4.1 is end-of-life, so no need to backport to there (even though the bug exists in 4.1).  4.2 and 4.3 are both currently in the maintenance phase [1], where we are only committed to backporting "Urgent and Selected High Priority" fixes, so at the current "medium" severity (conflating severity with priority?), we would not need to backport this bug to those releases either.

[1]: https://access.redhat.com/support/policy/updates/openshift#dates

Comment 9 hongyan li 2020-06-29 06:12:15 UTC
Blocked by  Bug 1851675 - open pkg/graphql/schema.graphql: no such file or directory, can't install ENV.

Comment 12 errata-xmlrpc 2020-10-27 16:09:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196