Since it landed (in master/4.1) CPUThrottlingHigh has lacked the namespace=~" filters that its sibling alerts have [1]. It should probably grow something like: namespace=~"(openshift-.*|kube-.*|default|logging)", which seems popular. That would avoid pestering the cluster admin when a pod in a user namespace was pegging it's CPU limit. I'm assuming there are separate alerts that cluster admins would receive if they forgot to set quotas on their users and a user either set a high limit or forgot to limit CPU and ended up leaving a given compute pool starved for CPU. Spot-checking master: $ git --no-pager log --oneline -1 d8c6e775 (HEAD -> master, origin/release-4.7, origin/release-4.6, origin/master, origin/HEAD) Merge pull request #820 from simonpasquier/fix-alertmanagerconfiginconsistent-alert $ yaml2json <assets/prometheus-k8s/rules.yaml | jq -r '.spec.groups[].rules[] | select(.alert != null and (.expr | contains("namespace=~") | not)) | .alert + " " + .expr' ClusterMonitoringOperatorReconciliationErrors rate(cluster_monitoring_operator_reconcile_errors_total[15m]) * 100 / rate(cluster_monitoring_operator_reconcile_attempts_total[15m]) > 10 AlertmanagerReceiversNotConfigured cluster:alertmanager_routing_enabled:max == 0 MultipleContainersOOMKilled sum(max by(namespace, container, pod) (increase(kube_pod_container_status_restarts_total[12m])) and max by(namespace, container, pod) (kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}) == 1) > 5 KubeStateMetricsListErrors (sum(rate(kube_state_metrics_list_total{job="kube-state-metrics",result="error"}[5m])) / sum(rate(kube_state_metrics_list_total{job="kube-state-metrics"}[5m]))) > 0.01 KubeStateMetricsWatchErrors (sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics",result="error"}[5m])) / sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics"}[5m]))) > 0.01 ...many more... so there are many more of these that seem to be relying on "nobody outside of the core components will name a job kube-state-metrics" and similarly brittle expressions. It's probably worth auditing the existing alerts and adding namespaces to all of them which we possibly can. This will make life better for cluster admins. It's not clear to me where it leaves other users. Are they now on the hook to define their own alerts if they want to hear about CPUThrottlingHigh, etc.? Or am I just missing a way to define a generic alert and then distinguish between "admins will want to hear about this" and "only users need to hear about this; admins can ignore"? [1]: https://github.com/openshift/cluster-monitoring-operator/commit/5b77682a1baaa29519c53a80835c889b8cb15be5#diff-1dca87d186c04a487d72e52ab0b4dde5R824-R826
From a discussion I've had with Lili on slack. Particularly for PodDisruptionBudgetAtLimit we can't introduce that namespace limitation. The alert was added as a response to an upgrade problem described in https://bugzilla.redhat.com/show_bug.cgi?id=1762888 where a faulty PDB was preventing the cluster from evicting pods and thus blocking the upgrade. That's why PodDisruptionBudgetAtLimit is a safety measure for the entire cluster and admin should be notified of any mis-configured PDB since they will cause problems during upgrades.
About the 4.1 version: I set that because that's when the open CPUThrottlingHigh landed. Deciding how far any fixes get backported will depend on the bug's severity and the maintenance phase of the z stream in question. E.g. 4.1 is end-of-life, so no need to backport to there (even though the bug exists in 4.1). 4.2 and 4.3 are both currently in the maintenance phase [1], where we are only committed to backporting "Urgent and Selected High Priority" fixes, so at the current "medium" severity (conflating severity with priority?), we would not need to backport this bug to those releases either. [1]: https://access.redhat.com/support/policy/updates/openshift#dates
Blocked by Bug 1851675 - open pkg/graphql/schema.graphql: no such file or directory, can't install ENV.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196