Description of problem: The monitoring workloads use system-critical as their priority which causes problems when monitoring uses excessive memory and the nodes can't evict them. Monitoring priority will be dropped to give the scheduler more flexibility to move these heavy workloads around and keep critical nodes alive. Version-Release number of selected component (if applicable): 4.7 (probably affects older versions also) How reproducible: Relatively easily, during upgrades Steps to Reproduce: 1. Create a 4.6 cluster 2. upgrade it to 4.7 3. If prometheus uses excessive memory during the upgrade (due to WAL re-read and excessive time-series creation), the nodes will struggle to evict the prometheus workload and cause node unready failures. Additional info: The fix for this bug is a mitigation for: https://bugzilla.redhat.com/show_bug.cgi?id=1925061 https://bugzilla.redhat.com/show_bug.cgi?id=1913532
*** Bug 1929748 has been marked as a duplicate of this bug. ***
@ben this is effectively blocked by the api server change. We cannot merge until api server allows to tweak those priority classes.
Not in any payload by now
Test with playload 4.8.0-0.nightly-2021-03-03-192757 prometheus pods under openshift-monitoring has priorityClassName: openshift-user-critical. #for i in prometheus-k8s-0 prometheus-k8s-1; do echo $i; oc -n openshift-monitoring get pod $i -oyaml|grep priorityClassName; done prometheus-k8s-0 priorityClassName: openshift-user-critical prometheus-k8s-1 priorityClassName: openshift-user-critical oc get priorityClass NAME VALUE GLOBAL-DEFAULT AGE openshift-user-critical 1000000000 false 3h23m system-cluster-critical 2000000000 false 3h24m system-node-critical 2000001000 false 3h24m
upgrade from 4.7 to 4.8.0-0.nightly-2021-03-03-192757, Prometheus used 7G memory at most, see attachment, no node goes to unready status.
Created attachment 1760669 [details] prometheus used memory when upgrade from 4.7 to 4.8
Created attachment 1760895 [details] prometheus memory during 4.7 to 4.8 upgrade The memory spike at 7G is an artifact of the query and doesn't reflect the reality. It's because after Prometheus has been upgraded, the container_memory_working_set_bytes series exist for both the old and new instances. It can be seen when querying 'container_memory_working_set_bytes{pod=~"prometheus-k8s.*",container="prometheus"}' without summing the series (see attached screenshot). This is explained by the fact that Prometheus didn't mark the old series as stale on shutdown (as expected) so they continue to "live" for 5 minutes (e.g. the default lookback interval). The "correct" query is 'sum by(pod) (max by(pod) (container_memory_working_set_bytes{pod=~"prometheus-k8s.*",container="prometheus"}))'.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438