Bug 1929277 - Monitoring workloads using too high a priorityclass
Summary: Monitoring workloads using too high a priorityclass
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.8.0
Assignee: Sergiusz Urbaniak
QA Contact: hongyan li
URL:
Whiteboard:
: 1929748 (view as bug list)
Depends On:
Blocks: 1929278
TreeView+ depends on / blocked
 
Reported: 2021-02-16 15:56 UTC by Ben Parees
Modified: 2021-07-27 22:45 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1929278 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:44:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
prometheus used memory when upgrade from 4.7 to 4.8 (118.90 KB, image/png)
2021-03-04 11:49 UTC, hongyan li
no flags Details
prometheus memory during 4.7 to 4.8 upgrade (699.02 KB, image/png)
2021-03-05 12:48 UTC, Simon Pasquier
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1063 0 None closed Bug 1929277: [master] jsonnet/prometheus.jsonnet: Apply openshift-user-critical class to cluster Prometheus 2021-03-03 08:52:47 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:45:22 UTC

Description Ben Parees 2021-02-16 15:56:18 UTC
Description of problem:

The monitoring workloads use system-critical as their priority which causes problems when monitoring uses excessive memory and the nodes can't evict them.

Monitoring priority will be dropped to give the scheduler more flexibility to move these heavy workloads around and keep critical nodes alive.


Version-Release number of selected component (if applicable):
4.7 (probably affects older versions also)

How reproducible:
Relatively easily, during upgrades

Steps to Reproduce:
1. Create a 4.6 cluster
2. upgrade it to 4.7
3. If prometheus uses excessive memory during the upgrade (due to WAL re-read and excessive time-series creation), the nodes will struggle to evict the prometheus workload and cause node unready failures.


Additional info:
The fix for this bug is a mitigation for:

https://bugzilla.redhat.com/show_bug.cgi?id=1925061
https://bugzilla.redhat.com/show_bug.cgi?id=1913532

Comment 1 Sergiusz Urbaniak 2021-03-02 08:25:49 UTC
*** Bug 1929748 has been marked as a duplicate of this bug. ***

Comment 2 Sergiusz Urbaniak 2021-03-02 12:31:47 UTC
@ben this is effectively blocked by the api server change. We cannot merge until api server allows to tweak those priority classes.

Comment 6 hongyan li 2021-03-04 02:05:33 UTC
Not in any payload by now

Comment 7 hongyan li 2021-03-04 02:33:25 UTC
Test with playload 4.8.0-0.nightly-2021-03-03-192757

prometheus pods under openshift-monitoring has priorityClassName: openshift-user-critical.

#for i in prometheus-k8s-0 prometheus-k8s-1; do echo $i; oc -n openshift-monitoring get pod $i -oyaml|grep priorityClassName; done 
prometheus-k8s-0
  priorityClassName: openshift-user-critical
prometheus-k8s-1
  priorityClassName: openshift-user-critical

oc get priorityClass
NAME                      VALUE        GLOBAL-DEFAULT   AGE
openshift-user-critical   1000000000   false            3h23m
system-cluster-critical   2000000000   false            3h24m
system-node-critical      2000001000   false            3h24m

Comment 8 hongyan li 2021-03-04 11:48:16 UTC
upgrade from 4.7 to 4.8.0-0.nightly-2021-03-03-192757, Prometheus used 7G memory at most, see attachment, no node goes to unready status.

Comment 9 hongyan li 2021-03-04 11:49:19 UTC
Created attachment 1760669 [details]
prometheus used memory when upgrade from 4.7 to 4.8

Comment 10 Simon Pasquier 2021-03-05 12:48:01 UTC
Created attachment 1760895 [details]
prometheus memory during 4.7 to 4.8 upgrade

The memory spike at 7G is an artifact of the query and doesn't reflect the reality. It's because after Prometheus has been upgraded, the container_memory_working_set_bytes series exist for both the old and new instances. It can be seen when querying 'container_memory_working_set_bytes{pod=~"prometheus-k8s.*",container="prometheus"}' without summing the series (see attached screenshot). This is explained by the fact that Prometheus didn't mark the old series as stale on shutdown (as expected) so they continue to "live" for 5 minutes (e.g. the default lookback interval).

The "correct" query is 'sum by(pod) (max by(pod) (container_memory_working_set_bytes{pod=~"prometheus-k8s.*",container="prometheus"}))'.

Comment 13 errata-xmlrpc 2021-07-27 22:44:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.