Bug 1929277

Summary: Monitoring workloads using too high a priorityclass
Product: OpenShift Container Platform Reporter: Ben Parees <bparees>
Component: MonitoringAssignee: Sergiusz Urbaniak <surbania>
Status: CLOSED ERRATA QA Contact: hongyan li <hongyli>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.7CC: alegrand, anpicker, aos-bugs, erooth, fpaoline, hongyli, kakkoyun, lcosic, mfojtik, pkrupa, spasquie, xxia
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1929278 (view as bug list) Environment:
Last Closed: 2021-07-27 22:44:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1929278    
Attachments:
Description Flags
prometheus used memory when upgrade from 4.7 to 4.8
none
prometheus memory during 4.7 to 4.8 upgrade none

Description Ben Parees 2021-02-16 15:56:18 UTC
Description of problem:

The monitoring workloads use system-critical as their priority which causes problems when monitoring uses excessive memory and the nodes can't evict them.

Monitoring priority will be dropped to give the scheduler more flexibility to move these heavy workloads around and keep critical nodes alive.


Version-Release number of selected component (if applicable):
4.7 (probably affects older versions also)

How reproducible:
Relatively easily, during upgrades

Steps to Reproduce:
1. Create a 4.6 cluster
2. upgrade it to 4.7
3. If prometheus uses excessive memory during the upgrade (due to WAL re-read and excessive time-series creation), the nodes will struggle to evict the prometheus workload and cause node unready failures.


Additional info:
The fix for this bug is a mitigation for:

https://bugzilla.redhat.com/show_bug.cgi?id=1925061
https://bugzilla.redhat.com/show_bug.cgi?id=1913532

Comment 1 Sergiusz Urbaniak 2021-03-02 08:25:49 UTC
*** Bug 1929748 has been marked as a duplicate of this bug. ***

Comment 2 Sergiusz Urbaniak 2021-03-02 12:31:47 UTC
@ben this is effectively blocked by the api server change. We cannot merge until api server allows to tweak those priority classes.

Comment 6 hongyan li 2021-03-04 02:05:33 UTC
Not in any payload by now

Comment 7 hongyan li 2021-03-04 02:33:25 UTC
Test with playload 4.8.0-0.nightly-2021-03-03-192757

prometheus pods under openshift-monitoring has priorityClassName: openshift-user-critical.

#for i in prometheus-k8s-0 prometheus-k8s-1; do echo $i; oc -n openshift-monitoring get pod $i -oyaml|grep priorityClassName; done 
prometheus-k8s-0
  priorityClassName: openshift-user-critical
prometheus-k8s-1
  priorityClassName: openshift-user-critical

oc get priorityClass
NAME                      VALUE        GLOBAL-DEFAULT   AGE
openshift-user-critical   1000000000   false            3h23m
system-cluster-critical   2000000000   false            3h24m
system-node-critical      2000001000   false            3h24m

Comment 8 hongyan li 2021-03-04 11:48:16 UTC
upgrade from 4.7 to 4.8.0-0.nightly-2021-03-03-192757, Prometheus used 7G memory at most, see attachment, no node goes to unready status.

Comment 9 hongyan li 2021-03-04 11:49:19 UTC
Created attachment 1760669 [details]
prometheus used memory when upgrade from 4.7 to 4.8

Comment 10 Simon Pasquier 2021-03-05 12:48:01 UTC
Created attachment 1760895 [details]
prometheus memory during 4.7 to 4.8 upgrade

The memory spike at 7G is an artifact of the query and doesn't reflect the reality. It's because after Prometheus has been upgraded, the container_memory_working_set_bytes series exist for both the old and new instances. It can be seen when querying 'container_memory_working_set_bytes{pod=~"prometheus-k8s.*",container="prometheus"}' without summing the series (see attached screenshot). This is explained by the fact that Prometheus didn't mark the old series as stale on shutdown (as expected) so they continue to "live" for 5 minutes (e.g. the default lookback interval).

The "correct" query is 'sum by(pod) (max by(pod) (container_memory_working_set_bytes{pod=~"prometheus-k8s.*",container="prometheus"}))'.

Comment 13 errata-xmlrpc 2021-07-27 22:44:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438