Bug 1929277
Summary: | Monitoring workloads using too high a priorityclass | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ben Parees <bparees> | ||||||
Component: | Monitoring | Assignee: | Sergiusz Urbaniak <surbania> | ||||||
Status: | CLOSED ERRATA | QA Contact: | hongyan li <hongyli> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 4.7 | CC: | alegrand, anpicker, aos-bugs, erooth, fpaoline, hongyli, kakkoyun, lcosic, mfojtik, pkrupa, spasquie, xxia | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 4.8.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1929278 (view as bug list) | Environment: | |||||||
Last Closed: | 2021-07-27 22:44:46 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1929278 | ||||||||
Attachments: |
|
Description
Ben Parees
2021-02-16 15:56:18 UTC
*** Bug 1929748 has been marked as a duplicate of this bug. *** @ben this is effectively blocked by the api server change. We cannot merge until api server allows to tweak those priority classes. Not in any payload by now Test with playload 4.8.0-0.nightly-2021-03-03-192757 prometheus pods under openshift-monitoring has priorityClassName: openshift-user-critical. #for i in prometheus-k8s-0 prometheus-k8s-1; do echo $i; oc -n openshift-monitoring get pod $i -oyaml|grep priorityClassName; done prometheus-k8s-0 priorityClassName: openshift-user-critical prometheus-k8s-1 priorityClassName: openshift-user-critical oc get priorityClass NAME VALUE GLOBAL-DEFAULT AGE openshift-user-critical 1000000000 false 3h23m system-cluster-critical 2000000000 false 3h24m system-node-critical 2000001000 false 3h24m upgrade from 4.7 to 4.8.0-0.nightly-2021-03-03-192757, Prometheus used 7G memory at most, see attachment, no node goes to unready status. Created attachment 1760669 [details]
prometheus used memory when upgrade from 4.7 to 4.8
Created attachment 1760895 [details]
prometheus memory during 4.7 to 4.8 upgrade
The memory spike at 7G is an artifact of the query and doesn't reflect the reality. It's because after Prometheus has been upgraded, the container_memory_working_set_bytes series exist for both the old and new instances. It can be seen when querying 'container_memory_working_set_bytes{pod=~"prometheus-k8s.*",container="prometheus"}' without summing the series (see attached screenshot). This is explained by the fact that Prometheus didn't mark the old series as stale on shutdown (as expected) so they continue to "live" for 5 minutes (e.g. the default lookback interval).
The "correct" query is 'sum by(pod) (max by(pod) (container_memory_working_set_bytes{pod=~"prometheus-k8s.*",container="prometheus"}))'.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |