Bug 1929354 - Monitoring workloads using too high a priorityclass
Summary: Monitoring workloads using too high a priorityclass
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.6.z
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On: 1929278
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-16 17:51 UTC by Ben Parees
Modified: 2021-04-09 12:57 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of: 1929278
Environment:
Last Closed: 2021-03-30 17:03:12 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1071 0 None open Bug 1929354: [4.6] jsonnet/prometheus.jsonnet: Apply openshift-user-critical class to cluster Prometheus 2021-03-03 12:12:14 UTC
Red Hat Product Errata RHBA-2021:0952 0 None None None 2021-03-30 17:03:25 UTC

Description Ben Parees 2021-02-16 17:51:15 UTC
+++ This bug was initially created as a clone of Bug #1929278 +++

+++ This bug was initially created as a clone of Bug #1929277 +++

Description of problem:

The monitoring workloads use system-critical as their priority which causes problems when monitoring uses excessive memory and the nodes can't evict them.

Monitoring priority will be dropped to give the scheduler more flexibility to move these heavy workloads around and keep critical nodes alive.


Version-Release number of selected component (if applicable):
4.7 (probably affects older versions also)

How reproducible:
Relatively easily, during upgrades

Steps to Reproduce:
1. Create a 4.6 cluster
2. upgrade it to 4.7
3. If prometheus uses excessive memory during the upgrade (due to WAL re-read and excessive time-series creation), the nodes will struggle to evict the prometheus workload and cause node unready failures.


Additional info:
The fix for this bug is a mitigation for:

https://bugzilla.redhat.com/show_bug.cgi?id=1925061
https://bugzilla.redhat.com/show_bug.cgi?id=1913532

Comment 1 Ben Parees 2021-02-16 21:08:27 UTC
I think we're going to want this change in 4.6, though it is not needed for 4.7GA

Comment 2 Lili Cosic 2021-02-17 10:54:46 UTC
Reassigning to the CVO team as we are blocked by lack of ability to change the existing value of the class.

Comment 3 W. Trevor King 2021-02-24 19:00:37 UTC
CVO already has bug 1929741 in this space, and that's blocked on upstream work (although sounds like CVO could delete/recreate as a temporary workaround if necessary).  CVO doesn't need two bugs in this space, so I'm sending this back to monitoring.  If you don't need this bug either, close it as a dup of bug 1929741, or explain the distinction I'm missing?

Comment 6 Sergiusz Urbaniak 2021-03-03 08:52:31 UTC
reassigning to kube-apiserver for the time being as we can't merge the PR as-is.

Comment 9 Junqi Zhao 2021-03-22 02:12:57 UTC
tested with 4.6.0-0.nightly-2021-03-21-131139
# oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep priorityClassName
      priorityClassName: openshift-user-critical

Comment 12 errata-xmlrpc 2021-03-30 17:03:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.23 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0952


Note You need to log in before you can comment on or make changes to this bug.