1906081 – Rules in kube-apiserver.rules are taking too long and consuming too much memory for Prometheus to evaluate them

Bug 1906081 - Rules in kube-apiserver.rules are taking too long and consuming too much memory for Prometheus to evaluate them

Summary: Rules in kube-apiserver.rules are taking too long and consuming too much memo...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.5.z
Assignee:	Damien Grisonnet
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:	1905903
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-09 16:12 UTC by Damien Grisonnet
Modified:	2024-03-25 17:28 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-01-20 05:49:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1016	0	None	closed	[release-4.5] Bug 1906081: jsonnet: remove apiserver_request:availability30d	2021-02-09 09:23:03 UTC
Red Hat Product Errata	RHBA-2021:0033	0	None	None	None	2021-01-20 05:49:54 UTC

Description Damien Grisonnet 2020-12-09 16:12:03 UTC

This bug was initially created as a copy of Bug #1872786

I am copying this bug because: 

This bug still exists in 4.5.

Description of problem:

the rule:
==========================================
    - interval: 3m
      name: kube-apiserver-availability.rules
      rules:
      - expr: |
          1 - (
            (
              # write too slow
              sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d]))
              -
              sum(increase(apiserver_request_duration_seconds_bucket{verb=~"POST|PUT|PATCH|DELETE",le="1"}[30d]))
            ) +
            (
              # read too slow
              sum(increase(apiserver_request_duration_seconds_count{verb=~"LIST|GET"}[30d]))
              -
              (
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30d])) +
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="namespace",le="0.5"}[30d])) +
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="cluster",le="5"}[30d]))
              )
            ) +
            # errors
            sum(code:apiserver_request_total:increase30d{code=~"5.."} or vector(0))
          )
          /
          sum(code:apiserver_request_total:increase30d)
        labels:
          verb: all
        record: apiserver_request:availability30d
===========================================

is getting too much info sincethe query corresponds to 30 days.
Customer is having this error message all the time:

"query processing would load too many samples into memory in query execution"

This has been mentioned already upstream here:

https://github.com/prometheus/prometheus/issues/7281

and here:

https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/411

The current limit is 500m set by query.max-samples.

This cannot be changed in openshift (managed by operator) but what we probably need to change is the rule have this query done not including the latest 30days.

Comment 2 Scott Dodson 2020-12-14 14:23:45 UTC

Both the 4.6.z nor 4.7.0 bug have blocker- set on them and this is not a regression as far as anyone knows so therefore it's not valid to set it blocker+ here.

Comment 3 Scott Dodson 2020-12-14 14:25:48 UTC

Clarifying my poorly worded previous comment,

Both the 4.6.z and 4.7.0 bug have blocker- set on them and this is not a regression as far as anyone knows so therefore it's not valid to set it blocker+ here. Please use blocker? to indicate that you'd like engineering to evaluate whether or not this should block a z-stream release or not.

Comment 4 Damien Grisonnet 2020-12-14 14:56:00 UTC

I agree with @sdodson here. Even if this bug has a huge impact on our clients and OSD, it's not a regression so we should not set the blocker+ flag on its z-stream fix. Although, this is still critical and very urgent.

Comment 10 errata-xmlrpc 2021-01-20 05:49:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.27 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0033

Note You need to log in before you can comment on or make changes to this bug.