1872786 – Rules in kube-apiserver.rules are taking too long and consuming too much memory for Prometheus to evaluate them

Bug 1872786 - Rules in kube-apiserver.rules are taking too long and consuming too much memory for Prometheus to evaluate them

Summary: Rules in kube-apiserver.rules are taking too long and consuming too much memo...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Damien Grisonnet
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1888549 (view as bug list)
Depends On:
Blocks:	1905903
TreeView+	depends on / blocked

Reported:	2020-08-26 15:45 UTC by German Parente
Modified:	2024-06-13 22:59 UTC (History)
CC List:	29 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:16:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes-monitoring kubernetes-mixin issues 411	None	closed	`code_verb:apiserver_request_total:increase30d` loads (too) many samples	2021-02-15 19:55:20 UTC
Github	openshift cluster-monitoring-operator pull 980	None	closed	Bug 1872786: jsonnet: remove apiserver_request:availability30d	2021-02-15 19:55:21 UTC
Red Hat Knowledge Base (Solution)	5491081	None	None	None	2020-10-15 08:28:52 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:17:05 UTC

Description German Parente 2020-08-26 15:45:44 UTC

Description of problem:

the rule:
==========================================
    - interval: 3m
      name: kube-apiserver-availability.rules
      rules:
      - expr: |
          1 - (
            (
              # write too slow
              sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d]))
              -
              sum(increase(apiserver_request_duration_seconds_bucket{verb=~"POST|PUT|PATCH|DELETE",le="1"}[30d]))
            ) +
            (
              # read too slow
              sum(increase(apiserver_request_duration_seconds_count{verb=~"LIST|GET"}[30d]))
              -
              (
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30d])) +
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="namespace",le="0.5"}[30d])) +
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="cluster",le="5"}[30d]))
              )
            ) +
            # errors
            sum(code:apiserver_request_total:increase30d{code=~"5.."} or vector(0))
          )
          /
          sum(code:apiserver_request_total:increase30d)
        labels:
          verb: all
        record: apiserver_request:availability30d
===========================================

is getting too much info sincethe query corresponds to 30 days.
Customer is having this error message all the time:

"query processing would load too many samples into memory in query execution"

This has been mentioned already upstream here:

https://github.com/prometheus/prometheus/issues/7281

and here:

https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/411

The current limit is 500m set by query.max-samples.

This cannot be changed in openshift (managed by operator) but what we probably need to change is the rule have this query done not including the latest 30days.

Comment 4 Sergiusz Urbaniak 2020-09-30 08:51:39 UTC

reassigning to Damien

Comment 5 Sergiusz Urbaniak 2020-10-02 13:28:55 UTC

No capacity to work on this currently.

Comment 7 Simon Pasquier 2020-10-15 07:49:05 UTC

*** Bug 1888549 has been marked as a duplicate of this bug. ***

Comment 17 Damien Grisonnet 2020-11-13 17:02:40 UTC

This issue will continue to be worked on during the upcoming sprint (193).

Comment 18 Damien Grisonnet 2020-11-16 09:13:59 UTC

This issue will continue to be addressed in the upcoming sprint (193). The upstream PR is opened, but we still need to follow-up on some technical discussions.

Comment 24 Junqi Zhao 2020-12-10 01:53:33 UTC

tested with 4.7.0-0.nightly-2020-12-09-112139, apiserver_request:availability30d record rule is removed
# oc -n openshift-monitoring exec -c cluster-monitoring-operator cluster-monitoring-operator-849f4db66d-744tm -- grep -ri apiserver_request:availability30d /assets/prometheus-k8s
command terminated with exit code 1

Comment 38 errata-xmlrpc 2021-02-24 15:16:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 40 Damien Grisonnet 2021-03-05 17:31:12 UTC

Hello Palash,

The fix has already been backported and is available in OCP 4.6.9 and 4.5.27.

Kind regards,
Damien

Note You need to log in before you can comment on or make changes to this bug.

achakrat
adeshpan
a.grimmeissen
aivaraslaimikis
akhaire
alegrand
anpicker
cblecker
christopher.obrien
dgrisonn
erooth
jnordell
kakkoyun
ksathe
lcosic
mbargenq
mf.flip
michele.sandro.emma
mrobson
ocasalsa
pkhaire
pkrupa
rabdulra
rpalathi
sagopina
spasquie
sreber
surbania
wking