Bug 1872786

Summary:	Rules in kube-apiserver.rules are taking too long and consuming too much memory for Prometheus to evaluate them
Product:	OpenShift Container Platform	Reporter:	German Parente <gparente>
Component:	Monitoring	Assignee:	Damien Grisonnet <dgrisonn>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	high
Version:	4.5	CC:	achakrat, adeshpan, a.grimmeissen, aivaraslaimikis, akhaire, alegrand, anpicker, cblecker, christopher.obrien, dgrisonn, erooth, jnordell, kakkoyun, ksathe, lcosic, mbargenq, mf.flip, michele.sandro.emma, mrobson, ocasalsa, pkhaire, pkrupa, rabdulra, rpalathi, sagopina, spasquie, sreber, surbania, wking
Target Milestone:	---	Keywords:	ServiceDeliveryBlocker
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-24 15:16:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1905903

Description German Parente 2020-08-26 15:45:44 UTC

Description of problem:

the rule:
==========================================
    - interval: 3m
      name: kube-apiserver-availability.rules
      rules:
      - expr: |
          1 - (
            (
              # write too slow
              sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d]))
              -
              sum(increase(apiserver_request_duration_seconds_bucket{verb=~"POST|PUT|PATCH|DELETE",le="1"}[30d]))
            ) +
            (
              # read too slow
              sum(increase(apiserver_request_duration_seconds_count{verb=~"LIST|GET"}[30d]))
              -
              (
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30d])) +
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="namespace",le="0.5"}[30d])) +
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="cluster",le="5"}[30d]))
              )
            ) +
            # errors
            sum(code:apiserver_request_total:increase30d{code=~"5.."} or vector(0))
          )
          /
          sum(code:apiserver_request_total:increase30d)
        labels:
          verb: all
        record: apiserver_request:availability30d
===========================================

is getting too much info sincethe query corresponds to 30 days.
Customer is having this error message all the time:

"query processing would load too many samples into memory in query execution"

This has been mentioned already upstream here:

https://github.com/prometheus/prometheus/issues/7281

and here:

https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/411

The current limit is 500m set by query.max-samples.

This cannot be changed in openshift (managed by operator) but what we probably need to change is the rule have this query done not including the latest 30days.

Comment 4 Sergiusz Urbaniak 2020-09-30 08:51:39 UTC

reassigning to Damien

Comment 5 Sergiusz Urbaniak 2020-10-02 13:28:55 UTC

No capacity to work on this currently.

Comment 7 Simon Pasquier 2020-10-15 07:49:05 UTC

*** Bug 1888549 has been marked as a duplicate of this bug. ***

Comment 17 Damien Grisonnet 2020-11-13 17:02:40 UTC

This issue will continue to be worked on during the upcoming sprint (193).

Comment 18 Damien Grisonnet 2020-11-16 09:13:59 UTC

This issue will continue to be addressed in the upcoming sprint (193). The upstream PR is opened, but we still need to follow-up on some technical discussions.

Comment 24 Junqi Zhao 2020-12-10 01:53:33 UTC

tested with 4.7.0-0.nightly-2020-12-09-112139, apiserver_request:availability30d record rule is removed
# oc -n openshift-monitoring exec -c cluster-monitoring-operator cluster-monitoring-operator-849f4db66d-744tm -- grep -ri apiserver_request:availability30d /assets/prometheus-k8s
command terminated with exit code 1

Comment 38 errata-xmlrpc 2021-02-24 15:16:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 40 Damien Grisonnet 2021-03-05 17:31:12 UTC

Hello Palash,

The fix has already been backported and is available in OCP 4.6.9 and 4.5.27.

Kind regards,
Damien