Bug 1906081

Summary:	Rules in kube-apiserver.rules are taking too long and consuming too much memory for Prometheus to evaluate them
Product:	OpenShift Container Platform	Reporter:	Damien Grisonnet <dgrisonn>
Component:	Monitoring	Assignee:	Damien Grisonnet <dgrisonn>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	4.5	CC:	achakrat, alegrand, anpicker, cblecker, erooth, kakkoyun, lcosic, mloibl, ocasalsa, pkrupa, rrackow, sdodson, surbania
Target Milestone:	---	Keywords:	ServiceDeliveryBlocker
Target Release:	4.5.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-01-20 05:49:28 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1905903
Bug Blocks:

Description Damien Grisonnet 2020-12-09 16:12:03 UTC

This bug was initially created as a copy of Bug #1872786

I am copying this bug because: 

This bug still exists in 4.5.

Description of problem:

the rule:
==========================================
    - interval: 3m
      name: kube-apiserver-availability.rules
      rules:
      - expr: |
          1 - (
            (
              # write too slow
              sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d]))
              -
              sum(increase(apiserver_request_duration_seconds_bucket{verb=~"POST|PUT|PATCH|DELETE",le="1"}[30d]))
            ) +
            (
              # read too slow
              sum(increase(apiserver_request_duration_seconds_count{verb=~"LIST|GET"}[30d]))
              -
              (
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30d])) +
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="namespace",le="0.5"}[30d])) +
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="cluster",le="5"}[30d]))
              )
            ) +
            # errors
            sum(code:apiserver_request_total:increase30d{code=~"5.."} or vector(0))
          )
          /
          sum(code:apiserver_request_total:increase30d)
        labels:
          verb: all
        record: apiserver_request:availability30d
===========================================

is getting too much info sincethe query corresponds to 30 days.
Customer is having this error message all the time:

"query processing would load too many samples into memory in query execution"

This has been mentioned already upstream here:

https://github.com/prometheus/prometheus/issues/7281

and here:

https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/411

The current limit is 500m set by query.max-samples.

This cannot be changed in openshift (managed by operator) but what we probably need to change is the rule have this query done not including the latest 30days.

Comment 2 Scott Dodson 2020-12-14 14:23:45 UTC

Both the 4.6.z nor 4.7.0 bug have blocker- set on them and this is not a regression as far as anyone knows so therefore it's not valid to set it blocker+ here.

Comment 3 Scott Dodson 2020-12-14 14:25:48 UTC

Clarifying my poorly worded previous comment,

Both the 4.6.z and 4.7.0 bug have blocker- set on them and this is not a regression as far as anyone knows so therefore it's not valid to set it blocker+ here. Please use blocker? to indicate that you'd like engineering to evaluate whether or not this should block a z-stream release or not.

Comment 4 Damien Grisonnet 2020-12-14 14:56:00 UTC

I agree with @sdodson here. Even if this bug has a huge impact on our clients and OSD, it's not a regression so we should not set the blocker+ flag on its z-stream fix. Although, this is still critical and very urgent.

Comment 10 errata-xmlrpc 2021-01-20 05:49:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.27 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0033