Bug 1906081

Summary: Rules in kube-apiserver.rules are taking too long and consuming too much memory for Prometheus to evaluate them
Product: OpenShift Container Platform Reporter: Damien Grisonnet <dgrisonn>
Component: MonitoringAssignee: Damien Grisonnet <dgrisonn>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: urgent    
Version: 4.5CC: achakrat, alegrand, anpicker, cblecker, erooth, kakkoyun, lcosic, mloibl, ocasalsa, pkrupa, rrackow, sdodson, surbania
Target Milestone: ---Keywords: ServiceDeliveryBlocker
Target Release: 4.5.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-20 05:49:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1905903    
Bug Blocks:    

Description Damien Grisonnet 2020-12-09 16:12:03 UTC
This bug was initially created as a copy of Bug #1872786

I am copying this bug because: 

This bug still exists in 4.5.

Description of problem:

the rule:
==========================================
    - interval: 3m
      name: kube-apiserver-availability.rules
      rules:
      - expr: |
          1 - (
            (
              # write too slow
              sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d]))
              -
              sum(increase(apiserver_request_duration_seconds_bucket{verb=~"POST|PUT|PATCH|DELETE",le="1"}[30d]))
            ) +
            (
              # read too slow
              sum(increase(apiserver_request_duration_seconds_count{verb=~"LIST|GET"}[30d]))
              -
              (
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30d])) +
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="namespace",le="0.5"}[30d])) +
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="cluster",le="5"}[30d]))
              )
            ) +
            # errors
            sum(code:apiserver_request_total:increase30d{code=~"5.."} or vector(0))
          )
          /
          sum(code:apiserver_request_total:increase30d)
        labels:
          verb: all
        record: apiserver_request:availability30d
===========================================

is getting too much info sincethe query corresponds to 30 days.
Customer is having this error message all the time:

"query processing would load too many samples into memory in query execution"

This has been mentioned already upstream here:

https://github.com/prometheus/prometheus/issues/7281

and here:

https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/411

The current limit is 500m set by query.max-samples.

This cannot be changed in openshift (managed by operator) but what we probably need to change is the rule have this query done not including the latest 30days.

Comment 2 Scott Dodson 2020-12-14 14:23:45 UTC
Both the 4.6.z nor 4.7.0 bug have blocker- set on them and this is not a regression as far as anyone knows so therefore it's not valid to set it blocker+ here.

Comment 3 Scott Dodson 2020-12-14 14:25:48 UTC
Clarifying my poorly worded previous comment,

Both the 4.6.z and 4.7.0 bug have blocker- set on them and this is not a regression as far as anyone knows so therefore it's not valid to set it blocker+ here. Please use blocker? to indicate that you'd like engineering to evaluate whether or not this should block a z-stream release or not.

Comment 4 Damien Grisonnet 2020-12-14 14:56:00 UTC
I agree with @sdodson here. Even if this bug has a huge impact on our clients and OSD, it's not a regression so we should not set the blocker+ flag on its z-stream fix. Although, this is still critical and very urgent.

Comment 10 errata-xmlrpc 2021-01-20 05:49:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.27 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0033