Bug 1906081 - Rules in kube-apiserver.rules are taking too long and consuming too much memory for Prometheus to evaluate them
Summary: Rules in kube-apiserver.rules are taking too long and consuming too much memo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.5
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.5.z
Assignee: Damien Grisonnet
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On: 1905903
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-09 16:12 UTC by Damien Grisonnet
Modified: 2021-04-23 13:42 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-20 05:49:28 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1016 0 None closed [release-4.5] Bug 1906081: jsonnet: remove apiserver_request:availability30d 2021-02-09 09:23:03 UTC
Red Hat Product Errata RHBA-2021:0033 0 None None None 2021-01-20 05:49:54 UTC

Description Damien Grisonnet 2020-12-09 16:12:03 UTC
This bug was initially created as a copy of Bug #1872786

I am copying this bug because: 

This bug still exists in 4.5.

Description of problem:

the rule:
==========================================
    - interval: 3m
      name: kube-apiserver-availability.rules
      rules:
      - expr: |
          1 - (
            (
              # write too slow
              sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d]))
              -
              sum(increase(apiserver_request_duration_seconds_bucket{verb=~"POST|PUT|PATCH|DELETE",le="1"}[30d]))
            ) +
            (
              # read too slow
              sum(increase(apiserver_request_duration_seconds_count{verb=~"LIST|GET"}[30d]))
              -
              (
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30d])) +
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="namespace",le="0.5"}[30d])) +
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="cluster",le="5"}[30d]))
              )
            ) +
            # errors
            sum(code:apiserver_request_total:increase30d{code=~"5.."} or vector(0))
          )
          /
          sum(code:apiserver_request_total:increase30d)
        labels:
          verb: all
        record: apiserver_request:availability30d
===========================================

is getting too much info sincethe query corresponds to 30 days.
Customer is having this error message all the time:

"query processing would load too many samples into memory in query execution"

This has been mentioned already upstream here:

https://github.com/prometheus/prometheus/issues/7281

and here:

https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/411

The current limit is 500m set by query.max-samples.

This cannot be changed in openshift (managed by operator) but what we probably need to change is the rule have this query done not including the latest 30days.

Comment 2 Scott Dodson 2020-12-14 14:23:45 UTC
Both the 4.6.z nor 4.7.0 bug have blocker- set on them and this is not a regression as far as anyone knows so therefore it's not valid to set it blocker+ here.

Comment 3 Scott Dodson 2020-12-14 14:25:48 UTC
Clarifying my poorly worded previous comment,

Both the 4.6.z and 4.7.0 bug have blocker- set on them and this is not a regression as far as anyone knows so therefore it's not valid to set it blocker+ here. Please use blocker? to indicate that you'd like engineering to evaluate whether or not this should block a z-stream release or not.

Comment 4 Damien Grisonnet 2020-12-14 14:56:00 UTC
I agree with @sdodson here. Even if this bug has a huge impact on our clients and OSD, it's not a regression so we should not set the blocker+ flag on its z-stream fix. Although, this is still critical and very urgent.

Comment 10 errata-xmlrpc 2021-01-20 05:49:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.27 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0033


Note You need to log in before you can comment on or make changes to this bug.