Bug 1872786 - Rules in kube-apiserver.rules are taking too long and consuming too much memory for Prometheus to evaluate them
Summary: Rules in kube-apiserver.rules are taking too long and consuming too much memo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.7.0
Assignee: Damien Grisonnet
QA Contact: Junqi Zhao
URL:
Whiteboard:
: 1888549 (view as bug list)
Depends On:
Blocks: 1905903
TreeView+ depends on / blocked
 
Reported: 2020-08-26 15:45 UTC by German Parente
Modified: 2023-12-15 19:02 UTC (History)
29 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:16:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes-monitoring kubernetes-mixin issues 411 0 None closed `code_verb:apiserver_request_total:increase30d` loads (too) many samples 2021-02-15 19:55:20 UTC
Github openshift cluster-monitoring-operator pull 980 0 None closed Bug 1872786: jsonnet: remove apiserver_request:availability30d 2021-02-15 19:55:21 UTC
Red Hat Knowledge Base (Solution) 5491081 0 None None None 2020-10-15 08:28:52 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:17:05 UTC

Description German Parente 2020-08-26 15:45:44 UTC
Description of problem:

the rule:
==========================================
    - interval: 3m
      name: kube-apiserver-availability.rules
      rules:
      - expr: |
          1 - (
            (
              # write too slow
              sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d]))
              -
              sum(increase(apiserver_request_duration_seconds_bucket{verb=~"POST|PUT|PATCH|DELETE",le="1"}[30d]))
            ) +
            (
              # read too slow
              sum(increase(apiserver_request_duration_seconds_count{verb=~"LIST|GET"}[30d]))
              -
              (
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30d])) +
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="namespace",le="0.5"}[30d])) +
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="cluster",le="5"}[30d]))
              )
            ) +
            # errors
            sum(code:apiserver_request_total:increase30d{code=~"5.."} or vector(0))
          )
          /
          sum(code:apiserver_request_total:increase30d)
        labels:
          verb: all
        record: apiserver_request:availability30d
===========================================

is getting too much info sincethe query corresponds to 30 days.
Customer is having this error message all the time:

"query processing would load too many samples into memory in query execution"

This has been mentioned already upstream here:

https://github.com/prometheus/prometheus/issues/7281

and here:

https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/411

The current limit is 500m set by query.max-samples.

This cannot be changed in openshift (managed by operator) but what we probably need to change is the rule have this query done not including the latest 30days.

Comment 4 Sergiusz Urbaniak 2020-09-30 08:51:39 UTC
reassigning to Damien

Comment 5 Sergiusz Urbaniak 2020-10-02 13:28:55 UTC
No capacity to work on this currently.

Comment 7 Simon Pasquier 2020-10-15 07:49:05 UTC
*** Bug 1888549 has been marked as a duplicate of this bug. ***

Comment 17 Damien Grisonnet 2020-11-13 17:02:40 UTC
This issue will continue to be worked on during the upcoming sprint (193).

Comment 18 Damien Grisonnet 2020-11-16 09:13:59 UTC
This issue will continue to be addressed in the upcoming sprint (193). The upstream PR is opened, but we still need to follow-up on some technical discussions.

Comment 24 Junqi Zhao 2020-12-10 01:53:33 UTC
tested with 4.7.0-0.nightly-2020-12-09-112139, apiserver_request:availability30d record rule is removed
# oc -n openshift-monitoring exec -c cluster-monitoring-operator cluster-monitoring-operator-849f4db66d-744tm -- grep -ri apiserver_request:availability30d /assets/prometheus-k8s
command terminated with exit code 1

Comment 38 errata-xmlrpc 2021-02-24 15:16:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 40 Damien Grisonnet 2021-03-05 17:31:12 UTC
Hello Palash,

The fix has already been backported and is available in OCP 4.6.9 and 4.5.27.

Kind regards,
Damien


Note You need to log in before you can comment on or make changes to this bug.