Bug 1872786

Summary: Rules in kube-apiserver.rules are taking too long and consuming too much memory for Prometheus to evaluate them
Product: OpenShift Container Platform Reporter: German Parente <gparente>
Component: MonitoringAssignee: Damien Grisonnet <dgrisonn>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: high    
Version: 4.5CC: achakrat, adeshpan, a.grimmeissen, aivaraslaimikis, akhaire, alegrand, anpicker, cblecker, christopher.obrien, dgrisonn, erooth, jnordell, kakkoyun, ksathe, lcosic, mbargenq, mf.flip, michele.sandro.emma, mrobson, ocasalsa, pkhaire, pkrupa, rabdulra, rpalathi, sagopina, spasquie, sreber, surbania, wking
Target Milestone: ---Keywords: ServiceDeliveryBlocker
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:16:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1905903    

Description German Parente 2020-08-26 15:45:44 UTC
Description of problem:

the rule:
==========================================
    - interval: 3m
      name: kube-apiserver-availability.rules
      rules:
      - expr: |
          1 - (
            (
              # write too slow
              sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d]))
              -
              sum(increase(apiserver_request_duration_seconds_bucket{verb=~"POST|PUT|PATCH|DELETE",le="1"}[30d]))
            ) +
            (
              # read too slow
              sum(increase(apiserver_request_duration_seconds_count{verb=~"LIST|GET"}[30d]))
              -
              (
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30d])) +
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="namespace",le="0.5"}[30d])) +
                sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="cluster",le="5"}[30d]))
              )
            ) +
            # errors
            sum(code:apiserver_request_total:increase30d{code=~"5.."} or vector(0))
          )
          /
          sum(code:apiserver_request_total:increase30d)
        labels:
          verb: all
        record: apiserver_request:availability30d
===========================================

is getting too much info sincethe query corresponds to 30 days.
Customer is having this error message all the time:

"query processing would load too many samples into memory in query execution"

This has been mentioned already upstream here:

https://github.com/prometheus/prometheus/issues/7281

and here:

https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/411

The current limit is 500m set by query.max-samples.

This cannot be changed in openshift (managed by operator) but what we probably need to change is the rule have this query done not including the latest 30days.

Comment 4 Sergiusz Urbaniak 2020-09-30 08:51:39 UTC
reassigning to Damien

Comment 5 Sergiusz Urbaniak 2020-10-02 13:28:55 UTC
No capacity to work on this currently.

Comment 7 Simon Pasquier 2020-10-15 07:49:05 UTC
*** Bug 1888549 has been marked as a duplicate of this bug. ***

Comment 17 Damien Grisonnet 2020-11-13 17:02:40 UTC
This issue will continue to be worked on during the upcoming sprint (193).

Comment 18 Damien Grisonnet 2020-11-16 09:13:59 UTC
This issue will continue to be addressed in the upcoming sprint (193). The upstream PR is opened, but we still need to follow-up on some technical discussions.

Comment 24 Junqi Zhao 2020-12-10 01:53:33 UTC
tested with 4.7.0-0.nightly-2020-12-09-112139, apiserver_request:availability30d record rule is removed
# oc -n openshift-monitoring exec -c cluster-monitoring-operator cluster-monitoring-operator-849f4db66d-744tm -- grep -ri apiserver_request:availability30d /assets/prometheus-k8s
command terminated with exit code 1

Comment 38 errata-xmlrpc 2021-02-24 15:16:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 40 Damien Grisonnet 2021-03-05 17:31:12 UTC
Hello Palash,

The fix has already been backported and is available in OCP 4.6.9 and 4.5.27.

Kind regards,
Damien