Description of problem: the rule: ========================================== - interval: 3m name: kube-apiserver-availability.rules rules: - expr: | 1 - ( ( # write too slow sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d])) - sum(increase(apiserver_request_duration_seconds_bucket{verb=~"POST|PUT|PATCH|DELETE",le="1"}[30d])) ) + ( # read too slow sum(increase(apiserver_request_duration_seconds_count{verb=~"LIST|GET"}[30d])) - ( sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30d])) + sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="namespace",le="0.5"}[30d])) + sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="cluster",le="5"}[30d])) ) ) + # errors sum(code:apiserver_request_total:increase30d{code=~"5.."} or vector(0)) ) / sum(code:apiserver_request_total:increase30d) labels: verb: all record: apiserver_request:availability30d =========================================== is getting too much info sincethe query corresponds to 30 days. Customer is having this error message all the time: "query processing would load too many samples into memory in query execution" This has been mentioned already upstream here: https://github.com/prometheus/prometheus/issues/7281 and here: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/411 The current limit is 500m set by query.max-samples. This cannot be changed in openshift (managed by operator) but what we probably need to change is the rule have this query done not including the latest 30days.
reassigning to Damien
No capacity to work on this currently.
*** Bug 1888549 has been marked as a duplicate of this bug. ***
This issue will continue to be worked on during the upcoming sprint (193).
This issue will continue to be addressed in the upcoming sprint (193). The upstream PR is opened, but we still need to follow-up on some technical discussions.
tested with 4.7.0-0.nightly-2020-12-09-112139, apiserver_request:availability30d record rule is removed # oc -n openshift-monitoring exec -c cluster-monitoring-operator cluster-monitoring-operator-849f4db66d-744tm -- grep -ri apiserver_request:availability30d /assets/prometheus-k8s command terminated with exit code 1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
Hello Palash, The fix has already been backported and is available in OCP 4.6.9 and 4.5.27. Kind regards, Damien