Bug 1633436 - [free-stg] use of histogram makes apiserver latency difficult to clear
Summary: [free-stg] use of histogram makes apiserver latency difficult to clear
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.11.z
Assignee: Frederic Branczyk
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-27 02:28 UTC by Justin Pierce
Modified: 2019-02-22 16:48 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-22 16:48:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
latencies which cannot be cleared (136.69 KB, image/png)
2018-09-27 02:28 UTC, Justin Pierce
no flags Details

Description Justin Pierce 2018-09-27 02:28:40 UTC
Created attachment 1487567 [details]
latencies which cannot be cleared

Description of problem:
KubeAPILatencyHigh is a warning rule ->  cluster_quantile:apiserver_request_latencies:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"^(?:LIST|WATCH|WATCHLIST|PROXY|CONNECT)$"}
  > 1

As I understand histograms in prometheus, the buckets are cumulative and do not necessarily reset. This alert, therefore, will continue to warn for an indeterminate period without there being a way to 'clear' the alert.  

Version-Release number of selected component (if applicable):
v3.11.16

How reproducible:
by design

Actual results:
If there is a single high-latency request (e.g. a master is temporarily troubled), prometheus will raise this warning after 15m and an operations team has no way of clearing it -- even if the master is healthy again.

Expected results:
I would like to know if there are *recent* high latency requests. A full history is useful information, but should not be a warning.

Comment 1 Frederic Branczyk 2018-09-28 09:51:52 UTC
The use of colons in the name indicates that this is a recording rule. The underlying PromQL query of this particular rule is:

```
histogram_quantile(0.99,
  sum without(instance, pod) (rate(apiserver_request_latencies_bucket{job="apiserver"}[5m])))
  / 1e+06
```

Because of the use of rate here, the request latency should recover within a few minutes, given the latency actually improved. This should only show latency spikes over the last five minutes.


Note You need to log in before you can comment on or make changes to this bug.