Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1633436

Summary: [free-stg] use of histogram makes apiserver latency difficult to clear
Product: OpenShift Container Platform Reporter: Justin Pierce <jupierce>
Component: MonitoringAssignee: Frederic Branczyk <fbranczy>
Status: CLOSED NOTABUG QA Contact: Junqi Zhao <juzhao>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: minden
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-22 16:48:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
latencies which cannot be cleared none

Description Justin Pierce 2018-09-27 02:28:40 UTC
Created attachment 1487567 [details]
latencies which cannot be cleared

Description of problem:
KubeAPILatencyHigh is a warning rule ->  cluster_quantile:apiserver_request_latencies:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"^(?:LIST|WATCH|WATCHLIST|PROXY|CONNECT)$"}
  > 1

As I understand histograms in prometheus, the buckets are cumulative and do not necessarily reset. This alert, therefore, will continue to warn for an indeterminate period without there being a way to 'clear' the alert.  

Version-Release number of selected component (if applicable):
v3.11.16

How reproducible:
by design

Actual results:
If there is a single high-latency request (e.g. a master is temporarily troubled), prometheus will raise this warning after 15m and an operations team has no way of clearing it -- even if the master is healthy again.

Expected results:
I would like to know if there are *recent* high latency requests. A full history is useful information, but should not be a warning.

Comment 1 Frederic Branczyk 2018-09-28 09:51:52 UTC
The use of colons in the name indicates that this is a recording rule. The underlying PromQL query of this particular rule is:

```
histogram_quantile(0.99,
  sum without(instance, pod) (rate(apiserver_request_latencies_bucket{job="apiserver"}[5m])))
  / 1e+06
```

Because of the use of rate here, the request latency should recover within a few minutes, given the latency actually improved. This should only show latency spikes over the last five minutes.