Bug 1633436
| Summary: | [free-stg] use of histogram makes apiserver latency difficult to clear | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Justin Pierce <jupierce> | ||||
| Component: | Monitoring | Assignee: | Frederic Branczyk <fbranczy> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | Junqi Zhao <juzhao> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 3.11.0 | CC: | minden | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.11.z | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-02-22 16:48:31 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
The use of colons in the name indicates that this is a recording rule. The underlying PromQL query of this particular rule is:
```
histogram_quantile(0.99,
sum without(instance, pod) (rate(apiserver_request_latencies_bucket{job="apiserver"}[5m])))
/ 1e+06
```
Because of the use of rate here, the request latency should recover within a few minutes, given the latency actually improved. This should only show latency spikes over the last five minutes.
|
Created attachment 1487567 [details] latencies which cannot be cleared Description of problem: KubeAPILatencyHigh is a warning rule -> cluster_quantile:apiserver_request_latencies:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"^(?:LIST|WATCH|WATCHLIST|PROXY|CONNECT)$"} > 1 As I understand histograms in prometheus, the buckets are cumulative and do not necessarily reset. This alert, therefore, will continue to warn for an indeterminate period without there being a way to 'clear' the alert. Version-Release number of selected component (if applicable): v3.11.16 How reproducible: by design Actual results: If there is a single high-latency request (e.g. a master is temporarily troubled), prometheus will raise this warning after 15m and an operations team has no way of clearing it -- even if the master is healthy again. Expected results: I would like to know if there are *recent* high latency requests. A full history is useful information, but should not be a warning.