Bug 1743911 - KubeAPILatencyHigh alert flapping
Summary: KubeAPILatencyHigh alert flapping
Keywords:
Status: CLOSED DUPLICATE of bug 1825001
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
: 1771342 1821697 (view as bug list)
Depends On: 1745566
Blocks: 1771342
TreeView+ depends on / blocked
 
Reported: 2019-08-21 00:16 UTC by Naveen Malik
Modified: 2023-10-06 18:30 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1745566 1771342 (view as bug list)
Environment:
Last Closed: 2020-05-29 19:16:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
cluster_quantile:apiserver_request_latencies:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"^(?:LIST|WATCH|WATCHLIST|PROXY|CONNECT)$"} > 4 (365.23 KB, image/png)
2019-08-21 00:16 UTC, Naveen Malik
no flags Details
ALERTS{alertstate="firing"} (446.94 KB, image/png)
2019-08-21 00:18 UTC, Naveen Malik
no flags Details
etcd log (3.61 MB, text/plain)
2019-09-02 16:22 UTC, Rick Rackow
no flags Details

Description Naveen Malik 2019-08-21 00:16:14 UTC
Created attachment 1606309 [details]
cluster_quantile:apiserver_request_latencies:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"^(?:LIST|WATCH|WATCHLIST|PROXY|CONNECT)$"} > 4

Description of problem:
in OSD v4 we have noticed the KubeAPILatencyHigh alert trigger and typically resolve within 2 to 10 minutes from when the alert is raised.  To date, the SRE team has had no actions specifically required to fix this issue on any clusters yet get paged 24/7 for these alerts because it is severity "critical".


Version-Release number of selected component (if applicable):
4.1.x


How reproducible:
Resolution of the alert is very consistent and I can show data from specific clusters where resolution occurs within 10 minutes of the alert triggering.


Steps to Reproduce:
1. Unknown.
2.
3.


Actual results:
Alert is raised in Prometheus and triggers paging SRE via PagerDuty integration in AlertManager.  Alert automatically resolves itself within 10 minutes.

Expected results:
No critical alert is raised unless it is truly a critical problem that requires human intervention.


Additional info:
Attached screenshot is a 1 hr slice from a production OSD v4.1.9 cluster showing a period when this alert triggered and automatically resolved.

Comment 1 Naveen Malik 2019-08-21 00:18:17 UTC
Created attachment 1606310 [details]
ALERTS{alertstate="firing"}

Alerts firing at the time of high api latency, same range and end time.

Comment 2 Stefan Schimanski 2019-08-21 08:21:35 UTC
Do you see warnings of long etcd requests?

Comment 3 Stefan Schimanski 2019-08-21 08:22:25 UTC
Can you provide us with kube-apiserver logs of the time the alert fires and minutes before?

Comment 5 Rick Rackow 2019-09-02 16:22:46 UTC
Created attachment 1610835 [details]
etcd log

This etcd node was on the master node that seemed to have issues

Comment 8 Frederic Branczyk 2019-10-18 09:04:17 UTC
I think there are two things to do here:

1) Make the alert less flaky with a multi-window burn rate approach and thus only fire when it consumes error budget at a too fast rate that we will not make our SLA.
2) This seems to be an actual problem and it seems that during this time a lot of requests take a very long time so it's a real issue that the apiserver team needs to investigate.

The alert fix is probably something that the monitoring team can take care of, nevertheless this alert does show an issue correctly, so there does seem to be an underlying problem in the apiserver (and or other components).

Comment 17 Mitchell Rollinson 2019-11-20 00:17:59 UTC
Hi folks,

IHAC who has reported  KubeAPILatencyHigh alert flapping  when taking etcd backups.

OCP 3.11

Please let me know if you require additional information IE - etcd dumps for each peer, or kube-apiserver logs, in order to gauge general etcd performance.

Regards

Mitch

Comment 18 Pawel Krupa 2020-01-08 15:20:50 UTC
*** Bug 1771342 has been marked as a duplicate of this bug. ***

Comment 21 Rick Rackow 2020-05-20 11:06:02 UTC
This bug is still valid and we're seeing it in 4.3.18.

Can you let us know what information we can provide to help troubleshoot this?

Comment 22 Sam Batschelet 2020-05-20 12:03:25 UTC
This is not a bug against etcd it is the manifestation of the overhead snapshotting action against the server. To help us better understand the impact screenshot of the following metrics over the alerted time series would be useful. We could consider adjusting the Alert toleration to better accommodate the snapshotting overhead. Another idea is to use a conditional silence of the alert during backup schedule. Moving to 4.6.

histogram_quantile(0.99, irate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))

histogram_quantile(0.99, irate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))

histogram_quantile(0.99, irate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))

Comment 23 Dan Mace 2020-05-20 18:52:29 UTC
*** Bug 1821697 has been marked as a duplicate of this bug. ***

Comment 24 Sam Batschelet 2020-05-26 17:21:08 UTC
> This bug is still valid and we're seeing it in 4.3.18.

> Can you let us know what information we can provide to help troubleshoot this?

@Rick please see comment 22[1].

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1743911#c22

Comment 25 Sam Batschelet 2020-05-29 19:16:26 UTC

*** This bug has been marked as a duplicate of bug 1825001 ***


Note You need to log in before you can comment on or make changes to this bug.