Bug 1743911

Summary: KubeAPILatencyHigh alert flapping
Product: OpenShift Container Platform Reporter: Naveen Malik <nmalik>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED DUPLICATE QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.3.zCC: aos-bugs, fbranczy, haowang, jeder, kelly.brown1, mfojtik, mirollin, mmariyan, mwoodson, pkrupa, rrackow, skumari
Target Milestone: ---Keywords: OpsBlocker, Reopened, UpcomingSprint
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1745566 1771342 (view as bug list) Environment:
Last Closed: 2020-05-29 19:16:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1745566    
Bug Blocks: 1771342    
Attachments:
Description Flags
cluster_quantile:apiserver_request_latencies:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"^(?:LIST|WATCH|WATCHLIST|PROXY|CONNECT)$"} > 4
none
ALERTS{alertstate="firing"}
none
etcd log none

Description Naveen Malik 2019-08-21 00:16:14 UTC
Created attachment 1606309 [details]
cluster_quantile:apiserver_request_latencies:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"^(?:LIST|WATCH|WATCHLIST|PROXY|CONNECT)$"} > 4

Description of problem:
in OSD v4 we have noticed the KubeAPILatencyHigh alert trigger and typically resolve within 2 to 10 minutes from when the alert is raised.  To date, the SRE team has had no actions specifically required to fix this issue on any clusters yet get paged 24/7 for these alerts because it is severity "critical".


Version-Release number of selected component (if applicable):
4.1.x


How reproducible:
Resolution of the alert is very consistent and I can show data from specific clusters where resolution occurs within 10 minutes of the alert triggering.


Steps to Reproduce:
1. Unknown.
2.
3.


Actual results:
Alert is raised in Prometheus and triggers paging SRE via PagerDuty integration in AlertManager.  Alert automatically resolves itself within 10 minutes.

Expected results:
No critical alert is raised unless it is truly a critical problem that requires human intervention.


Additional info:
Attached screenshot is a 1 hr slice from a production OSD v4.1.9 cluster showing a period when this alert triggered and automatically resolved.

Comment 1 Naveen Malik 2019-08-21 00:18:17 UTC
Created attachment 1606310 [details]
ALERTS{alertstate="firing"}

Alerts firing at the time of high api latency, same range and end time.

Comment 2 Stefan Schimanski 2019-08-21 08:21:35 UTC
Do you see warnings of long etcd requests?

Comment 3 Stefan Schimanski 2019-08-21 08:22:25 UTC
Can you provide us with kube-apiserver logs of the time the alert fires and minutes before?

Comment 5 Rick Rackow 2019-09-02 16:22:46 UTC
Created attachment 1610835 [details]
etcd log

This etcd node was on the master node that seemed to have issues

Comment 8 Frederic Branczyk 2019-10-18 09:04:17 UTC
I think there are two things to do here:

1) Make the alert less flaky with a multi-window burn rate approach and thus only fire when it consumes error budget at a too fast rate that we will not make our SLA.
2) This seems to be an actual problem and it seems that during this time a lot of requests take a very long time so it's a real issue that the apiserver team needs to investigate.

The alert fix is probably something that the monitoring team can take care of, nevertheless this alert does show an issue correctly, so there does seem to be an underlying problem in the apiserver (and or other components).

Comment 17 Mitchell Rollinson 2019-11-20 00:17:59 UTC
Hi folks,

IHAC who has reported  KubeAPILatencyHigh alert flapping  when taking etcd backups.

OCP 3.11

Please let me know if you require additional information IE - etcd dumps for each peer, or kube-apiserver logs, in order to gauge general etcd performance.

Regards

Mitch

Comment 18 Pawel Krupa 2020-01-08 15:20:50 UTC
*** Bug 1771342 has been marked as a duplicate of this bug. ***

Comment 21 Rick Rackow 2020-05-20 11:06:02 UTC
This bug is still valid and we're seeing it in 4.3.18.

Can you let us know what information we can provide to help troubleshoot this?

Comment 22 Sam Batschelet 2020-05-20 12:03:25 UTC
This is not a bug against etcd it is the manifestation of the overhead snapshotting action against the server. To help us better understand the impact screenshot of the following metrics over the alerted time series would be useful. We could consider adjusting the Alert toleration to better accommodate the snapshotting overhead. Another idea is to use a conditional silence of the alert during backup schedule. Moving to 4.6.

histogram_quantile(0.99, irate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))

histogram_quantile(0.99, irate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))

histogram_quantile(0.99, irate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))

Comment 23 Dan Mace 2020-05-20 18:52:29 UTC
*** Bug 1821697 has been marked as a duplicate of this bug. ***

Comment 24 Sam Batschelet 2020-05-26 17:21:08 UTC
> This bug is still valid and we're seeing it in 4.3.18.

> Can you let us know what information we can provide to help troubleshoot this?

@Rick please see comment 22[1].

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1743911#c22

Comment 25 Sam Batschelet 2020-05-29 19:16:26 UTC

*** This bug has been marked as a duplicate of bug 1825001 ***