Created attachment 1606309 [details] cluster_quantile:apiserver_request_latencies:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"^(?:LIST|WATCH|WATCHLIST|PROXY|CONNECT)$"} > 4 Description of problem: in OSD v4 we have noticed the KubeAPILatencyHigh alert trigger and typically resolve within 2 to 10 minutes from when the alert is raised. To date, the SRE team has had no actions specifically required to fix this issue on any clusters yet get paged 24/7 for these alerts because it is severity "critical". Version-Release number of selected component (if applicable): 4.1.x How reproducible: Resolution of the alert is very consistent and I can show data from specific clusters where resolution occurs within 10 minutes of the alert triggering. Steps to Reproduce: 1. Unknown. 2. 3. Actual results: Alert is raised in Prometheus and triggers paging SRE via PagerDuty integration in AlertManager. Alert automatically resolves itself within 10 minutes. Expected results: No critical alert is raised unless it is truly a critical problem that requires human intervention. Additional info: Attached screenshot is a 1 hr slice from a production OSD v4.1.9 cluster showing a period when this alert triggered and automatically resolved.
Created attachment 1606310 [details] ALERTS{alertstate="firing"} Alerts firing at the time of high api latency, same range and end time.
Do you see warnings of long etcd requests?
Can you provide us with kube-apiserver logs of the time the alert fires and minutes before?
Created attachment 1610835 [details] etcd log This etcd node was on the master node that seemed to have issues
I think there are two things to do here: 1) Make the alert less flaky with a multi-window burn rate approach and thus only fire when it consumes error budget at a too fast rate that we will not make our SLA. 2) This seems to be an actual problem and it seems that during this time a lot of requests take a very long time so it's a real issue that the apiserver team needs to investigate. The alert fix is probably something that the monitoring team can take care of, nevertheless this alert does show an issue correctly, so there does seem to be an underlying problem in the apiserver (and or other components).
Hi folks, IHAC who has reported KubeAPILatencyHigh alert flapping when taking etcd backups. OCP 3.11 Please let me know if you require additional information IE - etcd dumps for each peer, or kube-apiserver logs, in order to gauge general etcd performance. Regards Mitch
*** Bug 1771342 has been marked as a duplicate of this bug. ***
This bug is still valid and we're seeing it in 4.3.18. Can you let us know what information we can provide to help troubleshoot this?
This is not a bug against etcd it is the manifestation of the overhead snapshotting action against the server. To help us better understand the impact screenshot of the following metrics over the alerted time series would be useful. We could consider adjusting the Alert toleration to better accommodate the snapshotting overhead. Another idea is to use a conditional silence of the alert during backup schedule. Moving to 4.6. histogram_quantile(0.99, irate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) histogram_quantile(0.99, irate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) histogram_quantile(0.99, irate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))
*** Bug 1821697 has been marked as a duplicate of this bug. ***
> This bug is still valid and we're seeing it in 4.3.18. > Can you let us know what information we can provide to help troubleshoot this? @Rick please see comment 22[1]. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1743911#c22
*** This bug has been marked as a duplicate of bug 1825001 ***