Bug 1825001

Summary: [4.3 upgrade][alert] KubeAPILatencyHigh: The API server has an abnormal latency of 0.3257691935407682 seconds for POST pods.
Product: OpenShift Container Platform Reporter: Hongkai Liu <hongkliu>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED DUPLICATE QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: low    
Version: 4.3.0CC: aos-bugs, ccoleman, dmace, mfojtik, nmalik, skolicha, wking
Target Milestone: ---Keywords: Reopened, Upgrades
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-06-15 21:08:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1771342    

Description Hongkai Liu 2020-04-16 19:55:56 UTC
During upgrade of a cluster in CI build farm, we have seen a sequence of alerts and messages of failures from clusterversion.

oc --context build01 adm upgrade --allow-explicit-upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-04-13-190424 --force=true

Eventually upgrade was completed successfully (which is so nice).
But those alerts and messages are too frightening.

I would like to create a bug for each of those and feel better for the next upgrade.

https://coreos.slack.com/archives/CHY2E1BL4/p1587060620464700
[FIRING:1] KubeAPILatencyHigh apiserver (apiserver https default openshift-monitoring/k8s pods namespace kubernetes warning POST v1)
The API server has an abnormal latency of 0.3257691935407682 seconds for POST pods.

must-gather after upgrade:
http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/

Comment 1 Michal Fojtik 2020-05-18 07:29:37 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 2 Stefan Schimanski 2020-05-19 11:29:07 UTC
Moved to etcd to investigate why the write requests are so slow.

Comment 3 Sam Batschelet 2020-05-19 12:26:42 UTC
performance assessment would involve Prometheus metrics analysis of the time series during the upgrade. Could you provide screenshots of the following queries over that time series?

histogram_quantile(0.99, irate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))

histogram_quantile(0.99, irate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))

histogram_quantile(0.99, irate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))

Comment 4 Hongkai Liu 2020-05-19 21:00:22 UTC
Our prometheus stores data for 15 days (I will attach the screenshots for the existing data).
It is the default setting.
I do not have the data for that day.

Comment 7 Sam Batschelet 2020-05-19 22:02:05 UTC
Thanks for the screenshots

> I do not have the data for that day.

But even given what you have I see backend commit p99 spike to 300ms. Normal running is probably between 2 - 25ms. I know CI has fast disks but in the case of for example depletion of burst credits, it is possible that you simply had too much I/O at that time. That is going to manifest itself in etcd being slow to respond to api server and basically root cause of KubeAPILatencyHigh. During upgrade, this could be compounded with heavy event traffic etc. Not sure what etcd or apiserver can do about this. You need to triage the workloads causing the spike and if it can't be reduced you might need to look into faster disks.

I will think about next steps to isolate workload.

Comment 9 Dan Mace 2020-05-20 18:33:23 UTC
We want to improve the alerting here, but we can't do it in 4.5. Moving to 4.6.

Comment 10 Michal Fojtik 2020-05-27 00:02:14 UTC
This bug hasn't had any activity 7 days after it was marked as LifecycleStale, so we are closing this bug as WONTFIX. If you consider this bug still valuable, please reopen it or create new bug.

Comment 11 W. Trevor King 2020-05-27 03:01:35 UTC
Hongkai beat me to reopening this, but didn't provide motivation.  My motivation was going to be "why should you expect weekly bumps on a bug that has UpcomingSprint and which has been punted to 4.6?".

Comment 12 Sam Batschelet 2020-05-29 19:16:26 UTC
*** Bug 1743911 has been marked as a duplicate of this bug. ***

Comment 13 W. Trevor King 2020-06-15 21:08:42 UTC
Closing as a dup of bug 1845446.  As [1] points out, the intention for the alert was to only fire for latency over 1s, and 0.32... doesn't qualify.  This is being ported back to 4.3.z with bug 1845443.

[1]: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/434

*** This bug has been marked as a duplicate of bug 1845446 ***