During upgrade of a cluster in CI build farm, we have seen a sequence of alerts and messages of failures from clusterversion. oc --context build01 adm upgrade --allow-explicit-upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-04-13-190424 --force=true Eventually upgrade was completed successfully (which is so nice). But those alerts and messages are too frightening. I would like to create a bug for each of those and feel better for the next upgrade. https://coreos.slack.com/archives/CHY2E1BL4/p1587060620464700 [FIRING:1] KubeAPILatencyHigh apiserver (apiserver https default openshift-monitoring/k8s pods namespace kubernetes warning POST v1) The API server has an abnormal latency of 0.3257691935407682 seconds for POST pods. must-gather after upgrade: http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.
Moved to etcd to investigate why the write requests are so slow.
performance assessment would involve Prometheus metrics analysis of the time series during the upgrade. Could you provide screenshots of the following queries over that time series? histogram_quantile(0.99, irate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) histogram_quantile(0.99, irate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) histogram_quantile(0.99, irate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))
Our prometheus stores data for 15 days (I will attach the screenshots for the existing data). It is the default setting. I do not have the data for that day.
Thanks for the screenshots > I do not have the data for that day. But even given what you have I see backend commit p99 spike to 300ms. Normal running is probably between 2 - 25ms. I know CI has fast disks but in the case of for example depletion of burst credits, it is possible that you simply had too much I/O at that time. That is going to manifest itself in etcd being slow to respond to api server and basically root cause of KubeAPILatencyHigh. During upgrade, this could be compounded with heavy event traffic etc. Not sure what etcd or apiserver can do about this. You need to triage the workloads causing the spike and if it can't be reduced you might need to look into faster disks. I will think about next steps to isolate workload.
We want to improve the alerting here, but we can't do it in 4.5. Moving to 4.6.
This bug hasn't had any activity 7 days after it was marked as LifecycleStale, so we are closing this bug as WONTFIX. If you consider this bug still valuable, please reopen it or create new bug.
Hongkai beat me to reopening this, but didn't provide motivation. My motivation was going to be "why should you expect weekly bumps on a bug that has UpcomingSprint and which has been punted to 4.6?".
*** Bug 1743911 has been marked as a duplicate of this bug. ***
Closing as a dup of bug 1845446. As [1] points out, the intention for the alert was to only fire for latency over 1s, and 0.32... doesn't qualify. This is being ported back to 4.3.z with bug 1845443. [1]: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/434 *** This bug has been marked as a duplicate of bug 1845446 ***