Description of problem: During the 4.10 development cycle, we started gathering more information from the CI jobs in order to detect performance regression between versions. After looking at the graph for the `periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade` job, we noticed that there was a major performance regression. Namely: - 3x kube-apiserver read/write requests average latency compared to 4.9 [1][2] - 2x etcd read/write requests average latency compared to 4.9 [3][4] It is worth noting that the number of requests made to the kube-apiserver during the period where we started gathering latency hasn't changed [5] and that this latency regression only occurs during upgrade jobs [6]. Also, after looking at the metrics from one of the jobs from the 8th of January [7], I was able to confirm that the latency regression was already present pre-rebase, [1] https://search.ci.openshift.org/graph/metrics?metric=cluster%3Aapi%3Aread%3Arequests%3Alatency%3Atotal%3Aavg&job=periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade&job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade [2] https://search.ci.openshift.org/graph/metrics?metric=cluster%3Aapi%3Awrite%3Arequests%3Alatency%3Atotal%3Aavg&job=periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade&job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade [3] https://search.ci.openshift.org/graph/metrics?metric=cluster%3Aetcd%3Aread%3Arequests%3Alatency%3Atotal%3Aavg&job=periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade&job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade [4] https://search.ci.openshift.org/graph/metrics?metric=cluster%3Aetcd%3Awrite%3Arequests%3Alatency%3Atotal%3Aavg&job=periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade&job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade [5] https://search.ci.openshift.org/graph/metrics?metric=cluster%3Aapi%3Atotal%3Arequests&job=periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade&job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade [6] https://search.ci.openshift.org/graph/metrics?metric=cluster%3Aapi%3Aread%3Arequests%3Alatency%3Atotal%3Aavg&job=periodic-ci-openshift-release-master-ci-4.10-e2e-aws&job=periodic-ci-openshift-release-master-ci-4.9-e2e-azure [7] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1479777578087616512 How to reproduce: By looking at the ci.search links I shared above and by executing the following promQL queries: - kube-apiserver read latency avg: sum(rate(apiserver_request_duration_seconds_sum{job="apiserver",scope!="",verb=~"GET|LIST"}[${d_all}])) by (le,scope) / sum(rate(apiserver_request_duration_seconds_count{job="apiserver",scope!="",verb=~"GET|LIST"}[${d_all}])) by (le,scope) - kube-apiserver write latency avg: sum(rate(apiserver_request_duration_seconds_sum{job="apiserver",scope!="",verb=~"POST|PUT|PATCH|DELETE"}[${d_all}])) by (le,scope) / sum(rate(apiserver_request_duration_seconds_count{job="apiserver",scope!="",verb=~"POST|PUT|PATCH|DELETE"}[${d_all}])) by (le,scope) - etcd read latency avg: sum(rate(etcd_request_duration_seconds_sum{operation=~"get|list|listWithCount"}[${d_all}])) by (le,scope) / sum(rate(etcd_request_duration_seconds_count{operation=~"get|list|listWithCount"}[${d_all}])) by (le,scope) - etcd write latency avg: sum(rate(etcd_request_duration_seconds_sum{operation=~"create|update|delete"}[${d_all}])) by (le,scope) / sum(rate(etcd_request_duration_seconds_count{operation=~"create|update|delete"}[${d_all}])) by (le,scope)
Dear reporter, we greatly appreciate the bug you have reported here. Unfortunately, due to migration to a new issue-tracking system (https://issues.redhat.com/), we cannot continue triaging bugs reported in Bugzilla. Since this bug has been stale for multiple days, we, therefore, decided to close this bug. If you think this is a mistake or this bug has a higher priority or severity as set today, please feel free to reopen this bug and tell us why. We are going to move every re-opened bug to https://issues.redhat.com. Thank you for your patience and understanding.