2047828 – Major increase in kube-apiserver latency between 4.9 and 4.10 during azure upgrade CI jobs

Bug 2047828 - Major increase in kube-apiserver latency between 4.9 and 4.10 during azure upgrade CI jobs

Summary: Major increase in kube-apiserver latency between 4.9 and 4.10 during azure up...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Damien Grisonnet
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-28 16:17 UTC by Damien Grisonnet
Modified:	2023-01-16 11:33 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-16 11:33:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Damien Grisonnet 2022-01-28 16:17:44 UTC

Description of problem:

During the 4.10 development cycle, we started gathering more information from the CI jobs in order to detect performance regression between versions. After looking at the graph for the `periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade` job, we noticed that there was a major performance regression. Namely:

- 3x kube-apiserver read/write requests average latency compared to 4.9 [1][2]
- 2x etcd read/write requests average latency compared to 4.9 [3][4]

It is worth noting that the number of requests made to the kube-apiserver during the period where we started gathering latency hasn't changed [5] and that this latency regression only occurs during upgrade jobs [6].

Also, after looking at the metrics from one of the jobs from the 8th of January [7], I was able to confirm that the latency regression was already present pre-rebase,

[1] https://search.ci.openshift.org/graph/metrics?metric=cluster%3Aapi%3Aread%3Arequests%3Alatency%3Atotal%3Aavg&job=periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade&job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade
[2] https://search.ci.openshift.org/graph/metrics?metric=cluster%3Aapi%3Awrite%3Arequests%3Alatency%3Atotal%3Aavg&job=periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade&job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade
[3] https://search.ci.openshift.org/graph/metrics?metric=cluster%3Aetcd%3Aread%3Arequests%3Alatency%3Atotal%3Aavg&job=periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade&job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade
[4] https://search.ci.openshift.org/graph/metrics?metric=cluster%3Aetcd%3Awrite%3Arequests%3Alatency%3Atotal%3Aavg&job=periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade&job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade
[5] https://search.ci.openshift.org/graph/metrics?metric=cluster%3Aapi%3Atotal%3Arequests&job=periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade&job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade
[6] https://search.ci.openshift.org/graph/metrics?metric=cluster%3Aapi%3Aread%3Arequests%3Alatency%3Atotal%3Aavg&job=periodic-ci-openshift-release-master-ci-4.10-e2e-aws&job=periodic-ci-openshift-release-master-ci-4.9-e2e-azure
[7] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1479777578087616512

How to reproduce:

By looking at the ci.search links I shared above and by executing the following promQL queries:

- kube-apiserver read latency avg: sum(rate(apiserver_request_duration_seconds_sum{job="apiserver",scope!="",verb=~"GET|LIST"}[${d_all}])) by (le,scope) / sum(rate(apiserver_request_duration_seconds_count{job="apiserver",scope!="",verb=~"GET|LIST"}[${d_all}])) by (le,scope)
- kube-apiserver write latency avg: sum(rate(apiserver_request_duration_seconds_sum{job="apiserver",scope!="",verb=~"POST|PUT|PATCH|DELETE"}[${d_all}])) by (le,scope) / sum(rate(apiserver_request_duration_seconds_count{job="apiserver",scope!="",verb=~"POST|PUT|PATCH|DELETE"}[${d_all}])) by (le,scope)
- etcd read latency avg: sum(rate(etcd_request_duration_seconds_sum{operation=~"get|list|listWithCount"}[${d_all}])) by (le,scope) / sum(rate(etcd_request_duration_seconds_count{operation=~"get|list|listWithCount"}[${d_all}])) by (le,scope)
- etcd write latency avg: sum(rate(etcd_request_duration_seconds_sum{operation=~"create|update|delete"}[${d_all}])) by (le,scope) / sum(rate(etcd_request_duration_seconds_count{operation=~"create|update|delete"}[${d_all}])) by (le,scope)

Comment 2 Michal Fojtik 2023-01-16 11:33:33 UTC

Dear reporter, we greatly appreciate the bug you have reported here. Unfortunately, due to migration to a new issue-tracking system (https://issues.redhat.com/), we cannot continue triaging bugs reported in Bugzilla. Since this bug has been stale for multiple days, we, therefore, decided to close this bug.

If you think this is a mistake or this bug has a higher priority or severity as set today, please feel free to reopen this bug and tell us why. We are going to move every re-opened bug to https://issues.redhat.com. 

Thank you for your patience and understanding.

Note You need to log in before you can comment on or make changes to this bug.