1751375 – Azure: Azure e2e/serial tests still report etcd server overloaded

Bug 1751375 - Azure: Azure e2e/serial tests still report etcd server overloaded

Summary: Azure: Azure e2e/serial tests still report etcd server overloaded

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-11 20:44 UTC by Yu Qi Zhang
Modified:	2019-10-16 06:41 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:40:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift installer pull 2367	0	None	closed	Bug 1751375: Azure: bump disk size to 1TB for control plane	2020-11-11 18:43:29 UTC
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:41:02 UTC

Description Yu Qi Zhang 2019-09-11 20:44:24 UTC

Description of problem:

As of today we are seeing e2e-azure and azure-serial pass relatively consistently for 4.2.0-0.nightly runs. But a ~20% fail rate still leaves a lot to be desired for release. See latest runs at: https://openshift-release.svc.ci.openshift.org/.

For failing runs, the situation generally boils down to two cases:

1. The cluster never initialized. This is generally due to initialization hitting the 30min timeout. This happens less often than the second case but is still a cause for concern (but not directly related to this BZ)

2. The cluster initialized, but some tests failed due to timeouts. If we take a look at failing tests at:

https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-azure-4.2

We can see that there's a cascading bar of red, indicating that on every run a different subsets of tests fail. Those failures are generally timeouts.

About a month ago we did some triaging, which lead to us discovering that disk/network latency was very high causing etcd to often be overloaded: https://bugzilla.redhat.com/show_bug.cgi?id=1737660

I took a look at some recent failures and it seemed that etcd overload/took too long messages are even more prevalent than before. See "Test job failure etcd logs" at the bottom of this description for examples.

I'm not 100% certain that this is the root cause of the flakiness of running e2e/serial on Azure but it is certainly concerning that etcd's performance is reporting so many overloads. I think this is a good point to start debugging overall platform performance.

Version-Release number of selected component (if applicable):
OCP 4.2

Test job failure etcd logs:
https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-serial-4.2/127/artifacts/e2e-azure-serial/pods/openshift-etcd_etcd-member-ci-op-zifmpx4h-3a8ca-t4swf-master-1_etcd-member.log

https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-serial-4.2/117/artifacts/e2e-azure-serial/pods/openshift-etcd_etcd-member-ci-op-gshzdsgq-3a8ca-kg6bj-master-0_etcd-member.log

https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/238/artifacts/e2e-azure/pods/openshift-etcd_etcd-member-ci-op-bcl700bj-282fe-gfxtc-master-0_etcd-member.log

https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/225/artifacts/e2e-azure/pods/openshift-etcd_etcd-member-ci-op-ngw7jh6i-282fe-gqtnk-master-0_etcd-member.log

https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/220/artifacts/e2e-azure/pods/openshift-etcd_etcd-member-ci-op-pvl87yqc-282fe-9x9bt-master-2_etcd-member.log

Comment 6 ge liu 2019-09-19 03:05:22 UTC

I think we may close this bug according to comments, feel free reopen it if there is issues, thx

Comment 7 errata-xmlrpc 2019-10-16 06:40:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.