1737660 – e2e-azure - etcd server overloaded

Bug 1737660 - e2e-azure - etcd server overloaded

Summary: e2e-azure - etcd server overloaded

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-06 01:22 UTC by Kirsten Garrison
Modified:	2019-10-16 06:35 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:34:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift installer pull 2177	None	closed	WIP: Bug 1737660: pkg/asset/machines/worker: Bump Azure default to 256 GiB (P15) disks	2020-11-24 12:44:23 UTC
Github	openshift installer pull 2186	None	closed	Bug 1737660: data/azure/master: use ReadOnly caching for OSDisk	2020-11-24 12:43:58 UTC
Red Hat Product Errata	RHBA-2019:2922	None	None	None	2019-10-16 06:35:05 UTC

Description Kirsten Garrison 2019-08-06 01:22:52 UTC

Description of problem:

When running the new e2e-azure tests I am noticing that frequently we are seeing:
"etcdserver: server is likely overloaded" messages in the logs

Version-Release number of selected component (if applicable):
Current masters in installer & release

How reproducible:
- run e2e-azure in master
- if e2e test fails check etcd-member.logs and note that you will often see a significant number (~30-140) of `etcdserver: server is likely overloaded`

Actual results:
Many `etcdserver: server is likely overloaded` in logs

Expected results:
0 to a small handful (5-7) of `etcdserver: server is likely overloaded` in logs

Additional info:
example failed runs: 
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/4582/rehearse-4582-pull-ci-openshift-cluster-ingress-operator-master-e2e-azure/2/artifacts/e2e-azure/pods/openshift-etcd_etcd-member-ci-op-3zqg2wdp-cb8da-5kszg-master-0_etcd-member.log

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/4582/rehearse-4582-pull-ci-openshift-cluster-ingress-operator-master-e2e-azure/3/artifacts/e2e-azure/pods/openshift-etcd_etcd-member-ci-op-9i9cd5vv-cb8da-cmm9v-master-0_etcd-member.log

For open PRs with e2e-azure runs see: 
https://github.com/openshift/release/pull/4582 
https://github.com/openshift/installer/pull/2123

Comment 5 Sam Batschelet 2019-08-08 20:24:56 UTC

After further exploration from the team, it was found that setting ReadOnly[1] cache for Azure made a considerable improvement to disk I/O.

[1] https://github.com/openshift/installer/pull/2186

Comment 7 ge liu 2019-09-03 07:04:30 UTC

hi Sam, do you have any suggestion for how to verify it, thanks in advance!

Comment 8 ge liu 2019-09-11 02:41:43 UTC

I can't open e2e test url, so I could not check the msg report situation, but I suppose it be fixed for it finish many rounds of regression test already.

Comment 9 errata-xmlrpc 2019-10-16 06:34:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.