Bug 1737660 - e2e-azure - etcd server overloaded
Summary: e2e-azure - etcd server overloaded
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.2.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-06 01:22 UTC by Kirsten Garrison
Modified: 2019-10-16 06:35 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:34:48 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift installer pull 2177 None closed WIP: Bug 1737660: pkg/asset/machines/worker: Bump Azure default to 256 GiB (P15) disks 2020-08-19 20:32:09 UTC
Github openshift installer pull 2186 None closed Bug 1737660: data/azure/master: use ReadOnly caching for OSDisk 2020-08-19 20:32:08 UTC
Red Hat Product Errata RHBA-2019:2922 None None None 2019-10-16 06:35:05 UTC

Description Kirsten Garrison 2019-08-06 01:22:52 UTC
Description of problem:

When running the new e2e-azure tests I am noticing that frequently we are seeing:
"etcdserver: server is likely overloaded" messages in the logs

Version-Release number of selected component (if applicable):
Current masters in installer & release

How reproducible:
- run e2e-azure in master
- if e2e test fails check etcd-member.logs and note that you will often see a significant number (~30-140) of `etcdserver: server is likely overloaded`

Actual results:
Many `etcdserver: server is likely overloaded` in logs

Expected results:
0 to a small handful (5-7) of `etcdserver: server is likely overloaded` in logs

Additional info:
example failed runs: 
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/4582/rehearse-4582-pull-ci-openshift-cluster-ingress-operator-master-e2e-azure/2/artifacts/e2e-azure/pods/openshift-etcd_etcd-member-ci-op-3zqg2wdp-cb8da-5kszg-master-0_etcd-member.log

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/4582/rehearse-4582-pull-ci-openshift-cluster-ingress-operator-master-e2e-azure/3/artifacts/e2e-azure/pods/openshift-etcd_etcd-member-ci-op-9i9cd5vv-cb8da-cmm9v-master-0_etcd-member.log

For open PRs with e2e-azure runs see: 
https://github.com/openshift/release/pull/4582 
https://github.com/openshift/installer/pull/2123

Comment 5 Sam Batschelet 2019-08-08 20:24:56 UTC
After further exploration from the team, it was found that setting ReadOnly[1] cache for Azure made a considerable improvement to disk I/O.

[1] https://github.com/openshift/installer/pull/2186

Comment 7 ge liu 2019-09-03 07:04:30 UTC
hi Sam, do you have any suggestion for how to verify it, thanks in advance!

Comment 8 ge liu 2019-09-11 02:41:43 UTC
I can't open e2e test url, so I could not check the msg report situation, but I suppose it be fixed for it finish many rounds of regression test already.

Comment 9 errata-xmlrpc 2019-10-16 06:34:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.