Bug 1798785 - etcd sometimes struggles with slow Azure disks with lots of: etcdserver: leader changed
Summary: etcd sometimes struggles with slow Azure disks with lots of: etcdserver: lead...
Keywords:
Status: CLOSED DUPLICATE of bug 1806700
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.4.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-06 01:07 UTC by W. Trevor King
Modified: 2020-04-02 21:16 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-02 21:16:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description W. Trevor King 2020-02-06 01:07:33 UTC
For example, in 4.1.18 -> 4.3.1 CI [1]:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/39/build-log.txt | sort | uniq | grep -c 'etcdserver: leader changed'
  6
  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/39/build-log.txt | sort | uniq | grep -c 'etcdserver: request timed out'
  46

Some previous discussion of slow Azure disks in [2].  Some previous discussion on the sorts of things that can go wrong when etcd cannot keep up in bug 1775878.

Additional examples: 4.3 [3]:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/891/build-log.txt | sort | uniq | grep -c 'etcdserver: leader changed'
  15

and 4.4 [4]:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/702/build-log.txt | sort | uniq | grep -c 'etcdserver: leader changed'
  4

Query to search for these:

$ curl -s 'https://search.svc.ci.openshift.org/search?search=etcdserver%3A+leader+changed&maxAge=336h&context=0&type=build-log' | jq -r '[. | to_entries[] | .hits = ([.value["etcdserver: leader changed"][].context[]] | unique | length)] |sort_by(.hits)[] | (.hits | tostring) + " " + .key' | tail
5 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/16355
5 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/3036/pull-ci-openshift-installer-release-4.3-e2e-azure/113
6 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/39
6 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24405/pull-ci-openshift-origin-release-4.3-e2e-azure/196
8 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/16457
10 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-compact-4.3/105
10 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/746/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-upgrade/1245
10 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2433/pull-ci-openshift-installer-master-e2e-aws-upgrade/4676
12 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-ovn-4.3/385
13 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/891

You can see that this is not unique to Azure, but Azure is over-represented.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/39#1:build-log.txt%3A1148
[2]: https://github.com/openshift/installer/pull/2186
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/891
[4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/702

Comment 2 W. Trevor King 2020-02-06 21:27:48 UTC
Also in this space is work to alert cluster admins when their hardware underperforms: bug 1793183.  That would help with diagnosing problems like this, but would obviously not magically make Azure's disks faster, so distinct from this bug ;).

Comment 4 Sam Batschelet 2020-04-02 21:16:33 UTC

*** This bug has been marked as a duplicate of bug 1806700 ***


Note You need to log in before you can comment on or make changes to this bug.