Bug 1798785

Summary: etcd sometimes struggles with slow Azure disks with lots of: etcdserver: leader changed
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED DUPLICATE QA Contact: ge liu <geliu>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.2.0CC: jerzhang, mfojtik
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-02 21:16:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2020-02-06 01:07:33 UTC
For example, in 4.1.18 -> 4.3.1 CI [1]:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/39/build-log.txt | sort | uniq | grep -c 'etcdserver: leader changed'
  6
  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/39/build-log.txt | sort | uniq | grep -c 'etcdserver: request timed out'
  46

Some previous discussion of slow Azure disks in [2].  Some previous discussion on the sorts of things that can go wrong when etcd cannot keep up in bug 1775878.

Additional examples: 4.3 [3]:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/891/build-log.txt | sort | uniq | grep -c 'etcdserver: leader changed'
  15

and 4.4 [4]:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/702/build-log.txt | sort | uniq | grep -c 'etcdserver: leader changed'
  4

Query to search for these:

$ curl -s 'https://search.svc.ci.openshift.org/search?search=etcdserver%3A+leader+changed&maxAge=336h&context=0&type=build-log' | jq -r '[. | to_entries[] | .hits = ([.value["etcdserver: leader changed"][].context[]] | unique | length)] |sort_by(.hits)[] | (.hits | tostring) + " " + .key' | tail
5 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/16355
5 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/3036/pull-ci-openshift-installer-release-4.3-e2e-azure/113
6 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/39
6 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24405/pull-ci-openshift-origin-release-4.3-e2e-azure/196
8 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/16457
10 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-compact-4.3/105
10 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/746/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-upgrade/1245
10 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2433/pull-ci-openshift-installer-master-e2e-aws-upgrade/4676
12 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-ovn-4.3/385
13 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/891

You can see that this is not unique to Azure, but Azure is over-represented.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/39#1:build-log.txt%3A1148
[2]: https://github.com/openshift/installer/pull/2186
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/891
[4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/702

Comment 2 W. Trevor King 2020-02-06 21:27:48 UTC
Also in this space is work to alert cluster admins when their hardware underperforms: bug 1793183.  That would help with diagnosing problems like this, but would obviously not magically make Azure's disks faster, so distinct from this bug ;).

Comment 4 Sam Batschelet 2020-04-02 21:16:33 UTC

*** This bug has been marked as a duplicate of bug 1806700 ***