Bug 1389804

Summary: etcd3 cluster keeps electing new leaders during OpenShift cluster load to 1K namespaces
Product: Red Hat Enterprise Linux 7 Reporter: Mike Fiedler <mifiedle>
Component: etcd3Assignee: Timothy St. Clair <tstclair>
Status: CLOSED NOTABUG QA Contact: Martin Jenner <mjenner>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.3CC: jeder, mifiedle, sjr, tstclair, vlaad
Target Milestone: rcKeywords: Extras
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: aos-scalability-34
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-25 13:56:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Mike Fiedler 2016-10-28 17:07:13 UTC
Description of problem:

During our standard OpenShift cluster-load horizontal scale test to 1K nodes, 2K deployments, 4K running pods on 300 nodes, the etcd3 3.0.12-3 cluster frequently called for new elections and changed leaders.   The same workload on etcd 2.3.7 normally results in no leader changes.   On this scale up it happened 28 times.


Version-Release number of selected component (if applicable):  OpenShift 3.4.0.16 with etcd3 3.0.12-3


How reproducible: unknown.   Happened frequently on this run


Steps to Reproduce:
1.  HA cluster with 3 masters, 3 etcd, 2 infra nodes and 300 application nodes
2.  Run the https://github.com/openshift/svt/blob/master/openshift_scalability/config/pyconfigMasterVirtScalePause.yaml workload configured for 1000 projects
3.  Watch the etcd3 logs for leader changes

Actual results:

Frequent leader changes that seem to come in bursts.  Occasional oc command failures from the cluster-loader script due to temporary leaderless etcd cluster.

Expected results:

No unnecessary etcd3 cluster churn.

Comment 3 Timothy St. Clair 2016-10-28 18:28:38 UTC
Issue log'd upstream https://github.com/coreos/etcd/issues/6753

Comment 4 Timothy St. Clair 2016-11-03 20:32:53 UTC
The only thing I can think of is if there is write contention on the VMs. Could you check to make certain that the etcd nodes are landing on different hypervisors. An easy way to do this in our environment is to make the instance sizes so large that the eat a whole host.

Comment 9 Vikas Laad 2016-11-23 01:00:06 UTC
I am running into this problem with 1000 nodes cluster when trying to run conformance tests.

Comment 10 Vikas Laad 2016-11-28 18:07:03 UTC
(In reply to Vikas Laad from comment #9)
> I am running into this problem with 1000 nodes cluster when trying to run
> conformance tests.

There are other issues related to network etc in this env, please ignore this comment.

Comment 13 Timothy St. Clair 2017-01-25 13:56:43 UTC
Closing this issue as we rooted the causes on a couple of conditions due to the storage subsystems write latency on openstack environments.  

1. Was host anti-affinity is needed if using local storage
2. Shared ceph cluster write latency issues occur during fsyncs

Once putting etcd on dedicated storage, issues were resolved. 

Please reference upstream guidelines on deployment: https://github.com/coreos/etcd/blob/master/Documentation/op-guide/hardware.md#hardware-recommendations