Hide Forgot
Description of problem: During our standard OpenShift cluster-load horizontal scale test to 1K nodes, 2K deployments, 4K running pods on 300 nodes, the etcd3 3.0.12-3 cluster frequently called for new elections and changed leaders. The same workload on etcd 2.3.7 normally results in no leader changes. On this scale up it happened 28 times. Version-Release number of selected component (if applicable): OpenShift 3.4.0.16 with etcd3 3.0.12-3 How reproducible: unknown. Happened frequently on this run Steps to Reproduce: 1. HA cluster with 3 masters, 3 etcd, 2 infra nodes and 300 application nodes 2. Run the https://github.com/openshift/svt/blob/master/openshift_scalability/config/pyconfigMasterVirtScalePause.yaml workload configured for 1000 projects 3. Watch the etcd3 logs for leader changes Actual results: Frequent leader changes that seem to come in bursts. Occasional oc command failures from the cluster-loader script due to temporary leaderless etcd cluster. Expected results: No unnecessary etcd3 cluster churn.
Issue log'd upstream https://github.com/coreos/etcd/issues/6753
The only thing I can think of is if there is write contention on the VMs. Could you check to make certain that the etcd nodes are landing on different hypervisors. An easy way to do this in our environment is to make the instance sizes so large that the eat a whole host.
I am running into this problem with 1000 nodes cluster when trying to run conformance tests.
(In reply to Vikas Laad from comment #9) > I am running into this problem with 1000 nodes cluster when trying to run > conformance tests. There are other issues related to network etc in this env, please ignore this comment.
Closing this issue as we rooted the causes on a couple of conditions due to the storage subsystems write latency on openstack environments. 1. Was host anti-affinity is needed if using local storage 2. Shared ceph cluster write latency issues occur during fsyncs Once putting etcd on dedicated storage, issues were resolved. Please reference upstream guidelines on deployment: https://github.com/coreos/etcd/blob/master/Documentation/op-guide/hardware.md#hardware-recommendations