Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1426733

Summary:	master ha failover sometimes does not work when one member in etcd cluster is stopped.
Product:	OpenShift Container Platform	Reporter:	Eric Paris <eparis>
Component:	Node	Assignee:	Andy Goldstein <agoldste>
Status:	CLOSED ERRATA	QA Contact:	Johnny Liu <jialiu>
Severity:	high	Docs Contact:
Priority:	medium
Version:	3.5.0	CC:	agoldste, aos-bugs, dma, gpei, jialiu, jokerman, mmccomas, tdawson
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: when attempting to connect to etcd to acquire a leader lease, the master controllers process only tries to reach a single etcd cluster member even if multiple are specified. Consequence: if the selected etcd cluster member is unavailable, the master controllers process is not able to acquire the leader lease, which means it will not start up and run properly. Fix: attempt to connect to all of the specified etcd cluster members until a successful connection is made. Result: the master controllers process can acquire the leader lease and start up properly.	Story Points:	---
Clone Of:	1426183	Environment:
Last Closed:	2017-04-12 19:13:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1426183
Bug Blocks:

Comment 1 Troy Dawson 2017-02-27 17:49:52 UTC

This has been merged into ocp and is in OCP v3.5.0.35 or newer.

Comment 3 Johnny Liu 2017-02-28 03:09:39 UTC

Verified this bug with atomic-openshift-3.5.0.35-1.git.0.b806d03.el7.x86_64, and PASS.

After stop one etcd service in the cluster and start controller service one by one, see the failover action in controller logs:


# journalctl -f -u atomic-openshift-master-controllers
-- Logs begin at Mon 2017-02-27 21:45:59 EST. --
Feb 27 21:58:02 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: E0227 21:58:02.642867    4946 leaderlease.go:95] unable to check lease openshift.io/leases/controllers: dial tcp 192.168.2.122:2379: getsockopt: connection refused
Feb 27 21:58:03 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: E0227 21:58:03.643865    4946 leaderlease.go:95] unable to check lease openshift.io/leases/controllers: dial tcp 192.168.2.122:2379: getsockopt: connection refused
Feb 27 21:58:04 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: E0227 21:58:04.645016    4946 leaderlease.go:95] unable to check lease openshift.io/leases/controllers: dial tcp 192.168.2.122:2379: getsockopt: connection refused
Feb 27 21:58:05 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: E0227 21:58:05.645906    4946 leaderlease.go:95] unable to check lease openshift.io/leases/controllers: dial tcp 192.168.2.122:2379: getsockopt: connection refused
Feb 27 21:58:06 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: I0227 21:58:06.650700    4946 leaderlease.go:154] Lease openshift.io/leases/controllers owned by master-vwvx282d at 60503 ttl 10 seconds, waiting for expiration
Feb 27 21:58:06 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: I0227 21:58:06.650762    4946 leaderlease.go:290] watching for expiration of lease openshift.io/leases/controllers from 60504
Feb 27 21:58:13 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: I0227 21:58:13.364032    4946 leaderlease.go:290] watching for expiration of lease openshift.io/leases/controllers from 60511
Feb 27 21:58:20 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: I0227 21:58:20.307972    4946 leaderlease.go:290] watching for expiration of lease openshift.io/leases/controllers from 60519

Comment 6 errata-xmlrpc 2017-04-12 19:13:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884