Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1426733

Summary: master ha failover sometimes does not work when one member in etcd cluster is stopped.
Product: OpenShift Container Platform Reporter: Eric Paris <eparis>
Component: NodeAssignee: Andy Goldstein <agoldste>
Status: CLOSED ERRATA QA Contact: Johnny Liu <jialiu>
Severity: high Docs Contact:
Priority: medium    
Version: 3.5.0CC: agoldste, aos-bugs, dma, gpei, jialiu, jokerman, mmccomas, tdawson
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: when attempting to connect to etcd to acquire a leader lease, the master controllers process only tries to reach a single etcd cluster member even if multiple are specified. Consequence: if the selected etcd cluster member is unavailable, the master controllers process is not able to acquire the leader lease, which means it will not start up and run properly. Fix: attempt to connect to all of the specified etcd cluster members until a successful connection is made. Result: the master controllers process can acquire the leader lease and start up properly.
Story Points: ---
Clone Of: 1426183 Environment:
Last Closed: 2017-04-12 19:13:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1426183    
Bug Blocks:    

Comment 1 Troy Dawson 2017-02-27 17:49:52 UTC
This has been merged into ocp and is in OCP v3.5.0.35 or newer.

Comment 3 Johnny Liu 2017-02-28 03:09:39 UTC
Verified this bug with atomic-openshift-3.5.0.35-1.git.0.b806d03.el7.x86_64, and PASS.

After stop one etcd service in the cluster and start controller service one by one, see the failover action in controller logs:


# journalctl -f -u atomic-openshift-master-controllers
-- Logs begin at Mon 2017-02-27 21:45:59 EST. --
Feb 27 21:58:02 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: E0227 21:58:02.642867    4946 leaderlease.go:95] unable to check lease openshift.io/leases/controllers: dial tcp 192.168.2.122:2379: getsockopt: connection refused
Feb 27 21:58:03 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: E0227 21:58:03.643865    4946 leaderlease.go:95] unable to check lease openshift.io/leases/controllers: dial tcp 192.168.2.122:2379: getsockopt: connection refused
Feb 27 21:58:04 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: E0227 21:58:04.645016    4946 leaderlease.go:95] unable to check lease openshift.io/leases/controllers: dial tcp 192.168.2.122:2379: getsockopt: connection refused
Feb 27 21:58:05 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: E0227 21:58:05.645906    4946 leaderlease.go:95] unable to check lease openshift.io/leases/controllers: dial tcp 192.168.2.122:2379: getsockopt: connection refused
Feb 27 21:58:06 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: I0227 21:58:06.650700    4946 leaderlease.go:154] Lease openshift.io/leases/controllers owned by master-vwvx282d at 60503 ttl 10 seconds, waiting for expiration
Feb 27 21:58:06 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: I0227 21:58:06.650762    4946 leaderlease.go:290] watching for expiration of lease openshift.io/leases/controllers from 60504
Feb 27 21:58:13 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: I0227 21:58:13.364032    4946 leaderlease.go:290] watching for expiration of lease openshift.io/leases/controllers from 60511
Feb 27 21:58:20 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: I0227 21:58:20.307972    4946 leaderlease.go:290] watching for expiration of lease openshift.io/leases/controllers from 60519

Comment 6 errata-xmlrpc 2017-04-12 19:13:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884