Bug 1426733 - master ha failover sometimes does not work when one member in etcd cluster is stopped.
Summary: master ha failover sometimes does not work when one member in etcd cluster is...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: ---
Assignee: Andy Goldstein
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On: 1426183
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-24 17:25 UTC by Eric Paris
Modified: 2017-07-24 14:11 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: when attempting to connect to etcd to acquire a leader lease, the master controllers process only tries to reach a single etcd cluster member even if multiple are specified. Consequence: if the selected etcd cluster member is unavailable, the master controllers process is not able to acquire the leader lease, which means it will not start up and run properly. Fix: attempt to connect to all of the specified etcd cluster members until a successful connection is made. Result: the master controllers process can acquire the leader lease and start up properly.
Clone Of: 1426183
Environment:
Last Closed: 2017-04-12 19:13:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ose pull 627 0 None None None 2017-02-24 17:35:24 UTC
Red Hat Product Errata RHBA-2017:0884 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.5 RPM Release Advisory 2017-04-12 22:50:07 UTC

Comment 1 Troy Dawson 2017-02-27 17:49:52 UTC
This has been merged into ocp and is in OCP v3.5.0.35 or newer.

Comment 3 Johnny Liu 2017-02-28 03:09:39 UTC
Verified this bug with atomic-openshift-3.5.0.35-1.git.0.b806d03.el7.x86_64, and PASS.

After stop one etcd service in the cluster and start controller service one by one, see the failover action in controller logs:


# journalctl -f -u atomic-openshift-master-controllers
-- Logs begin at Mon 2017-02-27 21:45:59 EST. --
Feb 27 21:58:02 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: E0227 21:58:02.642867    4946 leaderlease.go:95] unable to check lease openshift.io/leases/controllers: dial tcp 192.168.2.122:2379: getsockopt: connection refused
Feb 27 21:58:03 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: E0227 21:58:03.643865    4946 leaderlease.go:95] unable to check lease openshift.io/leases/controllers: dial tcp 192.168.2.122:2379: getsockopt: connection refused
Feb 27 21:58:04 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: E0227 21:58:04.645016    4946 leaderlease.go:95] unable to check lease openshift.io/leases/controllers: dial tcp 192.168.2.122:2379: getsockopt: connection refused
Feb 27 21:58:05 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: E0227 21:58:05.645906    4946 leaderlease.go:95] unable to check lease openshift.io/leases/controllers: dial tcp 192.168.2.122:2379: getsockopt: connection refused
Feb 27 21:58:06 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: I0227 21:58:06.650700    4946 leaderlease.go:154] Lease openshift.io/leases/controllers owned by master-vwvx282d at 60503 ttl 10 seconds, waiting for expiration
Feb 27 21:58:06 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: I0227 21:58:06.650762    4946 leaderlease.go:290] watching for expiration of lease openshift.io/leases/controllers from 60504
Feb 27 21:58:13 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: I0227 21:58:13.364032    4946 leaderlease.go:290] watching for expiration of lease openshift.io/leases/controllers from 60511
Feb 27 21:58:20 openshift-103.lab.sjc.redhat.com atomic-openshift-master-controllers[4946]: I0227 21:58:20.307972    4946 leaderlease.go:290] watching for expiration of lease openshift.io/leases/controllers from 60519

Comment 6 errata-xmlrpc 2017-04-12 19:13:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884


Note You need to log in before you can comment on or make changes to this bug.