Bug 1490427

Summary: [3.6.z] Stopping etcd leader can cause authentication failures due to OpenShift Master caching and using etcd leader IP address.
Product: OpenShift Container Platform Reporter: Eric Rich <erich>
Component: MasterAssignee: Robert Rati <rrati>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.5.1CC: aos-bugs, byeo, ccoleman, decarr, dma, eparis, erich, fcami, jliggitt, jokerman, knakai, mfojtik, misalunk, mmccomas, pdwyer, rhowe, rrati, tkimura
Target Milestone: ---Keywords: Unconfirmed
Target Release: 3.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: In some failure cases the etcd client used by OpenShift won't rotate through all the available etcd cluster members. The client will end up repeatedly trying the same server, and if that server is down then requests will fail for an extended time until the client finds the server invalid. Consequence: If the etcd leader goes away when it is attempted to be contacted for something like authentication then the authentication will fail and the etcd client will be stuck trying to talk to the etcd member that doesn't exist. User authentication would fail for an extended period of time. Fix: The etcd client now rotates to other cluster members even on failure. Result: If the etcd leader goes away, the worst that should happen is a failure of that one authentication attempt. The next attempt will succeed because a different etcd member will be used.
Story Points: ---
Clone Of: 1475184 Environment:
Last Closed: 2017-10-25 13:06:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 5 ge liu 2017-09-28 06:55:13 UTC
Verified in ocp version:

openshift v3.6.173.0.37
kubernetes v1.6.1+5115d708d7
etcd 3.2.1


Steps:

1). setup env with HA mode: 3 master, 4 nodes, 1 lb

2). run oc login/logout in a loop

3). turn off master service randomly

4). the login/logout works well without interruption

5). turn off the rest master service one by one

6). login/logout works well if there is one master service living at least, and report err after all master service turned off

error: EOF
error: You must have a token in order to logout.
error: EOF
error: You must have a token in order to logout.
error: EOF
error: You must have a token in order to logout.

Comment 7 errata-xmlrpc 2017-10-25 13:06:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3049