1475184 – Stopping etcd leader can cause authentication failures due to OpenShift Master caching and using etcd leader IP address.

Bug 1475184 - Stopping etcd leader can cause authentication failures due to OpenShift Master caching and using etcd leader IP address.

Summary: Stopping etcd leader can cause authentication failures due to OpenShift Maste...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	3.5.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.7.0
Assignee:	Robert Rati
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-26 07:46 UTC by Miheer Salunke
Modified:	2020-12-14 09:15 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1490427 1490428 (view as bug list)
Environment:
Last Closed:	2017-09-15 19:07:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Atomic-openshift-master-api logs during reproducer (22.78 KB, text/plain) 2017-08-29 22:13 UTC, Ryan Howe	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	coreos etcd issues 8515	0	'None'	closed	etcd2 client doesn't cycle to next cluster member in some error conditions	2020-07-01 13:14:58 UTC
Github	coreos etcd pull 8519	0	'None'	closed	client: fail over to next endpoint on oneshot failure	2020-07-01 13:14:58 UTC

Description Miheer Salunke 2017-07-26 07:46:31 UTC

Created attachment 1304623 [details]
script

Description of problem:

Master HA not working properly when one master shut down while performing any operation eg- login and after indefinite time period eg- 4 mins or 11mins it starts working.

Version-Release number of selected component (if applicable):
OCP 3.5

How reproducible:
Always

Steps to Reproduce:
1.Install 3 master 4 node setup
2.Run script attached login.txt
3.Then turn off one master randomly.
4. The script fails (eg-login api call) as may be the master(api server), it(API call for login) tried to reach was turned off by us in step 3.
5. After sometime maybe 4 mins or 11 mins things again start working.


Actual results:


Expected results:


Additional info:

Comment 3 Miheer Salunke 2017-07-26 08:22:32 UTC

You can see errors like Eg->

error: You must have a token in order to logout.
I0711 15:37:49.113254    4095 helpers.go:172] server response object: [{
  "metadata": {},
  "status": "Failure",
  "message": "Internal error occurred: unexpected response: 504",
  "reason": "InternalError",
  "details": {
    "causes": [
      {
        "message": "unexpected response: 504"
      }
    ]
  },
  "code": 500
}]
F0711 15:37:49.113462    4095 helpers.go:107] Error from server: Internal error occurred: unexpected response: 504
Error from server: User "system:anonymous" cannot list all pods in the cluster
Error from server: User "system:anonymous" cannot list all nodes in the cluster

So may be if we had shut down the master at the time when oc login API was connected to it you will see the above error with their script.

Comment 4 Jordan Liggitt 2017-07-26 16:03:13 UTC

I don't expect this to be login specific. What load balancer is configured? How is the load balancer noticing one of the backing servers disappeared and taking it out of rotation?

Comment 5 Miheer Salunke 2017-07-27 01:48:21 UTC

(In reply to Jordan Liggitt from comment #4)
> I don't expect this to be login specific. What load balancer is configured?
> How is the load balancer noticing one of the backing servers disappeared and
> taking it out of rotation?

Yes you are correct. They are using the native ha-proxy load balancer AFAIK.

They are using the default configuration AFAIK. I will still again confirm from their side. Adding their consultant on site to this BZ ticket.

Comment 8 Takayoshi Kimura 2017-08-04 02:38:15 UTC

I tested this with 3.5 latest, 3 master/etcd hosts with 1 haproxy host.

When I poweroff 1 of the master:

- "oc get all" hangs
- "oc get node" is ok
- "oc get pod" is ok
- "oc get buildconfig" hangs
- Etcd direct access getting buildconfig is ok: sudo curl --cacert /etc/etcd/ca.crt --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key https://$(hostname):2379/v2/keys/openshift.io/buildconfigs/test/hello-sinatra

Based on these fact that Kubernetes API is ok, Etcd is ok, OpenShift API hangs, it looks like OpenShift etcd access is trying to access to dead etcd host.

Comment 9 Takayoshi Kimura 2017-08-04 02:41:26 UTC

HAProxy quickly detected that the master down and dispatch requests to live masters correctly, I confirmed it by enabling HAProxy syslog with access log (tcplog).

Comment 11 Seth Jennings 2017-08-08 20:49:42 UTC

I tried to recreate on 3.6 but could not.  Downed backend master was removed from LB within seconds and everything normalized in < 5s with oc commands involving both kube and Openshift resources working fine.

I'll deploy a 3.5 cluster and try again.

Comment 12 Seth Jennings 2017-08-08 22:11:41 UTC

(In reply to Takayoshi Kimura from comment #8)
> I tested this with 3.5 latest, 3 master/etcd hosts with 1 haproxy host.
> 
> When I poweroff 1 of the master:
> 
> - "oc get all" hangs
> - "oc get node" is ok
> - "oc get pod" is ok
> - "oc get buildconfig" hangs
> - Etcd direct access getting buildconfig is ok: sudo curl --cacert
> /etc/etcd/ca.crt --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key
> https://$(hostname):2379/v2/keys/openshift.io/buildconfigs/test/hello-sinatra
> 
> Based on these fact that Kubernetes API is ok, Etcd is ok, OpenShift API
> hangs, it looks like OpenShift etcd access is trying to access to dead etcd
> host.

I'd be interested in client and master logs from this recreate.

I could not reproduce on Origin 1.5.

Comment 25 Ryan Howe 2017-08-29 22:11:16 UTC

The issue here looks to be with us caching one of the etcd endpoints. 

This is only and issue when we make a request for a token. 

I think its linked to this bit of code: 
https://github.com/openshift/origin/blob/master/pkg/oauth/server/osinserver/osinserver.go#L45


I can reproduce it 80% of the time. When the etcd leader is stopped we fail to obtain a oauthtoken. 



STEPS: 

Find the etcd leader and stop the service


[root@master-1 cloud-user]# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE --ca-file=$ETCD_CA_FILE --peers=$ETCD_LISTEN_CLIENT_URLS cluster-health
2017-08-29 17:55:20.977368 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2017-08-29 17:55:20.977974 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
member 696d1250543b574 is healthy: got healthy result from https://10.10.92.35:2379
member 392f19cd961179d5 is healthy: got healthy result from https://10.10.92.39:2379
member ada8affc10a95041 is healthy: got healthy result from https://10.10.92.37:2379
cluster is healthy

[root@master-1 cloud-user]# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE --ca-file=$ETCD_CA_FILE --peers=$ETCD_LISTEN_CLIENT_URLS member list
2017-08-29 17:55:23.920833 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
696d1250543b574: name=master-1.rhowe35.quicklab.rdu2.cee.redhat.com peerURLs=https://10.10.92.35:2380 clientURLs=https://10.10.92.35:2379 isLeader=false
392f19cd961179d5: name=master-2.rhowe35.quicklab.rdu2.cee.redhat.com peerURLs=https://10.10.92.39:2380 clientURLs=https://10.10.92.39:2379 isLeader=false
ada8affc10a95041: name=master-0.rhowe35.quicklab.rdu2.cee.redhat.com peerURLs=https://10.10.92.37:2380 clientURLs=https://10.10.92.37:2379 isLeader=true

[root@master-1 cloud-user]# ssh master-0 systemctl stop etcd 

[root@master-1 cloud-user]# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE --ca-file=$ETCD_CA_FILE --peers=$ETCD_LISTEN_CLIENT_URLS cluster-health
2017-08-29 17:55:40.033951 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2017-08-29 17:55:40.034502 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
member 696d1250543b574 is healthy: got healthy result from https://10.10.92.35:2379
member 392f19cd961179d5 is healthy: got healthy result from https://10.10.92.39:2379
failed to check the health of member ada8affc10a95041 on https://10.10.92.37:2379: Get https://10.10.92.37:2379/health: dial tcp 10.10.92.37:2379: getsockopt: connection refused
member ada8affc10a95041 is unreachable: [https://10.10.92.37:2379] are all unreachable
cluster is healthy

[root@master-1 cloud-user]# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE --ca-file=$ETCD_CA_FILE --peers=$ETCD_LISTEN_CLIENT_URLS member list
2017-08-29 17:55:54.680782 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
696d1250543b574: name=master-1.rhowe35.quicklab.rdu2.cee.redhat.com peerURLs=https://10.10.92.35:2380 clientURLs=https://10.10.92.35:2379 isLeader=true
392f19cd961179d5: name=master-2.rhowe35.quicklab.rdu2.cee.redhat.com peerURLs=https://10.10.92.39:2380 clientURLs=https://10.10.92.39:2379 isLeader=false
ada8affc10a95041: name=master-0.rhowe35.quicklab.rdu2.cee.redhat.com peerURLs=https://10.10.92.37:2380 clientURLs=https://10.10.92.37:2379 isLeader=false


Make a request for a new token: 

[root@master-1 cloud-user]# curl -sLk "https://openshift.internal.rhowe35.quicklab.rdu2.cee.redhat.com/oauth/authorize?response_type=token&client_id=openshift-challenging-client" --user $USER:$PASSWORD  -I
HTTP/1.1 302 Found
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Expires: Fri, 01 Jan 1990 00:00:00 GMT
Location: https://openshift.rhowe35.quicklab.rdu2.cee.redhat.com/oauth/token/implicit#error=server_error&error_description=The+authorization+server+encountered+an+unexpected+condition+that+prevented+it+from+fulfilling+the+request.
Pragma: no-cache
Set-Cookie: ssn=MTUwNDA0Mzc3OXxEeTVMci1lQVlxU0lLdDVyWElTTGZCM0ZUdFgzemt1UjltLWRoY3BsVDRELW5kX1RLZ0pUVFozUDhJTWxzZlFtVXdVRGxpb2hWWUhXQlhMd3c0SmtUT3lhd0c5VnRodzlKOEczV0xKczlLUDQtcFhUWVlVUE5lNGZPVkxUUnk2TG1BPT18iUrYmpWB9IM-0463gM5G8f1ZGMTq-ok8FkDjHgkmBBQ=; Path=/; Expires=Tue, 29 Aug 2017 22:56:19 GMT; Max-Age=3600; HttpOnly; Secure
Date: Tue, 29 Aug 2017 21:56:19 GMT
Content-Type: text/plain; charset=utf-8

HTTP/1.1 200 OK
Content-Type: text/plain
Date: Tue, 29 Aug 2017 21:56:19 GMT
Content-Length: 756


After we can see panics and stack traces in the master api logs. 
After some time or after restarting one api service we are able to request a token again. 

Master API logs attached.

Comment 26 Ryan Howe 2017-08-29 22:13:25 UTC

Created attachment 1319791 [details]
Atomic-openshift-master-api logs during reproducer

Comment 27 Ryan Howe 2017-08-29 22:14:34 UTC

Tested with OCP 3.5 
 atomic-openshift-3.5.5.31-1.git.0.b6f55a2.el7.x86_6

Comment 28 Clayton Coleman 2017-08-30 00:41:28 UTC

The masters could probably start setting AutoSyncInterval once people migrate to etcd3.  It would have to be opt in.

Comment 29 Miheer Salunke 2017-08-30 01:58:22 UTC

It seems the workaround till they migrate to OCP 3.6 is restart the masters i.e the controller and api services so shall they do it some intervals ? by writing a script ?

Note You need to log in before you can comment on or make changes to this bug.