Bug 1475184 - Stopping etcd leader can cause authentication failures due to OpenShift Master caching and using etcd leader IP address.
Stopping etcd leader can cause authentication failures due to OpenShift Maste...
Status: CLOSED NOTABUG
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master (Show other bugs)
3.5.1
Unspecified Unspecified
urgent Severity urgent
: ---
: 3.7.0
Assigned To: Robert Rati
DeShuai Ma
: Unconfirmed
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-26 03:46 EDT by Miheer Salunke
Modified: 2017-10-29 03:12 EDT (History)
17 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1490427 1490428 (view as bug list)
Environment:
Last Closed: 2017-09-15 15:07:37 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Atomic-openshift-master-api logs during reproducer (22.78 KB, text/plain)
2017-08-29 18:13 EDT, Ryan Howe
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Github coreos/etcd/issues/8515 None None None 2017-09-07 10:51 EDT
Github coreos/etcd/pull/8519 None None None 2017-09-08 10:25 EDT

  None (edit)
Description Miheer Salunke 2017-07-26 03:46:31 EDT
Created attachment 1304623 [details]
script

Description of problem:

Master HA not working properly when one master shut down while performing any operation eg- login and after indefinite time period eg- 4 mins or 11mins it starts working.

Version-Release number of selected component (if applicable):
OCP 3.5

How reproducible:
Always

Steps to Reproduce:
1.Install 3 master 4 node setup
2.Run script attached login.txt
3.Then turn off one master randomly.
4. The script fails (eg-login api call) as may be the master(api server), it(API call for login) tried to reach was turned off by us in step 3.
5. After sometime maybe 4 mins or 11 mins things again start working.


Actual results:


Expected results:


Additional info:
Comment 3 Miheer Salunke 2017-07-26 04:22:32 EDT
You can see errors like Eg->

error: You must have a token in order to logout.
I0711 15:37:49.113254    4095 helpers.go:172] server response object: [{
  "metadata": {},
  "status": "Failure",
  "message": "Internal error occurred: unexpected response: 504",
  "reason": "InternalError",
  "details": {
    "causes": [
      {
        "message": "unexpected response: 504"
      }
    ]
  },
  "code": 500
}]
F0711 15:37:49.113462    4095 helpers.go:107] Error from server: Internal error occurred: unexpected response: 504
Error from server: User "system:anonymous" cannot list all pods in the cluster
Error from server: User "system:anonymous" cannot list all nodes in the cluster

So may be if we had shut down the master at the time when oc login API was connected to it you will see the above error with their script.
Comment 4 Jordan Liggitt 2017-07-26 12:03:13 EDT
I don't expect this to be login specific. What load balancer is configured? How is the load balancer noticing one of the backing servers disappeared and taking it out of rotation?
Comment 5 Miheer Salunke 2017-07-26 21:48:21 EDT
(In reply to Jordan Liggitt from comment #4)
> I don't expect this to be login specific. What load balancer is configured?
> How is the load balancer noticing one of the backing servers disappeared and
> taking it out of rotation?

Yes you are correct. They are using the native ha-proxy load balancer AFAIK.

They are using the default configuration AFAIK. I will still again confirm from their side. Adding their consultant on site to this BZ ticket.
Comment 8 Takayoshi Kimura 2017-08-03 22:38:15 EDT
I tested this with 3.5 latest, 3 master/etcd hosts with 1 haproxy host.

When I poweroff 1 of the master:

- "oc get all" hangs
- "oc get node" is ok
- "oc get pod" is ok
- "oc get buildconfig" hangs
- Etcd direct access getting buildconfig is ok: sudo curl --cacert /etc/etcd/ca.crt --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key https://$(hostname):2379/v2/keys/openshift.io/buildconfigs/test/hello-sinatra

Based on these fact that Kubernetes API is ok, Etcd is ok, OpenShift API hangs, it looks like OpenShift etcd access is trying to access to dead etcd host.
Comment 9 Takayoshi Kimura 2017-08-03 22:41:26 EDT
HAProxy quickly detected that the master down and dispatch requests to live masters correctly, I confirmed it by enabling HAProxy syslog with access log (tcplog).
Comment 11 Seth Jennings 2017-08-08 16:49:42 EDT
I tried to recreate on 3.6 but could not.  Downed backend master was removed from LB within seconds and everything normalized in < 5s with oc commands involving both kube and Openshift resources working fine.

I'll deploy a 3.5 cluster and try again.
Comment 12 Seth Jennings 2017-08-08 18:11:41 EDT
(In reply to Takayoshi Kimura from comment #8)
> I tested this with 3.5 latest, 3 master/etcd hosts with 1 haproxy host.
> 
> When I poweroff 1 of the master:
> 
> - "oc get all" hangs
> - "oc get node" is ok
> - "oc get pod" is ok
> - "oc get buildconfig" hangs
> - Etcd direct access getting buildconfig is ok: sudo curl --cacert
> /etc/etcd/ca.crt --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key
> https://$(hostname):2379/v2/keys/openshift.io/buildconfigs/test/hello-sinatra
> 
> Based on these fact that Kubernetes API is ok, Etcd is ok, OpenShift API
> hangs, it looks like OpenShift etcd access is trying to access to dead etcd
> host.

I'd be interested in client and master logs from this recreate.

I could not reproduce on Origin 1.5.
Comment 25 Ryan Howe 2017-08-29 18:11:16 EDT
The issue here looks to be with us caching one of the etcd endpoints. 

This is only and issue when we make a request for a token. 

I think its linked to this bit of code: 
https://github.com/openshift/origin/blob/master/pkg/oauth/server/osinserver/osinserver.go#L45


I can reproduce it 80% of the time. When the etcd leader is stopped we fail to obtain a oauthtoken. 



STEPS: 

Find the etcd leader and stop the service


[root@master-1 cloud-user]# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE --ca-file=$ETCD_CA_FILE --peers=$ETCD_LISTEN_CLIENT_URLS cluster-health
2017-08-29 17:55:20.977368 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2017-08-29 17:55:20.977974 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
member 696d1250543b574 is healthy: got healthy result from https://10.10.92.35:2379
member 392f19cd961179d5 is healthy: got healthy result from https://10.10.92.39:2379
member ada8affc10a95041 is healthy: got healthy result from https://10.10.92.37:2379
cluster is healthy

[root@master-1 cloud-user]# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE --ca-file=$ETCD_CA_FILE --peers=$ETCD_LISTEN_CLIENT_URLS member list
2017-08-29 17:55:23.920833 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
696d1250543b574: name=master-1.rhowe35.quicklab.rdu2.cee.redhat.com peerURLs=https://10.10.92.35:2380 clientURLs=https://10.10.92.35:2379 isLeader=false
392f19cd961179d5: name=master-2.rhowe35.quicklab.rdu2.cee.redhat.com peerURLs=https://10.10.92.39:2380 clientURLs=https://10.10.92.39:2379 isLeader=false
ada8affc10a95041: name=master-0.rhowe35.quicklab.rdu2.cee.redhat.com peerURLs=https://10.10.92.37:2380 clientURLs=https://10.10.92.37:2379 isLeader=true

[root@master-1 cloud-user]# ssh master-0 systemctl stop etcd 

[root@master-1 cloud-user]# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE --ca-file=$ETCD_CA_FILE --peers=$ETCD_LISTEN_CLIENT_URLS cluster-health
2017-08-29 17:55:40.033951 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2017-08-29 17:55:40.034502 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
member 696d1250543b574 is healthy: got healthy result from https://10.10.92.35:2379
member 392f19cd961179d5 is healthy: got healthy result from https://10.10.92.39:2379
failed to check the health of member ada8affc10a95041 on https://10.10.92.37:2379: Get https://10.10.92.37:2379/health: dial tcp 10.10.92.37:2379: getsockopt: connection refused
member ada8affc10a95041 is unreachable: [https://10.10.92.37:2379] are all unreachable
cluster is healthy

[root@master-1 cloud-user]# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE --ca-file=$ETCD_CA_FILE --peers=$ETCD_LISTEN_CLIENT_URLS member list
2017-08-29 17:55:54.680782 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
696d1250543b574: name=master-1.rhowe35.quicklab.rdu2.cee.redhat.com peerURLs=https://10.10.92.35:2380 clientURLs=https://10.10.92.35:2379 isLeader=true
392f19cd961179d5: name=master-2.rhowe35.quicklab.rdu2.cee.redhat.com peerURLs=https://10.10.92.39:2380 clientURLs=https://10.10.92.39:2379 isLeader=false
ada8affc10a95041: name=master-0.rhowe35.quicklab.rdu2.cee.redhat.com peerURLs=https://10.10.92.37:2380 clientURLs=https://10.10.92.37:2379 isLeader=false


Make a request for a new token: 

[root@master-1 cloud-user]# curl -sLk "https://openshift.internal.rhowe35.quicklab.rdu2.cee.redhat.com/oauth/authorize?response_type=token&client_id=openshift-challenging-client" --user $USER:$PASSWORD  -I
HTTP/1.1 302 Found
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Expires: Fri, 01 Jan 1990 00:00:00 GMT
Location: https://openshift.rhowe35.quicklab.rdu2.cee.redhat.com/oauth/token/implicit#error=server_error&error_description=The+authorization+server+encountered+an+unexpected+condition+that+prevented+it+from+fulfilling+the+request.
Pragma: no-cache
Set-Cookie: ssn=MTUwNDA0Mzc3OXxEeTVMci1lQVlxU0lLdDVyWElTTGZCM0ZUdFgzemt1UjltLWRoY3BsVDRELW5kX1RLZ0pUVFozUDhJTWxzZlFtVXdVRGxpb2hWWUhXQlhMd3c0SmtUT3lhd0c5VnRodzlKOEczV0xKczlLUDQtcFhUWVlVUE5lNGZPVkxUUnk2TG1BPT18iUrYmpWB9IM-0463gM5G8f1ZGMTq-ok8FkDjHgkmBBQ=; Path=/; Expires=Tue, 29 Aug 2017 22:56:19 GMT; Max-Age=3600; HttpOnly; Secure
Date: Tue, 29 Aug 2017 21:56:19 GMT
Content-Type: text/plain; charset=utf-8

HTTP/1.1 200 OK
Content-Type: text/plain
Date: Tue, 29 Aug 2017 21:56:19 GMT
Content-Length: 756


After we can see panics and stack traces in the master api logs. 
After some time or after restarting one api service we are able to request a token again. 

Master API logs attached.
Comment 26 Ryan Howe 2017-08-29 18:13 EDT
Created attachment 1319791 [details]
Atomic-openshift-master-api logs during reproducer
Comment 27 Ryan Howe 2017-08-29 18:14:34 EDT
Tested with OCP 3.5 
 atomic-openshift-3.5.5.31-1.git.0.b6f55a2.el7.x86_6
Comment 28 Clayton Coleman 2017-08-29 20:41:28 EDT
The masters could probably start setting AutoSyncInterval once people migrate to etcd3.  It would have to be opt in.
Comment 29 Miheer Salunke 2017-08-29 21:58:22 EDT
It seems the workaround till they migrate to OCP 3.6 is restart the masters i.e the controller and api services so shall they do it some intervals ? by writing a script ?

Note You need to log in before you can comment on or make changes to this bug.