Created attachment 1304623 [details] script Description of problem: Master HA not working properly when one master shut down while performing any operation eg- login and after indefinite time period eg- 4 mins or 11mins it starts working. Version-Release number of selected component (if applicable): OCP 3.5 How reproducible: Always Steps to Reproduce: 1.Install 3 master 4 node setup 2.Run script attached login.txt 3.Then turn off one master randomly. 4. The script fails (eg-login api call) as may be the master(api server), it(API call for login) tried to reach was turned off by us in step 3. 5. After sometime maybe 4 mins or 11 mins things again start working. Actual results: Expected results: Additional info:
You can see errors like Eg-> error: You must have a token in order to logout. I0711 15:37:49.113254 4095 helpers.go:172] server response object: [{ "metadata": {}, "status": "Failure", "message": "Internal error occurred: unexpected response: 504", "reason": "InternalError", "details": { "causes": [ { "message": "unexpected response: 504" } ] }, "code": 500 }] F0711 15:37:49.113462 4095 helpers.go:107] Error from server: Internal error occurred: unexpected response: 504 Error from server: User "system:anonymous" cannot list all pods in the cluster Error from server: User "system:anonymous" cannot list all nodes in the cluster So may be if we had shut down the master at the time when oc login API was connected to it you will see the above error with their script.
I don't expect this to be login specific. What load balancer is configured? How is the load balancer noticing one of the backing servers disappeared and taking it out of rotation?
(In reply to Jordan Liggitt from comment #4) > I don't expect this to be login specific. What load balancer is configured? > How is the load balancer noticing one of the backing servers disappeared and > taking it out of rotation? Yes you are correct. They are using the native ha-proxy load balancer AFAIK. They are using the default configuration AFAIK. I will still again confirm from their side. Adding their consultant on site to this BZ ticket.
I tested this with 3.5 latest, 3 master/etcd hosts with 1 haproxy host. When I poweroff 1 of the master: - "oc get all" hangs - "oc get node" is ok - "oc get pod" is ok - "oc get buildconfig" hangs - Etcd direct access getting buildconfig is ok: sudo curl --cacert /etc/etcd/ca.crt --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key https://$(hostname):2379/v2/keys/openshift.io/buildconfigs/test/hello-sinatra Based on these fact that Kubernetes API is ok, Etcd is ok, OpenShift API hangs, it looks like OpenShift etcd access is trying to access to dead etcd host.
HAProxy quickly detected that the master down and dispatch requests to live masters correctly, I confirmed it by enabling HAProxy syslog with access log (tcplog).
I tried to recreate on 3.6 but could not. Downed backend master was removed from LB within seconds and everything normalized in < 5s with oc commands involving both kube and Openshift resources working fine. I'll deploy a 3.5 cluster and try again.
(In reply to Takayoshi Kimura from comment #8) > I tested this with 3.5 latest, 3 master/etcd hosts with 1 haproxy host. > > When I poweroff 1 of the master: > > - "oc get all" hangs > - "oc get node" is ok > - "oc get pod" is ok > - "oc get buildconfig" hangs > - Etcd direct access getting buildconfig is ok: sudo curl --cacert > /etc/etcd/ca.crt --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key > https://$(hostname):2379/v2/keys/openshift.io/buildconfigs/test/hello-sinatra > > Based on these fact that Kubernetes API is ok, Etcd is ok, OpenShift API > hangs, it looks like OpenShift etcd access is trying to access to dead etcd > host. I'd be interested in client and master logs from this recreate. I could not reproduce on Origin 1.5.
The issue here looks to be with us caching one of the etcd endpoints. This is only and issue when we make a request for a token. I think its linked to this bit of code: https://github.com/openshift/origin/blob/master/pkg/oauth/server/osinserver/osinserver.go#L45 I can reproduce it 80% of the time. When the etcd leader is stopped we fail to obtain a oauthtoken. STEPS: Find the etcd leader and stop the service [root@master-1 cloud-user]# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE --ca-file=$ETCD_CA_FILE --peers=$ETCD_LISTEN_CLIENT_URLS cluster-health 2017-08-29 17:55:20.977368 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated 2017-08-29 17:55:20.977974 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated member 696d1250543b574 is healthy: got healthy result from https://10.10.92.35:2379 member 392f19cd961179d5 is healthy: got healthy result from https://10.10.92.39:2379 member ada8affc10a95041 is healthy: got healthy result from https://10.10.92.37:2379 cluster is healthy [root@master-1 cloud-user]# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE --ca-file=$ETCD_CA_FILE --peers=$ETCD_LISTEN_CLIENT_URLS member list 2017-08-29 17:55:23.920833 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated 696d1250543b574: name=master-1.rhowe35.quicklab.rdu2.cee.redhat.com peerURLs=https://10.10.92.35:2380 clientURLs=https://10.10.92.35:2379 isLeader=false 392f19cd961179d5: name=master-2.rhowe35.quicklab.rdu2.cee.redhat.com peerURLs=https://10.10.92.39:2380 clientURLs=https://10.10.92.39:2379 isLeader=false ada8affc10a95041: name=master-0.rhowe35.quicklab.rdu2.cee.redhat.com peerURLs=https://10.10.92.37:2380 clientURLs=https://10.10.92.37:2379 isLeader=true [root@master-1 cloud-user]# ssh master-0 systemctl stop etcd [root@master-1 cloud-user]# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE --ca-file=$ETCD_CA_FILE --peers=$ETCD_LISTEN_CLIENT_URLS cluster-health 2017-08-29 17:55:40.033951 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated 2017-08-29 17:55:40.034502 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated member 696d1250543b574 is healthy: got healthy result from https://10.10.92.35:2379 member 392f19cd961179d5 is healthy: got healthy result from https://10.10.92.39:2379 failed to check the health of member ada8affc10a95041 on https://10.10.92.37:2379: Get https://10.10.92.37:2379/health: dial tcp 10.10.92.37:2379: getsockopt: connection refused member ada8affc10a95041 is unreachable: [https://10.10.92.37:2379] are all unreachable cluster is healthy [root@master-1 cloud-user]# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE --ca-file=$ETCD_CA_FILE --peers=$ETCD_LISTEN_CLIENT_URLS member list 2017-08-29 17:55:54.680782 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated 696d1250543b574: name=master-1.rhowe35.quicklab.rdu2.cee.redhat.com peerURLs=https://10.10.92.35:2380 clientURLs=https://10.10.92.35:2379 isLeader=true 392f19cd961179d5: name=master-2.rhowe35.quicklab.rdu2.cee.redhat.com peerURLs=https://10.10.92.39:2380 clientURLs=https://10.10.92.39:2379 isLeader=false ada8affc10a95041: name=master-0.rhowe35.quicklab.rdu2.cee.redhat.com peerURLs=https://10.10.92.37:2380 clientURLs=https://10.10.92.37:2379 isLeader=false Make a request for a new token: [root@master-1 cloud-user]# curl -sLk "https://openshift.internal.rhowe35.quicklab.rdu2.cee.redhat.com/oauth/authorize?response_type=token&client_id=openshift-challenging-client" --user $USER:$PASSWORD -I HTTP/1.1 302 Found Cache-Control: no-cache, no-store, max-age=0, must-revalidate Expires: Fri, 01 Jan 1990 00:00:00 GMT Location: https://openshift.rhowe35.quicklab.rdu2.cee.redhat.com/oauth/token/implicit#error=server_error&error_description=The+authorization+server+encountered+an+unexpected+condition+that+prevented+it+from+fulfilling+the+request. Pragma: no-cache Set-Cookie: ssn=MTUwNDA0Mzc3OXxEeTVMci1lQVlxU0lLdDVyWElTTGZCM0ZUdFgzemt1UjltLWRoY3BsVDRELW5kX1RLZ0pUVFozUDhJTWxzZlFtVXdVRGxpb2hWWUhXQlhMd3c0SmtUT3lhd0c5VnRodzlKOEczV0xKczlLUDQtcFhUWVlVUE5lNGZPVkxUUnk2TG1BPT18iUrYmpWB9IM-0463gM5G8f1ZGMTq-ok8FkDjHgkmBBQ=; Path=/; Expires=Tue, 29 Aug 2017 22:56:19 GMT; Max-Age=3600; HttpOnly; Secure Date: Tue, 29 Aug 2017 21:56:19 GMT Content-Type: text/plain; charset=utf-8 HTTP/1.1 200 OK Content-Type: text/plain Date: Tue, 29 Aug 2017 21:56:19 GMT Content-Length: 756 After we can see panics and stack traces in the master api logs. After some time or after restarting one api service we are able to request a token again. Master API logs attached.
Created attachment 1319791 [details] Atomic-openshift-master-api logs during reproducer
Tested with OCP 3.5 atomic-openshift-3.5.5.31-1.git.0.b6f55a2.el7.x86_6
The masters could probably start setting AutoSyncInterval once people migrate to etcd3. It would have to be opt in.
It seems the workaround till they migrate to OCP 3.6 is restart the masters i.e the controller and api services so shall they do it some intervals ? by writing a script ?