Description of problem:
See the following for more detailed info.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Set up a mutli-maters native HA env, here I am using google network LB as external LB (https://cloud.google.com/compute/docs/load-balancing/network/).
2. Set current-context to use LB cluster entry in /etc/origin/master/openshift-master.kubeconfig to workaournd for #1342049.
3. Currently in this env, there are two masters, master-1 and master-2, active controller service are running in master-1, master-2's controller service is in passive mode.
4. Stop api service running in master-1 to simulate api outage, master-2's api is working well.
5. Wait for a moment, google LB's api health check found master-1 api is unhealthy, then LB will not transfter any api traffic to master-1, and master-1 ether unable to reach to LB api (LB-ip:8443), other instances in the cluster still able to reach to LB api (LB-ip:8443).
6. Check master-1's controller logs.
Found master-1's controller is still the active one, master-2 is the passive one.
master-1 controller log is as following:
Jun 02 07:07:45 qe-jialiu-master-etcd-1 atomic-openshift-master-controllers: E0602 07:07:45.532705 26885 reflector.go:271] /usr/lib/golang/src/runtime/asm_amd64.s:2232: Failed to watch *api.Service: Get https://18.104.22.168:8443/api/v1/watch/services?resourceVersion=2434&timeoutSeconds=433: dial tcp 22.214.171.124:8443: connection refused
Jun 02 07:07:45 qe-jialiu-master-etcd-1 atomic-openshift-master-controllers: E0602 07:07:45.532877 26885 reflector.go:271] pkg/build/controller/factory/factory.go:125: Failed to watch *api.Build: Get https://126.96.36.199:8443/oapi/v1/watch/builds?resourceVersion=1900&timeoutSeconds=160: dial tcp 188.8.131.52:8443: connection refused
Jun 02 07:07:45 qe-jialiu-master-etcd-1 atomic-openshift-master-controllers: E0602 07:07:45.533017 26885 reflector.go:271] pkg/admission/serviceaccount/admission.go:102: Failed to watch *api.ServiceAccount: Get https://184.108.40.206:8443/api/v1/watch/serviceaccounts?resourceVersion=4455&timeoutSeconds=587: dial tcp 220.127.116.11:8443: connection refused
That means the controller can not reach to API endpoint, that would cause all the deployment does not work.
Wait for several minutes, controller failover never happened, that means the whole cluster would go into the stalled status.
controllers failover functionality should be more smart, when controller can not reach to API endpoint, failover should happen.
What is 18.104.22.168? Is that master-1, master-2, or the LB?
We're waiting for the answer from the QE, I've spoken with Andy about possible issues, I'll let him own it :)
(In reply to Andy Goldstein from comment #1)
> What is 22.214.171.124? Is that master-1, master-2, or the LB?
126.96.36.199 is LB.
Personally I think we need some kind of health check mechanism inside controller itself.
There are 2 parts to this issue.
1. The controller talks directly to etcd, so even if the local api-server is not present the fail-over will not be triggered b/c lease is renewed.
2. The controller traffic is defaulted on loop-back. This is a known issue and can be configured otherwise: https://github.com/openshift/openshift-ansible/issues/1563
Why exactly is the local controller unable to connect to the load-balancer?
Johnny, it's entirely appropriate for the controller on master-1 to keep the lease even if the apiserver on master-1 is down. That's Tim's point #1 above.
Assuming you have the controller configured to talk to the LB (which it sounds like you do, based on comment 3), then it should reestablish connectivity to the LB after you bring the apiserver on master-1 down.
Did the controller on master-1 ever reestablish connectivity to the LB and resume operations?
(In reply to Andy Goldstein from comment #7)
> Johnny, it's entirely appropriate for the controller on master-1 to keep the
> lease even if the apiserver on master-1 is down. That's Tim's point #1 above.
Bases on current controller's HA implement, yeah, the behavior described in this bug is expected, but I do not think it is a reasonable behavior from a customer view. When the controller lost its connection to api, it already does not work well as controller, but it still was holding the role of "active" controller.
> Assuming you have the controller configured to talk to the LB (which it
> sounds like you do, based on comment 3), then it should reestablish
> connectivity to the LB after you bring the apiserver on master-1 down.
> Did the controller on master-1 ever reestablish connectivity to the LB and
> resume operations?
Yes, I already have the controller configured to talk to the LB, not itself (#1342049).
In theory, it should re-establish connectivity to the LB after the apiserver on master-1 get down, but in this bug, I am using google network LB, unfortunately, the behavior of google network LB does not work as our expectation (step 5).
When using google network LB, once api server on master-1 get down, google LB will treat master-1 as unhealthy, not only LB will not transfer any api traffic to master-1, but also master-1 will unable to connect LB. (Sound like it is a little weird, but my test result really proved google network LB is working in such way.) That is why controller on master-1 can not re-establish connectivity to the LB when api on master-1 get down.
BTW, this test request came from OPS team, because they are planing to set up dedicate env on GCE using google network LB.
> When using google network LB, once api server on master-1 get down, google
> LB will treat master-1 as unhealthy, not only LB will not transfer any api
> traffic to master-1, but also master-1 will unable to connect LB. (Sound
> like it is a little weird, but my test result really proved google network
> LB is working in such way.) That is why controller on master-1 can not
> re-establish connectivity to the LB when api on master-1 get down.
This seems like a bug in a the load balancer configuration, e.g. it almost seems like it's trying to "fence" the node.
If the load balancer can-not be reconfigured it will hasten:
https://github.com/openshift/origin/issues/6642 to become a priority.
Avesh is testing the GCE LB setup (1 LB, 2 VMs running httpd, stop httpd on vm1, see if vm1 can curl the LB and get a response with content (i.e. from vm2)).
I have done some testing on gce regarding this lb issue, and here are steps I performed:
1. Create 2 instances on gce: test-aka-1 , test-aka-2
Both instances are running https: (apachctl start).
2. Create a same tag for both VMs:
gcloud compute instances add-tags test-aka-1 --tags test-aka-lb
gcloud compute instances add-tags test-aka-2 --tags test-aka-lb
3. Add a firewall rule:
gcloud compute firewall-rules create www-firewall-aka --target-tags test-aka-lb --allow tcp:80
4. Create target pool:
gcloud compute target-pools create test-pool-aka
5. Add VMs to this target pool:
gcloud compute target-pools add-instances test-pool-aka --instances test-aka-1,test-aka-2
6. Create a forwarding rule:
gcloud compute forwarding-rules create test-fw-aka --target-pool test-pool-aka
7. Create a health check:
gcloud compute http-health-checks create test-hc-aka
8. Add health check for the target pool:
gcloud compute target-pools add-health-checks test-pool-aka --http-health-check test-hc-aka
Here is my observation:
When I stop httpd(apachctl stop) on VM1: I can curl from outside but not from VM1. curl from VM2 works.
Sorry I meant httpd not https in the step 1.
Also to be precise, when I curl to the load balancer IP from ouside when VM1's apache is stopped: One moment it gives error that: curl: (7) Failed to connect to 188.8.131.52 port 80: Connection refused but the next moment it works.
I tested the load balance stuff with http load balancing and the set up is working right now. The observation is this:
If I stop VM1's httpd (apachectl stop), and when I do curl http://load_balancer_ip from both VM1 and VM2 and I was able to access the instance running on VM2.
Their health checks are very weird that they specifically expect that there must be index.html in /var/www/html, otherwise it fails. Because otherwise I was just able to access the default httpd page without any issue.
I did the network load balancer test again with successful/right health checks and the observed behavior is same what I already reported in the comments 11,12,13.
So the summary is that network load balancer and http load balancer in my tests work different when httpd is stopped on VM1.
(In reply to Timothy St. Clair from comment #9)
> This seems like a bug in a the load balancer configuration, e.g. it almost
> seems like it's trying to "fence" the node.
Based on coment 15, sound like some other's test result (comment 11) is the same as mine (comment 8) for network load balancer. Seem like I was configuring load balancer correctly.
> If the load balancer can-not be reconfigured it will hasten:
> https://github.com/openshift/origin/issues/6642 to become a priority.
Can you switch to using the http load balancer?
(In reply to Andy Goldstein from comment #17)
> Can you switch to using the http load balancer?
As OPS team's request, "network load balancer" is their prior plan.
@twiest, do you have any suggestion here?
Actually I also tested it with haproxy http load balancer, everything is working well. So I guess everything will go well once switch to use google http load balancer.
The important thing is I opened this bug is mainly for tracking the issue that we need a better active-passive leader election (seem like the same request mentioned in https://github.com/openshift/origin/issues/6642). Using google network LB is the only way to reproduce it so that raise it up to everyone. So I think to switch to use http load balancer is only a workaround.
I thought about it little more and not sure I am getting confused or missing something, so let me clarify my doubt:
Why the controller on the failed master api server is talking to the load balancer (lb) in the first place?
My understanding is that lb is for external traffic.
When all instances have their internal IPs and external IPs, why all processes on them can't talk to each other on their internal/external networks instead of going through the lb?
Because I just checked that the failed instance can still curl directly to the internal IP or external IP of other instances.
Or is it because the lb is load balancing internal cluster traffic to multiple masters too?
I read more about it and it seems that its expected that going through the lb in multiple masters HA setup. So may be just ignore my previous questions.
And the fact that http based lb is working as expected, either its a limitation of gce network lb or may be a bug/ or may be there is some configuration setting that we are not aware of that might mitigate this.
Johnny, could you please update the controllers to talk to kubernetes.default.svc.cluster.local instead of the GCE LB and retest? We know that DNS will have some issues with one of the masters down (slower lookup times) (and we're working on a fix for that), but our current thinking is that we want to make this configuration change and see if it works.
This is in 3.2.1.x now. This scenario was verified as reported in https://bugzilla.redhat.com/show_bug.cgi?id=1351645#c16. I'm moving this to ON_QA. Please either retest in GCE or move to verified. Thanks!
Verified it in native-HA (2 masters and 4 nodes) using google network LB
# openshift version
After installation, update the controllers to talk to kubernetes.default.svc.cluster.local on the two masters. Then stop master-1 api service (controllers service is active on master-1), create app successfully in 20 times by "oc new-app cakephp-example -n install-test".
@Jason, though this bug is moved to verified, user have to update master/node's kube config file to make them talk to kubernetes.default.svc.cluster.local, it is really inconvenient.
Is it possible to improve installer to set all the master/node kube config file to talk to kubernetes.default.svc.cluster.local by default?