Bug 1342061 - master controller does not fail over to another one when master api is not reachable in native HA env. [NEEDINFO]
Summary: master controller does not fail over to another one when master api is not re...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.2.1
Assignee: Andy Goldstein
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks: OSOPS_V3
TreeView+ depends on / blocked
 
Reported: 2016-06-02 11:17 UTC by Johnny Liu
Modified: 2016-09-07 22:20 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-09-07 22:20:55 UTC
Target Upstream Version:
jialiu: needinfo? (jdetiber)


Attachments (Terms of Use)

Description Johnny Liu 2016-06-02 11:17:26 UTC
Description of problem:
See the following for more detailed info.

Version-Release number of selected component (if applicable):
openshift-ansible-3.0.94-1.git.0.67a822a.el7.noarch.rpm
openshift-ansible-docs-3.0.94-1.git.0.67a822a.el7.noarch.rpm
openshift-ansible-filter-plugins-3.0.94-1.git.0.67a822a.el7.noarch.rpm
openshift-ansible-lookup-plugins-3.0.94-1.git.0.67a822a.el7.noarch.rpm
openshift-ansible-playbooks-3.0.94-1.git.0.67a822a.el7.noarch.rpm
openshift-ansible-roles-3.0.94-1.git.0.67a822a.el7.noarch.rpm
atomic-openshift-3.2.0.45-1.git.0.a2ee9db.el7.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Set up a mutli-maters native HA env, here I am using google network LB as external LB (https://cloud.google.com/compute/docs/load-balancing/network/).
2. Set current-context to use LB cluster entry in /etc/origin/master/openshift-master.kubeconfig to workaournd for #1342049.
3. Currently in this env, there are two masters, master-1 and master-2, active controller service are running in master-1, master-2's controller service is in passive mode.
4. Stop api service running in master-1 to simulate api outage, master-2's api is working well.
5. Wait for a moment, google LB's api health check found master-1 api is unhealthy, then LB will not transfter any api traffic to master-1, and master-1 ether unable to reach to LB api (LB-ip:8443), other instances in the cluster still able to reach to LB api (LB-ip:8443).
6. Check master-1's controller logs.


Actual results:
Found master-1's controller is still the active one, master-2 is the passive one.

master-1 controller log is as following:
<--snip-->
Jun 02 07:07:45 qe-jialiu-master-etcd-1 atomic-openshift-master-controllers[26885]: E0602 07:07:45.532705   26885 reflector.go:271] /usr/lib/golang/src/runtime/asm_amd64.s:2232: Failed to watch *api.Service: Get https://104.197.218.186:8443/api/v1/watch/services?resourceVersion=2434&timeoutSeconds=433: dial tcp 104.197.218.186:8443: connection refused
Jun 02 07:07:45 qe-jialiu-master-etcd-1 atomic-openshift-master-controllers[26885]: E0602 07:07:45.532877   26885 reflector.go:271] pkg/build/controller/factory/factory.go:125: Failed to watch *api.Build: Get https://104.197.218.186:8443/oapi/v1/watch/builds?resourceVersion=1900&timeoutSeconds=160: dial tcp 104.197.218.186:8443: connection refused
Jun 02 07:07:45 qe-jialiu-master-etcd-1 atomic-openshift-master-controllers[26885]: E0602 07:07:45.533017   26885 reflector.go:271] pkg/admission/serviceaccount/admission.go:102: Failed to watch *api.ServiceAccount: Get https://104.197.218.186:8443/api/v1/watch/serviceaccounts?resourceVersion=4455&timeoutSeconds=587: dial tcp 104.197.218.186:8443: connection refused
<--snip-->

That means the controller can not reach to API endpoint, that would cause all the deployment does not work.

Wait for several minutes, controller failover never happened, that means the whole cluster would go into the stalled status.

Expected results:
controllers failover functionality should be more smart, when controller can not reach to API endpoint, failover should happen.

Additional info:

Comment 1 Andy Goldstein 2016-06-02 17:17:20 UTC
What is 104.197.218.186? Is that master-1, master-2, or the LB?

Comment 2 Maciej Szulik 2016-06-02 21:01:32 UTC
We're waiting for the answer from the QE, I've spoken with Andy about possible issues, I'll let him own it :)

Comment 3 Johnny Liu 2016-06-03 02:11:39 UTC
(In reply to Andy Goldstein from comment #1)
> What is 104.197.218.186? Is that master-1, master-2, or the LB?
104.197.218.186 is LB.

Comment 4 Johnny Liu 2016-06-03 02:13:47 UTC
Personally I think we need some kind of health check mechanism inside controller itself.

Comment 5 Timothy St. Clair 2016-06-03 14:02:57 UTC
There are 2 parts to this issue.  

1. The controller talks directly to etcd, so even if the local api-server is not present the fail-over will not be triggered b/c lease is renewed. 

2. The controller traffic is defaulted on loop-back.  This is a known issue and can be configured otherwise:  https://github.com/openshift/openshift-ansible/issues/1563

Comment 6 Timothy St. Clair 2016-06-03 14:13:45 UTC
Why exactly is the local controller unable to connect to the load-balancer?

Comment 7 Andy Goldstein 2016-06-03 14:16:28 UTC
Johnny, it's entirely appropriate for the controller on master-1 to keep the lease even if the apiserver on master-1 is down. That's Tim's point #1 above.

Assuming you have the controller configured to talk to the LB (which it sounds like you do, based on comment 3), then it should reestablish connectivity to the LB after you bring the apiserver on master-1 down.

Did the controller on master-1 ever reestablish connectivity to the LB and resume operations?

Comment 8 Johnny Liu 2016-06-06 03:48:02 UTC
(In reply to Andy Goldstein from comment #7)
> Johnny, it's entirely appropriate for the controller on master-1 to keep the
> lease even if the apiserver on master-1 is down. That's Tim's point #1 above.

Bases on current controller's HA implement, yeah, the behavior described in this bug is expected, but I do not think it is a reasonable behavior from a customer view. When the controller lost its connection to api, it already does not work well as controller, but it still was holding the role of "active" controller.

> 
> Assuming you have the controller configured to talk to the LB (which it
> sounds like you do, based on comment 3), then it should reestablish
> connectivity to the LB after you bring the apiserver on master-1 down.
> 
> Did the controller on master-1 ever reestablish connectivity to the LB and
> resume operations?

Yes, I already have the controller configured to talk to the LB, not itself (#1342049).

In theory, it should re-establish connectivity to the LB after the apiserver on master-1 get down, but in this bug, I am using google network LB, unfortunately, the behavior of google network LB does not work as our expectation (step 5). 

When using google network LB, once api server on master-1 get down, google LB will treat master-1 as unhealthy, not only LB will not transfer any api traffic to master-1, but also master-1 will unable to connect LB. (Sound like it is a little weird, but my test result really proved google network LB is working in such way.) That is why controller on master-1 can not re-establish connectivity to the LB when api on master-1 get down.

BTW, this test request came from OPS team, because they are planing to set up dedicate env on GCE using google network LB.

Comment 9 Timothy St. Clair 2016-06-06 14:55:26 UTC
> 
> When using google network LB, once api server on master-1 get down, google
> LB will treat master-1 as unhealthy, not only LB will not transfer any api
> traffic to master-1, but also master-1 will unable to connect LB. (Sound
> like it is a little weird, but my test result really proved google network
> LB is working in such way.) That is why controller on master-1 can not
> re-establish connectivity to the LB when api on master-1 get down.


This seems like a bug in a the load balancer configuration, e.g. it almost seems like it's trying to "fence" the node.  

If the load balancer can-not be reconfigured it will hasten: 
https://github.com/openshift/origin/issues/6642 to become a priority.

Comment 10 Andy Goldstein 2016-06-06 15:09:09 UTC
Avesh is testing the GCE LB setup (1 LB, 2 VMs running httpd, stop httpd on vm1, see if vm1 can curl the LB and get a response with content (i.e. from vm2)).

Comment 11 Avesh Agarwal 2016-06-06 16:44:28 UTC
Hi,

I have done some testing on gce regarding this lb issue, and here are steps I performed:

1. Create 2 instances on gce: test-aka-1 , test-aka-2
Both instances are running https: (apachctl start).


2. Create a same tag for both VMs:
gcloud compute instances add-tags test-aka-1 --tags test-aka-lb
gcloud compute instances add-tags test-aka-2 --tags test-aka-lb

3. Add a firewall rule:
gcloud compute firewall-rules create www-firewall-aka --target-tags test-aka-lb --allow tcp:80
 
4. Create target pool:
gcloud compute target-pools create test-pool-aka

5. Add VMs to this target pool:
gcloud compute target-pools add-instances test-pool-aka --instances test-aka-1,test-aka-2

6. Create a forwarding rule:
gcloud compute forwarding-rules create test-fw-aka --target-pool test-pool-aka

7. Create a health check:
gcloud compute http-health-checks create test-hc-aka

8. Add health check for the target pool:
gcloud compute target-pools add-health-checks test-pool-aka --http-health-check test-hc-aka

Here is my observation:

When I stop httpd(apachctl stop) on VM1: I can curl from outside but not from VM1. curl from VM2 works.

Comment 12 Avesh Agarwal 2016-06-06 16:45:48 UTC
Sorry I meant httpd not https in the step 1.

Comment 13 Avesh Agarwal 2016-06-06 16:48:43 UTC
Also to be precise, when I curl to the load balancer IP from ouside when VM1's apache is stopped: One moment it gives error that: curl: (7) Failed to connect to 104.196.98.228 port 80: Connection refused but the next moment it works.

Comment 14 Avesh Agarwal 2016-06-06 20:37:50 UTC
I tested the load balance stuff with http load balancing and the set up is working right now. The observation is this:

If I stop VM1's httpd (apachectl stop), and when I do curl http://load_balancer_ip from both VM1 and VM2 and I was able to access the instance running on VM2. 

Their health checks are very weird that they specifically expect that there must be index.html in /var/www/html, otherwise it fails. Because otherwise I was just able to access the default httpd page without any issue.

Comment 15 Avesh Agarwal 2016-06-06 21:05:28 UTC
I did the network load balancer test again with successful/right health checks and the observed behavior is same what I already reported in the comments 11,12,13.

So the summary is that network load balancer and http load balancer in my tests work different when httpd is stopped on VM1.

Comment 16 Johnny Liu 2016-06-07 02:51:58 UTC
(In reply to Timothy St. Clair from comment #9)
> This seems like a bug in a the load balancer configuration, e.g. it almost
> seems like it's trying to "fence" the node. 
Based on coment 15, sound like some other's test result (comment 11) is the same as mine (comment 8) for network load balancer. Seem like I was configuring load balancer correctly. 
> 
> If the load balancer can-not be reconfigured it will hasten: 
> https://github.com/openshift/origin/issues/6642 to become a priority.

Comment 17 Andy Goldstein 2016-06-07 10:25:50 UTC
Can you switch to using the http load balancer?

Comment 18 Johnny Liu 2016-06-07 11:09:33 UTC
(In reply to Andy Goldstein from comment #17)
> Can you switch to using the http load balancer?
As OPS team's request, "network load balancer" is their prior plan. 
@twiest, do you have any suggestion here?

Actually I also tested it with haproxy http load balancer, everything is working well. So I guess everything will go well once switch to use google http load balancer. 

The important thing is I opened this bug is mainly for tracking the issue that we need a better active-passive leader election (seem like the same request mentioned in https://github.com/openshift/origin/issues/6642). Using google network LB is the only way to reproduce it so that raise it up to everyone. So I think to switch to use http load balancer is only a workaround.

Comment 19 Avesh Agarwal 2016-06-07 17:16:32 UTC
I thought about it little more and not sure I am getting confused or missing something, so let me clarify my doubt:

Why the controller on the failed master api server is talking to the load balancer (lb) in the first place?

My understanding is that lb is for external traffic.

When all instances have their internal IPs and external IPs, why all processes on them can't talk to each other on their internal/external networks instead of going through the lb? 

Because I just checked that the failed instance can still curl directly to the internal IP or external IP of other instances.

Comment 20 Avesh Agarwal 2016-06-07 17:45:13 UTC
Or is it because the lb is load balancing internal cluster traffic to multiple masters too?

Comment 21 Avesh Agarwal 2016-06-07 18:24:05 UTC
I read more about it and it seems that its expected that going through the lb in multiple masters HA setup. So may be just ignore my previous questions. 

And the fact that http based lb is working as expected, either its a limitation of gce network lb or may be a bug/ or may be there is some configuration setting that we are not aware of that might mitigate this.

Comment 22 Andy Goldstein 2016-06-20 19:41:11 UTC
Johnny, could you please update the controllers to talk to kubernetes.default.svc.cluster.local instead of the GCE LB and retest? We know that DNS will have some issues with one of the masters down (slower lookup times) (and we're working on a fix for that), but our current thinking is that we want to make this configuration change and see if it works.

Comment 25 Andy Goldstein 2016-07-20 14:43:19 UTC
This is in 3.2.1.x now. This scenario was verified as reported in https://bugzilla.redhat.com/show_bug.cgi?id=1351645#c16. I'm moving this to ON_QA. Please either retest in GCE or move to verified. Thanks!

Comment 26 Gan Huang 2016-07-26 08:06:25 UTC
Verified it in native-HA (2 masters and 4 nodes) using google network LB 
# openshift version
openshift v3.2.1.10-1-g668ed0a
kubernetes v1.2.0-36-g4a3f9c5
etcd 2.2.5

After installation, update the controllers to talk to kubernetes.default.svc.cluster.local on the two masters. Then stop master-1 api service (controllers service is active on master-1), create app successfully in 20 times by "oc new-app cakephp-example -n install-test".

Comment 34 Johnny Liu 2016-08-04 10:45:14 UTC
@Jason, though this bug is moved to verified, user have to update master/node's kube config file to make them talk to kubernetes.default.svc.cluster.local, it is really inconvenient. 

Is it possible to improve installer to set all the master/node kube config file to talk to kubernetes.default.svc.cluster.local by default?


Note You need to log in before you can comment on or make changes to this bug.