Bug 1351645

Summary:	SkyDNS resolution intermittently fails when at least 1 master is down in an HA setup
Product:	OpenShift Container Platform	Reporter:	Andy Goldstein <agoldste>
Component:	Node	Assignee:	Andy Goldstein <agoldste>
Status:	CLOSED ERRATA	QA Contact:	DeShuai Ma <dma>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.1.0	CC:	agoldste, akokshar, aos-bugs, bbennett, ccoleman, chezhang, danw, decarr, dma, erich, jkaur, jokerman, knakayam, marc.jadoul, mbarrett, misalunk, mmccomas, pep, rhowe, sdodson, steven, stwalter, twiest, whearn, xtian
Target Milestone:	---
Target Release:	3.2.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: In an HA environment with multiple masters, one or more of the masters goes down. Consequence: DNS requests sent to the cluster nameserver running at kubernetes.default.svc.cluster.local can become slow, which can result in things such as builds taking significantly longer than usual if they perform several DNS lookups. Fix: All the masters now coordinate to maintain an up to date list of endpoints for kubernetes.default.svc.cluster.local. If a master goes down, its endpoint is removed from the list. Note that it may take up to 20 seconds for the endpoint to be removed. When a master comes back up, its endpoint is reinserted into the list. Result: DNS resolution returns to normal once the endpoints list is updated to remove the down master.	Story Points:	---
Clone Of:	1300028	Environment:
Last Closed:	2016-07-20 19:37:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1300028
Bug Blocks:	1303130, 1267746, 1286513

Comment 6 Zhang Cheng 2016-07-15 10:41:20 UTC

Although this fix work in OSE3.3, but it doesn't work in OSE3.2. We QE tested on openshift v3.2.1.9-1-g2265530, kubernetes v1.2.0-36-g4a3f9c5, etcd 2.2.5. The endpoints can be updated follow api servers status, but cannot trigger deployment and build while stop api server in one master.

Reproduce steps:
1. Check current endpoints status
# oc describe svc kubernetes -n default
Name: kubernetes
Namespace: default
Labels: component=apiserver,provider=kubernetes
Selector: <none>
Type: ClusterIP
IP: 172.31.0.1
Port: https 443/TCP
Endpoints: 192.168.1.118:443,192.168.1.119:443
Port: dns 53/UDP
Endpoints: 192.168.1.118:8053,192.168.1.119:8053
Port: dns-tcp 53/TCP
Endpoints: 192.168.1.118:8053,192.168.1.119:8053
Session Affinity: None
No events.
2. Stop api server in one master:
# systemctl stop atomic-openshift-master-api
3. Check endpoint status:
# oc describe svc kubernetes -n default
Name: kubernetes
Namespace: default
Labels: component=apiserver,provider=kubernetes
Selector: <none>
Type: ClusterIP
IP: 172.31.0.1
Port: https 443/TCP
Endpoints: 192.168.1.118:443
Port: dns 53/UDP
Endpoints: 192.168.1.118:8053
Port: dns-tcp 53/TCP
Endpoints: 192.168.1.118:8053
Session Affinity: None
No events.
4. Create template dancer-example
# oc new-app --template=dancer-example -n cheng
5. Check build and pod status

Actual results: Cannot trigger deployment and build
5. Check build and pod status
# oc get build
# oc get pod

Expected results: trigger deployment and build
5. Check build and pod status
# oc get build
NAME TYPE FROM STATUS STARTED DURATION
dancer-example-1 Source Git@11c93c3 Running 56 seconds ago 56s
# oc get pod
NAME READY STATUS RESTARTS AGE
dancer-example-1-build 1/1 Running 0 1m

Addition info:
Step 6: We can find build was triggered while started api server where stopped in step 2.
# systemctl start atomic-openshift-master-api
oc get build
NAME TYPE FROM STATUS STARTED DURATION
dancer-example-1 Source Git@11c93c3 Running 56 seconds ago 56s
# oc get pod
NAME READY STATUS RESTARTS AGE
dancer-example-1-build 1/1 Running 0 1m

Comment 7 Andy Goldstein 2016-07-15 13:36:26 UTC

Could you please retest, and if it fails again, capture the information we need to be able to debug:

- where did you test (e.g. AWS, local, ...)?
- how many masters?
- how many nodes?
- what load balancer did you use? if haproxy, logs from it
- logs from all atomic-openshift-master-api services
- logs from all atomic-openshift-master-controller services
- logs from all atomic-openshift-node services
- oc describe buildconfig/<name of build config>
- oc get events --all-namespaces

Thanks!

Comment 8 Andy Goldstein 2016-07-15 14:09:28 UTC

Also, the master and node config files too.

Comment 10 Zhang Cheng 2016-07-18 07:33:33 UTC

@Andy Goldstein
MTV2 is OpenStack, you maybe cannot access.
Attachment exclude master1-controller-service-log, this size about 93MB, more than the limitation. I will attach in a mail and send to you.

Comment 12 Zhang Cheng 2016-07-18 08:39:37 UTC

@Andy Goldstein
Because the size of master1-controller-service-log more than limitation of email, I put it in my google drive and shared with you.

Comment 15 Andy Goldstein 2016-07-18 17:21:00 UTC

There are a few items to point out:

1) Until https://github.com/openshift/openshift-ansible/issues/1563 is resolved, you will have to manually configure /etc/origin/master/openshift-master.kubeconfig to point either to the load balancer or to kubernetes.default.svc.cluster.local. This is used by the controllers to know the URL and credentials for the masters. Out of the box, this file is not configured to talk to an HA endpoint. Given that the fix for this bug updates the endpoints for kubernetes.default.svc.cluster.local, I would recommend updating the config to point to that URL.

2) The controllers talk directly to etcd to attempt to acquire the lease to become the active controller. As long as the active controller is still able to talk to etcd, it will remain active. In the event that the active controller is configured to talk only to its colocated master, and not to the load balancer or kubernetes service, it will happily continue being the active controller, even after the master goes down.

3) As mentioned before, it may take 10 to 20 seconds before a now-dead master's endpoint is removed from the list of endpoints for the kubernetes service.

Comment 16 Zhang Cheng 2016-07-19 02:15:00 UTC

@Andy Goldstein Thanks for your clarification.

Trigger deployment and build successful anyway after manually configure /etc/origin/master/openshift-master.kubeconfig to point to load balancer.

I will mark status to verified as above discussion and we already have pr https://github.com/openshift/openshift-ansible/issues/1563 to trace the special scenario.

Comment 18 errata-xmlrpc 2016-07-20 19:37:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1466

Comment 19 Derek Carr 2016-09-30 14:08:30 UTC

*** Bug 1370610 has been marked as a duplicate of this bug. ***