1351645 – SkyDNS resolution intermittently fails when at least 1 master is down in an HA setup

Bug 1351645 - SkyDNS resolution intermittently fails when at least 1 master is down in an HA setup

Summary: SkyDNS resolution intermittently fails when at least 1 master is down in an H...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.2.1
Assignee:	Andy Goldstein
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1370610 (view as bug list)
Depends On:	1300028
Blocks:	OSOPS_V3 1267746 1286513
TreeView+	depends on / blocked

Reported:	2016-06-30 13:23 UTC by Andy Goldstein
Modified:	2020-01-17 15:48 UTC (History)
CC List:	25 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: In an HA environment with multiple masters, one or more of the masters goes down. Consequence: DNS requests sent to the cluster nameserver running at kubernetes.default.svc.cluster.local can become slow, which can result in things such as builds taking significantly longer than usual if they perform several DNS lookups. Fix: All the masters now coordinate to maintain an up to date list of endpoints for kubernetes.default.svc.cluster.local. If a master goes down, its endpoint is removed from the list. Note that it may take up to 20 seconds for the endpoint to be removed. When a master comes back up, its endpoint is reinserted into the list. Result: DNS resolution returns to normal once the endpoints list is updated to remove the down master.
Clone Of:	1300028
Environment:
Last Closed:	2016-07-20 19:37:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2216911	0	None	None	None	2016-06-30 13:23:04 UTC
Red Hat Product Errata	RHBA-2016:1466	0	normal	SHIPPED_LIVE	Red Hat OpenShift Enterprise 3.2.1.9 security and bug fix update	2016-07-20 23:37:14 UTC

Comment 6 Zhang Cheng 2016-07-15 10:41:20 UTC

Although this fix work in OSE3.3, but it doesn't work in OSE3.2. We QE tested on openshift v3.2.1.9-1-g2265530, kubernetes v1.2.0-36-g4a3f9c5, etcd 2.2.5. The endpoints can be updated follow api servers status, but cannot trigger deployment and build while stop api server in one master.

Reproduce steps:
1. Check current endpoints status
# oc describe svc kubernetes -n default
Name: kubernetes
Namespace: default
Labels: component=apiserver,provider=kubernetes
Selector: <none>
Type: ClusterIP
IP: 172.31.0.1
Port: https 443/TCP
Endpoints: 192.168.1.118:443,192.168.1.119:443
Port: dns 53/UDP
Endpoints: 192.168.1.118:8053,192.168.1.119:8053
Port: dns-tcp 53/TCP
Endpoints: 192.168.1.118:8053,192.168.1.119:8053
Session Affinity: None
No events.
2. Stop api server in one master:
# systemctl stop atomic-openshift-master-api
3. Check endpoint status:
# oc describe svc kubernetes -n default
Name: kubernetes
Namespace: default
Labels: component=apiserver,provider=kubernetes
Selector: <none>
Type: ClusterIP
IP: 172.31.0.1
Port: https 443/TCP
Endpoints: 192.168.1.118:443
Port: dns 53/UDP
Endpoints: 192.168.1.118:8053
Port: dns-tcp 53/TCP
Endpoints: 192.168.1.118:8053
Session Affinity: None
No events.
4. Create template dancer-example
# oc new-app --template=dancer-example -n cheng
5. Check build and pod status

Actual results: Cannot trigger deployment and build
5. Check build and pod status
# oc get build
# oc get pod

Expected results: trigger deployment and build
5. Check build and pod status
# oc get build
NAME TYPE FROM STATUS STARTED DURATION
dancer-example-1 Source Git@11c93c3 Running 56 seconds ago 56s
# oc get pod
NAME READY STATUS RESTARTS AGE
dancer-example-1-build 1/1 Running 0 1m

Addition info:
Step 6: We can find build was triggered while started api server where stopped in step 2.
# systemctl start atomic-openshift-master-api
oc get build
NAME TYPE FROM STATUS STARTED DURATION
dancer-example-1 Source Git@11c93c3 Running 56 seconds ago 56s
# oc get pod
NAME READY STATUS RESTARTS AGE
dancer-example-1-build 1/1 Running 0 1m

Comment 7 Andy Goldstein 2016-07-15 13:36:26 UTC

Could you please retest, and if it fails again, capture the information we need to be able to debug:

- where did you test (e.g. AWS, local, ...)?
- how many masters?
- how many nodes?
- what load balancer did you use? if haproxy, logs from it
- logs from all atomic-openshift-master-api services
- logs from all atomic-openshift-master-controller services
- logs from all atomic-openshift-node services
- oc describe buildconfig/<name of build config>
- oc get events --all-namespaces

Thanks!

Comment 8 Andy Goldstein 2016-07-15 14:09:28 UTC

Also, the master and node config files too.

Comment 10 Zhang Cheng 2016-07-18 07:33:33 UTC

@Andy Goldstein
MTV2 is OpenStack, you maybe cannot access.
Attachment exclude master1-controller-service-log, this size about 93MB, more than the limitation. I will attach in a mail and send to you.

Comment 12 Zhang Cheng 2016-07-18 08:39:37 UTC

@Andy Goldstein
Because the size of master1-controller-service-log more than limitation of email, I put it in my google drive and shared with you.

Comment 15 Andy Goldstein 2016-07-18 17:21:00 UTC

There are a few items to point out:

1) Until https://github.com/openshift/openshift-ansible/issues/1563 is resolved, you will have to manually configure /etc/origin/master/openshift-master.kubeconfig to point either to the load balancer or to kubernetes.default.svc.cluster.local. This is used by the controllers to know the URL and credentials for the masters. Out of the box, this file is not configured to talk to an HA endpoint. Given that the fix for this bug updates the endpoints for kubernetes.default.svc.cluster.local, I would recommend updating the config to point to that URL.

2) The controllers talk directly to etcd to attempt to acquire the lease to become the active controller. As long as the active controller is still able to talk to etcd, it will remain active. In the event that the active controller is configured to talk only to its colocated master, and not to the load balancer or kubernetes service, it will happily continue being the active controller, even after the master goes down.

3) As mentioned before, it may take 10 to 20 seconds before a now-dead master's endpoint is removed from the list of endpoints for the kubernetes service.

Comment 16 Zhang Cheng 2016-07-19 02:15:00 UTC

@Andy Goldstein Thanks for your clarification.

Trigger deployment and build successful anyway after manually configure /etc/origin/master/openshift-master.kubeconfig to point to load balancer.

I will mark status to verified as above discussion and we already have pr https://github.com/openshift/openshift-ansible/issues/1563 to trace the special scenario.

Comment 18 errata-xmlrpc 2016-07-20 19:37:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1466

Comment 19 Derek Carr 2016-09-30 14:08:30 UTC

*** Bug 1370610 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

agoldste
akokshar
aos-bugs
bbennett
ccoleman
chezhang
danw
decarr
dma
erich
jkaur
jokerman
knakayam
marc.jadoul
mbarrett
misalunk
mmccomas
pep
rhowe
sdodson
steven
stwalter
twiest
whearn
xtian