1300028 – SkyDNS resolution intermittently fails when at least 1 master is down in an HA setup

Bug 1300028 - SkyDNS resolution intermittently fails when at least 1 master is down in an HA setup

Summary: SkyDNS resolution intermittently fails when at least 1 master is down in an H...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Andy Goldstein
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1321093 1342933 (view as bug list)
Depends On:
Blocks:	OSOPS_V3 1267746 1286513 1351645
TreeView+	depends on / blocked

Reported:	2016-01-19 19:23 UTC by Wesley Hearn
Modified:	2020-01-17 15:38 UTC (History)
CC List:	26 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1351645 (view as bug list)
Environment:
Last Closed:	2016-09-27 09:34:59 UTC
Target Upstream Version:
Embargoed:
Flags:	decarr: needinfo-

Attachments	(Terms of Use)
Build log (5.63 KB, application/x-xz) 2016-01-19 19:23 UTC, Wesley Hearn	no flags	Details
iostat output (16.98 KB, application/x-xz) 2016-01-19 19:23 UTC, Wesley Hearn	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2216911	0	None	None	None	2016-04-01 14:32:15 UTC
Red Hat Product Errata	RHBA-2016:1933	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.3 Release Advisory	2016-09-27 13:24:36 UTC

Description Wesley Hearn 2016-01-19 19:23:20 UTC

Created attachment 1116337 [details]
Build log

Description of problem:
On a new cluster a build of the sample dancer application takes about 1 hour 

Version-Release number of selected component (if applicable):
3.1.0.4-1.git.15.5e061c3.el7aos.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Spin up a new native ha cluster on AWS (m4.xlarge)
2. Setup  docker registry to use S3
3. Build the sample dancer application

Actual results:
[0]Pluto:~ $ oc get builds -n wesleyhearn
NAME               TYPE      FROM      STATUS     STARTED       DURATION
dancer-example-1   Source    Git       Complete   3 hours ago   56m25s


Expected results:
[0]Pluto:~ $ oc get builds -n wesleyhearn
NAME               TYPE      FROM      STATUS     STARTED       DURATION
dancer-example-1   Source    Git       Complete   3 hours ago   3m25s


Additional info:
Attached is the build log and iostats from during the time window

Comment 1 Wesley Hearn 2016-01-19 19:23:37 UTC

Created attachment 1116338 [details]
iostat output

Comment 2 Cesar Wong 2016-01-19 20:35:09 UTC

Both the build and push seem to be taking an extremely long time. I wonder if it's got something to do with the network.

Unfortunately there's nothing in the assemble script to indicate how long each step is taking.

The assemble script is started at 11:18, and the container exits at 11:44 (26 min)

Then it starts pushing the image at 11:44 and it finishes at 12:13 (29 min)

Ben, sending your way

Comment 3 Ben Parees 2016-01-19 20:45:27 UTC

likely both have the same root cause but i'd suggest starting by analyzing the push side of the equation since we can at least look at the docker registry logs to see if there's anything interesting there.

and i agree, network seems like the probable cause since all the assemble script is doing is pulling down dependencies.

Andy can you suggest some debug steps for why the push would be taking 30 mins?

Comment 4 Andy Goldstein 2016-01-19 20:47:30 UTC

What is the MTU setting (https://docs.openshift.com/enterprise/3.1/install_config/configuring_sdn.html#configuring-the-pod-network-on-nodes)? How does it compare to the NIC's MTU?

Comment 5 Wesley Hearn 2016-01-19 20:51:49 UTC

A difference of 50.

[root@ip-172-31-0-58 ~]# grep mtu /etc/origin/node/node-config.yaml 
   mtu: 8951
[root@ip-172-31-0-58 ~]# ip a l eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc pfifo_fast state UP qlen 1000
    link/ether 0a:20:b0:f1:d7:23 brd ff:ff:ff:ff:ff:ff
    inet 172.31.0.58/20 brd 172.31.15.255 scope global dynamic eth0
       valid_lft 2850sec preferred_lft 2850sec
    inet6 fe80::820:b0ff:fef1:d723/64 scope link 
       valid_lft forever preferred_lft forever
[root@ip-172-31-0-58 ~]#

Comment 6 Ben Parees 2016-01-19 21:21:58 UTC

Would like the networking team to look into whether the network is performing correctly since it seems like both of these have a common denominator of being networking ops (pulling dependencies from the perl dep server, and pushing an image to the docker registry).

Comment 7 Dan Winship 2016-01-19 21:47:08 UTC

Is that with log level 5? If not, could you get logs with that?

Also, some tcpdump output from during the pull and/or push ops would help narrow things down.

Comment 8 Ben Bennett 2016-01-20 14:20:04 UTC

Running iperf3 server inside the registry pod and doing tests to it from a remote pod and node showed throughput of between 580Mbps and 1Gps depending on where I tested from... so the raw connectivity seems correct.  I'll try a build next and see what happens from those pods...

Comment 9 Ben Bennett 2016-01-20 15:06:41 UTC

Those builds are really slow.  The download portion is the fastest step below:

# oc logs dancer-mysql-example-1-build -n bb-test -f | awk '{now=strftime("%F %T%z\t");sub(/^/, now);print}'
...
2016-01-20 09:31:40-0500	--> Working on Plack
2016-01-20 09:31:51-0500	Fetching http://www.cpan.org/authors/id/M/MI/MIYAGAWA/Plack-1.0039.tar.gz ... OK
2016-01-20 09:31:51-0500	Configuring Plack-1.0039 ... OK
2016-01-20 09:31:51-0500	==> Found dependencies: Cookie::Baker, Filesys::Notify::Simple, Hash::MultiValue, Apache::LogFormat::Compiler, Devel::StackTrace::AsHTML, HTTP::Message, Stream::Buffered, HTTP::Body, File::ShareDir, Test::TCP
2016-01-20 09:31:56-0500	--> Working on Cookie::Baker
2016-01-20 09:32:06-0500	Fetching http://www.cpan.org/authors/id/K/KA/KAZEBURO/Cookie-Baker-0.06.tar.gz ... OK
2016-01-20 09:32:07-0500	Configuring Cookie-Baker-0.06 ... OK
2016-01-20 09:32:07-0500	Building Cookie-Baker-0.06 ... OK
2016-01-20 09:32:07-0500	Successfully installed Cookie-Baker-0.06
2016-01-20 09:32:17-0500	--> Working on Filesys::Notify::Simple
2016-01-20 09:32:22-0500	Fetching http://www.cpan.org/authors/id/M/MI/MIYAGAWA/Filesys-Notify-Simple-0.12.tar.gz ... OK
2016-01-20 09:32:22-0500	Configuring Filesys-Notify-Simple-0.12 ... OK
2016-01-20 09:32:23-0500	Building Filesys-Notify-Simple-0.12 ... OK
2016-01-20 09:32:23-0500	Successfully installed Filesys-Notify-Simple-0.12
2016-01-20 09:32:33-0500	--> Working on Hash::MultiValue
2016-01-20 09:32:43-0500	Fetching http://www.cpan.org/authors/id/A/AR/ARISTOTLE/Hash-MultiValue-0.16.tar.gz ... OK
2016-01-20 09:32:43-0500	Configuring Hash-MultiValue-0.16 ... OK
2016-01-20 09:32:44-0500	Building Hash-MultiValue-0.16 ... OK
2016-01-20 09:32:44-0500	Successfully installed Hash-MultiValue-0.16
2016-01-20 09:32:49-0500	--> Working on Apache::LogFormat::Compiler
2016-01-20 09:32:59-0500	Fetching http://www.cpan.org/authors/id/K/KA/KAZEBURO/Apache-LogFormat-Compiler-0.32.tar.gz ... OK
2016-01-20 09:32:59-0500	Configuring Apache-LogFormat-Compiler-0.32 ... OK
2016-01-20 09:32:59-0500	==> Found dependencies: POSIX::strftime::Compiler
2016-01-20 09:33:09-0500	--> Working on POSIX::strftime::Compiler
2016-01-20 09:33:15-0500	Fetching http://www.cpan.org/authors/id/K/KA/KAZEBURO/POSIX-strftime-Compiler-0.41.tar.gz ... OK
2016-01-20 09:33:15-0500	Configuring POSIX-strftime-Compiler-0.41 ... OK
2016-01-20 09:33:15-0500	Building POSIX-strftime-Compiler-0.41 ... OK
2016-01-20 09:33:15-0500	Successfully installed POSIX-strftime-Compiler-0.41
2016-01-20 09:33:16-0500	Building Apache-LogFormat-Compiler-0.32 ... OK
2016-01-20 09:33:16-0500	Successfully installed Apache-LogFormat-Compiler-0.32

Comment 10 Ben Bennett 2016-01-20 19:12:46 UTC

Ok.  The root cause turned out to be terribly slow dns performance on the pulls.

Then I noticed that only one in three was slow... and then we found that an API server had stopped responding.

That was also why the pulls were slow, because one in three api calls was timing out.

Comment 11 Andy Goldstein 2016-02-01 16:45:53 UTC

More details:

- pods are configured to use the `kubernetes` service IP address for DNS
- in this case, the `kubernetes` service is backed by 3 endpoints: the 3 masters
- there is no health check on the masters, so there will always be 3 endpoints in the list, regardless of up/down status
- the userspace proxy round robins each request
- the proxy tries 4 times to connect to a backend endpoint, waiting .25, .5, 1, and 2 seconds respectively before trying the next endpoint.

The question at hand is how can we avoid these delays? Also, what does the behavior look like with the iptables proxy?

Comment 12 Derek Carr 2016-02-03 16:23:14 UTC

Solly - can you investigate while Andy is out?

Comment 13 Solly Ross 2016-02-03 19:13:27 UTC

It seems like we should have a health check run on other masters by the active controller-manager master.  A control loop could check on the other apiservers, and remove the relevant endpoints from the "default/kubernetes" service if the health checks fail.

Comment 14 Solly Ross 2016-02-03 21:36:49 UTC

Alternatively, an external monitoring tool could be taught how to remove the endpoints for the `default/kubernetes` service, and then if/when it restarted the master, the master would re-add itself.

Comment 15 Clayton Coleman 2016-02-04 19:56:40 UTC

I would prefer the masters check each other IF we can guarantee that masters have reachability to them.

Comment 16 Andy Goldstein 2016-02-09 15:49:47 UTC

Solly is working on this. Setting to assigned.

Comment 17 Solly Ross 2016-02-10 15:22:46 UTC

I've opened a PR against upstream kube (which can then be backported to Origin) which adds an optional controller loop to periodically check the health of the API servers: https://github.com/kubernetes/kubernetes/pull/20975

Comment 18 Kenjiro Nakayama 2016-03-24 04:45:49 UTC

Hi Solly and Wesley,

Can you change the subject of this bugzilla? 
The "Steps to Reproduce" on first comment is different from the current working issue, so it is really difficult to find this bug ticket...

Comment 19 Andy Goldstein 2016-03-24 16:12:08 UTC

*** Bug 1321093 has been marked as a duplicate of this bug. ***

Comment 23 Andy Goldstein 2016-04-11 15:53:43 UTC

The current workaround is to ensure that you have a monitoring tool (e.g. monit or equivalent) that can constantly watch the master api processes and ensure they are restarted if they ever exit.

Comment 26 Marc Jadoul 2016-04-23 08:06:03 UTC

I do not think it should be only solved through monitoring or solving kubernete bugs... Actually all resolv.conf for infra and pods should point on all available DNS instance and not to the service.
It might not fully solve the issue but at least applications will continue to work.
Currently this bug make Openshift unusable in prod: you simply can't maintain the masters.

Comment 37 Ben Bennett 2016-06-06 12:30:02 UTC

*** Bug 1342933 has been marked as a duplicate of this bug. ***

Comment 38 Mike Barrett 2016-06-07 18:41:55 UTC

We have had success getting movement on a kube 1.3 PR that will be backported to OSE 3.2.1.z between July and September.

Comment 39 Andy Goldstein 2016-06-21 17:13:47 UTC

PR in progress: https://github.com/openshift/origin/pull/9469

Comment 40 Andy Goldstein 2016-06-30 13:15:18 UTC

PR merged

Comment 42 Qixuan Wang 2016-07-13 09:37:56 UTC

The PR is in OSE3.3 now. We QE tested on openshift v3.3.0.4, kubernetes v1.3.0+57fb9ac, etcd 2.3.0+git. The bug has been fixed. Once it is ON_QA, we can verify it.

Comment 43 Qixuan Wang 2016-07-14 06:05:05 UTC

Here are test record:

1. [root@ip-172-18-15-105 ~]# oc describe endpoints kubernetes -n default
Name:        kubernetes
Namespace:    default
Labels:        <none>
Subsets:
  Addresses:        172.18.15.104,172.18.15.105
  NotReadyAddresses:    <none>
  Ports:
    Name    Port    Protocol
    ----    ----    --------
    https    443    TCP
    dns-tcp    8053    TCP
    dns        8053    UDP

No events.

2. [root@ip-172-18-15-105 ~]# service atomic-openshift-master-api stop
Redirecting to /bin/systemctl stop  atomic-openshift-master-api.service

3. One of endpoint is removed after one api server is down. 
[root@ip-172-18-15-105 ~]# oc describe endpoints kubernetes -n default
Name:        kubernetes
Namespace:    default
Labels:        <none>
Subsets:
  Addresses:        172.18.15.104
  NotReadyAddresses:    <none>
  Ports:
    Name    Port    Protocol
    ----    ----    --------
    https    443    TCP
    dns-tcp    8053    TCP
    dns        8053    UDP

No events.

4. One master is down won't affect pod build/deploy.(Even stop/start master during pod build/deploy)
[root@ip-172-18-15-105 ~]# oc new-app --template=dancer-example -n qwang1

[root@ip-172-18-15-105 ~]# oc get build -n qwang1
NAME               TYPE      FROM          STATUS     STARTED         DURATION
dancer-example-1   Source    Git@11c93c3   Complete   4 minutes ago   4m51s

[root@ip-172-18-15-105 ~]# oc get pod -n qwang1
NAME                     READY     STATUS      RESTARTS   AGE
dancer-example-1-build   0/1       Completed   0          6m
dancer-example-1-oddpt   1/1       Running     0          28s

Comment 44 Scott Dodson 2016-08-16 19:18:32 UTC

*** Bug 1286513 has been marked as a duplicate of this bug. ***

Comment 48 errata-xmlrpc 2016-09-27 09:34:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1933

Note You need to log in before you can comment on or make changes to this bug.

agoldste
akokshar
aos-bugs
bbennett
ccoleman
chezhang
danw
decarr
erich
erjones
jialiu
jkaur
jokerman
knakayam
marc.jadoul
mbarrett
misalunk
mmccomas
pep
qixuan.wang
rhowe
sdodson
steven
tdawson
twiest
xtian