Bug 1300028 - SkyDNS resolution intermittently fails when at least 1 master is down in an HA setup
Summary: SkyDNS resolution intermittently fails when at least 1 master is down in an H...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: ---
Assignee: Andy Goldstein
QA Contact: DeShuai Ma
URL:
Whiteboard:
: 1321093 1342933 (view as bug list)
Depends On:
Blocks: OSOPS_V3 1267746 1286513 1351645
TreeView+ depends on / blocked
 
Reported: 2016-01-19 19:23 UTC by Wesley Hearn
Modified: 2019-11-14 07:20 UTC (History)
26 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1351645 (view as bug list)
Environment:
Last Closed: 2016-09-27 09:34:59 UTC
Target Upstream Version:
decarr: needinfo-


Attachments (Terms of Use)
Build log (5.63 KB, application/x-xz)
2016-01-19 19:23 UTC, Wesley Hearn
no flags Details
iostat output (16.98 KB, application/x-xz)
2016-01-19 19:23 UTC, Wesley Hearn
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1933 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.3 Release Advisory 2016-09-27 13:24:36 UTC
Red Hat Knowledge Base (Solution) 2216911 None None None 2016-04-01 14:32:15 UTC

Description Wesley Hearn 2016-01-19 19:23:20 UTC
Created attachment 1116337 [details]
Build log

Description of problem:
On a new cluster a build of the sample dancer application takes about 1 hour 

Version-Release number of selected component (if applicable):
3.1.0.4-1.git.15.5e061c3.el7aos.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Spin up a new native ha cluster on AWS (m4.xlarge)
2. Setup  docker registry to use S3
3. Build the sample dancer application

Actual results:
[0]Pluto:~ $ oc get builds -n wesleyhearn
NAME               TYPE      FROM      STATUS     STARTED       DURATION
dancer-example-1   Source    Git       Complete   3 hours ago   56m25s


Expected results:
[0]Pluto:~ $ oc get builds -n wesleyhearn
NAME               TYPE      FROM      STATUS     STARTED       DURATION
dancer-example-1   Source    Git       Complete   3 hours ago   3m25s


Additional info:
Attached is the build log and iostats from during the time window

Comment 1 Wesley Hearn 2016-01-19 19:23:37 UTC
Created attachment 1116338 [details]
iostat output

Comment 2 Cesar Wong 2016-01-19 20:35:09 UTC
Both the build and push seem to be taking an extremely long time. I wonder if it's got something to do with the network.

Unfortunately there's nothing in the assemble script to indicate how long each step is taking.

The assemble script is started at 11:18, and the container exits at 11:44 (26 min)

Then it starts pushing the image at 11:44 and it finishes at 12:13 (29 min)

Ben, sending your way

Comment 3 Ben Parees 2016-01-19 20:45:27 UTC
likely both have the same root cause but i'd suggest starting by analyzing the push side of the equation since we can at least look at the docker registry logs to see if there's anything interesting there.

and i agree, network seems like the probable cause since all the assemble script is doing is pulling down dependencies.

Andy can you suggest some debug steps for why the push would be taking 30 mins?

Comment 4 Andy Goldstein 2016-01-19 20:47:30 UTC
What is the MTU setting (https://docs.openshift.com/enterprise/3.1/install_config/configuring_sdn.html#configuring-the-pod-network-on-nodes)? How does it compare to the NIC's MTU?

Comment 5 Wesley Hearn 2016-01-19 20:51:49 UTC
A difference of 50.

[root@ip-172-31-0-58 ~]# grep mtu /etc/origin/node/node-config.yaml 
   mtu: 8951
[root@ip-172-31-0-58 ~]# ip a l eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc pfifo_fast state UP qlen 1000
    link/ether 0a:20:b0:f1:d7:23 brd ff:ff:ff:ff:ff:ff
    inet 172.31.0.58/20 brd 172.31.15.255 scope global dynamic eth0
       valid_lft 2850sec preferred_lft 2850sec
    inet6 fe80::820:b0ff:fef1:d723/64 scope link 
       valid_lft forever preferred_lft forever
[root@ip-172-31-0-58 ~]#

Comment 6 Ben Parees 2016-01-19 21:21:58 UTC
Would like the networking team to look into whether the network is performing correctly since it seems like both of these have a common denominator of being networking ops (pulling dependencies from the perl dep server, and pushing an image to the docker registry).

Comment 7 Dan Winship 2016-01-19 21:47:08 UTC
Is that with log level 5? If not, could you get logs with that?

Also, some tcpdump output from during the pull and/or push ops would help narrow things down.

Comment 8 Ben Bennett 2016-01-20 14:20:04 UTC
Running iperf3 server inside the registry pod and doing tests to it from a remote pod and node showed throughput of between 580Mbps and 1Gps depending on where I tested from... so the raw connectivity seems correct.  I'll try a build next and see what happens from those pods...

Comment 9 Ben Bennett 2016-01-20 15:06:41 UTC
Those builds are really slow.  The download portion is the fastest step below:

# oc logs dancer-mysql-example-1-build -n bb-test -f | awk '{now=strftime("%F %T%z\t");sub(/^/, now);print}'
...
2016-01-20 09:31:40-0500	--> Working on Plack
2016-01-20 09:31:51-0500	Fetching http://www.cpan.org/authors/id/M/MI/MIYAGAWA/Plack-1.0039.tar.gz ... OK
2016-01-20 09:31:51-0500	Configuring Plack-1.0039 ... OK
2016-01-20 09:31:51-0500	==> Found dependencies: Cookie::Baker, Filesys::Notify::Simple, Hash::MultiValue, Apache::LogFormat::Compiler, Devel::StackTrace::AsHTML, HTTP::Message, Stream::Buffered, HTTP::Body, File::ShareDir, Test::TCP
2016-01-20 09:31:56-0500	--> Working on Cookie::Baker
2016-01-20 09:32:06-0500	Fetching http://www.cpan.org/authors/id/K/KA/KAZEBURO/Cookie-Baker-0.06.tar.gz ... OK
2016-01-20 09:32:07-0500	Configuring Cookie-Baker-0.06 ... OK
2016-01-20 09:32:07-0500	Building Cookie-Baker-0.06 ... OK
2016-01-20 09:32:07-0500	Successfully installed Cookie-Baker-0.06
2016-01-20 09:32:17-0500	--> Working on Filesys::Notify::Simple
2016-01-20 09:32:22-0500	Fetching http://www.cpan.org/authors/id/M/MI/MIYAGAWA/Filesys-Notify-Simple-0.12.tar.gz ... OK
2016-01-20 09:32:22-0500	Configuring Filesys-Notify-Simple-0.12 ... OK
2016-01-20 09:32:23-0500	Building Filesys-Notify-Simple-0.12 ... OK
2016-01-20 09:32:23-0500	Successfully installed Filesys-Notify-Simple-0.12
2016-01-20 09:32:33-0500	--> Working on Hash::MultiValue
2016-01-20 09:32:43-0500	Fetching http://www.cpan.org/authors/id/A/AR/ARISTOTLE/Hash-MultiValue-0.16.tar.gz ... OK
2016-01-20 09:32:43-0500	Configuring Hash-MultiValue-0.16 ... OK
2016-01-20 09:32:44-0500	Building Hash-MultiValue-0.16 ... OK
2016-01-20 09:32:44-0500	Successfully installed Hash-MultiValue-0.16
2016-01-20 09:32:49-0500	--> Working on Apache::LogFormat::Compiler
2016-01-20 09:32:59-0500	Fetching http://www.cpan.org/authors/id/K/KA/KAZEBURO/Apache-LogFormat-Compiler-0.32.tar.gz ... OK
2016-01-20 09:32:59-0500	Configuring Apache-LogFormat-Compiler-0.32 ... OK
2016-01-20 09:32:59-0500	==> Found dependencies: POSIX::strftime::Compiler
2016-01-20 09:33:09-0500	--> Working on POSIX::strftime::Compiler
2016-01-20 09:33:15-0500	Fetching http://www.cpan.org/authors/id/K/KA/KAZEBURO/POSIX-strftime-Compiler-0.41.tar.gz ... OK
2016-01-20 09:33:15-0500	Configuring POSIX-strftime-Compiler-0.41 ... OK
2016-01-20 09:33:15-0500	Building POSIX-strftime-Compiler-0.41 ... OK
2016-01-20 09:33:15-0500	Successfully installed POSIX-strftime-Compiler-0.41
2016-01-20 09:33:16-0500	Building Apache-LogFormat-Compiler-0.32 ... OK
2016-01-20 09:33:16-0500	Successfully installed Apache-LogFormat-Compiler-0.32

Comment 10 Ben Bennett 2016-01-20 19:12:46 UTC
Ok.  The root cause turned out to be terribly slow dns performance on the pulls.

Then I noticed that only one in three was slow... and then we found that an API server had stopped responding.

That was also why the pulls were slow, because one in three api calls was timing out.

Comment 11 Andy Goldstein 2016-02-01 16:45:53 UTC
More details:

- pods are configured to use the `kubernetes` service IP address for DNS
- in this case, the `kubernetes` service is backed by 3 endpoints: the 3 masters
- there is no health check on the masters, so there will always be 3 endpoints in the list, regardless of up/down status
- the userspace proxy round robins each request
- the proxy tries 4 times to connect to a backend endpoint, waiting .25, .5, 1, and 2 seconds respectively before trying the next endpoint.

The question at hand is how can we avoid these delays? Also, what does the behavior look like with the iptables proxy?

Comment 12 Derek Carr 2016-02-03 16:23:14 UTC
Solly - can you investigate while Andy is out?

Comment 13 Solly Ross 2016-02-03 19:13:27 UTC
It seems like we should have a health check run on other masters by the active controller-manager master.  A control loop could check on the other apiservers, and remove the relevant endpoints from the "default/kubernetes" service if the health checks fail.

Comment 14 Solly Ross 2016-02-03 21:36:49 UTC
Alternatively, an external monitoring tool could be taught how to remove the endpoints for the `default/kubernetes` service, and then if/when it restarted the master, the master would re-add itself.

Comment 15 Clayton Coleman 2016-02-04 19:56:40 UTC
I would prefer the masters check each other IF we can guarantee that masters have reachability to them.

Comment 16 Andy Goldstein 2016-02-09 15:49:47 UTC
Solly is working on this. Setting to assigned.

Comment 17 Solly Ross 2016-02-10 15:22:46 UTC
I've opened a PR against upstream kube (which can then be backported to Origin) which adds an optional controller loop to periodically check the health of the API servers: https://github.com/kubernetes/kubernetes/pull/20975

Comment 18 Kenjiro Nakayama 2016-03-24 04:45:49 UTC
Hi Solly and Wesley,

Can you change the subject of this bugzilla? 
The "Steps to Reproduce" on first comment is different from the current working issue, so it is really difficult to find this bug ticket...

Comment 19 Andy Goldstein 2016-03-24 16:12:08 UTC
*** Bug 1321093 has been marked as a duplicate of this bug. ***

Comment 23 Andy Goldstein 2016-04-11 15:53:43 UTC
The current workaround is to ensure that you have a monitoring tool (e.g. monit or equivalent) that can constantly watch the master api processes and ensure they are restarted if they ever exit.

Comment 26 Marc Jadoul 2016-04-23 08:06:03 UTC
I do not think it should be only solved through monitoring or solving kubernete bugs... Actually all resolv.conf for infra and pods should point on all available DNS instance and not to the service.
It might not fully solve the issue but at least applications will continue to work.
Currently this bug make Openshift unusable in prod: you simply can't maintain the masters.

Comment 37 Ben Bennett 2016-06-06 12:30:02 UTC
*** Bug 1342933 has been marked as a duplicate of this bug. ***

Comment 38 Mike Barrett 2016-06-07 18:41:55 UTC
We have had success getting movement on a kube 1.3 PR that will be backported to OSE 3.2.1.z between July and September.

Comment 39 Andy Goldstein 2016-06-21 17:13:47 UTC
PR in progress: https://github.com/openshift/origin/pull/9469

Comment 40 Andy Goldstein 2016-06-30 13:15:18 UTC
PR merged

Comment 42 Qixuan Wang 2016-07-13 09:37:56 UTC
The PR is in OSE3.3 now. We QE tested on openshift v3.3.0.4, kubernetes v1.3.0+57fb9ac, etcd 2.3.0+git. The bug has been fixed. Once it is ON_QA, we can verify it.

Comment 43 Qixuan Wang 2016-07-14 06:05:05 UTC
Here are test record:

1. [root@ip-172-18-15-105 ~]# oc describe endpoints kubernetes -n default
Name:        kubernetes
Namespace:    default
Labels:        <none>
Subsets:
  Addresses:        172.18.15.104,172.18.15.105
  NotReadyAddresses:    <none>
  Ports:
    Name    Port    Protocol
    ----    ----    --------
    https    443    TCP
    dns-tcp    8053    TCP
    dns        8053    UDP

No events.

2. [root@ip-172-18-15-105 ~]# service atomic-openshift-master-api stop
Redirecting to /bin/systemctl stop  atomic-openshift-master-api.service

3. One of endpoint is removed after one api server is down. 
[root@ip-172-18-15-105 ~]# oc describe endpoints kubernetes -n default
Name:        kubernetes
Namespace:    default
Labels:        <none>
Subsets:
  Addresses:        172.18.15.104
  NotReadyAddresses:    <none>
  Ports:
    Name    Port    Protocol
    ----    ----    --------
    https    443    TCP
    dns-tcp    8053    TCP
    dns        8053    UDP

No events.

4. One master is down won't affect pod build/deploy.(Even stop/start master during pod build/deploy)
[root@ip-172-18-15-105 ~]# oc new-app --template=dancer-example -n qwang1

[root@ip-172-18-15-105 ~]# oc get build -n qwang1
NAME               TYPE      FROM          STATUS     STARTED         DURATION
dancer-example-1   Source    Git@11c93c3   Complete   4 minutes ago   4m51s

[root@ip-172-18-15-105 ~]# oc get pod -n qwang1
NAME                     READY     STATUS      RESTARTS   AGE
dancer-example-1-build   0/1       Completed   0          6m
dancer-example-1-oddpt   1/1       Running     0          28s

Comment 44 Scott Dodson 2016-08-16 19:18:32 UTC
*** Bug 1286513 has been marked as a duplicate of this bug. ***

Comment 48 errata-xmlrpc 2016-09-27 09:34:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1933


Note You need to log in before you can comment on or make changes to this bug.