Bug 1382793

Summary:	[OCP3.2] slow DNS lookups in dockerregsitry process
Product:	OpenShift Container Platform	Reporter:	Scott Dodson <sdodson>
Component:	Image Registry	Assignee:	Michal Fojtik <mfojtik>
Status:	CLOSED NEXTRELEASE	QA Contact:	Mike Fiedler <mifiedle>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.2.1	CC:	agoldste, aos-bugs, jgoulding, maszulik, mifiedle, pweil, twiest, yinzhou
Target Milestone:	---
Target Release:	3.2.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:	1321968	Environment:
Last Closed:	2016-12-14 19:40:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1321968
Bug Blocks:	1303130

Description Scott Dodson 2016-10-07 18:26:24 UTC

+++ This bug was initially created as a clone of Bug #1321968 +++

We are using an S3-backed image registry, and whenever I am debugging performance issues with the registry, I notice this DNS lookup issue.  There are two DNS servers in the container's resolv.conf -- SkyDNS (172.30.0.1 in the logs below), and the normal Amazon DNS IP (172.31.0.2).

The search list in resolv.conf is: default.svc.cluster.local svc.cluster.local cluster.local ec2.internal, and ndots is 5

The DNS lookup sequence always looks like this:

16:05:08.291530 IP 10.1.9.5.38873 > 172.30.0.1.domain: 43023+ A? s3.amazonaws.com.default.svc.cluster.local. (60)
16:05:08.293001 IP 172.30.0.1.domain > 10.1.9.5.38873: 43023 NXDomain* 0/1/0 (114)
16:05:08.293088 IP 10.1.9.5.38873 > 172.30.0.1.domain: 57106+ AAAA? s3.amazonaws.com.default.svc.cluster.local. (60)
16:05:08.294491 IP 172.30.0.1.domain > 10.1.9.5.38873: 57106 NXDomain* 0/1/0 (114)
16:05:08.294614 IP 10.1.9.5.51218 > 172.30.0.1.domain: 36084+ A? s3.amazonaws.com.svc.cluster.local. (52)
16:05:08.297123 IP 172.30.0.1.domain > 10.1.9.5.51218: 36084 NXDomain* 0/1/0 (106)
16:05:08.297185 IP 10.1.9.5.51218 > 172.30.0.1.domain: 64990+ AAAA? s3.amazonaws.com.svc.cluster.local. (52)
16:05:08.298806 IP 172.30.0.1.domain > 10.1.9.5.51218: 64990 NXDomain* 0/1/0 (106)
16:05:08.298899 IP 10.1.9.5.55007 > 172.30.0.1.domain: 14232+ A? s3.amazonaws.com.cluster.local. (48)
16:05:08.300209 IP 172.30.0.1.domain > 10.1.9.5.55007: 14232 NXDomain* 0/1/0 (102)
16:05:08.300272 IP 10.1.9.5.55007 > 172.30.0.1.domain: 62807+ AAAA? s3.amazonaws.com.cluster.local. (48)
16:05:08.301311 IP 172.30.0.1.domain > 10.1.9.5.55007: 62807 NXDomain* 0/1/0 (102)
16:05:08.301402 IP 10.1.9.5.46588 > 172.30.0.1.domain: 10188+ A? s3.amazonaws.com.ec2.internal. (47)
16:05:08.302004 IP 172.30.0.1.domain > 10.1.9.5.46588: 10188 ServFail- 0/0/0 (47)
16:05:13.307102 IP 10.1.9.5.42971 > 172.31.0.2.domain: 10188+ A? s3.amazonaws.com.ec2.internal. (47)
16:05:13.310267 IP 172.31.0.2.domain > 10.1.9.5.42971: 10188 NXDomain 0/1/0 (108)
16:05:13.310332 IP 10.1.9.5.42971 > 172.31.0.2.domain: 62022+ AAAA? s3.amazonaws.com.ec2.internal. (47)
16:05:13.310526 IP 172.31.0.2.domain > 10.1.9.5.42971: 62022 NXDomain 0/1/0 (108)
16:05:13.310606 IP 10.1.9.5.50545 > 172.30.0.1.domain: 50172+ A? s3.amazonaws.com. (34)
16:05:13.311230 IP 172.30.0.1.domain > 10.1.9.5.50545: 50172 ServFail- 0/0/0 (34)
16:05:18.314720 IP 10.1.9.5.45503 > 172.31.0.2.domain: 50172+ A? s3.amazonaws.com. (34)
16:05:18.315647 IP 172.31.0.2.domain > 10.1.9.5.45503: 50172 2/0/0 CNAME s3-1.amazonaws.com., A 54.231.49.164 (69)

That's a pretty significant number of lookup attempts to have to make every time, but that's just an artifact of the kubernetes architecture, and I'm willing to look past that.  The really frustrating part here is the 5-second delay between the ServFail responses from SkyDNS for s3.amazonaws.com.ec2.internal and s3.amazonaws.com and the fallback to Amazon's DNS.  This is a ten-second performance penalty that becomes very painful if you're making lots of connections and doing a new lookup each time.

The problem does not reproduce with "getent hosts s3.amazonaws.com" in the same container.  The very same sequence of lookups happens , but there are no 5-second delays, which makes me think that this is specific to DNS resolution in go libraries used by the docker registry.

--- Additional comment from Andy Grimm on 2016-03-29 10:31:52 EDT ---

This bug just got weirder.  I have always seen this issue when debugging pushes, and I just tried to reproduce with a query instead.  I logged into a master and ran:

docker login
docker images

And I verified that in this case the registry process made the DNS queries all within 0.05 seconds.  Somehow it looks specific to the upload code path.

--- Additional comment from Maciej Szulik on 2016-03-30 07:23:46 EDT ---

I wonder if this is connected with the other s3 issues we have:
https://bugzilla.redhat.com/show_bug.cgi?id=1314381
https://bugzilla.redhat.com/show_bug.cgi?id=1318939

--- Additional comment from Paul Weil on 2016-03-31 20:56:06 EDT ---

Andy,

I haven't been able to make much progress reproducing this locally.  I was able to look through the s3 code and (as I think Maciej mentioned this morning in IRC) it doesn't look like it has any special resolver logic.  It appears to just be using the net dialer to set up http requests.

I'm curious if this is affecting any type of go program.  Could you try running a tester like:

package main

import (
  "net"
  "fmt"
)

func main() {
  addr, err := net.LookupIP("s3.amazonaws.com")
  if err != nil {
    fmt.Printf("err: %v\n", err)
  }
  fmt.Printf("addr: %v\n", addr)
}


and see if you see the same pauses in the tcpdump?

I'm also curious if adding "options timeout: 1" to resolv.conf offers any relief as a possible workaround while we continue troubleshooting since this may be the root cause of other delays/timeouts.

--- Additional comment from Paul Weil on 2016-04-01 08:35:18 EDT ---

trying to reproduce locally, I'm not seeing any pausing in my resolver https://gist.github.com/pweil-/4e01eb8838e018a91998a1a5b764350c

--- Additional comment from Paul Weil on 2016-04-01 09:48:35 EDT ---

unable to reproduce on my own aws instance as well.  tcpdumps of both the test program and a manual docker push to the internal registry: https://gist.github.com/pweil-/b69a3c7462754ab12546022cc439bb8b

--- Additional comment from Paul Weil on 2016-04-01 11:22:49 EDT ---

<agrimm> pweil, the cluster where I was reproducing the slow pushes earlier in the week is no longer having that issue.  You can just leave that bug in my court until I find a place to reproduce it again.

Removing from blocker list for now

--- Additional comment from Andy Goldstein on 2016-04-01 12:38:34 EDT ---

I ran a similar go program in a container in my local environment (not AWS) and I can't reproduce. If this happens again, let us know and we'll hop on your VM and see what's going on.

--- Additional comment from Maciej Szulik on 2016-06-28 05:40:09 EDT ---

With the new docker registry [1] comes different s3 driver, specifically it's using AWS driver instead of external one. According to tests done by Michal the performance should be greatly improved. I'm moving this bug to QA, similarly to bug 1314381, comment 24.

[1] https://github.com/openshift/origin/pull/8938

--- Additional comment from Mike Fiedler on 2016-06-28 21:30:07 EDT ---

Moving to MODIFIED until it merges in OSE

--- Additional comment from Maciej Szulik on 2016-07-20 08:32:58 EDT ---

Based on https://bugzilla.redhat.com/show_bug.cgi?id=1314381 moving to on_qa as well.

--- Additional comment from Mike Fiedler on 2016-08-22 07:43:20 EDT ---

No longer seeing this issue in 3.3 with the new registry and new S3 driver.

Comment 2 zhou ying 2016-10-09 03:04:22 UTC

Will confirm this issue when latest puddle sync.

Comment 3 zhou ying 2016-10-13 10:09:51 UTC

Mike Fiedler : 
   Could you please help to verify this bug, We need to verify this issue with openshift v3.2.1.16. thanks.

Comment 4 Mike Fiedler 2016-10-14 00:43:03 UTC

Verification in progress.  Waiting for results of overnight run of continuous concurrent builds

Comment 5 Mike Fiedler 2016-10-14 15:13:50 UTC

Verified on 3.2.1.16.  Over 4900 builds were run without seeing a recurrence of the slow push push to S3 issue

Comment 7 Scott Dodson 2016-12-14 19:40:39 UTC

This bug has been fixed in OCP 3.3 however the fix will not be backported to OSE 3.2.