1321968 – slow DNS lookups in dockerregsitry process

Bug 1321968 - slow DNS lookups in dockerregsitry process

Summary: slow DNS lookups in dockerregsitry process

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Image Registry
Sub Component:
Version:	3.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Maciej Szulik
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	OSOPS_V3 1382793
TreeView+	depends on / blocked

Reported:	2016-03-29 13:36 UTC by Andy Grimm
Modified:	2016-11-08 03:48 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1382793 (view as bug list)
Environment:
Last Closed:	2016-10-04 13:07:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Andy Grimm 2016-03-29 13:36:38 UTC

We are using an S3-backed image registry, and whenever I am debugging performance issues with the registry, I notice this DNS lookup issue.  There are two DNS servers in the container's resolv.conf -- SkyDNS (172.30.0.1 in the logs below), and the normal Amazon DNS IP (172.31.0.2).

The search list in resolv.conf is: default.svc.cluster.local svc.cluster.local cluster.local ec2.internal, and ndots is 5

The DNS lookup sequence always looks like this:

16:05:08.291530 IP 10.1.9.5.38873 > 172.30.0.1.domain: 43023+ A? s3.amazonaws.com.default.svc.cluster.local. (60)
16:05:08.293001 IP 172.30.0.1.domain > 10.1.9.5.38873: 43023 NXDomain* 0/1/0 (114)
16:05:08.293088 IP 10.1.9.5.38873 > 172.30.0.1.domain: 57106+ AAAA? s3.amazonaws.com.default.svc.cluster.local. (60)
16:05:08.294491 IP 172.30.0.1.domain > 10.1.9.5.38873: 57106 NXDomain* 0/1/0 (114)
16:05:08.294614 IP 10.1.9.5.51218 > 172.30.0.1.domain: 36084+ A? s3.amazonaws.com.svc.cluster.local. (52)
16:05:08.297123 IP 172.30.0.1.domain > 10.1.9.5.51218: 36084 NXDomain* 0/1/0 (106)
16:05:08.297185 IP 10.1.9.5.51218 > 172.30.0.1.domain: 64990+ AAAA? s3.amazonaws.com.svc.cluster.local. (52)
16:05:08.298806 IP 172.30.0.1.domain > 10.1.9.5.51218: 64990 NXDomain* 0/1/0 (106)
16:05:08.298899 IP 10.1.9.5.55007 > 172.30.0.1.domain: 14232+ A? s3.amazonaws.com.cluster.local. (48)
16:05:08.300209 IP 172.30.0.1.domain > 10.1.9.5.55007: 14232 NXDomain* 0/1/0 (102)
16:05:08.300272 IP 10.1.9.5.55007 > 172.30.0.1.domain: 62807+ AAAA? s3.amazonaws.com.cluster.local. (48)
16:05:08.301311 IP 172.30.0.1.domain > 10.1.9.5.55007: 62807 NXDomain* 0/1/0 (102)
16:05:08.301402 IP 10.1.9.5.46588 > 172.30.0.1.domain: 10188+ A? s3.amazonaws.com.ec2.internal. (47)
16:05:08.302004 IP 172.30.0.1.domain > 10.1.9.5.46588: 10188 ServFail- 0/0/0 (47)
16:05:13.307102 IP 10.1.9.5.42971 > 172.31.0.2.domain: 10188+ A? s3.amazonaws.com.ec2.internal. (47)
16:05:13.310267 IP 172.31.0.2.domain > 10.1.9.5.42971: 10188 NXDomain 0/1/0 (108)
16:05:13.310332 IP 10.1.9.5.42971 > 172.31.0.2.domain: 62022+ AAAA? s3.amazonaws.com.ec2.internal. (47)
16:05:13.310526 IP 172.31.0.2.domain > 10.1.9.5.42971: 62022 NXDomain 0/1/0 (108)
16:05:13.310606 IP 10.1.9.5.50545 > 172.30.0.1.domain: 50172+ A? s3.amazonaws.com. (34)
16:05:13.311230 IP 172.30.0.1.domain > 10.1.9.5.50545: 50172 ServFail- 0/0/0 (34)
16:05:18.314720 IP 10.1.9.5.45503 > 172.31.0.2.domain: 50172+ A? s3.amazonaws.com. (34)
16:05:18.315647 IP 172.31.0.2.domain > 10.1.9.5.45503: 50172 2/0/0 CNAME s3-1.amazonaws.com., A 54.231.49.164 (69)

That's a pretty significant number of lookup attempts to have to make every time, but that's just an artifact of the kubernetes architecture, and I'm willing to look past that.  The really frustrating part here is the 5-second delay between the ServFail responses from SkyDNS for s3.amazonaws.com.ec2.internal and s3.amazonaws.com and the fallback to Amazon's DNS.  This is a ten-second performance penalty that becomes very painful if you're making lots of connections and doing a new lookup each time.

The problem does not reproduce with "getent hosts s3.amazonaws.com" in the same container.  The very same sequence of lookups happens , but there are no 5-second delays, which makes me think that this is specific to DNS resolution in go libraries used by the docker registry.

Comment 1 Andy Grimm 2016-03-29 14:31:52 UTC

This bug just got weirder.  I have always seen this issue when debugging pushes, and I just tried to reproduce with a query instead.  I logged into a master and ran:

docker login
docker images

And I verified that in this case the registry process made the DNS queries all within 0.05 seconds.  Somehow it looks specific to the upload code path.

Comment 2 Maciej Szulik 2016-03-30 11:23:46 UTC

I wonder if this is connected with the other s3 issues we have:
https://bugzilla.redhat.com/show_bug.cgi?id=1314381
https://bugzilla.redhat.com/show_bug.cgi?id=1318939

Comment 3 Paul Weil 2016-04-01 00:56:06 UTC

Andy,

I haven't been able to make much progress reproducing this locally.  I was able to look through the s3 code and (as I think Maciej mentioned this morning in IRC) it doesn't look like it has any special resolver logic.  It appears to just be using the net dialer to set up http requests.

I'm curious if this is affecting any type of go program.  Could you try running a tester like:

package main

import (
  "net"
  "fmt"
)

func main() {
  addr, err := net.LookupIP("s3.amazonaws.com")
  if err != nil {
    fmt.Printf("err: %v\n", err)
  }
  fmt.Printf("addr: %v\n", addr)
}


and see if you see the same pauses in the tcpdump?

I'm also curious if adding "options timeout: 1" to resolv.conf offers any relief as a possible workaround while we continue troubleshooting since this may be the root cause of other delays/timeouts.

Comment 4 Paul Weil 2016-04-01 12:35:18 UTC

trying to reproduce locally, I'm not seeing any pausing in my resolver https://gist.github.com/pweil-/4e01eb8838e018a91998a1a5b764350c

Comment 5 Paul Weil 2016-04-01 13:48:35 UTC

unable to reproduce on my own aws instance as well.  tcpdumps of both the test program and a manual docker push to the internal registry: https://gist.github.com/pweil-/b69a3c7462754ab12546022cc439bb8b

Comment 6 Paul Weil 2016-04-01 15:22:49 UTC

<agrimm> pweil, the cluster where I was reproducing the slow pushes earlier in the week is no longer having that issue.  You can just leave that bug in my court until I find a place to reproduce it again.

Removing from blocker list for now

Comment 7 Andy Goldstein 2016-04-01 16:38:34 UTC

I ran a similar go program in a container in my local environment (not AWS) and I can't reproduce. If this happens again, let us know and we'll hop on your VM and see what's going on.

Comment 8 Maciej Szulik 2016-06-28 09:40:09 UTC

With the new docker registry [1] comes different s3 driver, specifically it's using AWS driver instead of external one. According to tests done by Michal the performance should be greatly improved. I'm moving this bug to QA, similarly to bug 1314381, comment 24.

[1] https://github.com/openshift/origin/pull/8938

Comment 9 Mike Fiedler 2016-06-29 01:30:07 UTC

Moving to MODIFIED until it merges in OSE

Comment 10 Maciej Szulik 2016-07-20 12:32:58 UTC

Based on https://bugzilla.redhat.com/show_bug.cgi?id=1314381 moving to on_qa as well.

Comment 11 Mike Fiedler 2016-08-22 11:43:20 UTC

No longer seeing this issue in 3.3 with the new registry and new S3 driver.

Comment 12 zhou ying 2016-10-09 03:04:06 UTC

Will confirm this issue when latest puddle sync.

Note You need to log in before you can comment on or make changes to this bug.