| Summary: | slow DNS lookups in dockerregsitry process | |||
|---|---|---|---|---|
| Product: | OpenShift Online | Reporter: | Andy Grimm <agrimm> | |
| Component: | Image Registry | Assignee: | Maciej Szulik <maszulik> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Fiedler <mifiedle> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 3.x | CC: | agoldste, aos-bugs, jgoulding, maszulik, mifiedle, pweil, twiest, yinzhou | |
| Target Milestone: | --- | |||
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1382793 (view as bug list) | Environment: | ||
| Last Closed: | 2016-10-04 13:07:25 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Bug Depends On: | ||||
| Bug Blocks: | 1303130, 1382793 | |||
|
Description
Andy Grimm
2016-03-29 13:36:38 UTC
This bug just got weirder. I have always seen this issue when debugging pushes, and I just tried to reproduce with a query instead. I logged into a master and ran: docker login docker images And I verified that in this case the registry process made the DNS queries all within 0.05 seconds. Somehow it looks specific to the upload code path. I wonder if this is connected with the other s3 issues we have: https://bugzilla.redhat.com/show_bug.cgi?id=1314381 https://bugzilla.redhat.com/show_bug.cgi?id=1318939 Andy,
I haven't been able to make much progress reproducing this locally. I was able to look through the s3 code and (as I think Maciej mentioned this morning in IRC) it doesn't look like it has any special resolver logic. It appears to just be using the net dialer to set up http requests.
I'm curious if this is affecting any type of go program. Could you try running a tester like:
package main
import (
"net"
"fmt"
)
func main() {
addr, err := net.LookupIP("s3.amazonaws.com")
if err != nil {
fmt.Printf("err: %v\n", err)
}
fmt.Printf("addr: %v\n", addr)
}
and see if you see the same pauses in the tcpdump?
I'm also curious if adding "options timeout: 1" to resolv.conf offers any relief as a possible workaround while we continue troubleshooting since this may be the root cause of other delays/timeouts.
trying to reproduce locally, I'm not seeing any pausing in my resolver https://gist.github.com/pweil-/4e01eb8838e018a91998a1a5b764350c unable to reproduce on my own aws instance as well. tcpdumps of both the test program and a manual docker push to the internal registry: https://gist.github.com/pweil-/b69a3c7462754ab12546022cc439bb8b <agrimm> pweil, the cluster where I was reproducing the slow pushes earlier in the week is no longer having that issue. You can just leave that bug in my court until I find a place to reproduce it again. Removing from blocker list for now I ran a similar go program in a container in my local environment (not AWS) and I can't reproduce. If this happens again, let us know and we'll hop on your VM and see what's going on. With the new docker registry [1] comes different s3 driver, specifically it's using AWS driver instead of external one. According to tests done by Michal the performance should be greatly improved. I'm moving this bug to QA, similarly to bug 1314381, comment 24. [1] https://github.com/openshift/origin/pull/8938 Moving to MODIFIED until it merges in OSE Based on https://bugzilla.redhat.com/show_bug.cgi?id=1314381 moving to on_qa as well. No longer seeing this issue in 3.3 with the new registry and new S3 driver. Will confirm this issue when latest puddle sync. |