Bug 1750956
| Summary: | Latest bind-utils makes AAAA queries leading to DNS timeouts | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | emahoney |
| Component: | Networking | Assignee: | Dan Mace <dmace> |
| Networking sub component: | DNS | QA Contact: | Hongan Li <hongli> |
| Status: | CLOSED NOTABUG | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | aos-bugs, bbennett, dmace, gscott, mfisher, ndavids |
| Version: | 3.11.0 | ||
| Target Milestone: | --- | ||
| Target Release: | 3.11.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-10-11 01:54:25 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
emahoney
2019-09-10 20:27:13 UTC
If the problem is with running the nslookup command itself to validate that the hostname is correct, you can tell it what kind of query to issue: nslookup -query=A blah.example.com That way it will not issue a AAAA query at all. If the problem is that we are passing AAAA queries for unknown hosts upstream, I think that is the expected behavior and I don't think we can change that. But it sounds like your upstream server is not correctly telling dnsmasq that the hostname does not exist, leading to the long timeout. So another possible workaround is, in their app, query specifically for A records instead of depending on an apparently changed dnsmasq default behavior? Yes. We can't change the behavior in dnsmasq because we don't know what queries containers will want to make. The upstream resolver is at fault because it is not saying NXDOMAIN when it gets a AAAA query. I would urge them to see why the upstream resolver is not saying that the AAAA query has no associated IPv6 address, rather than timing out. But if they don't issue AAAA queries from nslookup (or any other resolver in their app), then that will solve their immediate problem. Still, the big deal is the change in behavior. With bind-utils 9.9, the trace shows dnsmasq returns when it gets an IPv4 address. With 9.11, the trace shows it queries again looking for an IPV6 address. Is there a dnsmasq.conf setting to revert to the old behavior? I feel as though the nslookup behavioral change is now well understood. However, the original problem as I understand it is that a curl command from a pod was reporting a >5 second DNS resolution time. The nslookup change seems important but orthogonal. Do you agree? If so, what would be useful to me is a fresh set of data around the original test case[1], including verbose curl output, UDP packet captures the pod and host network interfaces, and dnsmasq logs for the test window duration. If not, can you help me understand the significance of nslookup's behavior in relation to the case? [1] https://access.redhat.com/support/cases/#/case/02426516?commentId=a0a2K00000R1iugQAB Absolutely, we understand the behavior change. I can attach tcpdumps from the client (Openshift node) and the upstream dns server. However, we have already reviewed these and I can tell you what you will see. The client makes an AAAA query upstream and after 15 seconds concludes ';; connection timed out; no servers could be reached'. On the upstream DNS server side, it waits 30 seconds before responding and by that time the socket on the client side is closed and we see 'icmp-host-prohibited'. Again, I understand this change and have explained to the end user that the proper fix is to change the upstream nameserver behavior to respond with NXDOMAIN in less than 15s. What the end user is asking for clarification on is the change in behavior in bind-utils (or glibc, whatever leads to the additional AAAA query vs previous behavior). Downgrading glibc/bind-utils isn't an option to revert to the old behavior. |