Bug 824201
Summary: | getaddrinfo DNS referral response returns host not found when A and AAAA questions are sent and one response is a referral | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Gino LV. Ledesma <gledesma> | ||||
Component: | glibc | Assignee: | Jeff Law <law> | ||||
Status: | CLOSED UPSTREAM | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | low | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 16 | CC: | fweimer, gledesma, jakub, law, pfrankli, schwab | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 845218 (view as bug list) | Environment: | |||||
Last Closed: | Type: | Bug | |||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Gino LV. Ledesma
2012-05-23 00:58:58 UTC
Created attachment 586223 [details]
Wireshark Packet Capture
I took a packet capture of the DNS request-response scenario (dns.pcap) and found the following Case 1: Response is served fresh (not from cache) Frame 01 Client: A? active-mrepo.me.com Frame 02 Client: AAAA? active-mrepo.me.com Frame 03 VIP: A 1/1/1 active-mrepo.me.com. 17.172.194.16 Frame 04 VIP: AAAA 0/1/0 hostmaster... Case 2: Response is served from cache Frame 05 Client: A? active-mrepo.me.com Frame 06 Client: AAAA? active-mrepo.me.com Frame 07 VIP: A 1/0/0 active-mrepo.me.com. 17.172.194.16 Frame 08 VIP: AAAA 0/0/0 hostmaster... # Try 2 (this is done automatically by glibc) Frame 09 Client: A? active-mrepo.me.com Frame 10 Client: AAAA? active-mrepo.me.com Frame 11 VIP: A 1/0/0 active-mrepo.me.com. 17.172.194.16 Frame 12 VIP: AAAA 0/0/0 hostmaster... As best as I can tell, there are two things happening here: 1) This DNS server is serving a DNS response that falls under NODATA type 3 (RFC 2308 section 2.2) 2) glibc's (mis-?) interpretation of a referral response and short-circuiting its logic glibc interprets DNS responses as "referral" if the following conditions are met (see glibc 2.15, resolv/res_send.c lines 1301-1303): a) rcode == NOERROR b) ancount == 0 c) aa == 0 d) ra == 0 e) arcount == 0 I see that this change was introduced in glibc 2.9 and is still present in 2.15. In the above situation when two responses come in (A response = authoritative, AAAA = referral), send_dg() immediately returns 0, causing __libc_res_nsend to try the next nameserver and repeat the query, ignoring any valid responses that may have come in. Here is a debug call-trace of glibc with RES_DEBUG enabled: looking up: active-mrepo ;; res_setoptions(" timeout:600 debug ", "conf").. ;; debug dots=0, statp->ndots=1, trailing_dot=0, name=active-mrepo ;; res_nquerydomain(active-mrepo, me.com, 1, 62321) ;; res_query(active-mrepo.me.com, 1, 62321) ;; res_nmkquery(QUERY, active-mrepo.me.com, IN, A) ;; res_nmkquery(QUERY, active-mrepo.me.com, IN, AAAA) ;; res_send() ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49477 ;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; active-mrepo.me.com, type = A, class = IN ;; Querying server (# 1) address = 17.230.128.24 referred query: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8813 ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; active-mrepo.me.com, type = AAAA, class = IN ;; got answer: ;; ns_initparse: Message too long ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8813 ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; active-mrepo.me.com, type = AAAA, class = IN The call trace won't show the A response because the AAAA referral response returns 0 immediately. Adjusting the next_ns: goto label to be moved above the if(buf2 != null) under the SERVAIL/NOTIMP/REFUSED code checks seems like a work-around but is most likely incorrect (due to the subsequent goto wait call). On a side note, I can only make this happen with the Citrix Netscaler acting as a DNS cache. Other caching resolvers (bind, dnsmasq, dnscache, pdns-recursor, etc) do not expose this behavior in glibc because they will either have the "aa" or "ra" bits set to 1. The Netscaler seems to be unique in serving the following combination of flags for type=AAAA: rcode == NOERROR ancount == 0 ra == 0 arcount == 0 nscount=0 I've filed a bug with Citrix for this problem: Citrix SR 60783234 Possibly related bugs: https://bugzilla.redhat.com/show_bug.cgi?id=459756 https://bugs.launchpad.net/ubuntu/+source/apt/+bug/326718 Work-arounds: 1. Disable Netscaler caching (set dns param -cache NO) 2. Switch service type from DNS/DNS_TCP to UDP/TCP 3. Make Netscaler authoritative for all zones that its DNS proxying for (add ns soaRecord ...) 4. Disable IPv6 on client-side Some notes / tidbits: 1. This behavior does NOT affect older glibc hosts (e.g. EL 5.x), presumably because gethostbyname3_r does two separate calls for A/AAAA 2. The order of the response (whether A or AAAA comes first) doesn't seem to matter I'd be real interested to know if the F17 glibc fixes this problem. There were a number of fixes in these paths through the resolver that might help. I just installed Fedora 17 i386 on a VM to test this. Problem still persists there (glibc-2.15-37.fc17.i686). I've also confirmed that the problem exists with upstream, stock glibc 2.15. |