Created attachment 573236 [details] Very detailed bind log during lookups Description of problem: The bind97 package is currently unable to look up the name mbs.partners.extranet.microsoft.com. In /var/named/data/named.run, the following gets output: DNS format error from 65.55.31.17#53 resolving mbs.partners.extranet.microsoft.com/A for client 127.0.0.1#36194: invalid response error (FORMERR) resolving 'mbs.partners.extranet.microsoft.com/A/IN': 65.55.31.17#53 Doing a "dig @65.55.31.17 mbs.partners.extranet.microsoft.com" on the box itself will result in the proper record, though. This problem does not occur when using bind 9.3 (via the "bind" package), though, and I have not been able to reproduce it on CentOS 6.2 (bind 9.7.3), or via my home ISP's nameserver, or my VPS's nameserver. Only the RHEL5 bind97 package seems to fail on this zone. Anecdotally, this has been confirmed by a couple other users in #centos on Freenode as well - "bind" will resolve it, "bind97" will not. Version-Release number of selected component (if applicable): bind97-9.7.0-6.P2.el5_7.4 Steps to Reproduce: 1. Install bind97, use default configuration 2. /etc/init.d/named start 3. dig @localhost mbs.partners.extranet.microsoft.com Actual results: ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 18602 ... and the following in /var/named/data/named.run: DNS format error from 131.107.125.65#53 resolving mbs.partners.extranet.microsoft.com/A for client 127.0.0.1#41058: invalid response error (FORMERR) resolving 'mbs.partners.extranet.microsoft.com/A/IN': 131.107.125.65#53 DNS format error from 94.245.124.49#53 resolving mbs.partners.extranet.microsoft.com/A for client 127.0.0.1#41058: invalid response error (FORMERR) resolving 'mbs.partners.extranet.microsoft.com/A/IN': 94.245.124.49#53 DNS format error from 207.46.55.10#53 resolving mbs.partners.extranet.microsoft.com/A for client 127.0.0.1#41058: invalid response error (FORMERR) resolving 'mbs.partners.extranet.microsoft.com/A/IN': 207.46.55.10#53 DNS format error from 65.55.31.17#53 resolving mbs.partners.extranet.microsoft.com/A for client 127.0.0.1#41058: invalid response error (FORMERR) resolving 'mbs.partners.extranet.microsoft.com/A/IN': 65.55.31.17#53 Expected results: ;; ANSWER SECTION: mbs.partners.extranet.microsoft.com. 2750 IN A 131.107.96.163 Additional info: This doesn't appear to be network-related. partners.extranet.microsoft.com uses the following NS records: dns10.one.microsoft.com. - 131.107.125.65 dns11.one.microsoft.com. - 94.245.124.49 dns12.one.microsoft.com. - 207.46.55.10 dns13.one.microsoft.com. - 65.55.31.17 I can query any of those directly, from the same box running bind, with "dig @dns10.one.microsoft.com mbs.partners.extranet.microsoft.com" for example, to retrieve the proper DNS record. I've run a "tcpdump" against those four nameservers while attempting name lookups and the packets look fine to me. I'll attach a pcap of the queries that the nameserver attempts, taken on the box itself (via "tcpdump -s 0"). I'll also attach a very detailed log from named.run with the trace level jacked up very high. Let me know if there's more information which would be useful. Thanks!
Created attachment 573237 [details] PCAP of the DNS traffic between the bind97 server and the remote authorities
Oh, I neglected to mention that by default bind97 will send out an EDNS option with its queries saying that it supports receiving DNSSEC RRs or the like, and the Microsoft servers will reply with a similar option section as well, if so. I had been thinking that perhaps that was related somehow (since the digs directly to the remote nameservers worked fine and *didn't* include those options), so I had tried adding this to my named.conf: server 131.107.125.65 { edns no; }; server 94.245.124.49 { edns no; }; server 207.46.55.10 { edns no; }; server 65.55.31.17 { edns no; }; In fact, I'm pretty sure the PCAP I attached is from after I had tried adding those, so you'll not see any EDNS info in those DNS packets. At any rate, that turned out to be not the issue. I had also played around with the edns-udp-size parameter for a bit, though that was definitely looking in the wrong direction since the packets in question never got close to 512bytes, let alone a larger possibly-fragmented reply which could have caused problems for a firewall. Anyway, just some more data. I was planning on digging into code a little bit today...
A bit more investigation here, btw. The root of the problem seems to be that the advertised nameservers for that zone don't actually set the "authoritative" flag on the DNS packet, so bind thinks that it still has to recurse further. Because there's nowhere else to recurse into, though, we end up failing out. The relevant block begins at lib/dns/resolver.c:6856 - bind sees that it has answers in the packet and tries to deal with them, first checking for the Authoritative Answer Flag (or ISFORWARDER()), then for some CNAME logic, and finally falling back to one of two ways of calling noanswer_response, where it'll eventually bomb out (on line 5575). I have yet to see exactly how other bind versions deal with this; I assume that something must be setting the FCTX_ADDRINFO_FORWARDER flag on query->addrinfo so that the packet passes that first check and goes into answer_response() ...
Created attachment 573471 [details] Patch to fix the problem, taken from RHEL bind-9.7.3-8.P3.el6_2.2 I took a look into the bind package in RHEL6 to see how it was being handled in there, and it turns out that there's another clause to the "if" statement to handle this kind of "Lame" response, and a helper function called "betterreferral" to deal with it. I've copied those little bits in on my testing machine and it does take care of the issue for me. I'd guess this code should be pretty safe to put in, as it's not referenced anywhere else.
*** Bug 747863 has been marked as a duplicate of this bug. ***
The patch is fine, thanks for it.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release.
*** This bug has been marked as a duplicate of bug 717610 ***
Are you sure that bug 807540 is the same problem as this one? The zones in question appear to have an SOA - microsoft.com redirects the lookup over to partners.extranet.microsoft.com, and both of those have an SOA: $ dig +short microsoft.com soa ns1.msft.net. msnhst.microsoft.com. 2012061903 300 600 2419200 3600 $ dig +short partners.extranet.microsoft.com soa tk5-ptnr-dc-02.partners.extranet.microsoft.com. msdns.microsoft.com. 309795 900 600 86400 3600 The SOA for partners.extranet.microsoft.com does look like it might be somewhat problematic - the MNAME hostname in there resolves to an internal IP - but the SOA record is, at least, present.
Er, bug 717610, rather.
(In reply to comment #15) > Er, bug 717610, rather. Those bugs are really same, although they might look differently. Both are about handling of lame servers and same patch fixes both of them.
(In reply to comment #16) > Those bugs are really same, although they might look differently. Both are > about handling of lame servers and same patch fixes both of them. Oh, doy - right you are. I should have actually looked at the patch on the other one instead of just reading the summary. Thanks!