Bug 807540 - bind97 "DNS format error... invalid response" from mbs.partners.extranet.microsoft.com
Summary: bind97 "DNS format error... invalid response" from mbs.partners.extranet.micr...
Keywords:
Status: CLOSED DUPLICATE of bug 717610
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: bind97
Version: 5.8
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Adam Tkac
QA Contact: qe-baseos-daemons
URL:
Whiteboard:
: 747863 (view as bug list)
Depends On:
Blocks: 798457 743405
TreeView+ depends on / blocked
 
Reported: 2012-03-28 05:43 UTC by CJ Kucera
Modified: 2018-11-14 10:33 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-06-20 13:02:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Very detailed bind log during lookups (61.53 KB, text/plain)
2012-03-28 05:43 UTC, CJ Kucera
no flags Details
PCAP of the DNS traffic between the bind97 server and the remote authorities (976 bytes, application/x-pcap)
2012-03-28 05:44 UTC, CJ Kucera
no flags Details
Patch to fix the problem, taken from RHEL bind-9.7.3-8.P3.el6_2.2 (1.63 KB, patch)
2012-03-28 21:18 UTC, CJ Kucera
no flags Details | Diff

Description CJ Kucera 2012-03-28 05:43:30 UTC
Created attachment 573236 [details]
Very detailed bind log during lookups

Description of problem:

The bind97 package is currently unable to look up the name mbs.partners.extranet.microsoft.com.  In /var/named/data/named.run, the following gets output:

DNS format error from 65.55.31.17#53 resolving mbs.partners.extranet.microsoft.com/A for client 127.0.0.1#36194: invalid response
error (FORMERR) resolving 'mbs.partners.extranet.microsoft.com/A/IN': 65.55.31.17#53

Doing a "dig @65.55.31.17 mbs.partners.extranet.microsoft.com" on the box itself will result in the proper record, though.

This problem does not occur when using bind 9.3 (via the "bind" package), though, and I have not been able to reproduce it on CentOS 6.2 (bind 9.7.3), or via my home ISP's nameserver, or my VPS's nameserver.  Only the RHEL5 bind97 package seems to fail on this zone.  Anecdotally, this has been confirmed by a couple other users in #centos on Freenode as well - "bind" will resolve it, "bind97" will not.

Version-Release number of selected component (if applicable):

bind97-9.7.0-6.P2.el5_7.4

Steps to Reproduce:
1. Install bind97, use default configuration
2. /etc/init.d/named start
3. dig @localhost mbs.partners.extranet.microsoft.com
  
Actual results:

;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 18602

... and the following in /var/named/data/named.run:

DNS format error from 131.107.125.65#53 resolving mbs.partners.extranet.microsoft.com/A for client 127.0.0.1#41058: invalid response
error (FORMERR) resolving 'mbs.partners.extranet.microsoft.com/A/IN': 131.107.125.65#53
DNS format error from 94.245.124.49#53 resolving mbs.partners.extranet.microsoft.com/A for client 127.0.0.1#41058: invalid response
error (FORMERR) resolving 'mbs.partners.extranet.microsoft.com/A/IN': 94.245.124.49#53
DNS format error from 207.46.55.10#53 resolving mbs.partners.extranet.microsoft.com/A for client 127.0.0.1#41058: invalid response
error (FORMERR) resolving 'mbs.partners.extranet.microsoft.com/A/IN': 207.46.55.10#53
DNS format error from 65.55.31.17#53 resolving mbs.partners.extranet.microsoft.com/A for client 127.0.0.1#41058: invalid response
error (FORMERR) resolving 'mbs.partners.extranet.microsoft.com/A/IN': 65.55.31.17#53

Expected results:

;; ANSWER SECTION:
mbs.partners.extranet.microsoft.com. 2750 IN A  131.107.96.163

Additional info:

This doesn't appear to be network-related.  partners.extranet.microsoft.com uses the following NS records:

dns10.one.microsoft.com. - 131.107.125.65
dns11.one.microsoft.com. - 94.245.124.49
dns12.one.microsoft.com. - 207.46.55.10
dns13.one.microsoft.com. - 65.55.31.17

I can query any of those directly, from the same box running bind, with "dig @dns10.one.microsoft.com mbs.partners.extranet.microsoft.com" for example, to retrieve the proper DNS record.

I've run a "tcpdump" against those four nameservers while attempting name lookups and the packets look fine to me.  I'll attach a pcap of the queries that the nameserver attempts, taken on the box itself (via "tcpdump -s 0").  I'll also attach a very detailed log from named.run with the trace level jacked up very high.

Let me know if there's more information which would be useful.  Thanks!

Comment 1 CJ Kucera 2012-03-28 05:44:37 UTC
Created attachment 573237 [details]
PCAP of the DNS traffic between the bind97 server and the remote authorities

Comment 2 CJ Kucera 2012-03-28 13:41:16 UTC
Oh, I neglected to mention that by default bind97 will send out an EDNS option with its queries saying that it supports receiving DNSSEC RRs or the like, and the Microsoft servers will reply with a similar option section as well, if so.  I had been thinking that perhaps that was related somehow (since the digs directly to the remote nameservers worked fine and *didn't* include those options), so I had tried adding this to my named.conf:

server 131.107.125.65 { edns no; };
server 94.245.124.49 { edns no; };
server 207.46.55.10 { edns no; };
server 65.55.31.17 { edns no; };

In fact, I'm pretty sure the PCAP I attached is from after I had tried adding those, so you'll not see any EDNS info in those DNS packets.

At any rate, that turned out to be not the issue.  I had also played around with the edns-udp-size parameter for a bit, though that was definitely looking in the wrong direction since the packets in question never got close to 512bytes, let alone a larger possibly-fragmented reply which could have caused problems for a firewall.

Anyway, just some more data.  I was planning on digging into code a little bit today...

Comment 3 CJ Kucera 2012-03-28 20:13:00 UTC
A bit more investigation here, btw.  The root of the problem seems to be that the advertised nameservers for that zone don't actually set the "authoritative" flag on the DNS packet, so bind thinks that it still has to recurse further.  Because there's nowhere else to recurse into, though, we end up failing out.

The relevant block begins at lib/dns/resolver.c:6856 - bind sees that it has answers in the packet and tries to deal with them, first checking for the Authoritative Answer Flag (or ISFORWARDER()), then for some CNAME logic, and finally falling back to one of two ways of calling noanswer_response, where it'll eventually bomb out (on line 5575).

I have yet to see exactly how other bind versions deal with this; I assume that something must be setting the FCTX_ADDRINFO_FORWARDER flag on query->addrinfo so that the packet passes that first check and goes into answer_response() ...

Comment 4 CJ Kucera 2012-03-28 21:18:00 UTC
Created attachment 573471 [details]
Patch to fix the problem, taken from RHEL bind-9.7.3-8.P3.el6_2.2

I took a look into the bind package in RHEL6 to see how it was being handled in there, and it turns out that there's another clause to the "if" statement to handle this kind of "Lame" response, and a helper function called "betterreferral" to deal with it.  I've copied those little bits in on my testing machine and it does take care of the issue for me.  I'd guess this code should be pretty safe to put in, as it's not referenced anywhere else.

Comment 5 Adam Tkac 2012-03-30 08:39:54 UTC
*** Bug 747863 has been marked as a duplicate of this bug. ***

Comment 6 Adam Tkac 2012-03-30 08:40:35 UTC
The patch is fine, thanks for it.

Comment 11 RHEL Program Management 2012-04-19 10:27:14 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 13 Adam Tkac 2012-06-20 13:02:59 UTC

*** This bug has been marked as a duplicate of bug 717610 ***

Comment 14 CJ Kucera 2012-06-20 14:02:24 UTC
Are you sure that bug 807540 is the same problem as this one?  The zones in question appear to have an SOA - microsoft.com redirects the lookup over to partners.extranet.microsoft.com, and both of those have an SOA:

$ dig +short microsoft.com soa
ns1.msft.net. msnhst.microsoft.com. 2012061903 300 600 2419200 3600

$ dig +short partners.extranet.microsoft.com soa
tk5-ptnr-dc-02.partners.extranet.microsoft.com. msdns.microsoft.com. 309795 900 600 86400 3600

The SOA for partners.extranet.microsoft.com does look like it might be somewhat problematic - the MNAME hostname in there resolves to an internal IP - but the SOA record is, at least, present.

Comment 15 CJ Kucera 2012-06-20 14:03:22 UTC
Er, bug 717610, rather.

Comment 16 Adam Tkac 2012-06-20 15:41:50 UTC
(In reply to comment #15)
> Er, bug 717610, rather.

Those bugs are really same, although they might look differently. Both are about handling of lame servers and same patch fixes both of them.

Comment 17 CJ Kucera 2012-06-20 16:33:28 UTC
(In reply to comment #16)
> Those bugs are really same, although they might look differently. Both are
> about handling of lame servers and same patch fixes both of them.

Oh, doy - right you are.  I should have actually looked at the patch on the other one instead of just reading the summary.  Thanks!


Note You need to log in before you can comment on or make changes to this bug.