Bug 677093

Summary: glibc resolver breaks if the reply is too short (lacks authority section?)
Product: [Fedora] Fedora Reporter: Pierre Ossman <ossman>
Component: glibcAssignee: Jeff Law <law>
Status: CLOSED DUPLICATE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 14CC: aaron, fweimer, jakub, schwab
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-01-25 18:01:59 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Packet trace none

Description Pierre Ossman 2011-02-13 10:07:56 UTC
Description of problem:
Fedora is currently unable to use the recursive resolvers from Turk Telekom. Many lookups just result in "Name or service not known". Other resolvers like "host", "dig" and the resolver in Windows 7 (haven't tried other versions of Windows) work just fine.


Version-Release number of selected component (if applicable):
glibc-2.13-1.x86_64


How reproducible:
Always (but see below).


Steps to Reproduce:
1. telnet www.slashdot.org http
  

Actual results:
telnet: www.slashdot.org: Name or service not known
www.slashdot.org: Host name lookup failure



Expected results:
Trying 216.34.181.48...
Connected to www.slashdot.org.
Escape character is '^]'.


Additional info:
I've been doing a lot of digging, and the problem seems to be related to the contents of the resolver response. The weird thing about these resolvers is that they will not include an authority section (except when there's a cache miss, which makes things work now and then).

The lack of an authority section doesn't seem to be the key issue though, as other lookups work fine. What is key though is that lookups that only return a single A record in the response. If there are CNAME:s or multiple A records, then glibc gladly accepts the response. It would therefore seem it has something to do with the size of the response. A poorly implemented sanity check perhaps?

One thing confusing this though is that domains that have AAAA records seem to work. The strange thing here is that glibc sends out two requests, one for A and one for AAAA. The A response is a short one, just like the ones that won't work. But the AAAA response coming just after it has something in it (unlike the broken stuff), and that for some reason makes glibc cope with both records. Some internal merging before validation?

I'm just here for this week, so if you want further testing you'll have to request it in the coming days.

Comment 1 Pierre Ossman 2011-02-13 10:14:42 UTC
I've also verified several times that pointing my machine towards another recursive resolver fixes things.

My resolver at home (via VPN):

~
[drzeus@mjolnir]$ dig www.slashdot.org

; <<>> DiG 9.7.2-P3-RedHat-9.7.2-5.P3.fc14 <<>> www.slashdot.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50199
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 3, ADDITIONAL: 3

;; QUESTION SECTION:
;www.slashdot.org.		IN	A

;; ANSWER SECTION:
www.slashdot.org.	3591	IN	A	216.34.181.48

;; AUTHORITY SECTION:
slashdot.org.		84842	IN	NS	ns-1.ch3.sourceforge.com.
slashdot.org.		84842	IN	NS	ns-2.ch3.sourceforge.com.
slashdot.org.		84842	IN	NS	ns-1.sourceforge.com.

;; ADDITIONAL SECTION:
ns-1.ch3.sourceforge.com. 2042	IN	A	216.34.181.21
ns-1.sourceforge.com.	2042	IN	A	208.122.22.23
ns-2.ch3.sourceforge.com. 2042	IN	A	216.34.181.22

;; Query time: 127 msec
;; SERVER: 10.8.0.1#53(10.8.0.1)
;; WHEN: Sun Feb 13 11:13:13 2011
;; MSG SIZE  rcvd: 174


The Turk Telekom resolver:

~
[drzeus@mjolnir]$ dig www.slashdot.org @195.175.39.39

; <<>> DiG 9.7.2-P3-RedHat-9.7.2-5.P3.fc14 <<>> www.slashdot.org @195.175.39.39
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 23993
;; flags: qr rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;www.slashdot.org.		IN	A

;; ANSWER SECTION:
www.slashdot.org.	2490	IN	A	216.34.181.48

;; Query time: 17 msec
;; SERVER: 195.175.39.39#53(195.175.39.39)
;; WHEN: Sun Feb 13 11:13:31 2011
;; MSG SIZE  rcvd: 50

Comment 2 Pierre Ossman 2011-02-13 10:19:41 UTC
~
[drzeus@mjolnir]$ dig  version.bind chaos txt @195.175.39.39

; <<>> DiG 9.7.2-P3-RedHat-9.7.2-5.P3.fc14 <<>> version.bind chaos txt @195.175.39.39
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1101
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;version.bind.			CH	TXT

;; ANSWER SECTION:
version.bind.		0	CH	TXT	"9.6-ESV-R3"

;; AUTHORITY SECTION:
version.bind.		0	CH	NS	version.bind.

;; Query time: 19 msec
;; SERVER: 195.175.39.39#53(195.175.39.39)
;; WHEN: Sun Feb 13 11:18:48 2011
;; MSG SIZE  rcvd: 67

Comment 3 Pierre Ossman 2011-02-13 10:20:49 UTC
Created attachment 478447 [details]
Packet trace

Comment 4 Pierre Ossman 2011-02-13 10:27:26 UTC
More confusion. We tried using Google's DNS instead (8.8.8.8). That gives the same kind of short replies that Turk Telekom's is doing. But the problems do not appear there.

The only obvious difference is the lack of the RA flag from Turk Telekom's servers. Seems a bit stupid to omit that as it is quite clearly a recursive lookup. Perhaps that's what's triggering something?

Comment 5 Andreas Schwab 2011-02-14 12:27:05 UTC
Can't reproduce.

$ dig www.slashdot.org @195.175.39.39

; <<>> DiG 9.7.2-P3-RedHat-9.7.2-5.P3.fc14 <<>> www.slashdot.org @195.175.39.39
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 42925
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 3, ADDITIONAL: 3

;; QUESTION SECTION:
;www.slashdot.org.              IN      A

;; ANSWER SECTION:
www.slashdot.org.       3600    IN      A       216.34.181.48

;; AUTHORITY SECTION:
slashdot.org.           10246   IN      NS      ns-2.ch3.sourceforge.com.
slashdot.org.           10246   IN      NS      ns-1.ch3.sourceforge.com.
slashdot.org.           10246   IN      NS      ns-1.sourceforge.com.

;; ADDITIONAL SECTION:
ns-1.ch3.sourceforge.com. 1546  IN      A       216.34.181.21
ns-1.sourceforge.com.   1546    IN      A       208.122.22.23
ns-2.ch3.sourceforge.com. 1546  IN      A       216.34.181.22

;; Query time: 219 msec
;; SERVER: 195.175.39.39#53(195.175.39.39)
;; WHEN: Mon Feb 14 13:24:22 2011
;; MSG SIZE  rcvd: 174

Comment 6 Pierre Ossman 2011-02-14 19:52:29 UTC
Did you try multiple times? It seems there are several machines hiding behind that IP (5 from what I can tell). I can reproduce it here today from Turkey, as well as from a server in Sweden. Not seeing it from a machine in Boston though... Odd :/

Comment 7 Andreas Schwab 2011-02-16 15:29:33 UTC
A nameserver that does not allow recursion is not usable at all.

Comment 8 Pierre Ossman 2011-02-16 17:59:46 UTC
It allows recursion, it just doesn't set that bit in the reply. And since Windows handles these replies just fine, I don't think the whole high horse routine is going to fly. As long as Linux is a niche player we sometimes need to be bug for bug compatible with redmond.

Comment 9 Fedora Admin XMLRPC Client 2011-11-14 19:43:35 UTC
This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 10 Jeff Law 2012-01-25 18:01:59 UTC

*** This bug has been marked as a duplicate of bug 505505 ***