Bug 516179

Summary:	strange ping address resolving behavior
Product:	[Fedora] Fedora	Reporter:	J. Randall Owens <jrowens.fedora>
Component:	iputils	Assignee:	Jiri Skala <jskala>
Status:	CLOSED NEXTRELEASE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	low
Version:	11	CC:	aglotov, jskala
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-09-09 09:33:05 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	195271

Description J. Randall Owens 2009-08-07 08:13:53 UTC

Description of problem:
This has been affecting me off and on for quite some time now, but it's mild enough that I didn't bother to collect enough info for a decent bug report until now.
Fairly often, when I try to ping by hostname, the lookup seems to return the loopback address instead.  For instance:

$ ping linksys0.ghiapet.net.
PING linksys0.ghiapet.net (127.0.0.1) 56(84) bytes of data.
64 bytes from minerva.ghiapet.net (127.0.0.1): icmp_seq=1 ttl=64 time=0.058 ms

The times suggest that it is, indeed, pinging the loopback (actual time to the linksys is typically just over 1 ms).

So, this time, I did a tcpdump on the looback's port 53 (X's where I don't want you seeing my network addresses):

# tcpdump -i lo -nnvvX -s 0 port 53
tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 65535 bytes
00:43:52.467480 IP (tos 0x0, ttl 64, id 40065, offset 0, flags [DF], proto UDP (17), length 66) 127.0.0.1.48940 > 127.0.0.1.53: [bad udp cksum 273f!] 58693+ AAAA? linksys0.ghiapet.net. (38)
        0x0000:  4500 0042 9c81 4000 4011 a027 7f00 0001  E..B..@.@..'....
        0x0010:  7f00 0001 bf2c 0035 002e fe41 e545 0100  .....,.5...A.E..
        0x0020:  0001 0000 0000 0000 086c 696e 6b73 7973  .........linksys
        0x0030:  3007 6768 6961 7065 7403 6e65 7400 001c  0.ghiapet.net...
        0x0040:  0001                                     ..
00:43:52.467732 IP (tos 0x0, ttl 64, id 45882, offset 0, flags [none], proto UDP (17), length 110) 127.0.0.1.53 > 127.0.0.1.48940: [bad udp cksum 75a3!] 58693* q: AAAA? linksys0.ghiapet.net. 0/1/0 ns: ghiapet.net. SOA ns1.ghiapet.net. noc.ghiapet.net. 2009080601 3600 1200 604800 600 (82)
        0x0000:  4500 006e b33a 0000 4011 c942 7f00 0001  E..n.:..@..B....
        0x0010:  7f00 0001 0035 bf2c 005a fe6d e545 8580  .....5.,.Z.m.E..
        0x0020:  0001 0000 0001 0000 086c 696e 6b73 7973  .........linksys
        0x0030:  3007 6768 6961 7065 7403 6e65 7400 001c  0.ghiapet.net...
        0x0040:  0001 c015 0006 0001 0000 0258 0020 036e  ...........X...n
        0x0050:  7331 c015 036e 6f63 c015 77c0 2319 0000  s1...noc..w.#...
        0x0060:  0e10 0000 04b0 0009 3a80 0000 0258       ........:....X
00:43:52.469323 IP (tos 0x0, ttl 64, id 40067, offset 0, flags [DF], proto UDP (17), length 66) 127.0.0.1.35911 > 127.0.0.1.53: [bad udp cksum 8715!] 16870+ A? linksys0.ghiapet.net. (38)
        0x0000:  4500 0042 9c83 4000 4011 a025 7f00 0001  E..B..@.@..%....
        0x0010:  7f00 0001 8c47 0035 002e fe41 41e6 0100  .....G.5...AA...
        0x0020:  0001 0000 0000 0000 086c 696e 6b73 7973  .........linksys
        0x0030:  3007 6768 6961 7065 7403 6e65 7400 0001  0.ghiapet.net...
        0x0040:  0001                                     ..
00:43:52.469572 IP (tos 0x0, ttl 64, id 45883, offset 0, flags [none], proto UDP (17), length 206) 127.0.0.1.53 > 127.0.0.1.35911: [bad udp cksum cf6d!] 16870* q: A? linksys0.ghiapet.net. 1/2/4 linksys0.ghiapet.net. A 10.XXX.XXX.2 ns: ghiapet.net. NS ns1.ghiapet.net., ghiapet.net. NS ns2.ghiapet.net. ar: ns1.ghiapet.net. A 127.0.0.1, ns1.ghiapet.net. AAAA ::1, ns2.ghiapet.net. A 127.0.0.1, ns2.ghiapet.net. AAAA ::1 (178)
        0x0000:  4500 00ce b33b 0000 4011 c8e1 7f00 0001  E....;..@.......
        0x0010:  7f00 0001 0035 8c47 00ba fecd 41e6 8580  .....5.G....A...
        0x0020:  0001 0001 0002 0004 086c 696e 6b73 7973  .........linksys
        0x0030:  3007 6768 6961 7065 7403 6e65 7400 0001  0.ghiapet.net...
        0x0040:  0001 c00c 0001 0001 0000 0e10 0004 0aXX  ................
        0x0050:  XX02 c015 0002 0001 0001 5180 0006 036e  ..........Q....n
        0x0060:  7331 c015 c015 0002 0001 0001 5180 0006  s1..........Q...
        0x0070:  036e 7332 c015 c042 0001 0001 0000 0e10  .ns2...B........
        0x0080:  0004 7f00 0001 c042 001c 0001 0000 0e10  .......B........
        0x0090:  0010 0000 0000 0000 0000 0000 0000 0000  ................
        0x00a0:  0001 c054 0001 0001 0000 0e10 0004 7f00  ...T............
        0x00b0:  0001 c054 001c 0001 0000 0e10 0010 0000  ...T............
        0x00c0:  0000 0000 0000 0000 0000 0000 0001       ..............
^C
4 packets captured
8 packets received by filter
0 packets dropped by kernel

Perhaps ping is taking the A record(s) from the additional section and using that, for some reason?  I don't know.  Oh, the redacted bits are the correct address.

Version-Release number of selected component (if applicable):
iputils-20071127-8.fc11.i586
bind-9.6.1-4.P1.fc11.i586
plus -chroot, -devel, -libs, -utils all with same VR
bind-dyndb-ldap-0.1.0-0.2.a1.fc11.i586
glibc-2.10.1-2.i686

Additional info:
As I mentioned, this has been going on occasionally for a while now, since at least F9, maybe F8.  But back then, it was even worse, because Firefox would occasionally try to go to 127.0.0.1 instead of some usual external website.  That hasn't happened in quite a while, though.

I'm quite open to the possibility that this is an underlying issue in the glibc resolver (assuming that's what ping uses), or something messed up with my own BIND setup.  But from what I know of DNS format, those datagrams seem to be conveying the correct information in the correct manner.

Comment 1 J. Randall Owens 2009-08-07 08:21:42 UTC

Oh, I forgot to include my resolv.conf (minus comments):

search ghiapet.net dyn.ghiapet.net
sortlist 10.XXX.XXX.0/22
options timeout:3 inet6
nameserver 127.0.0.1


As you can see, I put the dot at the end so the search domains shouldn't enter into it at all (indeed, earlier I used just 'linksys0', and got a six-packet exchange; I used FQDN to shorten the output to paste here).  And this behavior long precedes my use of the sortlist.

Comment 2 J. Randall Owens 2009-08-12 23:07:52 UTC

My memory's been refreshed now, that this also happens with ncftp and traceroute, but not telnet (I think).

I have another interesting example. When I `ping moon.linux-ipv6.org.`, a host with both IPv4 and IPv6 addresses, it starts pinging at 32.1.2.0. I discovered that in the response packet, the first four bytes of the host's IPv6 address, 2001:200:0:1003:207:e9ff:fe04:9924, come to 32.1.2.0 when rendered decimally.

I also notice in the example above that even though it's a IPv4-only ping, you can see the second-to-last pair of octets of the first packet, x001c, it's requesting an AAAA record, instead of an A record (and then acting shocked when it gets one!). The second packet, in reply, then contains SOA information. The third packet is another query, this time for an A record.

The fourth packet, the should-be-final response, returns an A record (the part with the Xed out octets), class IN, using message compression to represent linksys0.ghiapet.net. with that xc00c. Then there are a pair of NS records, also using the compression, starting with xc015 to represent the shorter part of the domain, and also ending with the same, to shorten the nameservers' names. The xc042 represents ns1, and xc054 for ns2. It returns an A and a AAAA record for each of these (and absolutely no reason not to; just because the ping is IPv4 only, doesn't mean its resolver is), seen by the x0001 and x001c after the shortened forms. Then, after a bit of other stuff (xe10 is TTL of 1H, x0004 is IPv4 record length, x0010 is IPv6 record length), are the additional NS addresses themselves in each of those RRs.

So apparently, in that case, it grabs an IPv4 address from one of the additional records, rather than assuming that the first four octets of an AAAA record are the IPv4 address it wants, like moon.linux-ipv6.org did. These may actually be two separate bugs, neither of which is really in ping.

Comment 3 J. Randall Owens 2009-08-12 23:13:47 UTC

Actually, no, traceroute isn't one that this happens to.  So, so far, ping and ncftp definitely get it, traceroute and telnet seem clean.  Firefox too.

Comment 4 J. Randall Owens 2009-08-12 23:17:09 UTC

$ foreach i ( /bin/ping /usr/bin/ncftp /bin/traceroute /usr/bin/telnet )
foreach?echo $i
foreach?ldd $i | sort
foreach?end
/bin/ping
        libc.so.6 => /lib/libc.so.6 (0x00c06000)
        libidn.so.11 => /lib/libidn.so.11 (0x04162000)
        /lib/ld-linux.so.2 (0x00be2000)
        linux-gate.so.1 =>  (0x00e27000)
/usr/bin/ncftp
        libc.so.6 => /lib/libc.so.6 (0x00c06000)
        /lib/ld-linux.so.2 (0x00be2000)
        libresolv.so.2 => /lib/libresolv.so.2 (0x00631000)
        linux-gate.so.1 =>  (0x005c1000)
/bin/traceroute
        libc.so.6 => /lib/libc.so.6 (0x00c06000)
        /lib/ld-linux.so.2 (0x00be2000)
        libm.so.6 => /lib/libm.so.6 (0x00d79000)
        linux-gate.so.1 =>  (0x0014c000)
/usr/bin/telnet
        libc.so.6 => /lib/libc.so.6 (0x00322000)
        libdl.so.2 => /lib/libdl.so.2 (0x001b1000)
        /lib/ld-linux.so.2 (0x00c77000)
        libncurses.so.5 => /lib/libncurses.so.5 (0x00843000)
        libtinfo.so.5 => /lib/libtinfo.so.5 (0x001e6000)
        libutil.so.1 => /lib/libutil.so.1 (0x00ddd000)
        linux-gate.so.1 =>  (0x00321000)

So, ping and ncftp use libidn and libresolv, traceroute seems to do it itself, or else finds something in libc it can use, and telnet, I guess uses one of libdl, libtinfo, or libutil, or else does the resolution itself.

Comment 5 J. Randall Owens 2009-08-13 13:31:29 UTC

OK, I don't really know my C very well, but digging around in source and man pages, it looks like ping (but not ping6) is using the deprecated gethostbyname() function in ping.c line 261, and then doing a memcpy() of the first four bytes of the h_addr member of the returned struct, which is basically an alias for the first item in h_addr_list.  This definitely explains the 32.1.2.0 behaviour.

I don't think this quite entirely explains the additional-records 127.0.0.1 bug.  But I bet if gethostbyname() were replaced by something more current, it'd go away.  (And should they be using memcpy() that way, without validation?  Like I said, I don't know my C.)

I also notice that a quick grep of the iputils source shows that arping, clockdiff, rarpd, tracepath, and traceroute6 (but not traceroute, a totally separate package) all use gethostbyname(), too, and probably have much the same issues.

Comment 6 J. Randall Owens 2009-08-15 23:23:30 UTC

Seems like it was the "options inet6" in resolv.conf that caused this.  I'm not sure what the status or component should be at this point.  With that option removed, pinging those hosts seems to work as expected now.  (ncftp working normally again, too.)

I do think these utilities could handle a case of RES_USE_INET6 more gracefully, though.

Comment 7 Jiri Skala 2009-09-09 09:33:05 UTC

Hi,
you are right. Replacing gethostbyname by getaddrinfo fixes the problem.

Jiri