Bug 841787
Summary: | rotate option in resolv.conf causes lookup failures | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Dennis Holtlund <dennis> | ||||||||
Component: | glibc | Assignee: | Jeff Law <law> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | qe-baseos-tools-bugs | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | urgent | ||||||||||
Version: | 6.3 | CC: | arnaud.gomes, benjamin.parmentier, bugzilla.redhat.com.dev, cowan.ml, ddevaraj, dev, don, fweimer, gdzien, igeorgex, inode0, james.brown, jan.iven, john.horne, joshua, jrhett, jwest, kevin, klaus.steinberger, kyle, mfranc, mishu, ml, mpolacek, nalayil, n.beernink, nenad, nick, pasteur, pfrankli, rdassen, redhat-bugzilla, redhat-bugzilla, redhat, t.h.amundsen, thomas.oulevey, toddr, toracat, ubellavance, wnefal+redhatbugzilla | ||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: |
Prior to this update, glibc incorrectly handled the "options rotate" option in
the /etc/resolv.conf file when this file also contained one or more IPv6 name
servers. Consequently, DNS queries could unexpectedly fail, particularly when
multiple queries were issued by a single process. This update fixes
internalization of the listed servers from /etc/resolv.conf into glibc's
internal structures, as well as the sorting and rotation of those structures to
implement the "options rotate" capability. Now, DNS names are resolved correctly
in glibc in the described scenario
|
Story Points: | --- | ||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2013-02-21 07:05:23 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 843571 | ||||||||||
Attachments: |
|
Description
Dennis Holtlund
2012-07-20 08:59:31 UTC
Can you please send me a copy of your resolv.conf? We're also seeing this on the rhel6 hosts in fedora infrastructure. search phx2.fedoraproject.org vpn.fedoraproject.org fedoraproject.org nameserver 10.5.126.21 nameserver 10.5.126.22 options rotate timeout:1 if I remove the last line. it works. Easy replicator case: python import socket socket.gethostbyname_ex('somehostname_that_should_resolve') will traceback with: socket.herror strange failure mode... rather than all kinds of random things failing we consistently experienced problems with just a couple different things, like a couple specific pages of a web app being unable to connect to the database, while the rest of the pages, calling the same db connect subroutine, were fine, and one specific nagios check (check_smtp). doing any one of: - commenting the options line - adding a static hosts file entry - yum downgrade-ing the updated glibc packages got things working again The code which implements this stuff in glibc is a bit of a mess (to put it mildly). It seems that anytime someone changes it, something breaks. I'm under a mountain of time critical things right now, but expect to be able to dive into this early next week. (In reply to comment #5) > strange failure mode... rather than all kinds of random things failing we > consistently experienced problems with just a couple different things, like > a couple specific pages of a web app being unable to connect to the > database, while the rest of the pages, calling the same db connect > subroutine, were fine, and one specific nagios check (check_smtp). It only affects processes that do more than two name lookups. You won't see it with, say, "getent hosts $domain". A simple test to repeatedly look up a particular domain: $ perl -e 'print defined gethostbyname("example.com") ? "Y" : "." foreach 1..20; print "\n"' This prints a "Y" if the lookup was successful, a "." otherwise. With one nameserver defined in resolv.conf, the output is: YY.................. With two nameservers: YYY.YY.YY.YY.YY.YY.Y With three nameservers: YYYYYYYYYYYYYYYYYYYY There's a pretty good chance I know what this is and either Patsy or myself should have something ready to go Monday. Created attachment 599551 [details]
Fix handling of nameserver addresses
I thought I'd have quick look into problem. Please consider the attached patch. I have tested it against Fedora's glibc-2.15-51.fc17. It also applies cleanly on RHEL's glibc-1.80.el6_3.3.
This patch makes a few small changes. When parsing resolv.conf, the IPv4 nameserver addresses are packed to the front of statp->nsaddr_list, rather than gaps being left when IPv6 addresses are read. The logic in __libc_res_nsend is reverted back to how it was originally, except that EXT(statp).nscount tracks the number IPv4 nameservers only. Since the addresses are now packed in statp->nsaddr_list, the values in the map array are always 0 through to one less than the number of IPv4 addresses, so this index is used when copying each address from statp->nsaddr_list to EXT(statp).nsaddrs.
I have tested this with a single IPv4 or IPv6 nameserver, with multiple nameservers, with a mix of IPv4 and IPv6 nameservers, and done the whole lot both with and without "options rotate" enabled. It seems to work correctly in all cases now.
(In reply to comment #9) > It also applies > cleanly on RHEL's glibc-1.80.el6_3.3. That would be glibc-2.12-1.80.el6_3.3, of course. :-) Michael, the last hunk in res_send.c is really the one that's important. It just barely missed the cut for RHEL 6.3. I need to do some further testing, particularly with the other changes that have occurred in this code since we originally looked at this problem. *** Bug 771204 has been marked as a duplicate of this bug. *** Hi, same problem on our side. For information if NSCD is running, the problem does not appears: [root@test1 ~]# cat /etc/resolv.conf nameserver 192.168.22.100 nameserver 192.168.22.110 search priv.truc.fr domain priv.truc.fr options timeout:1 options rotate [root@test1 ~]# /etc/init.d/nscd status nscd is stopped [root@test1 ~]# perl -e 'print defined gethostbyname("example.com") ? "Y" : "." foreach 1..20; print "\n"' YYY.Y.Y.Y.Y.Y.Y.Y.Y. [root@test1 ~]# /etc/init.d/nscd start Starting nscd: [ OK ] [root@test1 ~]# perl -e 'print defined gethostbyname("example.com") ? "Y" : "." foreach 1..20; print "\n"' YYYYYYYYYYYYYYYYYYYY as above, removing "option rotate" also works. Michael, I can't see how your patch can be correct, particularly WRT packing the V4 addresses at the front of the nsaddr_list array. Given a resolv.conf with nameservers in the following order V6 V6 V4 It seems to me the code in res_init.c will start by placing the V6 servers into slots nsaddr_list[0] and nsaddr_list[1]. However, NSERV will still be zero. Thus when we encounter the V4 address, we overwrite the info at nsaddr_list[0]. The net result is yes, the V4 addresses are packed first, but we've lost a V6 address in the process. Am I missing something? We have a problem with the new resolver too which I am pretty sure is the same piece of code. Logging into systems configured to use hesiod results in hesiod resolution failing in pam preventing logins. Single hesiod lookups still work but multiple lookups fail. The easiest way to see the error is probably * Add /etc/hesiod.conf pointing to some hesiod server * Add relevant hesiod bits to /etc/nsswitch.conf $ getent passwd userA userB fails with the new glibc but works correctly with previous versions of glibc. A further note, the current code (ie my changes) can't be right either in that it will ignore some entries from resolv.conf. As I mentioned in c#6, this code is a mess. I'm still looking at it. (In reply to comment #15) > Given a resolv.conf with nameservers in the following order > V6 > V6 > V4 > > It seems to me the code in res_init.c will start by placing the V6 servers > into slots nsaddr_list[0] and nsaddr_list[1]. The IPv6 codepath in res_init.c places IPv6 addresses directly into statp->_u._ext.nsaddrs, bypassing statp->nsaddr_list altogether: 315 if ((*cp != '\0') && (*cp != '\n') 316 && __inet_aton(cp, &a)) { 317 statp->nsaddr_list[nserv].sin_addr = a; 318 statp->nsaddr_list[nserv].sin_family = AF_INET; 319 statp->nsaddr_list[nserv].sin_port = 320 htons(NAMESERVER_PORT); 321 nserv++; 322 #ifdef _LIBC 323 nservall++; 324 } else { 325 struct in6_addr a6; ... 332 if ((*cp != '\0') && 333 (inet_pton(AF_INET6, cp, &a6) > 0)) { 334 struct sockaddr_in6 *sa6; ... 365 statp->_u._ext.nsaddrs[nservall] = sa6; 366 statp->_u._ext.nssocks[nservall] = -1; 367 statp->_u._ext.nsmap[nservall] = MAXNS + 1; 368 nservall++; 369 } 370 } 371 #endif 372 } The severity and priority show as unspecified, but this bug clearly breaks systems with options rotate on. I'm unsure of the protocol on this. Should these be set? At this point it won't really make a difference; it's already OK'd for inclusion into 6.4 and it's just a matter of wrapping up what the final patch should look like. In terms of priorities, it's #1 on the list of glibc issues. Should I open a new bug for the hesiod issue then? I don't want logins to be broken until 6.4 very much. It's most likely the same underlying problem and once confirmed would be closed as a duplicate and linked to this bug. As to whether or not a fix will be released prior to RHEL 6.4, that's something that is primarily driven by customer reports. Thus, if you are a customer, please engage your support contacts to start the accelerated bugfix process if that's something you need. Ok, thanks. I have a ticket open with GSS about this already. I'm happy to test when there is something promising. Excellent. Do you have a ticket #? If so I can link it to this BZ. #00681024 - I have already added a link to this bz inside the ticket. If you have some other way to link things go right ahead please. Linking from BZ to the ticket has some value. Typically it's done by the GSS folks either when they identify an existing BZ or when they open one. I'll save them the step. Michael, Yea, my bad, the IPV6 servers aren't stored in the same place as the IPV4 servers. I'm currently looking at pulling out all the recent changes and just fixing the memcpy call site. I really thought I had tested that for 804630 and found it insufficient and had concluded that the state of MAP/NSMAP was bogus after the second loop. However, resting that tonight shows otherwise. I've got some more tests to run and want to look at the state of those arrays again under the debugger before going forward with that approach. Sorry for all the problems. I know it's been a bit of a nightmare for everyone. Created attachment 600399 [details]
Trivial tests for a few nameserver issues
(In reply to comment #27) > I'm currently looking at pulling out all the recent changes and just fixing > the memcpy call site. I haven't tested it to be sure, but I'm still not sure that this is sufficient. Here is my reasoning: If statp->nsaddr_list[ns] is being copied into an EXT(statp).nsaddrs slot, then statp->nsaddr_list[ns] must be a valid v4 address. That third loop iterates ns from 0 to EXT(statp).nscount - 1, which means the first EXT(statp).nscount elements of statp->nsaddr_list must be v4 addresses. But res_init doesn't guarantee this: if you have a mix of v6 and v4 addresses, there may be "gaps" in statp->nsaddr_list. Michael, I see your point. I actually cobbled up another test by iterating through the combinations of IPV6 and IPV4 servers at different positions in /etc/resolv.conf and can still trigger some failures. So I'm still "on it" :-) In addition to lookup failures, it seems that "option rotate" causes almost random appends of domain names written in "search" clause to actual DNS lookup names. That's the bug which started us down this path :-) Created attachment 600680 [details]
Tests for various options rotate issues
Add many more resolv.conf variants to prior tests.
Michael, Your change to pack the IPV4 servers at the front of the array and change how the count is tracked resolved the single issue that remained after backing out all the recent changes to this code and fixing the memcpy argument. Fresh builds are spinning with that update. (In reply to comment #41) > Fresh builds are spinning with that update. Does that include new Fedora packages as well? I can see a few recent glibc packages in Koji, but I don't think they include this last update. For rawhide, glibc-2.16-6.fc18. Patsy has an update in progress for F17 (glibc-2.15-54.fc17). Leave karma to get it moving :-) https://admin.fedoraproject.org/updates/glibc-2.15-54.fc17 (In reply to comment #43) > Patsy has an update in progress for F17 (glibc-2.15-54.fc17). Leave karma > to get it moving :-) I would, except that package doesn't have the complete fix. :-) Since this BZ is specifically for RHEL, should I open a new bug to track the problem in Fedora? Fedora has the complete now (-55 build) :-) In the mad rush to take care of things before taking a few days off, I forgot to pull the updated fix into Fedora. No need to open a new report. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-0279.html |