Bug 841787 - rotate option in resolv.conf causes lookup failures
rotate option in resolv.conf causes lookup failures
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: glibc (Show other bugs)
6.3
x86_64 Linux
urgent Severity high
: rc
: ---
Assigned To: Jeff Law
qe-baseos-tools
: ZStream
: 771204 (view as bug list)
Depends On:
Blocks: 843571
  Show dependency treegraph
 
Reported: 2012-07-20 04:59 EDT by Dennis Holtlund
Modified: 2016-01-31 07:04 EST (History)
40 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Prior to this update, glibc incorrectly handled the "options rotate" option in the /etc/resolv.conf file when this file also contained one or more IPv6 name servers. Consequently, DNS queries could unexpectedly fail, particularly when multiple queries were issued by a single process. This update fixes internalization of the listed servers from /etc/resolv.conf into glibc's internal structures, as well as the sorting and rotation of those structures to implement the "options rotate" capability. Now, DNS names are resolved correctly in glibc in the described scenario
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-02-21 02:05:23 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Fix handling of nameserver addresses (2.76 KB, patch)
2012-07-21 19:11 EDT, Michael Chapman
no flags Details | Diff
Trivial tests for a few nameserver issues (2.40 KB, application/x-gzip)
2012-07-25 17:18 EDT, Jeff Law
no flags Details
Tests for various options rotate issues (8.62 KB, application/x-gzip)
2012-07-27 00:12 EDT, Jeff Law
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Sourceware 13028 None None None 2016-01-31 07:04 EST
Red Hat Knowledge Base (Solution) 171483 None None None 2012-07-31 10:02:12 EDT

  None (edit)
Description Dennis Holtlund 2012-07-20 04:59:31 EDT
Description of problem:

Since updating glibc to glibc-2.12-1.80.el6_3.3.x86_64 on some servers yesterday we have had problems with name resolution. All servers use IPv4 only.

The first indication of the problem was cfengine mailing about failures.
cf bhost: Couldn't look up address v6 for : Temporary failure in name resolution
cf bhost: Id-authentication for bhost.bar.baz failed

Yum can't resolve any repos, neither local nor from RHN:
Loaded plugins: rhnplugin
There was an error communicating with RHN.
RHN channel support will be disabled.
Error communicating with server. The message was:
Unable to connect to the host and port specified
http://foo.bar.baz/dists/XYZ/6Server/x86_64/repodata/repomd.xml: [Errno 14] PYCURL ERROR 6 - "Couldn't resolve host 'foo.bar.baz'"
Trying other mirror.
Error: Cannot retrieve repository metadata (repomd.xml) for repository: XYZ. Please verify its path and try again


A simple way to confirm the problem os that "w" only shows the ip-address in the FROM-field instead of FQDN. This example is after removing the rotate option from resolv.conf.
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
root     pts/1    XXX.YYY.XYX.YX    10:44    1:36  27.45s 27.24s /usr/bin/python
root     pts/0    foo.bar.baz       11:02    1.00s  0.28s  0.23s w

OpenSSH also notices a problem. From /var/log/messages:
Jul 20 10:44:05 bhost sshd[13598]: reverse mapping checking getaddrinfo for foo.bar.baz [XXX.YYY.XYX.YX] failed - POSSIBLE BREAK-IN ATTEMPT!

Version-Release number of selected component (if applicable):
glibc-2.12-1.80.el6_3.3.x86_64

How reproducible:
Some commands always fail with the rotate option configured in resolv.conf.

Steps to Reproduce:
1. add "options rotate" to /etc/resolv.conf
2. run "yum clean all; yum check-update"
3.
  
Actual results:
Checking for updates fails because the hostnames can't be resolved.

Expected results:
Yum should report what, if any, updates are available.

Additional info:
The workaround is simple, but having a working rotate option can be very important in some cases. Without it, a failing DNS server will always be queried first.
Comment 2 Jeff Law 2012-07-20 12:17:17 EDT
Can you please send me a copy of your resolv.conf?
Comment 3 seth vidal 2012-07-20 12:42:39 EDT
We're also seeing this on the rhel6 hosts in fedora infrastructure.
Comment 4 seth vidal 2012-07-20 12:54:41 EDT
search phx2.fedoraproject.org vpn.fedoraproject.org fedoraproject.org
nameserver 10.5.126.21
nameserver 10.5.126.22
options rotate timeout:1


if I remove the last line.

it works.

Easy replicator case:

python
import socket
socket.gethostbyname_ex('somehostname_that_should_resolve')

will traceback with:

socket.herror
Comment 5 cowan.ml 2012-07-20 16:00:03 EDT
strange failure mode... rather than all kinds of random things failing we consistently experienced problems with just a couple different things, like
a couple specific pages of a web app being unable to connect to the database, while the rest of the pages, calling the same db connect subroutine, were fine, and one specific nagios check (check_smtp).

doing any one of:
 - commenting the options line
 - adding a static hosts file entry
 - yum downgrade-ing the updated glibc packages
got things working again
Comment 6 Jeff Law 2012-07-20 16:03:04 EDT
The code which implements this stuff in glibc is a bit of a mess (to put it mildly).  It seems that anytime someone changes it, something breaks.

I'm under a mountain of time critical things right now, but expect to be able to dive into this early next week.
Comment 7 Michael Chapman 2012-07-21 06:36:23 EDT
(In reply to comment #5)
> strange failure mode... rather than all kinds of random things failing we
> consistently experienced problems with just a couple different things, like
> a couple specific pages of a web app being unable to connect to the
> database, while the rest of the pages, calling the same db connect
> subroutine, were fine, and one specific nagios check (check_smtp).

It only affects processes that do more than two name lookups. You won't see it with, say, "getent hosts $domain".

A simple test to repeatedly look up a particular domain:

$ perl -e 'print defined gethostbyname("example.com") ? "Y" : "." foreach 1..20; print "\n"'

This prints a "Y" if the lookup was successful, a "." otherwise.

With one nameserver defined in resolv.conf, the output is:

YY..................

With two nameservers:

YYY.YY.YY.YY.YY.YY.Y

With three nameservers:

YYYYYYYYYYYYYYYYYYYY
Comment 8 Jeff Law 2012-07-21 14:04:17 EDT
There's a pretty good chance I know what this is and either Patsy or myself should have something ready to go Monday.
Comment 9 Michael Chapman 2012-07-21 19:11:48 EDT
Created attachment 599551 [details]
Fix handling of nameserver addresses

I thought I'd have quick look into problem. Please consider the attached patch. I have tested it against Fedora's glibc-2.15-51.fc17. It also applies cleanly on RHEL's glibc-1.80.el6_3.3.

This patch makes a few small changes. When parsing resolv.conf, the IPv4 nameserver addresses are packed to the front of statp->nsaddr_list, rather than gaps being left when IPv6 addresses are read. The logic in __libc_res_nsend is reverted back to how it was originally, except that EXT(statp).nscount tracks the number IPv4 nameservers only. Since the addresses are now packed in statp->nsaddr_list, the values in the map array are always 0 through to one less than the number of IPv4 addresses, so this index is used when copying each address from statp->nsaddr_list to EXT(statp).nsaddrs.

I have tested this with a single IPv4 or IPv6 nameserver, with multiple nameservers, with a mix of IPv4 and IPv6 nameservers, and done the whole lot both with and without "options rotate" enabled. It seems to work correctly in all cases now.
Comment 10 Michael Chapman 2012-07-21 19:13:35 EDT
(In reply to comment #9)
>     It also applies
> cleanly on RHEL's glibc-1.80.el6_3.3.

That would be glibc-2.12-1.80.el6_3.3, of course. :-)
Comment 12 Jeff Law 2012-07-23 13:57:14 EDT
Michael, the last hunk in res_send.c is really the one that's important.  It just barely missed the cut for RHEL 6.3.  I need to do some further testing, particularly with the other changes that have occurred in this code since we originally looked at this problem.
Comment 13 Jeff Law 2012-07-23 15:44:13 EDT
*** Bug 771204 has been marked as a duplicate of this bug. ***
Comment 14 Jérôme Loyet 2012-07-24 09:56:29 EDT
Hi,

same problem on our side. For information if NSCD is running, the problem does not appears:

[root@test1 ~]# cat /etc/resolv.conf
nameserver 192.168.22.100
nameserver 192.168.22.110
search priv.truc.fr
domain priv.truc.fr
options timeout:1
options rotate

[root@test1 ~]# /etc/init.d/nscd status
nscd is stopped

[root@test1 ~]# perl -e 'print defined gethostbyname("example.com") ? "Y" : "." foreach 1..20; print "\n"'
YYY.Y.Y.Y.Y.Y.Y.Y.Y.

[root@test1 ~]# /etc/init.d/nscd start
Starting nscd: [  OK  ]

[root@test1 ~]# perl -e 'print defined gethostbyname("example.com") ? "Y" : "." foreach 1..20; print "\n"'
YYYYYYYYYYYYYYYYYYYY


as above, removing "option rotate" also works.
Comment 15 Jeff Law 2012-07-24 13:26:00 EDT
Michael,

I can't see how your patch can be correct, particularly WRT packing the V4 addresses at the front of the nsaddr_list array.

Given a resolv.conf with nameservers in the following order
V6
V6
V4

It seems to me the code in res_init.c will start by placing the V6 servers into slots nsaddr_list[0] and nsaddr_list[1].  However, NSERV will still be zero.  Thus when we encounter the V4 address, we overwrite the info at nsaddr_list[0].  The net result is yes, the V4 addresses are packed first, but we've lost a V6 address in the process.

Am I missing something?
Comment 16 John T. Rose 2012-07-24 13:31:22 EDT
We have a problem with the new resolver too which I am pretty sure is the same piece of code. Logging into systems configured to use hesiod results in hesiod resolution failing in pam preventing logins.

Single hesiod lookups still work but multiple lookups fail. The easiest way to see the error is probably

* Add /etc/hesiod.conf pointing to some hesiod server
* Add relevant hesiod bits to /etc/nsswitch.conf

$ getent passwd userA userB

fails with the new glibc but works correctly with previous versions of glibc.
Comment 17 Jeff Law 2012-07-24 16:21:24 EDT
A further note, the current code (ie my changes) can't be right either in that it will ignore some entries from resolv.conf.

As I mentioned in c#6, this code is a mess.  I'm still looking at it.
Comment 18 Michael Chapman 2012-07-24 18:54:08 EDT
(In reply to comment #15)
> Given a resolv.conf with nameservers in the following order
> V6
> V6
> V4
> 
> It seems to me the code in res_init.c will start by placing the V6 servers
> into slots nsaddr_list[0] and nsaddr_list[1].

The IPv6 codepath in res_init.c places IPv6 addresses directly into statp->_u._ext.nsaddrs, bypassing statp->nsaddr_list altogether:

 315     if ((*cp != '\0') && (*cp != '\n')
 316         && __inet_aton(cp, &a)) {
 317         statp->nsaddr_list[nserv].sin_addr = a;
 318         statp->nsaddr_list[nserv].sin_family = AF_INET;
 319         statp->nsaddr_list[nserv].sin_port =
 320                 htons(NAMESERVER_PORT);
 321         nserv++;
 322 #ifdef _LIBC
 323         nservall++;
 324     } else {
 325         struct in6_addr a6;
     ...
 332         if ((*cp != '\0') &&
 333             (inet_pton(AF_INET6, cp, &a6) > 0)) {
 334             struct sockaddr_in6 *sa6;
     ...
 365                 statp->_u._ext.nsaddrs[nservall] = sa6;
 366                 statp->_u._ext.nssocks[nservall] = -1;
 367                 statp->_u._ext.nsmap[nservall] = MAXNS + 1;
 368                 nservall++;
 369             }
 370         }
 371 #endif
 372     }
Comment 19 Todd Rinaldo 2012-07-24 19:50:25 EDT
The severity and priority show as unspecified, but this bug clearly breaks systems with options rotate on. I'm unsure of the protocol on this. Should these be set?
Comment 20 Jeff Law 2012-07-24 22:32:02 EDT
At this point it won't really make a difference; it's already OK'd for inclusion into 6.4 and it's just a matter of wrapping up what the final patch should look like.  In terms of priorities, it's #1 on the list of glibc issues.
Comment 21 John T. Rose 2012-07-24 22:35:28 EDT
Should I open a new bug for the hesiod issue then? I don't want logins to be broken until 6.4 very much.
Comment 22 Jeff Law 2012-07-24 22:43:19 EDT
It's most likely the same underlying problem and once confirmed would be closed as a duplicate and linked to this bug.

As to whether or not a fix will be released prior to RHEL 6.4, that's something that is primarily driven by customer reports.  Thus, if you are a customer, please engage your support contacts to start the accelerated bugfix process if that's something you need.
Comment 23 John T. Rose 2012-07-24 22:45:58 EDT
Ok, thanks. I have a ticket open with GSS about this already. I'm happy to test when there is something promising.
Comment 24 Jeff Law 2012-07-24 22:47:02 EDT
Excellent.  Do you have a ticket #?  If so I can link it to this BZ.
Comment 25 John T. Rose 2012-07-24 22:57:52 EDT
#00681024 - I have already added a link to this bz inside the ticket. If you have some other way to link things go right ahead please.
Comment 26 Jeff Law 2012-07-25 02:43:52 EDT
Linking from BZ to the ticket has some value.  Typically it's done by the GSS folks either when they identify an existing BZ or when they open one.  I'll save them the step.
Comment 27 Jeff Law 2012-07-25 03:04:48 EDT
Michael,
Yea, my bad, the IPV6 servers aren't stored in the same place as the IPV4 servers.

I'm currently looking at pulling out all the recent changes and just fixing the memcpy call site.  I really thought I had tested that for 804630 and found it insufficient and had concluded that the state of MAP/NSMAP was bogus after the second loop.   However, resting that tonight shows otherwise.  I've got some more tests to run and want to look at the state of those arrays again under the debugger before going forward with that approach.

Sorry for all the problems.  I know it's been a bit of a nightmare for everyone.
Comment 31 Jeff Law 2012-07-25 17:18:06 EDT
Created attachment 600399 [details]
Trivial tests for a few nameserver issues
Comment 34 Michael Chapman 2012-07-26 03:28:06 EDT
(In reply to comment #27)
> I'm currently looking at pulling out all the recent changes and just fixing
> the memcpy call site.

I haven't tested it to be sure, but I'm still not sure that this is sufficient.

Here is my reasoning:

If statp->nsaddr_list[ns] is being copied into an EXT(statp).nsaddrs slot, then statp->nsaddr_list[ns] must be a valid v4 address. That third loop iterates ns from 0 to EXT(statp).nscount - 1, which means the first EXT(statp).nscount elements of statp->nsaddr_list must be v4 addresses.

But res_init doesn't guarantee this: if you have a mix of v6 and v4 addresses, there may be "gaps" in statp->nsaddr_list.
Comment 37 Jeff Law 2012-07-26 17:09:33 EDT
Michael,

I see your point.   I actually cobbled up another test by iterating through the combinations of IPV6 and IPV4 servers at different positions in /etc/resolv.conf and can still trigger some failures.  So I'm still "on it" :-)
Comment 38 Nenad Opsenica 2012-07-26 17:47:39 EDT
In addition to lookup failures, it seems that "option rotate" causes almost random appends of domain names written in "search" clause to actual DNS lookup names.
Comment 39 Jeff Law 2012-07-26 17:50:20 EDT
That's the bug which started us down this path  :-)
Comment 40 Jeff Law 2012-07-27 00:12:05 EDT
Created attachment 600680 [details]
Tests for various options rotate issues

Add many more resolv.conf variants to prior tests.
Comment 41 Jeff Law 2012-07-27 00:15:12 EDT
Michael,

Your change to pack the IPV4 servers at the front of the array and change how the count is tracked resolved the single issue that remained after backing out all the recent changes to this code and fixing the memcpy argument.   Fresh builds are spinning with that update.
Comment 42 Michael Chapman 2012-08-01 02:39:41 EDT
(In reply to comment #41)
>    Fresh builds are spinning with that update.

Does that include new Fedora packages as well?

I can see a few recent glibc packages in Koji, but I don't think they include this last update.
Comment 43 Jeff Law 2012-08-03 12:40:50 EDT
For rawhide, glibc-2.16-6.fc18.

Patsy has an update in progress for F17 (glibc-2.15-54.fc17).  Leave karma to get it moving :-)

https://admin.fedoraproject.org/updates/glibc-2.15-54.fc17
Comment 44 Michael Chapman 2012-08-05 04:29:39 EDT
(In reply to comment #43)
> Patsy has an update in progress for F17 (glibc-2.15-54.fc17).  Leave karma
> to get it moving :-)

I would, except that package doesn't have the complete fix. :-)

Since this BZ is specifically for RHEL, should I open a new bug to track the problem in Fedora?
Comment 45 Jeff Law 2012-08-06 13:57:36 EDT
Fedora has the complete now (-55 build) :-)  In the mad rush to take care of things before taking a few days off, I forgot to pull the updated fix into Fedora.

No need to open a new report.
Comment 52 errata-xmlrpc 2013-02-21 02:05:23 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0279.html

Note You need to log in before you can comment on or make changes to this bug.