841787 – rotate option in resolv.conf causes lookup failures

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 841787 - rotate option in resolv.conf causes lookup failures

Summary: rotate option in resolv.conf causes lookup failures

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	glibc
Sub Component:
Version:	6.3
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jeff Law
QA Contact:	qe-baseos-tools-bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	771204 (view as bug list)
Depends On:
Blocks:	843571
TreeView+	depends on / blocked

Reported:	2012-07-20 08:59 UTC by Dennis Holtlund
Modified:	2018-12-01 18:18 UTC (History)
CC List:	40 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Prior to this update, glibc incorrectly handled the "options rotate" option in the /etc/resolv.conf file when this file also contained one or more IPv6 name servers. Consequently, DNS queries could unexpectedly fail, particularly when multiple queries were issued by a single process. This update fixes internalization of the listed servers from /etc/resolv.conf into glibc's internal structures, as well as the sorting and rotation of those structures to implement the "options rotate" capability. Now, DNS names are resolved correctly in glibc in the described scenario
Clone Of:
Environment:
Last Closed:	2013-02-21 07:05:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Fix handling of nameserver addresses (2.76 KB, patch) 2012-07-21 23:11 UTC, Michael Chapman	no flags	Details \| Diff
Trivial tests for a few nameserver issues (2.40 KB, application/x-gzip) 2012-07-25 21:18 UTC, Jeff Law	no flags	Details
Tests for various options rotate issues (8.62 KB, application/x-gzip) 2012-07-27 04:12 UTC, Jeff Law	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	171483	None	None	None	2012-07-31 14:02:12 UTC
Red Hat Product Errata	RHBA-2013:0279	normal	SHIPPED_LIVE	glibc bug fix and enhancement update	2013-02-20 20:37:06 UTC
Sourceware	13028	None	None	None	2019-04-01 14:49:31 UTC

Description Dennis Holtlund 2012-07-20 08:59:31 UTC

Description of problem:

Since updating glibc to glibc-2.12-1.80.el6_3.3.x86_64 on some servers yesterday we have had problems with name resolution. All servers use IPv4 only.

The first indication of the problem was cfengine mailing about failures.
cf bhost: Couldn't look up address v6 for : Temporary failure in name resolution
cf bhost: Id-authentication for bhost.bar.baz failed

Yum can't resolve any repos, neither local nor from RHN:
Loaded plugins: rhnplugin
There was an error communicating with RHN.
RHN channel support will be disabled.
Error communicating with server. The message was:
Unable to connect to the host and port specified
http://foo.bar.baz/dists/XYZ/6Server/x86_64/repodata/repomd.xml: [Errno 14] PYCURL ERROR 6 - "Couldn't resolve host 'foo.bar.baz'"
Trying other mirror.
Error: Cannot retrieve repository metadata (repomd.xml) for repository: XYZ. Please verify its path and try again

A simple way to confirm the problem os that "w" only shows the ip-address in the FROM-field instead of FQDN. This example is after removing the rotate option from resolv.conf.
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
root pts/1 XXX.YYY.XYX.YX 10:44 1:36 27.45s 27.24s /usr/bin/python
root pts/0 foo.bar.baz 11:02 1.00s 0.28s 0.23s w

OpenSSH also notices a problem. From /var/log/messages:
Jul 20 10:44:05 bhost sshd[13598]: reverse mapping checking getaddrinfo for foo.bar.baz [XXX.YYY.XYX.YX] failed - POSSIBLE BREAK-IN ATTEMPT!

Version-Release number of selected component (if applicable):
glibc-2.12-1.80.el6_3.3.x86_64

How reproducible:
Some commands always fail with the rotate option configured in resolv.conf.

Steps to Reproduce:
1. add "options rotate" to /etc/resolv.conf
2. run "yum clean all; yum check-update"
3.

Actual results:
Checking for updates fails because the hostnames can't be resolved.

Expected results:
Yum should report what, if any, updates are available.

Additional info:
The workaround is simple, but having a working rotate option can be very important in some cases. Without it, a failing DNS server will always be queried first.

Comment 2 Jeff Law 2012-07-20 16:17:17 UTC

Can you please send me a copy of your resolv.conf?

Comment 3 seth vidal 2012-07-20 16:42:39 UTC

We're also seeing this on the rhel6 hosts in fedora infrastructure.

Comment 4 seth vidal 2012-07-20 16:54:41 UTC

search phx2.fedoraproject.org vpn.fedoraproject.org fedoraproject.org
nameserver 10.5.126.21
nameserver 10.5.126.22
options rotate timeout:1


if I remove the last line.

it works.

Easy replicator case:

python
import socket
socket.gethostbyname_ex('somehostname_that_should_resolve')

will traceback with:

socket.herror

Comment 5 cowan.ml 2012-07-20 20:00:03 UTC

strange failure mode... rather than all kinds of random things failing we consistently experienced problems with just a couple different things, like
a couple specific pages of a web app being unable to connect to the database, while the rest of the pages, calling the same db connect subroutine, were fine, and one specific nagios check (check_smtp).

doing any one of:
 - commenting the options line
 - adding a static hosts file entry
 - yum downgrade-ing the updated glibc packages
got things working again

Comment 6 Jeff Law 2012-07-20 20:03:04 UTC

The code which implements this stuff in glibc is a bit of a mess (to put it mildly).  It seems that anytime someone changes it, something breaks.

I'm under a mountain of time critical things right now, but expect to be able to dive into this early next week.

Comment 7 Michael Chapman 2012-07-21 10:36:23 UTC

(In reply to comment #5)
> strange failure mode... rather than all kinds of random things failing we
> consistently experienced problems with just a couple different things, like
> a couple specific pages of a web app being unable to connect to the
> database, while the rest of the pages, calling the same db connect
> subroutine, were fine, and one specific nagios check (check_smtp).

It only affects processes that do more than two name lookups. You won't see it with, say, "getent hosts $domain".

A simple test to repeatedly look up a particular domain:

$ perl -e 'print defined gethostbyname("example.com") ? "Y" : "." foreach 1..20; print "\n"'

This prints a "Y" if the lookup was successful, a "." otherwise.

With one nameserver defined in resolv.conf, the output is:

YY..................

With two nameservers:

YYY.YY.YY.YY.YY.YY.Y

With three nameservers:

YYYYYYYYYYYYYYYYYYYY

Comment 8 Jeff Law 2012-07-21 18:04:17 UTC

There's a pretty good chance I know what this is and either Patsy or myself should have something ready to go Monday.

Comment 9 Michael Chapman 2012-07-21 23:11:48 UTC

Created attachment 599551 [details]
Fix handling of nameserver addresses

I thought I'd have quick look into problem. Please consider the attached patch. I have tested it against Fedora's glibc-2.15-51.fc17. It also applies cleanly on RHEL's glibc-1.80.el6_3.3.

This patch makes a few small changes. When parsing resolv.conf, the IPv4 nameserver addresses are packed to the front of statp->nsaddr_list, rather than gaps being left when IPv6 addresses are read. The logic in __libc_res_nsend is reverted back to how it was originally, except that EXT(statp).nscount tracks the number IPv4 nameservers only. Since the addresses are now packed in statp->nsaddr_list, the values in the map array are always 0 through to one less than the number of IPv4 addresses, so this index is used when copying each address from statp->nsaddr_list to EXT(statp).nsaddrs.

I have tested this with a single IPv4 or IPv6 nameserver, with multiple nameservers, with a mix of IPv4 and IPv6 nameservers, and done the whole lot both with and without "options rotate" enabled. It seems to work correctly in all cases now.

Comment 10 Michael Chapman 2012-07-21 23:13:35 UTC

(In reply to comment #9)
>     It also applies
> cleanly on RHEL's glibc-1.80.el6_3.3.

That would be glibc-2.12-1.80.el6_3.3, of course. :-)

Comment 12 Jeff Law 2012-07-23 17:57:14 UTC

Michael, the last hunk in res_send.c is really the one that's important.  It just barely missed the cut for RHEL 6.3.  I need to do some further testing, particularly with the other changes that have occurred in this code since we originally looked at this problem.

Comment 13 Jeff Law 2012-07-23 19:44:13 UTC

*** Bug 771204 has been marked as a duplicate of this bug. ***

Comment 14 Jérôme Loyet 2012-07-24 13:56:29 UTC

Hi,

same problem on our side. For information if NSCD is running, the problem does not appears:

[root@test1 ~]# cat /etc/resolv.conf
nameserver 192.168.22.100
nameserver 192.168.22.110
search priv.truc.fr
domain priv.truc.fr
options timeout:1
options rotate

[root@test1 ~]# /etc/init.d/nscd status
nscd is stopped

[root@test1 ~]# perl -e 'print defined gethostbyname("example.com") ? "Y" : "." foreach 1..20; print "\n"'
YYY.Y.Y.Y.Y.Y.Y.Y.Y.

[root@test1 ~]# /etc/init.d/nscd start
Starting nscd: [  OK  ]

[root@test1 ~]# perl -e 'print defined gethostbyname("example.com") ? "Y" : "." foreach 1..20; print "\n"'
YYYYYYYYYYYYYYYYYYYY


as above, removing "option rotate" also works.

Comment 15 Jeff Law 2012-07-24 17:26:00 UTC

Michael,

I can't see how your patch can be correct, particularly WRT packing the V4 addresses at the front of the nsaddr_list array.

Given a resolv.conf with nameservers in the following order
V6
V6
V4

It seems to me the code in res_init.c will start by placing the V6 servers into slots nsaddr_list[0] and nsaddr_list[1].  However, NSERV will still be zero.  Thus when we encounter the V4 address, we overwrite the info at nsaddr_list[0].  The net result is yes, the V4 addresses are packed first, but we've lost a V6 address in the process.

Am I missing something?

Comment 16 John T. Rose 2012-07-24 17:31:22 UTC

We have a problem with the new resolver too which I am pretty sure is the same piece of code. Logging into systems configured to use hesiod results in hesiod resolution failing in pam preventing logins.

Single hesiod lookups still work but multiple lookups fail. The easiest way to see the error is probably

* Add /etc/hesiod.conf pointing to some hesiod server
* Add relevant hesiod bits to /etc/nsswitch.conf

$ getent passwd userA userB

fails with the new glibc but works correctly with previous versions of glibc.

Comment 17 Jeff Law 2012-07-24 20:21:24 UTC

A further note, the current code (ie my changes) can't be right either in that it will ignore some entries from resolv.conf.

As I mentioned in c#6, this code is a mess.  I'm still looking at it.

Comment 18 Michael Chapman 2012-07-24 22:54:08 UTC

(In reply to comment #15)
> Given a resolv.conf with nameservers in the following order
> V6
> V6
> V4
> 
> It seems to me the code in res_init.c will start by placing the V6 servers
> into slots nsaddr_list[0] and nsaddr_list[1].

The IPv6 codepath in res_init.c places IPv6 addresses directly into statp->_u._ext.nsaddrs, bypassing statp->nsaddr_list altogether:

 315     if ((*cp != '\0') && (*cp != '\n')
 316         && __inet_aton(cp, &a)) {
 317         statp->nsaddr_list[nserv].sin_addr = a;
 318         statp->nsaddr_list[nserv].sin_family = AF_INET;
 319         statp->nsaddr_list[nserv].sin_port =
 320                 htons(NAMESERVER_PORT);
 321         nserv++;
 322 #ifdef _LIBC
 323         nservall++;
 324     } else {
 325         struct in6_addr a6;
     ...
 332         if ((*cp != '\0') &&
 333             (inet_pton(AF_INET6, cp, &a6) > 0)) {
 334             struct sockaddr_in6 *sa6;
     ...
 365                 statp->_u._ext.nsaddrs[nservall] = sa6;
 366                 statp->_u._ext.nssocks[nservall] = -1;
 367                 statp->_u._ext.nsmap[nservall] = MAXNS + 1;
 368                 nservall++;
 369             }
 370         }
 371 #endif
 372     }

Comment 19 Todd Rinaldo 2012-07-24 23:50:25 UTC

The severity and priority show as unspecified, but this bug clearly breaks systems with options rotate on. I'm unsure of the protocol on this. Should these be set?

Comment 20 Jeff Law 2012-07-25 02:32:02 UTC

At this point it won't really make a difference; it's already OK'd for inclusion into 6.4 and it's just a matter of wrapping up what the final patch should look like.  In terms of priorities, it's #1 on the list of glibc issues.

Comment 21 John T. Rose 2012-07-25 02:35:28 UTC

Should I open a new bug for the hesiod issue then? I don't want logins to be broken until 6.4 very much.

Comment 22 Jeff Law 2012-07-25 02:43:19 UTC

It's most likely the same underlying problem and once confirmed would be closed as a duplicate and linked to this bug.

As to whether or not a fix will be released prior to RHEL 6.4, that's something that is primarily driven by customer reports.  Thus, if you are a customer, please engage your support contacts to start the accelerated bugfix process if that's something you need.

Comment 23 John T. Rose 2012-07-25 02:45:58 UTC

Ok, thanks. I have a ticket open with GSS about this already. I'm happy to test when there is something promising.

Comment 24 Jeff Law 2012-07-25 02:47:02 UTC

Excellent.  Do you have a ticket #?  If so I can link it to this BZ.

Comment 25 John T. Rose 2012-07-25 02:57:52 UTC

#00681024 - I have already added a link to this bz inside the ticket. If you have some other way to link things go right ahead please.

Comment 26 Jeff Law 2012-07-25 06:43:52 UTC

Linking from BZ to the ticket has some value.  Typically it's done by the GSS folks either when they identify an existing BZ or when they open one.  I'll save them the step.

Comment 27 Jeff Law 2012-07-25 07:04:48 UTC

Michael,
Yea, my bad, the IPV6 servers aren't stored in the same place as the IPV4 servers.

I'm currently looking at pulling out all the recent changes and just fixing the memcpy call site.  I really thought I had tested that for 804630 and found it insufficient and had concluded that the state of MAP/NSMAP was bogus after the second loop.   However, resting that tonight shows otherwise.  I've got some more tests to run and want to look at the state of those arrays again under the debugger before going forward with that approach.

Sorry for all the problems.  I know it's been a bit of a nightmare for everyone.

Comment 31 Jeff Law 2012-07-25 21:18:06 UTC

Created attachment 600399 [details]
Trivial tests for a few nameserver issues

Comment 34 Michael Chapman 2012-07-26 07:28:06 UTC

(In reply to comment #27)
> I'm currently looking at pulling out all the recent changes and just fixing
> the memcpy call site.

I haven't tested it to be sure, but I'm still not sure that this is sufficient.

Here is my reasoning:

If statp->nsaddr_list[ns] is being copied into an EXT(statp).nsaddrs slot, then statp->nsaddr_list[ns] must be a valid v4 address. That third loop iterates ns from 0 to EXT(statp).nscount - 1, which means the first EXT(statp).nscount elements of statp->nsaddr_list must be v4 addresses.

But res_init doesn't guarantee this: if you have a mix of v6 and v4 addresses, there may be "gaps" in statp->nsaddr_list.

Comment 37 Jeff Law 2012-07-26 21:09:33 UTC

Michael,

I see your point.   I actually cobbled up another test by iterating through the combinations of IPV6 and IPV4 servers at different positions in /etc/resolv.conf and can still trigger some failures.  So I'm still "on it" :-)

Comment 38 Nenad Opsenica 2012-07-26 21:47:39 UTC

In addition to lookup failures, it seems that "option rotate" causes almost random appends of domain names written in "search" clause to actual DNS lookup names.

Comment 39 Jeff Law 2012-07-26 21:50:20 UTC

That's the bug which started us down this path  :-)

Comment 40 Jeff Law 2012-07-27 04:12:05 UTC

Created attachment 600680 [details]
Tests for various options rotate issues

Add many more resolv.conf variants to prior tests.

Comment 41 Jeff Law 2012-07-27 04:15:12 UTC

Michael,

Your change to pack the IPV4 servers at the front of the array and change how the count is tracked resolved the single issue that remained after backing out all the recent changes to this code and fixing the memcpy argument.   Fresh builds are spinning with that update.

Comment 42 Michael Chapman 2012-08-01 06:39:41 UTC

(In reply to comment #41)
>    Fresh builds are spinning with that update.

Does that include new Fedora packages as well?

I can see a few recent glibc packages in Koji, but I don't think they include this last update.

Comment 43 Jeff Law 2012-08-03 16:40:50 UTC

For rawhide, glibc-2.16-6.fc18.

Patsy has an update in progress for F17 (glibc-2.15-54.fc17).  Leave karma to get it moving :-)

https://admin.fedoraproject.org/updates/glibc-2.15-54.fc17

Comment 44 Michael Chapman 2012-08-05 08:29:39 UTC

(In reply to comment #43)
> Patsy has an update in progress for F17 (glibc-2.15-54.fc17).  Leave karma
> to get it moving :-)

I would, except that package doesn't have the complete fix. :-)

Since this BZ is specifically for RHEL, should I open a new bug to track the problem in Fedora?

Comment 45 Jeff Law 2012-08-06 17:57:36 UTC

Fedora has the complete now (-55 build) :-)  In the mad rush to take care of things before taking a few days off, I forgot to pull the updated fix into Fedora.

No need to open a new report.

Comment 52 errata-xmlrpc 2013-02-21 07:05:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0279.html

Note You need to log in before you can comment on or make changes to this bug.

arnaud.gomes
benjamin.parmentier
bugzilla.redhat.com.dev
cowan.ml
ddevaraj
dev
don
fweimer
gdzien
igeorgex
inode0
james.brown
jan.iven
john.horne
joshua
jrhett
jwest
kevin
klaus.steinberger
kyle
mfranc
mishu
ml
mpolacek
nalayil
n.beernink
nenad
nick
pasteur
pfrankli
rdassen
redhat-bugzilla
redhat-bugzilla
redhat
t.h.amundsen
thomas.oulevey
toddr
toracat
ubellavance
wnefal+redhatbugzilla