Bug 730856 - assertion error in res_query.c: `hp != hp2' failed
assertion error in res_query.c: `hp != hp2' failed
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: glibc (Show other bugs)
16
Unspecified Unspecified
medium Severity high
: ---
: ---
Assigned To: Jeff Law
Fedora Extras Quality Assurance
: Patch, Reopened
: 732857 734018 768549 (view as bug list)
Depends On:
Blocks: F16Blocker/F16FinalBlocker 789968 1138348
  Show dependency treegraph
 
Reported: 2011-08-15 20:22 EDT by Adam Williamson
Modified: 2014-09-04 10:36 EDT (History)
20 users (show)

See Also:
Fixed In Version: glibc-2.14.90-8
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-02-17 01:04:24 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Sourceware 13013 None None None Never

  None (edit)
Description Adam Williamson 2011-08-15 20:22:26 EDT
A bug in glibc causes name resolution to crash applications, sometimes. Firefox can be very crash-y due to this bug; I've had days where it's crashed 10-20 times.

Upstream bug is:

http://sourceware.org/bugzilla/show_bug.cgi?id=13013

there's a patch there. Can we pull the patch into F16's glibc, if it looks sane? It'd be nice not to have this biting too many testers. Thanks!
Comment 1 Adam Williamson 2011-08-15 20:22:53 EDT
Nominating as a final blocker, as live images would be permanently buggy if we shipped this way...
Comment 2 Shawn Starr 2011-08-15 21:33:29 EDT
I can reproduce quite easily with firefox also using fc16 and rawhide GNU libc packages.
Comment 3 Kevin Fenzi 2011-08-20 18:09:38 EDT
I see this all the time with midori. I suspect anything that does a lot of dns lookups would hit it. 

I've made a scratch build with the patch from the upstream bug: 

http://koji.fedoraproject.org/koji/taskinfo?taskID=3289243

So far I have not hit this bug after updating to that package. :)
Comment 4 Adam Williamson 2011-08-23 19:41:26 EDT
*** Bug 732857 has been marked as a duplicate of this bug. ***
Comment 5 Michael Schwendt 2011-08-28 17:12:06 EDT
For me, firefox crashes reproducibly when visiting http://bikemap.net
Comment 6 Thomas Spura 2011-08-31 06:11:51 EDT
*** Bug 710697 has been marked as a duplicate of this bug. ***
Comment 7 Jens Petersen 2011-08-31 20:16:57 EDT
*** Bug 734018 has been marked as a duplicate of this bug. ***
Comment 8 Jens Petersen 2011-08-31 20:37:33 EDT
Kevin,

(In reply to comment #3)
> I've made a scratch build with the patch from the upstream bug: 
> http://koji.fedoraproject.org/koji/taskinfo?taskID=3289243

Unfortunately the rpms are already gone from koji and
glibc is still not updated...
(Is there any way to keep scratch builds around longer?)

Is it possible you could post the srpm somewhere? :)
Comment 10 Adam Williamson 2011-09-01 15:05:15 EDT
For the record, Andreas is working to reproduce and fix this bug, according to this post:

https://lists.fedoraproject.org/pipermail/devel/2011-August/156296.html
Comment 11 Fedora Update System 2011-09-02 03:18:45 EDT
glibc-2.14.90-7 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/glibc-2.14.90-7
Comment 12 Adam Williamson 2011-09-02 16:47:01 EDT
Note the fix for this in glibc-2.14.90-7 is not the same fix that was proposed in http://sourceware.org/bugzilla/show_bug.cgi?id=13013 . The fix in -7 seems to be this:

--- glibc-2.14-213-g3ba5751/resolv/res_query.c
+++ glibc-2.14.90-6/resolv/res_query.c
@@ -248,7 +248,7 @@ __libc_res_nquery(res_state statp,
 	    && *resplen2 > (int) sizeof (HEADER))
 	  {
 	    /* Special case of partial answer.  */
-	    assert (hp != hp2);
+	    assert (n == 0 || hp != hp2);
 	    hp = hp2;
 	  }
 	else if (answerp2 != NULL && *resplen2 < (int) sizeof (HEADER)

I'm not sure if this is something Andreas came up with, or if this is a better upstream fix.
Comment 13 Fedora Update System 2011-09-06 14:09:13 EDT
Package glibc-2.14.90-7:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing glibc-2.14.90-7'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/glibc-2.14.90-7
then log in and leave karma (feedback).
Comment 14 Fedora Update System 2011-09-08 16:48:18 EDT
Package glibc-2.14.90-8:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing glibc-2.14.90-8'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/glibc-2.14.90-8
then log in and leave karma (feedback).
Comment 15 Fedora Update System 2011-09-13 02:10:56 EDT
glibc-2.14.90-8 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.
Comment 16 Mathieu Bridon 2011-10-24 10:46:11 EDT
I think I still have this bug on a just distro-synced Fedora 16.

Transmission-gtk crashed, and gave me the following:
     res_query.c:258: __libc_res_nquery: Assertion `hp != hp2' failed.

That's with:
    glibc-2.14.90-13.x86_64

I'm trying to reproduce it now to get a full backtrace (I didn't have ABRT running), but even the original torrent (which made transmission-gtk crash twice) doesn't reproduce the issue any more. :-/

Andreas, Adam mentioned in comment 10 that you were trying to reproduce the bug. Did you ever manage to find a reproducer, so that I could try it and hopefully get more information?
Comment 17 Adam Williamson 2011-10-24 12:00:21 EDT
the initial bug happened when a DNS query returned one of two or three failures which are unusual and tend to be transient, so reproducing this can be a PITA. one way would be to set up an intentionally broken DNS server which always returned one of those failure codes, of course.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers
Comment 18 Mathieu Bridon 2011-10-24 12:28:45 EDT
(In reply to comment #17)
> the initial bug happened when a DNS query returned one of two or three failures
> which are unusual and tend to be transient, so reproducing this can be a PITA.
> one way would be to set up an intentionally broken DNS server which always
> returned one of those failure codes, of course.

Well, the DNS from my IP sucks so much, I've just had another crash with transmission-gtk. (perhaps they read your comment and thought they'd help us reproduce it? :)

This time I got the full traceback from ABRT, and it seems the bug was already reported, so ABRT just added this:
    https://bugzilla.redhat.com/show_bug.cgi?id=744501#c4

Perhaps this bug should be reopened?
Comment 19 Fernando Herrera 2011-12-18 22:01:12 EST
The bug is still present. I can reproduce it with glibc-2.14.90-21 using any version of firefox visiting URLs from rae.es like this:

http://buscon.rae.es/draeI/SrvltGUIBusUsual?LEMA=XYZ

A simple and isolated testcase crashing on glibc:

#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>
#include <unistd.h>

int main ()
{
        struct addrinfo *res, hints;

        memset(&hints, 0, sizeof(hints));
        hints.ai_flags |= AI_ADDRCONFIG;
        hints.ai_socktype = SOCK_STREAM;

        getaddrinfo ("buscon.rae.es", NULL, &hints, &res);
        return 0;
}


a: res_query.c:258: __libc_res_nquery: Assertion `hp != hp2' failed.

Program received signal SIGABRT, Aborted.
0x00110416 in __kernel_vsyscall ()
(gdb) bt
#0  0x00110416 in __kernel_vsyscall ()
#1  0x4fdee98f in raise () from /lib/libc.so.6
#2  0x4fdf02d5 in abort () from /lib/libc.so.6
#3  0x4fde76a5 in __assert_fail_base () from /lib/libc.so.6
#4  0x4fde7757 in __assert_fail () from /lib/libc.so.6
#5  0x4114f52c in __libc_res_nquery () from /lib/libresolv.so.2
#6  0x4114f71e in __libc_res_nquerydomain () from /lib/libresolv.so.2
#7  0x4114f9d3 in __libc_res_nsearch () from /lib/libresolv.so.2
#8  0x00124f6f in _nss_dns_gethostbyname4_r () from /lib/libnss_dns.so.2
#9  0x4fe9682b in gaih_inet () from /lib/libc.so.6
#10 0x4fe99d1d in getaddrinfo () from /lib/libc.so.6
#11 0x0804842c in main () at a.c:14
Comment 20 Fernando Herrera 2011-12-18 22:17:37 EST
Using nice nameservers looks like does not trigger the bug.

I can definitively trigger it using these nameservers:

nameserver 87.216.1.65
nameserver 87.216.1.66

(gdb) bt
#0  0x00110416 in __kernel_vsyscall ()
#1  0x4fdee98f in __GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2  0x4fdf02d5 in __GI_abort () at abort.c:91
#3  0x4fde76a5 in __assert_fail_base (fmt=0x4ff27be8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x4115ac18 "hp != hp2", file=0x4115ac02 "res_query.c", line=258, function=0x4115ac22 "__libc_res_nquery") at assert.c:94
#4  0x4fde7757 in __GI___assert_fail (assertion=0x4115ac18 "hp != hp2", file=0x4115ac02 "res_query.c", line=258, function=0x4115ac22 "__libc_res_nquery") at assert.c:103
#5  0x4114f52c in __libc_res_nquery (statp=0x4ff69c00, name=0xbfffd89b "buscon.rae.es.", class=1, type=62321, answer=0xbfffe190 "\260\275\201\200", anslen=2048, answerp=0xbfffe9b0, answerp2=0xbfffe9b4, nanswerp2=0xbfffe9b8, resplen2=0xbfffe9bc) at res_query.c:258
#6  0x4114f71e in __libc_res_nquerydomain (statp=0x4ff69c00, name=<optimized out>, domain=0x4ff69c60 "", class=1, type=62321, answer=0xbfffe190 "\260\275\201\200", anslen=2048, answerp=0xbfffe9b0, answerp2=0xbfffe9b4, nanswerp2=0xbfffe9b8, resplen2=0xbfffe9bc) at res_query.c:578
#7  0x4114f9d3 in __libc_res_nsearch (statp=0x4ff69c00, name=0x8048514 "buscon.rae.es", class=1, type=62321, answer=0xbfffe190 "\260\275\201\200", anslen=2048, answerp=0xbfffe9b0, answerp2=0xbfffe9b4, nanswerp2=0xbfffe9b8, resplen2=0xbfffe9bc) at res_query.c:416
#8  0x00124f6f in _nss_dns_gethostbyname4_r (name=0x8048514 "buscon.rae.es", pat=0xbfffef2c, buffer=0xbfffea20 "\300\250\001\t", buflen=1024, errnop=0xbfffef30, herrnop=0xbfffef3c, ttlp=0x0) at nss_dns/dns-host.c:314
#9  0x4fe9682b in gaih_inet (name=0x8048514 "buscon.rae.es", service=<optimized out>, req=0xbffff0dc, pai=0xbffff084, naddrs=0xbffff094) at ../sysdeps/posix/getaddrinfo.c:842
#10 0x4fe99d1d in __GI_getaddrinfo (name=0x8048514 "buscon.rae.es", service=<optimized out>, hints=<optimized out>, pai=0xbffff0fc) at ../sysdeps/posix/getaddrinfo.c:2356
#11 0x0804842c in main () at a.c:14
Comment 21 Jeff Law 2011-12-20 14:25:34 EST
*** Bug 768549 has been marked as a duplicate of this bug. ***
Comment 22 Klaus Pedersen 2011-12-20 18:39:09 EST
This assert is also back for me in glibc-2.14.90-18.x86_64. (mainly transmission)

I saw the problem around the same time as original reporter (August). The issue back then was fixed by updating to the -7 or -8 version from updates-testing (Comment 13/14).

What puzzles me is that if the fix was to change the assert:

-     assert (hp != hp2);
+     assert (n == 0 || hp != hp2);

Then how come that the assert looks like this:

 "__libc_res_nquery: Assertion `hp != hp2' failed"


I assume what happened was that Andreas' (working) fix was reverted and the upstream patch was applied instead.
Comment 23 Klaus Pedersen 2011-12-21 19:01:15 EST
This assert is also back for me in glibc-2.14.90-18.x86_64. (mainly transmission)

I saw the problem around the same time as original reporter (August). The issue back then was fixed by updating to the -7 or -8 version from updates-testing (Comment 13/14).

What puzzles me is that if the fix was to change the assert:

-     assert (hp != hp2);
+     assert (n == 0 || hp != hp2);

Then how come that the assert looks like this:

 "__libc_res_nquery: Assertion `hp != hp2' failed"


I assume what happened was that Andreas' (working) fix was reverted and the upstream patch was applied instead.
Comment 24 Jeff Law 2011-12-21 23:55:09 EST
In response to c#22/23, the reason the assert text isn't what you expect is because we're hitting a different assert.

There are two places in res_query which (prior to Andreas's change) which assert (hp != hp2).  Andreas's change only modified one of those asserts to be assert (n == 0 || hp != hp2).  The new failures are the other assert (which Andreas didn't change).

You can see this by matching up the line # in res_query with the backtrace provided in c20.  

I haven't managed to get this to fail, but I've got it the testcase running in a loop while I read the code in the hopes that it'll fail.
Comment 25 Jeff Law 2011-12-22 00:44:43 EST
Fernando, if you can trigger this with your testcase and get me the value of *resplen2 in frame #5 it would be helpful.
Comment 26 Fernando Herrera 2011-12-22 00:48:21 EST
sure:
(gdb) p *resplen2
$6 = 1


I have a debuging session opened right now with the aborted testcase, so anything you need just ask here or on #fedora-devel IRC
Comment 27 Jeff Law 2011-12-22 01:18:43 EST
What's your nick on IRC?

Presumably you're using the 87.216.1.65/66 nameservers?  I can't reach either of them unfortunately.

Also in frame #5
*answer, anslen, *answerp, *answerp2, *nanserp2
Comment 28 Fernando Herrera 2011-12-22 01:23:41 EST
(gdb)  p *answer
$5 = 231 '\347'
(gdb)  p anslen
$6 = 2048
(gdb)  p *answerp
$7 = (u_char *) 0xbfffe1e0 "\347\016\201\200"
(gdb)  p *answerp2
$8 = (u_char *) 0xbfffe1e0 "\347\016\201\200"
(gdb) p *nanswerp2
$9 = 2048
Comment 29 Eugene Kanter 2012-01-10 18:58:19 EST
update from bug 768549

same exact error for www.estamos.de
could it be related to a DNS query timeout?

host www.estamos.de
www.estamos.de has address 78.46.77.246
;; connection timed out; no servers could be reached
;; connection timed out; no servers could be reached

host ocsp.entrust.net
ocsp.entrust.net has address 216.191.247.203
;; connection timed out; no servers could be reached
;; connection timed out; no servers could be reached
Comment 30 Jeff Law 2012-01-10 23:51:32 EST
I spent some time remotely debugging with (thanks Fernando) just before Christmas.  It's definitely a problem with how the server responds and glibc's use of the multiple result buffers.  There were one or two more bits of state I needed to track down to fully understand the paths through the code.  I haven't caught Fernando on IRC (we're offset by ~8hrs) so I haven't been able to dive back into it yet.

Eugene, if the DNS resolver you're using is publicly accessable, that would be a big help as I could debug it here without having to coordinate with Fernando to get access to his box (which uses a private DNS server which often triggers this problem).  Can you please send me the contents of your /etc/resolv.conf file via private message (law@redhat.com)?
Comment 31 Clemens Eisserer 2012-02-13 11:36:03 EST
I see the problem when running azureus - my wlan cable/modem seems to get a bit angry and drops connections.

This happens with glibc-2.14.90-24.fc16.4.i686 when browsing with FireFox-10:

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x9ebffb40 (LWP 8473)]
0x008c8416 in __kernel_vsyscall ()
Missing separate debuginfos, use: debuginfo-install libthai-0.1.14-4.fc15.i686
(gdb) bt
#0  0x008c8416 in __kernel_vsyscall ()
#1  0x49d4998f in __GI_raise (sig=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2  0x49d4b2d5 in __GI_abort () at abort.c:91
#3  0x49d426a5 in __assert_fail_base (fmt=
    0x49e82c48 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=
    0x4a215c18 "hp != hp2", file=0x4a215c02 "res_query.c", line=258, function=
    0x4a215c22 "__libc_res_nquery") at assert.c:94
#4  0x49d42757 in __GI___assert_fail (assertion=0x4a215c18 "hp != hp2", file=
    0x4a215c02 "res_query.c", line=258, function=
    0x4a215c22 "__libc_res_nquery") at assert.c:103
#5  0x4a20a52c in __libc_res_nquery (statp=0x9ebffdc4, name=
    0x9ebfda1b "ecosia-de.tumblr.com.localdomain", class=1, type=62321, answer=
    0x9ebfe310 "ݗ\201\203", anslen=2048, answerp=0x9ebfeb30, answerp2=
    0x9ebfeb34, nanswerp2=0x9ebfeb38, resplen2=0x9ebfeb3c) at res_query.c:258
#6  0x4a20a71e in __libc_res_nquerydomain (statp=0x9ebffdc4, 
    name=<optimized out>, domain=0x9ebffe24 "localdomain", class=1, type=
    62321, answer=0x9ebfe310 "ݗ\201\203", anslen=2048, answerp=0x9ebfeb30, 
    answerp2=0x9ebfeb34, nanswerp2=0x9ebfeb38, resplen2=0x9ebfeb3c)
    at res_query.c:578
#7  0x4a20a9d3 in __libc_res_nsearch (statp=0x9ebffdc4, name=
    0xa3b325bc "ecosia-de.tumblr.com", class=1, type=62321, answer=
    0x9ebfe310 "ݗ\201\203", anslen=2048, answerp=0x9ebfeb30, answerp2=
---Type <return> to continue, or q <return> to quit---
    0x9ebfeb34, nanswerp2=0x9ebfeb38, resplen2=0x9ebfeb3c) at res_query.c:416
#8  0x005f2f6f in _nss_dns_gethostbyname4_r (name=
    0xa3b325bc "ecosia-de.tumblr.com", pat=0x9ebff0ac, buffer=0x9ebfeba0 "", 
    buflen=1024, errnop=0x9ebff0b0, herrnop=0x9ebff0bc, ttlp=0x0)
    at nss_dns/dns-host.c:314
#9  0x49df189b in gaih_inet (name=0xa3b325bc "ecosia-de.tumblr.com", 
    service=<optimized out>, req=0x9ebff26c, pai=0x9ebff204, naddrs=0x9ebff214)
    at ../sysdeps/posix/getaddrinfo.c:842
#10 0x49df4d8d in __GI_getaddrinfo (name=0xa3b325bc "ecosia-de.tumblr.com", 
    service=<optimized out>, hints=<optimized out>, pai=0x9ebff28c)
    at ../sysdeps/posix/getaddrinfo.c:2356
#11 0x0059f0cb in PR_GetAddrInfoByName ()
Comment 32 Clemens Eisserer 2012-02-13 11:46:49 EST
same happens for azureus - I always thought java had this specific problem, but it seems its deeper down in gblibc
Comment 33 Jeff Law 2012-02-13 11:49:32 EST
Clemens, what are the contents of your resolv.conf?  I'm still looking for a public dns server that exhibits the problem behaviour so I can debug locally rather than use Fernando's machine on the other side of the pond.  Ideally I'll have time to look at this again later this week.
Comment 34 Clemens Eisserer 2012-02-13 12:01:24 EST
Hi Jeff,

/etc/resolv.conf has the following two entries:
nameserver 212.186.211.21
nameserver 195.34.133.21

I also have a packet-dump which was created while firefox crashed showing lots of ugly stuff going on (tcp dups, retransmits, ..), when running azureus. If you are interested, I could upload it somewhere?
Comment 35 Clemens Eisserer 2012-02-16 12:21:31 EST
most likely you will not be able to reproduce the problem by just using the nameserver provided above.

I *only* get the crashes when running Azureus, which causes an awful lot of TCP retransmits and duplicates and causes even the wlan-driver to get a bit "angry" from time to time.

If required I could provide you with my machine, and create a remote account for you.
Comment 36 Jeff Law 2012-02-17 01:04:24 EST
I pulled in a patch from Debian which resolves this issue into rawhide & f17. 

The patch was submitted upstream some time ago, but Uli hasn't included it into the upstream sources yet.
Comment 37 Clemens Eisserer 2012-02-17 03:17:37 EST
Thanks :)

Any chance this will end up in Fedora-16 too? (as its only a bug-fix and not a new feature)
Comment 38 Jeff Law 2012-02-17 12:09:15 EST
Not planning to right now.  My time is quite limited and my focus is turning towards F17 issues.
Comment 39 Clemens Eisserer 2012-02-17 12:14:32 EST
Thats unfourunate - as it means I have to live wth Firefox crashing a dozen times a day while running azureus.

If glibc just wouldn't be maintained the way it is :(
Comment 40 Jeff Law 2012-02-17 12:40:32 EST
You could always update to the F17 glibc which includes the fix. 

Upstream glibc maintenance is a serious issue which increases the maintenance burden for all the downstream consumers including Fedora & Debian.  I don't know what the long term solution for upstream glibc will be; however, I have been in contact with some downstream consumers to see where we can work together to avoid duplicated efforts.
Comment 41 cfzeitler 2012-03-23 21:54:48 EDT
hmmm...

i've had Vuze 4.7.0.2 crashing @ random intervals with
"java: res_query.c:258: __libc_res_nquery: Assertion `hp != hp2' failed."

could be related? Fedora 16, glibc 2.14.90-24.fc16.6 64 bit
Comment 42 Clemens Eisserer 2012-03-24 06:42:52 EDT
Jap, its exactly the same bug - hopefully fixed in Fedora 17, it should be solveable by installing a glibc shipped with the Alpha releases of fedora 17.

Its really sad that will be no backport of the fix to Fedora 16.
Comment 43 Jeff Law 2012-03-26 12:24:10 EDT
It's definitely fixed in F17.  Unfortunately I simply don't have the time to backport changes into F16 and issue new updates.

Unfortunately, it may not be possible any more to use the F17 glibc on F16 because of the changes to move everything into /usr.  I haven't tried an rpm --force.
Comment 44 Nicola Soranzo 2012-04-27 09:53:57 EDT
(In reply to comment #42)
> Its really sad that will be no backport of the fix to Fedora 16.

+1, a backport for F16 (which is supposed to be supported for other 8 months!) would be very much appreciated, my Firefox crashes 2-3 times per day!
Comment 45 Fedora Update System 2012-05-14 16:49:24 EDT
glibc-2.14.90-24.fc16.7 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/glibc-2.14.90-24.fc16.7
Comment 46 Fedora Update System 2012-05-26 21:57:48 EDT
glibc-2.14.90-24.fc16.7 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.