Bug 677395 - gethosbyaddr() hangs with signals blocked if nameserver
Summary: gethosbyaddr() hangs with signals blocked if nameserver
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Fedora
Classification: Fedora
Component: glibc
Version: 14
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Jeff Law
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-02-14 15:59 UTC by Bernie Innocenti
Modified: 2016-11-24 15:43 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-02-17 18:41:39 UTC
Type: ---


Attachments (Terms of Use)

Description Bernie Innocenti 2011-02-14 15:59:25 UTC
Description of problem:
Programs calling gethostbyaddr() can block for a long time, during which they can't be aborted with ctrl-c, ctrl-\ or kill.

NOTE: I'm running nscd, which may be part of the problem.


Version-Release number of selected component (if applicable):
glibc-2.13-1.x86_64

How reproducible:
Always

Steps to Reproduce:
1.kill your network connection so that the local nameserver cannot perform recursive queries
2.run "ping some_host"
3.try to kill ping
  
Actual results:
Ping gets stuck in gethostbyaddr() and can't be killed


Expected results:
Ping should be killable at all times.

Additional info:

Backtrace of ping obtained while the process is stuck.

0x00000030004d7248 in __poll (fds=0x7fff30ef7870, nfds=1, timeout=5000) at ../sysdeps/unix/sysv/linux/poll.c:83
83	    return INLINE_SYSCALL (poll, 3, CHECK_N (fds, nfds), nfds, timeout);
Missing separate debuginfos, use: debuginfo-install iputils-20100418-3.fc14.x86_64
(gdb) bt
#0  0x00000030004d7248 in __poll (fds=0x7fff30ef7870, nfds=1, timeout=5000) at ../sysdeps/unix/sysv/linux/poll.c:83
#1  0x0000003003c0b5cb in send_dg (statp=0x3000799b80, buf=0x7fff30ef7910 "\244p\001", buflen=45, buf2=0x0, buflen2=0, ans=0x7fff30ef7ae0 "", anssiz=1024, 
    ansp=0x7fff30ef8370, ansp2=0x0, nansp2=0x0, resplen2=0x0) at res_send.c:1058
#2  __libc_res_nsend (statp=0x3000799b80, buf=0x7fff30ef7910 "\244p\001", buflen=45, buf2=0x0, buflen2=0, ans=0x7fff30ef7ae0 "", anssiz=1024, ansp=0x7fff30ef8370, 
    ansp2=0x0, nansp2=0x0, resplen2=0x0) at res_send.c:556
#3  0x0000003003c091b1 in __libc_res_nquery (statp=0x3000799b80, name=0x7fff30ef7f60 "161.76.232.199.in-addr.arpa", class=1, type=12, answer=0x7fff30ef7ae0 "", 
    anslen=1024, answerp=0x7fff30ef8370, answerp2=0x0, nanswerp2=0x0, resplen2=0x0) at res_query.c:225
#4  0x00007fe069d46c00 in _nss_dns_gethostbyaddr2_r (addr=0x7fff30ef84dc, len=<value optimized out>, af=<value optimized out>, result=0x3000799e60, buffer=0xd16700 "\177", 
    buflen=1024, errnop=0x7fe07024c6a0, h_errnop=0x7fff30ef8480, ttlp=0x0) at nss_dns/dns-host.c:471
#5  0x00000030004faab8 in __gethostbyaddr_r (addr=<value optimized out>, len=4, type=2, resbuf=0x3000799e60, buffer=0xd16700 "\177", buflen=1024, result=0x7fff30ef8470, 
    h_errnop=0x7fff30ef8480) at ../nss/getXXbyYY_r.c:256
#6  0x00000030004fa84c in gethostbyaddr (addr=0x7fff30ef84dc, len=4, type=2) at ../nss/getXXbyYY.c:117
#7  0x00000000004021da in ?? ()
#8  0x00000000004044aa in ?? ()
#9  0x0000000000406ba6 in ?? ()
#10 0x000000000040385a in ?? ()
#11 0x000000300041ee5d in __libc_start_main (main=0x402c40, argc=2, ubp_av=0x7fff30ef9d08, init=<value optimized out>, fini=<value optimized out>, 
    rtld_fini=<value optimized out>, stack_end=0x7fff30ef9cf8) at libc-start.c:226

Comment 1 Andreas Schwab 2011-02-15 15:12:26 UTC
Works fine here.  Make sure you didn't block the signals.

Comment 2 Bernie Innocenti 2011-02-18 00:18:52 UTC
Ok, it's pretty hard to trigger, but I can definitely trigger it *sometimes*:

130!bernie@giskard:~/src/fdo/xserver$ ping google.com
PING google.com (74.125.226.113) 56(84) bytes of data.
^C^C^C^C^C^C^C^C^C^C^C^C
^C^C^C
64 bytes from 74.125.226.113: icmp_req=1 ttl=55 time=16.8 ms

--- google.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 16.859/16.859/16.859/0.000 ms

Do you have nscd enabled?

Comment 3 Andreas Schwab 2011-02-21 15:28:09 UTC
The backtrace has no reference to nscd at all.

Comment 4 Bernie Innocenti 2011-02-21 21:18:20 UTC
(In reply to comment #3)
> The backtrace has no reference to nscd at all.

Is it just possible that my first ^C made the nscd codepath abort and return an error, so that __gethostbyaddr_r() continued with the in-process resolver?

Comment 5 Bernie Innocenti 2011-02-21 21:28:42 UTC
Ok, now I can reproduce it at will:

 sudo killall -STOP nscd
 sudo ltrace -S  ping google.com

And this is what I get from ping:

[...]
getopt(2, 0x7fff5009bc28, "h?VQ:I:M:aUc:dfi:w:l:S:np:qrs:vL"...) = -1
inet_aton("google.com", 0x61a7c4)                = 0
idna_to_ascii_lz(0x7fff5009d7f4, 0x7fff5009b640, 0, 0, 0) = 0
gethostbyname("google.com" <unfinished ...>
SYS_getpid()                                     = 30330
SYS_open("/etc/resolv.conf", 0, 0666)            = 4
SYS_fstat(4, 0x7fff500992e0, 0x7fff500992e0, 2, 1) = 0
SYS_mmap(0, 4096, 3, 34, 0xffffffff)             = 0x7f1a56b17000
SYS_read(4, "search localnet office.fsf.org f"..., 4096) = 71
SYS_read(4, "", 4096)                            = 0
SYS_close(4)                                     = 0
SYS_munmap(0x7f1a56b17000, 4096)                 = 0
SYS_socket(1, 526337, 0, 0x3000481310, 0)        = 4
SYS_connect(4, 0x7fff5009b010, 110, 0x3000481310, 0) = 0
SYS_sendto(4, 0x7fff5009afd0, 18, 16384, 0)      = 18
SYS_poll(0x7fff5009b280, 1, 5000, 16384, 4

The process hangs right here, while talking to nscd, and became unkillable.

Comment 6 Bernie Innocenti 2011-02-21 23:46:37 UTC
I was partially incorrect: in the above testcase, the process that would hang and become unresponsive to ^C was actually sudo, not ping.

At least now we know that it's not a problem specific to the resolver. Perhaps something odd happens while processes are talking with nscd?

Comment 7 Andreas Schwab 2011-02-22 15:07:54 UTC
Works fine here.

Comment 8 Bernie Innocenti 2011-02-28 16:49:54 UTC
I've seen the same symptom today on an Ubuntu 8.04 LTS server machine which isn't running nscd.

This time I could background the ping process and trace what happens when I hit CTRL-C:

---------------------------------------------------------------
root@monolith:~# ping servent.gnu.org
PING servent.gnu.org (199.232.41.14) 56(84) bytes of data.

***** HERE I'VE BEEN HITTING CTRL^C REPEATEDLY *****

[1]+  Stopped                 ping servent.gnu.org
root@monolith:~# strace -p 27318
Process 27318 attached - interrupt to quit

Process 27318 detached

[1]+  Stopped                 ping servent.gnu.org
root@monolith:~# strace -p 27318 &
[2] 27325
root@monolith:~# Process 27318 attached - interrupt to quit

root@monolith:~# fg
ping servent.gnu.org
restart_syscall(<... resuming interrupted call ...>) = 0
poll([{fd=5, events=POLLOUT, revents=POLLOUT}], 1, 0) = 1
sendto(5, "\315\355\1\0\0\1\0\0\0\0\0\0\00214\00241\003232\003199"..., 44, MSG_NOSIGNAL, NULL, 0) = 44
poll(
[{fd=5, events=POLLIN}], 1, 5000)  = ? ERESTART_RESTARTBLOCK (To be restarted)


***** HERE I STARTED TO HIT CTRL^C AGAIN *****


--- SIGINT (Interrupt) @ 0 (0) ---
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}], 1, 2323)  = ? ERESTART_RESTARTBLOCK (To be restarted)
--- SIGINT (Interrupt) @ 0 (0) ---
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}], 1, 1196)  = ? ERESTART_RESTARTBLOCK (To be restarted)
--- SIGINT (Interrupt) @ 0 (0) ---
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}], 1, 620)   = ? ERESTART_RESTARTBLOCK (To be restarted)
--- SIGINT (Interrupt) @ 0 (0) ---
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}], 1, 203)   = 0
close(4)                                = 0
close(5)                                = 0
write(1, "64 bytes from 199.232.41.14: icm"..., 6064 bytes from 199.232.41.14: icmp_seq=1 ttl=53 time=37.0 ms
) = 60
write(1, "\n", 1
)                       = 1
write(1, "--- servent.gnu.org ping statist"..., 40--- servent.gnu.org ping statistics ---
) = 40
write(1, "1 packets transmitted, 1 receive"..., 601 packets transmitted, 1 received, 0% packet loss, time 0ms
) = 60
write(1, "rtt min/avg/max/mdev = 37.052/37"..., 53rtt min/avg/max/mdev = 37.052/37.052/37.052/0.000 ms
) = 53
exit_group(0)                           = ?
Process 27318 detached
[2]-  Done                    strace -p 27318
---------------------------------------------------------------

Unfortunately, the network problem got fixed and I couldn't reproduce it again. I would have been good to have a backtrace of the process.

Now we know the following things:

 * the bug is not Fedora specific
 * the bug is not nscd specific
 * the bug has not been introduced in libc recently (glibc 2.7 had it)

Comment 9 Fedora Admin XMLRPC Client 2011-11-14 19:43:39 UTC
This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 10 Jeff Law 2012-02-17 18:41:39 UTC
I can't see any reason why gethostbyaddr or its relatives would be ignoring signals.  I guess the one additional test you could try would be ping -n which disables the reverse lookup.

I really think you need to look at your tty settings and other aspects of your system to ensure that the requested signals are actually being delivered.  I'm closing this as WORKSFORME as nobody else has been able to reproduce this problem.

Comment 11 Bernie Innocenti 2012-02-17 19:45:42 UTC
(In reply to comment #10)
> I can't see any reason why gethostbyaddr or its relatives would be ignoring
> signals.  I guess the one additional test you could try would be ping -n which
> disables the reverse lookup.
> 
> I really think you need to look at your tty settings and other aspects of your
> system to ensure that the requested signals are actually being delivered.  I'm
> closing this as WORKSFORME as nobody else has been able to reproduce this
> problem.

Then this bug might have been fixed in recent versions of Fedora.

I'm still seeing the same behavior on my work laptop which runs Ubuntu Lucid: on network outages, ping and other programs get stuck on reverse lookups and can't be killed with ^C.


Note You need to log in before you can comment on or make changes to this bug.