Bug 677395

Summary:	gethosbyaddr() hangs with signals blocked if nameserver
Product:	[Fedora] Fedora	Reporter:	Bernie Innocenti <bernie+fedora>
Component:	glibc	Assignee:	Jeff Law <law>
Status:	CLOSED WORKSFORME	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	14	CC:	fweimer, jakub, schwab
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-02-17 18:41:39 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Bernie Innocenti 2011-02-14 15:59:25 UTC

Description of problem:
Programs calling gethostbyaddr() can block for a long time, during which they can't be aborted with ctrl-c, ctrl-\ or kill.

NOTE: I'm running nscd, which may be part of the problem.


Version-Release number of selected component (if applicable):
glibc-2.13-1.x86_64

How reproducible:
Always

Steps to Reproduce:
1.kill your network connection so that the local nameserver cannot perform recursive queries
2.run "ping some_host"
3.try to kill ping
  
Actual results:
Ping gets stuck in gethostbyaddr() and can't be killed


Expected results:
Ping should be killable at all times.

Additional info:

Backtrace of ping obtained while the process is stuck.

0x00000030004d7248 in __poll (fds=0x7fff30ef7870, nfds=1, timeout=5000) at ../sysdeps/unix/sysv/linux/poll.c:83
83	    return INLINE_SYSCALL (poll, 3, CHECK_N (fds, nfds), nfds, timeout);
Missing separate debuginfos, use: debuginfo-install iputils-20100418-3.fc14.x86_64
(gdb) bt
#0  0x00000030004d7248 in __poll (fds=0x7fff30ef7870, nfds=1, timeout=5000) at ../sysdeps/unix/sysv/linux/poll.c:83
#1  0x0000003003c0b5cb in send_dg (statp=0x3000799b80, buf=0x7fff30ef7910 "\244p\001", buflen=45, buf2=0x0, buflen2=0, ans=0x7fff30ef7ae0 "", anssiz=1024, 
    ansp=0x7fff30ef8370, ansp2=0x0, nansp2=0x0, resplen2=0x0) at res_send.c:1058
#2  __libc_res_nsend (statp=0x3000799b80, buf=0x7fff30ef7910 "\244p\001", buflen=45, buf2=0x0, buflen2=0, ans=0x7fff30ef7ae0 "", anssiz=1024, ansp=0x7fff30ef8370, 
    ansp2=0x0, nansp2=0x0, resplen2=0x0) at res_send.c:556
#3  0x0000003003c091b1 in __libc_res_nquery (statp=0x3000799b80, name=0x7fff30ef7f60 "161.76.232.199.in-addr.arpa", class=1, type=12, answer=0x7fff30ef7ae0 "", 
    anslen=1024, answerp=0x7fff30ef8370, answerp2=0x0, nanswerp2=0x0, resplen2=0x0) at res_query.c:225
#4  0x00007fe069d46c00 in _nss_dns_gethostbyaddr2_r (addr=0x7fff30ef84dc, len=<value optimized out>, af=<value optimized out>, result=0x3000799e60, buffer=0xd16700 "\177", 
    buflen=1024, errnop=0x7fe07024c6a0, h_errnop=0x7fff30ef8480, ttlp=0x0) at nss_dns/dns-host.c:471
#5  0x00000030004faab8 in __gethostbyaddr_r (addr=<value optimized out>, len=4, type=2, resbuf=0x3000799e60, buffer=0xd16700 "\177", buflen=1024, result=0x7fff30ef8470, 
    h_errnop=0x7fff30ef8480) at ../nss/getXXbyYY_r.c:256
#6  0x00000030004fa84c in gethostbyaddr (addr=0x7fff30ef84dc, len=4, type=2) at ../nss/getXXbyYY.c:117
#7  0x00000000004021da in ?? ()
#8  0x00000000004044aa in ?? ()
#9  0x0000000000406ba6 in ?? ()
#10 0x000000000040385a in ?? ()
#11 0x000000300041ee5d in __libc_start_main (main=0x402c40, argc=2, ubp_av=0x7fff30ef9d08, init=<value optimized out>, fini=<value optimized out>, 
    rtld_fini=<value optimized out>, stack_end=0x7fff30ef9cf8) at libc-start.c:226

Comment 1 Andreas Schwab 2011-02-15 15:12:26 UTC

Works fine here.  Make sure you didn't block the signals.

Comment 2 Bernie Innocenti 2011-02-18 00:18:52 UTC

Ok, it's pretty hard to trigger, but I can definitely trigger it *sometimes*:

130!bernie@giskard:~/src/fdo/xserver$ ping google.com
PING google.com (74.125.226.113) 56(84) bytes of data.
^C^C^C^C^C^C^C^C^C^C^C^C
^C^C^C
64 bytes from 74.125.226.113: icmp_req=1 ttl=55 time=16.8 ms

--- google.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 16.859/16.859/16.859/0.000 ms

Do you have nscd enabled?

Comment 3 Andreas Schwab 2011-02-21 15:28:09 UTC

The backtrace has no reference to nscd at all.

Comment 4 Bernie Innocenti 2011-02-21 21:18:20 UTC

(In reply to comment #3)
> The backtrace has no reference to nscd at all.

Is it just possible that my first ^C made the nscd codepath abort and return an error, so that __gethostbyaddr_r() continued with the in-process resolver?

Comment 5 Bernie Innocenti 2011-02-21 21:28:42 UTC

Ok, now I can reproduce it at will:

 sudo killall -STOP nscd
 sudo ltrace -S  ping google.com

And this is what I get from ping:

[...]
getopt(2, 0x7fff5009bc28, "h?VQ:I:M:aUc:dfi:w:l:S:np:qrs:vL"...) = -1
inet_aton("google.com", 0x61a7c4)                = 0
idna_to_ascii_lz(0x7fff5009d7f4, 0x7fff5009b640, 0, 0, 0) = 0
gethostbyname("google.com" <unfinished ...>
SYS_getpid()                                     = 30330
SYS_open("/etc/resolv.conf", 0, 0666)            = 4
SYS_fstat(4, 0x7fff500992e0, 0x7fff500992e0, 2, 1) = 0
SYS_mmap(0, 4096, 3, 34, 0xffffffff)             = 0x7f1a56b17000
SYS_read(4, "search localnet office.fsf.org f"..., 4096) = 71
SYS_read(4, "", 4096)                            = 0
SYS_close(4)                                     = 0
SYS_munmap(0x7f1a56b17000, 4096)                 = 0
SYS_socket(1, 526337, 0, 0x3000481310, 0)        = 4
SYS_connect(4, 0x7fff5009b010, 110, 0x3000481310, 0) = 0
SYS_sendto(4, 0x7fff5009afd0, 18, 16384, 0)      = 18
SYS_poll(0x7fff5009b280, 1, 5000, 16384, 4

The process hangs right here, while talking to nscd, and became unkillable.

Comment 6 Bernie Innocenti 2011-02-21 23:46:37 UTC

I was partially incorrect: in the above testcase, the process that would hang and become unresponsive to ^C was actually sudo, not ping.

At least now we know that it's not a problem specific to the resolver. Perhaps something odd happens while processes are talking with nscd?

Comment 7 Andreas Schwab 2011-02-22 15:07:54 UTC

Works fine here.

Comment 8 Bernie Innocenti 2011-02-28 16:49:54 UTC

I've seen the same symptom today on an Ubuntu 8.04 LTS server machine which isn't running nscd.

This time I could background the ping process and trace what happens when I hit CTRL-C:

---------------------------------------------------------------
root@monolith:~# ping servent.gnu.org
PING servent.gnu.org (199.232.41.14) 56(84) bytes of data.

***** HERE I'VE BEEN HITTING CTRL^C REPEATEDLY *****

[1]+  Stopped                 ping servent.gnu.org
root@monolith:~# strace -p 27318
Process 27318 attached - interrupt to quit

Process 27318 detached

[1]+  Stopped                 ping servent.gnu.org
root@monolith:~# strace -p 27318 &
[2] 27325
root@monolith:~# Process 27318 attached - interrupt to quit

root@monolith:~# fg
ping servent.gnu.org
restart_syscall(<... resuming interrupted call ...>) = 0
poll([{fd=5, events=POLLOUT, revents=POLLOUT}], 1, 0) = 1
sendto(5, "\315\355\1\0\0\1\0\0\0\0\0\0\00214\00241\003232\003199"..., 44, MSG_NOSIGNAL, NULL, 0) = 44
poll(
[{fd=5, events=POLLIN}], 1, 5000)  = ? ERESTART_RESTARTBLOCK (To be restarted)


***** HERE I STARTED TO HIT CTRL^C AGAIN *****


--- SIGINT (Interrupt) @ 0 (0) ---
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}], 1, 2323)  = ? ERESTART_RESTARTBLOCK (To be restarted)
--- SIGINT (Interrupt) @ 0 (0) ---
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}], 1, 1196)  = ? ERESTART_RESTARTBLOCK (To be restarted)
--- SIGINT (Interrupt) @ 0 (0) ---
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}], 1, 620)   = ? ERESTART_RESTARTBLOCK (To be restarted)
--- SIGINT (Interrupt) @ 0 (0) ---
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}], 1, 203)   = 0
close(4)                                = 0
close(5)                                = 0
write(1, "64 bytes from 199.232.41.14: icm"..., 6064 bytes from 199.232.41.14: icmp_seq=1 ttl=53 time=37.0 ms
) = 60
write(1, "\n", 1
)                       = 1
write(1, "--- servent.gnu.org ping statist"..., 40--- servent.gnu.org ping statistics ---
) = 40
write(1, "1 packets transmitted, 1 receive"..., 601 packets transmitted, 1 received, 0% packet loss, time 0ms
) = 60
write(1, "rtt min/avg/max/mdev = 37.052/37"..., 53rtt min/avg/max/mdev = 37.052/37.052/37.052/0.000 ms
) = 53
exit_group(0)                           = ?
Process 27318 detached
[2]-  Done                    strace -p 27318
---------------------------------------------------------------

Unfortunately, the network problem got fixed and I couldn't reproduce it again. I would have been good to have a backtrace of the process.

Now we know the following things:

 * the bug is not Fedora specific
 * the bug is not nscd specific
 * the bug has not been introduced in libc recently (glibc 2.7 had it)

Comment 9 Fedora Admin XMLRPC Client 2011-11-14 19:43:39 UTC

This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 10 Jeff Law 2012-02-17 18:41:39 UTC

I can't see any reason why gethostbyaddr or its relatives would be ignoring signals.  I guess the one additional test you could try would be ping -n which disables the reverse lookup.

I really think you need to look at your tty settings and other aspects of your system to ensure that the requested signals are actually being delivered.  I'm closing this as WORKSFORME as nobody else has been able to reproduce this problem.

Comment 11 Bernie Innocenti 2012-02-17 19:45:42 UTC

(In reply to comment #10)
> I can't see any reason why gethostbyaddr or its relatives would be ignoring
> signals.  I guess the one additional test you could try would be ping -n which
> disables the reverse lookup.
> 
> I really think you need to look at your tty settings and other aspects of your
> system to ensure that the requested signals are actually being delivered.  I'm
> closing this as WORKSFORME as nobody else has been able to reproduce this
> problem.

Then this bug might have been fixed in recent versions of Fedora.

I'm still seeing the same behavior on my work laptop which runs Ubuntu Lucid: on network outages, ping and other programs get stuck on reverse lookups and can't be killed with ^C.