| Summary: | gethosbyaddr() hangs with signals blocked if nameserver | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Bernie Innocenti <bernie+fedora> |
| Component: | glibc | Assignee: | Jeff Law <law> |
| Status: | CLOSED WORKSFORME | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 14 | CC: | fweimer, jakub, schwab |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2012-02-17 18:41:39 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
Works fine here. Make sure you didn't block the signals. Ok, it's pretty hard to trigger, but I can definitely trigger it *sometimes*: 130!bernie@giskard:~/src/fdo/xserver$ ping google.com PING google.com (74.125.226.113) 56(84) bytes of data. ^C^C^C^C^C^C^C^C^C^C^C^C ^C^C^C 64 bytes from 74.125.226.113: icmp_req=1 ttl=55 time=16.8 ms --- google.com ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 16.859/16.859/16.859/0.000 ms Do you have nscd enabled? The backtrace has no reference to nscd at all. (In reply to comment #3) > The backtrace has no reference to nscd at all. Is it just possible that my first ^C made the nscd codepath abort and return an error, so that __gethostbyaddr_r() continued with the in-process resolver? Ok, now I can reproduce it at will:
sudo killall -STOP nscd
sudo ltrace -S ping google.com
And this is what I get from ping:
[...]
getopt(2, 0x7fff5009bc28, "h?VQ:I:M:aUc:dfi:w:l:S:np:qrs:vL"...) = -1
inet_aton("google.com", 0x61a7c4) = 0
idna_to_ascii_lz(0x7fff5009d7f4, 0x7fff5009b640, 0, 0, 0) = 0
gethostbyname("google.com" <unfinished ...>
SYS_getpid() = 30330
SYS_open("/etc/resolv.conf", 0, 0666) = 4
SYS_fstat(4, 0x7fff500992e0, 0x7fff500992e0, 2, 1) = 0
SYS_mmap(0, 4096, 3, 34, 0xffffffff) = 0x7f1a56b17000
SYS_read(4, "search localnet office.fsf.org f"..., 4096) = 71
SYS_read(4, "", 4096) = 0
SYS_close(4) = 0
SYS_munmap(0x7f1a56b17000, 4096) = 0
SYS_socket(1, 526337, 0, 0x3000481310, 0) = 4
SYS_connect(4, 0x7fff5009b010, 110, 0x3000481310, 0) = 0
SYS_sendto(4, 0x7fff5009afd0, 18, 16384, 0) = 18
SYS_poll(0x7fff5009b280, 1, 5000, 16384, 4
The process hangs right here, while talking to nscd, and became unkillable.
I was partially incorrect: in the above testcase, the process that would hang and become unresponsive to ^C was actually sudo, not ping. At least now we know that it's not a problem specific to the resolver. Perhaps something odd happens while processes are talking with nscd? Works fine here. I've seen the same symptom today on an Ubuntu 8.04 LTS server machine which isn't running nscd.
This time I could background the ping process and trace what happens when I hit CTRL-C:
---------------------------------------------------------------
root@monolith:~# ping servent.gnu.org
PING servent.gnu.org (199.232.41.14) 56(84) bytes of data.
***** HERE I'VE BEEN HITTING CTRL^C REPEATEDLY *****
[1]+ Stopped ping servent.gnu.org
root@monolith:~# strace -p 27318
Process 27318 attached - interrupt to quit
Process 27318 detached
[1]+ Stopped ping servent.gnu.org
root@monolith:~# strace -p 27318 &
[2] 27325
root@monolith:~# Process 27318 attached - interrupt to quit
root@monolith:~# fg
ping servent.gnu.org
restart_syscall(<... resuming interrupted call ...>) = 0
poll([{fd=5, events=POLLOUT, revents=POLLOUT}], 1, 0) = 1
sendto(5, "\315\355\1\0\0\1\0\0\0\0\0\0\00214\00241\003232\003199"..., 44, MSG_NOSIGNAL, NULL, 0) = 44
poll(
[{fd=5, events=POLLIN}], 1, 5000) = ? ERESTART_RESTARTBLOCK (To be restarted)
***** HERE I STARTED TO HIT CTRL^C AGAIN *****
--- SIGINT (Interrupt) @ 0 (0) ---
rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}], 1, 2323) = ? ERESTART_RESTARTBLOCK (To be restarted)
--- SIGINT (Interrupt) @ 0 (0) ---
rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}], 1, 1196) = ? ERESTART_RESTARTBLOCK (To be restarted)
--- SIGINT (Interrupt) @ 0 (0) ---
rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}], 1, 620) = ? ERESTART_RESTARTBLOCK (To be restarted)
--- SIGINT (Interrupt) @ 0 (0) ---
rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}], 1, 203) = 0
close(4) = 0
close(5) = 0
write(1, "64 bytes from 199.232.41.14: icm"..., 6064 bytes from 199.232.41.14: icmp_seq=1 ttl=53 time=37.0 ms
) = 60
write(1, "\n", 1
) = 1
write(1, "--- servent.gnu.org ping statist"..., 40--- servent.gnu.org ping statistics ---
) = 40
write(1, "1 packets transmitted, 1 receive"..., 601 packets transmitted, 1 received, 0% packet loss, time 0ms
) = 60
write(1, "rtt min/avg/max/mdev = 37.052/37"..., 53rtt min/avg/max/mdev = 37.052/37.052/37.052/0.000 ms
) = 53
exit_group(0) = ?
Process 27318 detached
[2]- Done strace -p 27318
---------------------------------------------------------------
Unfortunately, the network problem got fixed and I couldn't reproduce it again. I would have been good to have a backtrace of the process.
Now we know the following things:
* the bug is not Fedora specific
* the bug is not nscd specific
* the bug has not been introduced in libc recently (glibc 2.7 had it)
This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component. I can't see any reason why gethostbyaddr or its relatives would be ignoring signals. I guess the one additional test you could try would be ping -n which disables the reverse lookup. I really think you need to look at your tty settings and other aspects of your system to ensure that the requested signals are actually being delivered. I'm closing this as WORKSFORME as nobody else has been able to reproduce this problem. (In reply to comment #10) > I can't see any reason why gethostbyaddr or its relatives would be ignoring > signals. I guess the one additional test you could try would be ping -n which > disables the reverse lookup. > > I really think you need to look at your tty settings and other aspects of your > system to ensure that the requested signals are actually being delivered. I'm > closing this as WORKSFORME as nobody else has been able to reproduce this > problem. Then this bug might have been fixed in recent versions of Fedora. I'm still seeing the same behavior on my work laptop which runs Ubuntu Lucid: on network outages, ping and other programs get stuck on reverse lookups and can't be killed with ^C. |
Description of problem: Programs calling gethostbyaddr() can block for a long time, during which they can't be aborted with ctrl-c, ctrl-\ or kill. NOTE: I'm running nscd, which may be part of the problem. Version-Release number of selected component (if applicable): glibc-2.13-1.x86_64 How reproducible: Always Steps to Reproduce: 1.kill your network connection so that the local nameserver cannot perform recursive queries 2.run "ping some_host" 3.try to kill ping Actual results: Ping gets stuck in gethostbyaddr() and can't be killed Expected results: Ping should be killable at all times. Additional info: Backtrace of ping obtained while the process is stuck. 0x00000030004d7248 in __poll (fds=0x7fff30ef7870, nfds=1, timeout=5000) at ../sysdeps/unix/sysv/linux/poll.c:83 83 return INLINE_SYSCALL (poll, 3, CHECK_N (fds, nfds), nfds, timeout); Missing separate debuginfos, use: debuginfo-install iputils-20100418-3.fc14.x86_64 (gdb) bt #0 0x00000030004d7248 in __poll (fds=0x7fff30ef7870, nfds=1, timeout=5000) at ../sysdeps/unix/sysv/linux/poll.c:83 #1 0x0000003003c0b5cb in send_dg (statp=0x3000799b80, buf=0x7fff30ef7910 "\244p\001", buflen=45, buf2=0x0, buflen2=0, ans=0x7fff30ef7ae0 "", anssiz=1024, ansp=0x7fff30ef8370, ansp2=0x0, nansp2=0x0, resplen2=0x0) at res_send.c:1058 #2 __libc_res_nsend (statp=0x3000799b80, buf=0x7fff30ef7910 "\244p\001", buflen=45, buf2=0x0, buflen2=0, ans=0x7fff30ef7ae0 "", anssiz=1024, ansp=0x7fff30ef8370, ansp2=0x0, nansp2=0x0, resplen2=0x0) at res_send.c:556 #3 0x0000003003c091b1 in __libc_res_nquery (statp=0x3000799b80, name=0x7fff30ef7f60 "161.76.232.199.in-addr.arpa", class=1, type=12, answer=0x7fff30ef7ae0 "", anslen=1024, answerp=0x7fff30ef8370, answerp2=0x0, nanswerp2=0x0, resplen2=0x0) at res_query.c:225 #4 0x00007fe069d46c00 in _nss_dns_gethostbyaddr2_r (addr=0x7fff30ef84dc, len=<value optimized out>, af=<value optimized out>, result=0x3000799e60, buffer=0xd16700 "\177", buflen=1024, errnop=0x7fe07024c6a0, h_errnop=0x7fff30ef8480, ttlp=0x0) at nss_dns/dns-host.c:471 #5 0x00000030004faab8 in __gethostbyaddr_r (addr=<value optimized out>, len=4, type=2, resbuf=0x3000799e60, buffer=0xd16700 "\177", buflen=1024, result=0x7fff30ef8470, h_errnop=0x7fff30ef8480) at ../nss/getXXbyYY_r.c:256 #6 0x00000030004fa84c in gethostbyaddr (addr=0x7fff30ef84dc, len=4, type=2) at ../nss/getXXbyYY.c:117 #7 0x00000000004021da in ?? () #8 0x00000000004044aa in ?? () #9 0x0000000000406ba6 in ?? () #10 0x000000000040385a in ?? () #11 0x000000300041ee5d in __libc_start_main (main=0x402c40, argc=2, ubp_av=0x7fff30ef9d08, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7fff30ef9cf8) at libc-start.c:226