Description of problem: Sometimes I get these errors in the named log: Aug 23 12:04:22.294 running Aug 23 12:04:22.294 client.c:1334: unexpected error: Aug 23 12:04:22.294 failed to get request's destination: failure Afterwards, named stops responding to queries for some time. I do not know yet if it stops responding to all queries, or only to queries for certain records. Named seems to recover after several minutes, but during that time it will not answer certain queries. Version-Release number of selected component (if applicable): bind-9.2.4-30.el4_8.4 How reproducible: I have no idea yet what triggers this bug. Additional info: It appears that CentOS users have seen this bug, too: http://bugs.centos.org/view.php?id=2534
The Debian bug report and an email on bind-users have some more info: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=310772 https://lists.isc.org/pipermail/bind-users/2007-January/065503.html
Netstat output seems to show some stuck sockets: # netstat -unl | grep :53 udp 0 0 172.17.80.5:53 0.0.0.0:* udp 0 0 10.1.10.2:53 0.0.0.0:* udp 0 0 10.0.13.1:53 0.0.0.0:* udp 110384 0 74.92.59.65:53 0.0.0.0:* udp 0 0 127.0.0.1:53 0.0.0.0:* udp 110496 0 :::53 :::*
Moving the listen-on line in named.conf to below the listen-on-v6 line seems to result in bind having just one socket. So far nothing seems to be stuck.
I spoke too soon. The socket is empty, but bind is still stuck: # netstat -tnl | grep :53 tcp 0 0 :::53 :::* LISTEN Aug 23 13:04:56.260 running Aug 23 13:04:56.261 client.c:1334: unexpected error: Aug 23 13:04:56.261 failed to get request's destination: failure # dig @mithlond.surriel.com. hosted.wikiwall.org ; <<>> DiG 9.2.4 <<>> @mithlond.surriel.com. hosted.wikiwall.org ; (2 servers found) ;; global options: printcmd ;; connection timed out; no servers could be reached I guess "full sockets" are a symptom and not a cause of this problem.
Another update: it seems that sometimes named does not recover and stays unresponsive.
(In reply to comment #2) > Netstat output seems to show some stuck sockets: > > # netstat -unl | grep :53 > udp 0 0 172.17.80.5:53 0.0.0.0:* > udp 0 0 10.1.10.2:53 0.0.0.0:* > udp 0 0 10.0.13.1:53 0.0.0.0:* > udp 110384 0 74.92.59.65:53 0.0.0.0:* > udp 0 0 127.0.0.1:53 0.0.0.0:* > udp 110496 0 :::53 :::* Those sockets are not stuck, "named" deals differently with IPv4 and IPv6 interfaces. It creates per-interface IPv4 sockets (one listenning socket for each interface) and a global IPv6 socket.
Created attachment 358860 [details] proposed patch I'm not able to reproduce this issue but there is a bug in code. Would it be possible to test this patch, please? Let me know if you would like SRPM or binary RPM instead of patch. Thanks.
It looks like named can still get into a state where it answers only a small portion of the incoming queries. It no longer seems to be completely stuck, though. # netstat -unl | grep :53 udp 0 0 10.1.10.2:53 0.0.0.0:* udp 0 0 10.0.13.1:53 0.0.0.0:* udp 110384 0 74.92.59.65:53 0.0.0.0:* udp 0 0 127.0.0.1:53 0.0.0.0:* udp 0 0 :::53 :::* 10:06:31.901471 IP 195.38.32.3.domain > 74.92.59.65.domain: 25096% [1au] A? ns-nlo.surriel.com. (47) 10:06:31.901532 IP 195.38.32.3.domain > 74.92.59.65.domain: 22843% [1au] AAAA? ns-nlo.surriel.com. (47) 10:06:31.901618 IP 195.38.32.3.domain > 74.92.59.65.domain: 7509% [1au] A? imladris.surriel.com. (49) 10:06:31.901682 IP 195.38.32.3.domain > 74.92.59.65.domain: 51690% [1au] AAAA? imladris.surriel.com. (49) 10:06:31.919099 IP 58.253.81.77.41517 > 74.92.59.65.domain: 23875 [1au] A? 114.132.166.89.cart00ney.surriel.com. (65) 10:06:31.939701 IP 202.14.67.14.3758 > 74.92.59.65.domain: 55549 A? 96.160.175.85.cart00ney.surriel.com. (53) 10:06:31.957780 IP 202.126.109.231.62708 > 74.92.59.65.domain: 9005 A? rbldnsd.surriel.com. (37) 10:06:31.999357 IP 193.153.230.25.1205 > 74.92.59.65.domain: 2346+ MX? members.offshoreexecutive.com. (47) 10:06:32.004902 IP 168.95.192.134.34023 > 74.92.59.65.domain: 24538 [1au] MX? quilch.net. (39)
Created attachment 358880 [details] strace of the watcher thread File descriptor 23 seems to be the socket that got stuck. For some reason it appears to have been dropped from the socket set to query in select()... # lsof -p 7925 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME named 7925 named cwd DIR 3,2 4096 278018 /var/named/chroot/var/named named 7925 named rtd DIR 3,2 4096 278009 /var/named/chroot named 7925 named txt REG 3,2 287080 745072 /usr/sbin/named named 7925 named mem REG 3,2 1059856 1211670 /usr/lib/libdns.so.16.0.0 named 7925 named mem REG 3,2 101748 919235 /lib/libnsl-2.3.4.so named 7925 named mem REG 3,2 415188 747358 /usr/lib/libkrb5.so.3.2 named 7925 named mem REG 3,2 63624 743055 /usr/lib/libz.so.1.2.1.2 named 7925 named mem REG 3,2 47420 917562 /lib/libnss_files-2.3.4.so named 7925 named mem REG 3,2 136016 747334 /usr/lib/libk5crypto.so.3.0 named 7925 named mem REG 3,2 43036 915950 /lib/libnss_nis-2.3.4.so named 7925 named mem REG 3,2 7004 918805 /lib/libcom_err.so.2.1 named 7925 named mem REG 3,2 81140 919230 /lib/libresolv-2.3.4.so named 7925 named mem REG 3,2 25460 1177681 /usr/lib/libisccc.so.0.1.0 named 7925 named mem REG 3,2 16748 917251 /lib/libdl-2.3.4.so named 7925 named mem REG 3,2 1547588 916656 /lib/tls/libc-2.3.4.so named 7925 named mem REG 3,2 112212 916646 /lib/ld-2.3.4.so named 7925 named mem REG 3,2 945120 919231 /lib/libcrypto.so.0.9.7a named 7925 named mem REG 3,2 57252 1177754 /usr/lib/libisccfg.so.0.0.11 named 7925 named mem REG 3,2 59684 1177796 /usr/lib/liblwres.so.1.1.2 named 7925 named mem REG 3,2 107928 919234 /lib/tls/libpthread-2.3.4.so named 7925 named mem REG 3,2 224764 1177485 /usr/lib/libisc.so.7.1.5 named 7925 named mem REG 3,2 82944 748468 /usr/lib/libgssapi_krb5.so.2.2 named 7925 named 0u CHR 1,3 1978 /dev/null named 7925 named 1u CHR 1,3 1978 /dev/null named 7925 named 2u CHR 1,3 1978 /dev/null named 7925 named 3u unix 0xef683dc0 53675066 socket named 7925 named 4u CHR 1,3 1978 /dev/null named 7925 named 5r FIFO 0,7 53675074 pipe named 7925 named 6w REG 3,2 1012631 541458 /var/named/chroot/var/log/named/lame_log named 7925 named 7w FIFO 0,7 53675074 pipe named 7925 named 8r CHR 1,8 281775 /var/named/chroot/dev/random named 7925 named 9w REG 3,2 160571 278071 /var/named/chroot/var/log/named/named_log named 7925 named 20u IPv6 53675077 UDP *:domain named 7925 named 21u IPv6 53675078 TCP *:domain (LISTEN) named 7925 named 22u IPv4 53675080 UDP localhost.localdomain:domain named 7925 named 23u IPv4 53675082 UDP mithlond.surriel.com:domain named 7925 named 24u IPv4 53675084 UDP imladris.home.surriel.com:domain named 7925 named 25u IPv4 53675086 UDP 10.1.10.2:domain named 7925 named 26u IPv4 53675088 UDP *:38116 named 7925 named 27u IPv6 53675089 UDP *:34923 named 7925 named 28u IPv4 53675090 TCP localhost.localdomain:rndc (LISTEN) named 7925 named 31u IPv4 53689610 UDP *:58741 [root@mithlond ~]# netstat -unl | grep :53 udp 0 0 10.1.10.2:53 0.0.0.0:* udp 0 0 10.0.13.1:53 0.0.0.0:* udp 110384 0 74.92.59.65:53 0.0.0.0:* udp 0 0 127.0.0.1:53 0.0.0.0:* udp 0 0 :::53 :::*
<atkac> riel: hm, it really looks that the socket is "lost" <riel> atkac: yeah <riel> I am _guessing_ (have not read all the code yet) that things work something like this: <riel> 1) waiter thread gets a select event on the socket <riel> 2) waiter thread takes socket out of its select set <riel> 3) waiter thread wakes up a worker thread <riel> 4) worker thread does work on the socket, until it is empty <riel> 5) worker thread signals waiter thread to put the socket back in the select set <riel> 6) waiter thread puts the socket back into its select set <riel> if there is a race condition somewhere, it is possible the socket stays permanently removed from the select set, without the worker thread trying to do work on the socket <atkac> riel: if I remember code correctly you are quite right <atkac> riel: but it's interesting that noone hits this problem as well <atkac> riel: do you know if the lost socket is bound to ipv4 or ipv6 address? <riel> atkac: could be because named uses ~20% of the CPU here when answering all the queries it gets :) <riel> ipv4 - but I bet ipv6 would show the same problem if it was -this- busy
I tried another debugging approach: log all the queries and see what the last non-local ipv4 query is that was received before a slowdown. Turns out that during every recursive query that times out, named does not handle authoritive queries. After the recursive query has been handled, named continues to process authoritive queries from the same socket. The problem is that my spam analyzing scripts (for PSBL) make several queries a second that end up timing out, which excludes most other DNS requests from being processed.
I confirmed that it indeed seem to be recursive queries by having the spam processing virtual machine send its queries to other IP addresses (other named UDP sockets) on this name server. Lo and behold, now I can sometimes see multiple "stuck" sockets: # netstat -unl | grep :53 udp 0 0 172.17.64.2:53 0.0.0.0:* udp 1520 0 10.1.10.2:53 0.0.0.0:* udp 7904 0 10.0.13.1:53 0.0.0.0:* udp 110384 0 74.92.59.65:53 0.0.0.0:* udp 0 0 127.0.0.1:53 0.0.0.0:* udp 0 0 :::53 :::*
It gets even more interesting. I can see periods of several seconds where named only answers the recursive queries from the spam handling virtual machine, excluding all authoritive queries from external clients during that time. When that happens, the socket on 74.92.59.65:53 is "full"...
Would it be possible to check if everything works fine with the patch (comment #7) and /proc/sys/net/ipv6/bindv6only set to "1", please? Thanks.
Yes, with bind6only set to "1" everything works fine. With bind6only set to "0", named can get stuck.
Hi, are there any new rpm in errata dealing with this probleme ?
(In reply to comment #16) > Hi, > are there any new rpm in errata dealing with this probleme ? Not yet. If you would like to get updated rpms soon would it be possible to open a ticket in issue tracker, please? Such bugs are treated with higher priority. Btw you can set /proc/sys/net/ipv6/bindv6only to "1", it should be sufficient workaround.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: * when named received an IPv4-mapped request on an IPv6 interface an error in named caused it to stop responding with the following log message: client.c:1334: unexpected error: failed to get request's destination: failure This error has been corrected and named no longer freezes and no longer accepts IPv4-mapped requests on IPv6 interfaces.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -* when named received an IPv4-mapped request on an IPv6 interface an error in named caused it to stop responding with the following log message: client.c:1334: unexpected error: failed to get request's destination: failure This error has been corrected and named no longer freezes and no longer accepts IPv4-mapped requests on IPv6 interfaces.+When named received an IPv4-mapped request on an IPv6 interface an error in named caused it to stop responding with the following log message: client.c:1334: unexpected error: failed to get request's destination: failure This error has been corrected and named no longer freezes and no longer accepts IPv4-mapped requests on IPv6 interfaces.
=> Verified by Customer
=> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0223.html