Bug 518866 - named stops responding after an unexpected error in client.c
named stops responding after an unexpected error in client.c
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: bind (Show other bugs)
4.9
All Linux
medium Severity high
: rc
: ---
Assigned To: Adam Tkac
Martin Cermak
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-08-23 12:32 EDT by Rik van Riel
Modified: 2016-06-17 17:10 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
When named received an IPv4-mapped request on an IPv6 interface an error in named caused it to stop responding with the following log message: client.c:1334: unexpected error: failed to get request's destination: failure This error has been corrected and named no longer freezes and no longer accepts IPv4-mapped requests on IPv6 interfaces.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-02-16 09:05:58 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
proposed patch (462 bytes, patch)
2009-08-27 08:46 EDT, Adam Tkac
no flags Details | Diff
strace of the watcher thread (46.17 KB, text/plain)
2009-08-27 10:33 EDT, Rik van Riel
no flags Details

  None (edit)
Description Rik van Riel 2009-08-23 12:32:55 EDT
Description of problem:

Sometimes I get these errors in the named log:

Aug 23 12:04:22.294 running
Aug 23 12:04:22.294 client.c:1334: unexpected error:
Aug 23 12:04:22.294 failed to get request's destination: failure

Afterwards, named stops responding to queries for some time.  I do not know yet if it stops responding to all queries, or only to queries for certain records. Named seems to recover after several minutes, but during that time it will not answer certain queries.

Version-Release number of selected component (if applicable):

bind-9.2.4-30.el4_8.4

How reproducible:

I have no idea yet what triggers this bug.

Additional info:

It appears that CentOS users have seen this bug, too:

http://bugs.centos.org/view.php?id=2534
Comment 1 Rik van Riel 2009-08-23 12:42:17 EDT
The Debian bug report and an email on bind-users have some more info:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=310772

https://lists.isc.org/pipermail/bind-users/2007-January/065503.html
Comment 2 Rik van Riel 2009-08-23 12:44:05 EDT
Netstat output seems to show some stuck sockets:

# netstat -unl | grep :53
udp        0      0 172.17.80.5:53             0.0.0.0:*                               
udp        0      0 10.1.10.2:53                0.0.0.0:*                               
udp        0      0 10.0.13.1:53                0.0.0.0:*                               
udp   110384      0 74.92.59.65:53              0.0.0.0:*                               
udp        0      0 127.0.0.1:53                0.0.0.0:*                               
udp   110496      0 :::53                       :::*
Comment 3 Rik van Riel 2009-08-23 13:06:58 EDT
Moving the listen-on line in named.conf to below the listen-on-v6 line seems to result in bind having just one socket. So far nothing seems to be stuck.
Comment 4 Rik van Riel 2009-08-23 13:11:59 EDT
I spoke too soon.  The socket is empty, but bind is still stuck:

# netstat -tnl | grep :53
tcp        0      0 :::53                       :::*                        LISTEN      


Aug 23 13:04:56.260 running
Aug 23 13:04:56.261 client.c:1334: unexpected error:
Aug 23 13:04:56.261 failed to get request's destination: failure


# dig @mithlond.surriel.com. hosted.wikiwall.org

; <<>> DiG 9.2.4 <<>> @mithlond.surriel.com. hosted.wikiwall.org
; (2 servers found)
;; global options:  printcmd
;; connection timed out; no servers could be reached


I guess "full sockets" are a symptom and not a cause of this problem.
Comment 5 Rik van Riel 2009-08-23 13:16:14 EDT
Another update: it seems that sometimes named does not recover and stays unresponsive.
Comment 6 Adam Tkac 2009-08-27 08:43:27 EDT
(In reply to comment #2)
> Netstat output seems to show some stuck sockets:
> 
> # netstat -unl | grep :53
> udp        0      0 172.17.80.5:53             0.0.0.0:*                        
> udp        0      0 10.1.10.2:53                0.0.0.0:*                       
> udp        0      0 10.0.13.1:53                0.0.0.0:*                       
> udp   110384      0 74.92.59.65:53              0.0.0.0:*                       
> udp        0      0 127.0.0.1:53                0.0.0.0:*                       
> udp   110496      0 :::53                       :::*  

Those sockets are not stuck, "named" deals differently with IPv4 and IPv6 interfaces. It creates per-interface IPv4 sockets (one listenning socket for each interface) and a global IPv6 socket.
Comment 7 Adam Tkac 2009-08-27 08:46:19 EDT
Created attachment 358860 [details]
proposed patch

I'm not able to reproduce this issue but there is a bug in code. Would it be possible to test this patch, please? Let me know if you would like SRPM or binary RPM instead of patch. Thanks.
Comment 8 Rik van Riel 2009-08-27 10:14:02 EDT
It looks like named can still get into a state where it answers only a small portion of the incoming queries.  It no longer seems to be completely stuck, though.

# netstat -unl | grep :53
udp        0      0 10.1.10.2:53                0.0.0.0:*                               
udp        0      0 10.0.13.1:53                0.0.0.0:*                               
udp   110384      0 74.92.59.65:53              0.0.0.0:*                               
udp        0      0 127.0.0.1:53                0.0.0.0:*                               
udp        0      0 :::53                       :::*     

10:06:31.901471 IP 195.38.32.3.domain > 74.92.59.65.domain:  25096% [1au] A? ns-nlo.surriel.com. (47)
10:06:31.901532 IP 195.38.32.3.domain > 74.92.59.65.domain:  22843% [1au] AAAA? ns-nlo.surriel.com. (47)
10:06:31.901618 IP 195.38.32.3.domain > 74.92.59.65.domain:  7509% [1au] A? imladris.surriel.com. (49)
10:06:31.901682 IP 195.38.32.3.domain > 74.92.59.65.domain:  51690% [1au] AAAA? imladris.surriel.com. (49)
10:06:31.919099 IP 58.253.81.77.41517 > 74.92.59.65.domain:  23875 [1au] A? 114.132.166.89.cart00ney.surriel.com. (65)
10:06:31.939701 IP 202.14.67.14.3758 > 74.92.59.65.domain:  55549 A? 96.160.175.85.cart00ney.surriel.com. (53)
10:06:31.957780 IP 202.126.109.231.62708 > 74.92.59.65.domain:  9005 A? rbldnsd.surriel.com. (37)
10:06:31.999357 IP 193.153.230.25.1205 > 74.92.59.65.domain:  2346+ MX? members.offshoreexecutive.com. (47)
10:06:32.004902 IP 168.95.192.134.34023 > 74.92.59.65.domain:  24538 [1au] MX? quilch.net. (39)
Comment 9 Rik van Riel 2009-08-27 10:33:32 EDT
Created attachment 358880 [details]
strace of the watcher thread

File descriptor 23 seems to be the socket that got stuck.  For some reason it appears to have been dropped from the socket set to query in select()...

# lsof -p 7925
COMMAND  PID  USER   FD   TYPE     DEVICE    SIZE     NODE NAME
named   7925 named  cwd    DIR        3,2    4096   278018 /var/named/chroot/var/named
named   7925 named  rtd    DIR        3,2    4096   278009 /var/named/chroot
named   7925 named  txt    REG        3,2  287080   745072 /usr/sbin/named
named   7925 named  mem    REG        3,2 1059856  1211670 /usr/lib/libdns.so.16.0.0
named   7925 named  mem    REG        3,2  101748   919235 /lib/libnsl-2.3.4.so
named   7925 named  mem    REG        3,2  415188   747358 /usr/lib/libkrb5.so.3.2
named   7925 named  mem    REG        3,2   63624   743055 /usr/lib/libz.so.1.2.1.2
named   7925 named  mem    REG        3,2   47420   917562 /lib/libnss_files-2.3.4.so
named   7925 named  mem    REG        3,2  136016   747334 /usr/lib/libk5crypto.so.3.0
named   7925 named  mem    REG        3,2   43036   915950 /lib/libnss_nis-2.3.4.so
named   7925 named  mem    REG        3,2    7004   918805 /lib/libcom_err.so.2.1
named   7925 named  mem    REG        3,2   81140   919230 /lib/libresolv-2.3.4.so
named   7925 named  mem    REG        3,2   25460  1177681 /usr/lib/libisccc.so.0.1.0
named   7925 named  mem    REG        3,2   16748   917251 /lib/libdl-2.3.4.so
named   7925 named  mem    REG        3,2 1547588   916656 /lib/tls/libc-2.3.4.so
named   7925 named  mem    REG        3,2  112212   916646 /lib/ld-2.3.4.so
named   7925 named  mem    REG        3,2  945120   919231 /lib/libcrypto.so.0.9.7a
named   7925 named  mem    REG        3,2   57252  1177754 /usr/lib/libisccfg.so.0.0.11
named   7925 named  mem    REG        3,2   59684  1177796 /usr/lib/liblwres.so.1.1.2
named   7925 named  mem    REG        3,2  107928   919234 /lib/tls/libpthread-2.3.4.so
named   7925 named  mem    REG        3,2  224764  1177485 /usr/lib/libisc.so.7.1.5
named   7925 named  mem    REG        3,2   82944   748468 /usr/lib/libgssapi_krb5.so.2.2
named   7925 named    0u   CHR        1,3             1978 /dev/null
named   7925 named    1u   CHR        1,3             1978 /dev/null
named   7925 named    2u   CHR        1,3             1978 /dev/null
named   7925 named    3u  unix 0xef683dc0         53675066 socket
named   7925 named    4u   CHR        1,3             1978 /dev/null
named   7925 named    5r  FIFO        0,7         53675074 pipe
named   7925 named    6w   REG        3,2 1012631   541458 /var/named/chroot/var/log/named/lame_log
named   7925 named    7w  FIFO        0,7         53675074 pipe
named   7925 named    8r   CHR        1,8           281775 /var/named/chroot/dev/random
named   7925 named    9w   REG        3,2  160571   278071 /var/named/chroot/var/log/named/named_log
named   7925 named   20u  IPv6   53675077              UDP *:domain 
named   7925 named   21u  IPv6   53675078              TCP *:domain (LISTEN)
named   7925 named   22u  IPv4   53675080              UDP localhost.localdomain:domain 
named   7925 named   23u  IPv4   53675082              UDP mithlond.surriel.com:domain 
named   7925 named   24u  IPv4   53675084              UDP imladris.home.surriel.com:domain 
named   7925 named   25u  IPv4   53675086              UDP 10.1.10.2:domain 
named   7925 named   26u  IPv4   53675088              UDP *:38116 
named   7925 named   27u  IPv6   53675089              UDP *:34923 
named   7925 named   28u  IPv4   53675090              TCP localhost.localdomain:rndc (LISTEN)
named   7925 named   31u  IPv4   53689610              UDP *:58741 


[root@mithlond ~]# netstat -unl | grep :53
udp        0      0 10.1.10.2:53                0.0.0.0:*                               
udp        0      0 10.0.13.1:53                0.0.0.0:*                               
udp   110384      0 74.92.59.65:53              0.0.0.0:*                               
udp        0      0 127.0.0.1:53                0.0.0.0:*                               
udp        0      0 :::53                       :::*
Comment 10 Rik van Riel 2009-08-27 11:22:27 EDT
<atkac> riel: hm, it really looks that the socket is "lost"
<riel> atkac: yeah
<riel> I am _guessing_ (have not read all the code yet) that things work something like this:
<riel> 1) waiter thread gets a select event on the socket
<riel> 2) waiter thread takes socket out of its select set
<riel> 3) waiter thread wakes up a worker thread
<riel> 4) worker thread does work on the socket, until it is empty
<riel> 5) worker thread signals waiter thread to put the socket back in the select set
<riel> 6) waiter thread puts the socket back into its select set
<riel> if there is a race condition somewhere, it is possible the socket stays permanently removed from the select set, without the worker thread trying to do work on the socket
<atkac> riel: if I remember code correctly you are quite right
<atkac> riel: but it's interesting that noone hits this problem as well
<atkac> riel: do you know if the lost socket is bound to ipv4 or ipv6 address?
<riel> atkac: could be because named uses ~20% of the CPU here when answering all the queries it gets :)
<riel> ipv4 - but I bet ipv6 would show the same problem if it was -this- busy
Comment 11 Rik van Riel 2009-08-28 10:06:22 EDT
I tried another debugging approach: log all the queries and see what the last non-local ipv4 query is that was received before a slowdown. Turns out that during every recursive query that times out, named does not handle authoritive queries.  

After the recursive query has been handled, named continues to process authoritive queries from the same socket.  The problem is that my spam analyzing scripts (for PSBL) make several queries a second that end up timing out, which excludes most other DNS requests from being processed.
Comment 12 Rik van Riel 2009-08-28 10:08:44 EDT
I confirmed that it indeed seem to be recursive queries by having the spam processing virtual machine send its queries to other IP addresses (other named UDP sockets) on this name server.  Lo and behold, now I can sometimes see multiple "stuck" sockets:

# netstat -unl | grep :53
udp        0      0 172.17.64.2:53              0.0.0.0:*                               
udp     1520      0 10.1.10.2:53                0.0.0.0:*                               
udp     7904      0 10.0.13.1:53                0.0.0.0:*                               
udp   110384      0 74.92.59.65:53              0.0.0.0:*                               
udp        0      0 127.0.0.1:53                0.0.0.0:*                               
udp        0      0 :::53                       :::*
Comment 13 Rik van Riel 2009-08-28 10:14:21 EDT
It gets even more interesting.  I can see periods of several seconds where named only answers the recursive queries from the spam handling virtual machine, excluding all authoritive queries from external clients during that time.  When that happens, the socket on 74.92.59.65:53 is "full"...
Comment 14 Adam Tkac 2009-09-07 10:46:12 EDT
Would it be possible to check if everything works fine with the patch (comment #7) and /proc/sys/net/ipv6/bindv6only set to "1", please? Thanks.
Comment 15 Rik van Riel 2009-09-07 11:10:55 EDT
Yes, with bind6only set to "1" everything works fine.

With bind6only set to "0", named can get stuck.
Comment 16 malek shabou 2009-09-16 10:58:20 EDT
Hi,
are there any new rpm in errata dealing with this probleme ?
Comment 17 Adam Tkac 2009-09-16 11:16:14 EDT
(In reply to comment #16)
> Hi,
> are there any new rpm in errata dealing with this probleme ?  

Not yet. If you would like to get updated rpms soon would it be possible to open a ticket in issue tracker, please? Such bugs are treated with higher priority.

Btw you can set /proc/sys/net/ipv6/bindv6only to "1", it should be sufficient workaround.
Comment 20 Martin Prpic 2010-06-11 08:31:37 EDT
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.

New Contents:
* when named received an IPv4-mapped request on an IPv6 interface an error in named caused it to stop responding with the following log message: client.c:1334: unexpected error: failed to get request's destination: failure This error has been corrected and named no longer freezes and no longer accepts IPv4-mapped requests on IPv6 interfaces.
Comment 21 Douglas Silas 2010-06-14 03:06:36 EDT
Technical note updated. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1 @@
-* when named received an IPv4-mapped request on an IPv6 interface an error in named caused it to stop responding with the following log message: client.c:1334: unexpected error: failed to get request's destination: failure This error has been corrected and named no longer freezes and no longer accepts IPv4-mapped requests on IPv6 interfaces.+When named received an IPv4-mapped request on an IPv6 interface an error in named caused it to stop responding with the following log message: client.c:1334: unexpected error: failed to get request's destination: failure This error has been corrected and named no longer freezes and no longer accepts IPv4-mapped requests on IPv6 interfaces.
Comment 23 Martin Cermak 2010-06-21 06:22:31 EDT
=> Verified by Customer
Comment 27 Martin Cermak 2010-06-22 06:36:50 EDT
=> VERIFIED
Comment 29 errata-xmlrpc 2011-02-16 09:05:58 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0223.html

Note You need to log in before you can comment on or make changes to this bug.