Red Hat Bugzilla – Bug 427629
named stops responding if it cannot communicate over IPSEC
Last modified: 2008-05-30 07:49:35 EDT
Description of problem:
Primary and secondary DNS servers are connected via IPSEC (VPN). If the link is
not established when the secondary tries to transfer the zone(s) from the
primary, it stops responding to queries and also to rndc and must be terminated
with kill -9.
Version-Release number of selected component (if applicable):
always or often (I could not experiment on a production server)
Steps to Reproduce:
1. configure named as described above
2. service named restart (zone transfer occurs during startup)
3. e.g. rndc status
named hangs, rndc waits until interrupted
named starts normally, rndc shows its status. Unreachable primary server is a
normal condition for DNS operation and should not cause such problems.
Would it be possible get stack traces when named stops responding? When this
happen again please attach named with gdb and try commands "info threads" and
then in all thread print stack trace (bt command). You will switch threads with
thread <number> command. This looks like kernel problem for me.
I'm sorry for the delay. I'm going to attach the output of gdb in a short while.
I did the following:
- checked that the DNS is running fine
- simulated IPSEC failure:
- blocked the remote server (DNS master) in iptables
- # setkey -F (flush the IPSEC tables)
- # service named reload
(force a zone transfer from the unreachable remote server)
- checked the DNS, the server was unresponsive
- obtained the gdb stack traces
- restored the iptables, IPSEC link has recovered immediately
- checked the DNS server again, it was OK, it has recovered too!
Created attachment 292515 [details]
output of gdb
(In reply to comment #3)
> Created an attachment (id=292515) 
> output of gdb
Hm, you have to install bind-debuginfo package (same version as bind packages).
Without it it has output very poor usability :( ('?' characters etc...). Would
it be possible attach traces again with debuginfo installed, please?
Created attachment 293281 [details]
gdb output with debuginfo
Updated to the bind-9.4.2-3.fc7 version in the meantime.
Thanks you very much. This is exactly what I need.
It really looks like a kernel problem. The squid behaves the same. If a website
connected via an IPSEC link being currently down is requested, the squid proxy
This bug is capable of bringing important services to a halt.
Interesting. As written in comment #7 reassigning to kernel for inspection. When
sendmsg is called (on UDP socket) and IPSEC link is down sendmsg call simply
blocks. I'm not sure if this is expected behavior (I think not)
I'm actually seeing this under Ubuntu (no offense guys) and the 18.104.22.168 kernel,
so it's not likely RH specific. Nevertheless, a fix would be amazingly helpful. :)
I've confirmed that this is still an issue in 2.6.25.
I forwarded this issue to the lkml, and David Miller quickly suggested the
echo "1" >/proc/sys/net/core/xfrm_larval_drop
So far it looks good to me. Can anyone confirm?
It is amazingly helpful (-> comment#9). Named does not hang. Thank you.
I have found a related discussion if xfrm_larval_drop = 1 should be the default
setting, but there was no clear conclusion: http://lkml.org/lkml/2007/12/4/260
As of May 04, 2008, this may seem to be present in Fedora 9 (rawhide) too.
If confirmed on F9, please change this bug's "version" field to rawhide, since I
can't do it.
This message is a reminder that Fedora 7 is nearing the end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 7. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '7'.
Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 7's end of life.
Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 7 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug. If you are unable to change the version, please add a comment here and someone will do it for you.
Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. If possible, it is recommended that you try the newest available Fedora distribution to see if your bug still exists.
Please read the Release Notes for the newest Fedora distribution to make sure it will meet your needs:
The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Replying to the Bug Zapper standard message:
The reported problem very probably exists in Fedora 8 and 9, but I (the
reporter) am not able to check and confirm that.
Further I'm not sure how to fix it and where to fix it. Is it a bug, or is it
some kind of misconfiguration that comment #11 has fixed already or is it just a
lack of documentation for xfrm_larval_drop ?
For these reasons I would prefer if somebody else having more knowledge and
information decides whether this bug should expire with F7-EOL or not.
Hm, from lkml it seems that network maintainer thinks current state is correct
(http://lkml.org/lkml/2007/12/4/260). But I (and also upstream developers) think
that if socket is marked as non blocking it should not block.
In my opinion xfrm_larval_drop should be dropped because if someone marks socket
as non blocking he has reasons for it. Or at least set it to "1" by default.
Other kernels (FreeBSD I think) doesn't suffer from this feature.
I can create patch which sets xfrm_larval_drop to "1" when bind starts but it is
nasty hack and only hides original problem. Could someone from our kernel guys
try persuade Dave Miller to set xfrm_larval_drop to 1 by default, please? Thanks
(In reply to comment #17)
> I can create patch which sets xfrm_larval_drop to "1" when bind starts but it is
> nasty hack and only hides original problem. Could someone from our kernel guys
> try persuade Dave Miller to set xfrm_larval_drop to 1 by default, please? Thanks
I don't think that's going to happen...
(In reply to comment #18)
> I don't think that's going to happen...
Ok, I told this information to BIND upstream. They should write about this to
FAQ or somewhere into documentation.