Bug 427629

Summary: named stops responding if it cannot communicate over IPSEC
Product: [Fedora] Fedora Reporter: Vlado Potisk <reg.bugs>
Component: kernelAssignee: Herbert Xu <herbert.xu>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: rawhideCC: atkac, bugzilla, kernel-maint, k.georgiou, rsandu2004, triage
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-05-30 04:05:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
output of gdb
none
gdb output with debuginfo none

Description Vlado Potisk 2008-01-05 19:01:17 UTC
Description of problem:
Primary and secondary DNS servers are connected via IPSEC (VPN). If the link is
not established when the secondary tries to transfer the zone(s) from the
primary, it stops responding to queries and also to rndc and must be terminated
with kill -9.

Version-Release number of selected component (if applicable):
bind-9.4.2-2.fc7
bind-chroot-9.4.2-2.fc7

How reproducible:
always or often (I could not experiment on a production server)

Steps to Reproduce:
1. configure named as described above
2. service named restart (zone transfer occurs during startup)
3. e.g. rndc status
  
Actual results:
named hangs, rndc waits until interrupted

Expected results:
named starts normally, rndc shows its status. Unreachable primary server is a
normal condition for DNS operation and should not cause such problems.

Comment 1 Adam Tkac 2008-01-09 17:34:57 UTC
Would it be possible get stack traces when named stops responding? When this
happen again please attach named with gdb and try commands "info threads" and
then in all thread print stack trace (bt command). You will switch threads with
thread <number> command. This looks like kernel problem for me.

Thanks

Comment 2 Vlado Potisk 2008-01-22 14:30:49 UTC
I'm sorry for the delay. I'm going to attach the output of gdb in a short while.
I did the following:
- checked that the DNS is running fine
- simulated IPSEC failure:
    - blocked the remote server (DNS master) in iptables
    - # setkey -F (flush the IPSEC tables)
- # service named reload
 (force a zone transfer from the unreachable remote server)
- checked the DNS, the server was unresponsive
- obtained the gdb stack traces
- restored the iptables, IPSEC link has recovered immediately
- checked the DNS server again, it was OK, it has recovered too!

Comment 3 Vlado Potisk 2008-01-22 14:32:09 UTC
Created attachment 292515 [details]
output of gdb

Comment 4 Adam Tkac 2008-01-22 17:27:38 UTC
(In reply to comment #3)
> Created an attachment (id=292515) [edit]
> output of gdb

Hm, you have to install bind-debuginfo package (same version as bind packages).
Without it it has output very poor usability :( ('?' characters etc...). Would
it be possible attach traces again with debuginfo installed, please?

Comment 5 Vlado Potisk 2008-01-29 13:54:42 UTC
Created attachment 293281 [details]
gdb output with debuginfo

Updated to the bind-9.4.2-3.fc7 version in the meantime.

Comment 6 Adam Tkac 2008-01-29 14:32:27 UTC
Thanks you very much. This is exactly what I need.

Comment 7 Vlado Potisk 2008-03-08 22:42:28 UTC
It really looks like a kernel problem. The squid behaves the same. If a website
connected via an IPSEC link being currently down is requested, the squid proxy
stops responding.

This bug is capable of bringing important services to a halt.

Comment 8 Adam Tkac 2008-03-12 15:54:42 UTC
Interesting. As written in comment #7 reassigning to kernel for inspection. When
sendmsg is called (on UDP socket) and IPSEC link is down sendmsg call simply
blocks. I'm not sure if this is expected behavior (I think not)

Comment 9 M. LaPlante 2008-04-09 02:28:17 UTC
I'm actually seeing this under Ubuntu (no offense guys) and the 2.6.24.4 kernel,
so it's not likely RH specific.  Nevertheless, a fix would be amazingly helpful. :)

Comment 10 M. LaPlante 2008-04-18 00:24:12 UTC
I've confirmed that this is still an issue in 2.6.25.

Comment 11 M. LaPlante 2008-04-18 02:04:31 UTC
I forwarded this issue to the lkml, and David Miller quickly suggested the
following:

echo "1" >/proc/sys/net/core/xfrm_larval_drop

[http://lkml.org/lkml/2008/4/17/478]

So far it looks good to me.  Can anyone confirm?

Comment 12 Vlado Potisk 2008-04-18 07:46:51 UTC
It is amazingly helpful (-> comment#9). Named does not hang. Thank you. 

I have found a related discussion if xfrm_larval_drop = 1 should be the default
setting, but there was no clear conclusion: http://lkml.org/lkml/2007/12/4/260


Comment 13 Răzvan Sandu 2008-05-04 16:12:25 UTC
Hello,

As of May 04, 2008, this may seem to be present in Fedora 9 (rawhide) too.

Răzvan


Comment 14 Răzvan Sandu 2008-05-04 16:16:21 UTC
If confirmed on F9, please change this bug's "version" field to rawhide, since I
can't do it.

Răzvan
 

Comment 15 Bug Zapper 2008-05-14 15:14:02 UTC
This message is a reminder that Fedora 7 is nearing the end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 7. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '7'.

Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 7's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 7 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug. If you are unable to change the version, please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. If possible, it is recommended that you try the newest available Fedora distribution to see if your bug still exists.

Please read the Release Notes for the newest Fedora distribution to make sure it will meet your needs:
http://docs.fedoraproject.org/release-notes/

The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 16 Vlado Potisk 2008-05-18 19:43:33 UTC
Replying to the Bug Zapper standard message:

The reported problem very probably exists in Fedora 8 and 9, but I (the
reporter) am not able to check and confirm that.

Further I'm not sure how to fix it and where to fix it. Is it a bug, or is it
some kind of misconfiguration that comment #11 has fixed already or is it just a
lack of documentation for xfrm_larval_drop ?

For these reasons I would prefer if somebody else having more knowledge and
information decides whether this bug should expire with F7-EOL or not.

Comment 17 Adam Tkac 2008-05-19 12:04:07 UTC
Hm, from lkml it seems that network maintainer thinks current state is correct
(http://lkml.org/lkml/2007/12/4/260). But I (and also upstream developers) think
that if socket is marked as non blocking it should not block.

In my opinion xfrm_larval_drop should be dropped because if someone marks socket
as non blocking he has reasons for it. Or at least set it to "1" by default.
Other kernels (FreeBSD I think) doesn't suffer from this feature.

I can create patch which sets xfrm_larval_drop to "1" when bind starts but it is
nasty hack and only hides original problem. Could someone from our kernel guys
try persuade Dave Miller to set xfrm_larval_drop to 1 by default, please? Thanks

Comment 18 Chuck Ebbert 2008-05-30 04:05:58 UTC
(In reply to comment #17)
> I can create patch which sets xfrm_larval_drop to "1" when bind starts but it is
> nasty hack and only hides original problem. Could someone from our kernel guys
> try persuade Dave Miller to set xfrm_larval_drop to 1 by default, please? Thanks

I don't think that's going to happen...


Comment 19 Adam Tkac 2008-05-30 11:49:35 UTC
(In reply to comment #18)
> I don't think that's going to happen...

Ok, I told this information to BIND upstream. They should write about this to
FAQ or somewhere into documentation.