Bug 155422

Summary: NetworkManager looses DNS after a short while
Product: [Fedora] Fedora Reporter: Christian Schaller <uraeus>
Component: NetworkManagerAssignee: Christopher Aillon <caillon>
Status: CLOSED RAWHIDE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: jan.mynarik, jvdias
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-02-20 18:52:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 136451    
Attachments:
Description Flags
Backtrace of named none

Description Christian Schaller 2005-04-20 07:33:59 UTC
After I upgraded to FC4 test2 NetworkManager have not worked well for me. After
a period of about 10-15 minutes DNS stops working. Haven't figured out exactly
what is happening yet, but my theory is that the caching-dns that NetworkManager
seems to employ crash/stop and after that the system doesn't know what to do
anymore.

Is there any logfiles or programs I can check is running to find out what happens?

Comment 1 Dan Williams 2005-05-05 19:08:59 UTC
Can you post the output of /var/log/messages right after it stops working?

Can you also try to "killall -HUP named" right after it seems to stop, and see
if that makes it work again?

Comment 2 Dan Williams 2005-05-09 14:05:25 UTC
I think I might have just run into this...  If you know how, can you attach to
the 'named' process using gdb and run a 't a a bt' (thread apply all backtrace)
and post the output in here?

eh...
1) "ps aux | grep named"
2) note the PID of named (should be the second column, right after 'root')
3) "gdb attach <pid from step 2>"
4) "t a a bt"
5) paste the output in here
6) "detach"
7) "quit"

Thanks!
Dan

Comment 3 Jason Vas Dias 2005-05-09 18:15:36 UTC
It could also be the case that named was not given the options
   "forwarders { IP; IP; ...}
    forward only;
   "
in the global "options { ... };" section and is timing out trying
to contact root nameservers (by default, the resolver timeouts are
very long - @ 5mins). 
Are you running behind a firewall? 
Please also attach your named.conf configuration file (file passed
as '-c' command line option by NetworkManager). 



Comment 4 Christian Schaller 2005-05-10 19:05:02 UTC
Here is the content of my named.conf file, I will attach a bt as soon as my dns
stops working again. Also regarding a firewall, I am not behind one as such, but
afaik so do the combined ADSL/wireless router device from 3com to some degree
function as a firewall.

// Named configuration, generated by NetworkManager

options {
        directory "/";
        query-source address * port *;
        forward only;
        forwarders {  192.168.2.1; };
        listen-on  { 127.0.0.1; };
        pid-file "/var/named/data/NetworkManager-pid-named";
};

// Disable rndc
controls { };

Comment 5 Dan Williams 2005-05-10 19:13:20 UTC
This happened to me again this morning after leaving the box run all night.  If
you kill -9 the named process, NetworkManager will spawn a new named.  However,
that new named will be stuck in the same situation, and doesn't resolve at all.
 Picking a new access point or going to wired seems to kick named enough that it
starts resolving again...

Comment 6 Dan Williams 2005-05-10 19:16:53 UTC
hmm, false alarm on my previous comment.  vpnc had been running more than 8
hours (and therefore past the rekeying threshold), and therefore the VPN was not
technically "running" and vpnc couldnt' contact the nameservers listed in the
conf file anyway.  That would be why multiple kicks of named didn't work this
morning for me.

Dan

Comment 7 Christian Schaller 2005-05-10 19:21:33 UTC
I will attach a bt, but I have to point out that the process did not crash so I
guess it will be of little use. Also noticed while testing after this happened
that I was able to ping a few named addresses, but I am not sure what the
pattern behind those addresses is except that I didn't access any of them before
dns 'disappered'. Still it seems to be something among 1 in 10 addresses I am
able to reach. 

Comment 8 Christian Schaller 2005-05-10 19:22:25 UTC
Created attachment 114219 [details]
Backtrace of named

Comment 9 Jason Vas Dias 2005-05-12 00:22:43 UTC
The backtrace shows named in its normal state - its main thread is in 
sigsuspend waiting for a SIGHUP (reload) or SIGTERM (shutdown) signal.
All the work is done by the other threads, which seem to be in normal 
states.

Can you for instance lookup the localhost address, ie. do:
  # dig -x 127.0.0.1 @127.0.0.1 
  1.0.0.127.in-addr.arpa domain name pointer localhost.
(You should have the rfc1912 localhost zones in your named.conf,
 ie 'zone "0.0.127.in-addr.arpa" { type master; file "localhost.zone"; };' -
 if not, this could itself be a problem, since named will try to lookup
 127.0.0.1 and fail for every query - install the 'caching-nameserver'
 package or remove all your named configuration files and run 
 'system-config-bind').

If this does not succeed, and you do have the $ROOTDIR/var/named/localhost.zone
file (where ROOTDIR is defined in /etc/sysconfig/named), then we definitely do
have a named problem . Before reproducing, put named in debug mode - as root:
   # chmod g+w $ROOTDIR/var/named
   # rndc trace 99
named will create a $ROOTDIR/var/named/named.run file. Please compress this
file and append it to this bug / send it in the mail to us.

If the localhost lookup does succeed, then this is not a named problem - it
could be a firewall problem. Do you have a firewall enabled ? 
When you have reproduced the problem, please gather tcpdump information - as 
root, do :
   # tcpdump -nl -i any -vvv -s 2048 port domain >/tmp/tcpdump.log 2>&1 &
then do some queries which fail, and (if possible) some which succeed, eg.
using the 'dig' command . Then:
   # pkill tcpdump
and attach the /tmp/tcpdump.log file to this bug or send it to us.




Comment 10 Dan Williams 2005-05-16 20:46:09 UTC
jason: NetworkManager starts named from its main thread, and it runs a total of
1 main thread + 1 thread per device.

Comment 11 Christian Schaller 2005-07-27 10:24:25 UTC
Been using NetworkManager as my primary connection method for quite a while with
FC4 and I have noticed one thing. This problem only happens at home, never at
work. Which makes me believe the problem lies somewhere with the interaction
between named and the dns system inside my 3com ADSL router. I am not sure what
my router actually does, but when I get an IP from it my dns is set to be the
routers, so it might contain its own caching DNS server. And as mentioned
earlier a few addresses seems to keep working when this happens. My current
guess would be that somehow named and the dns system inside my router do
something together which one of the parts don't fully comprehend so the reply I
get back is that the dns name is unknown instead of the ADSL router system
sending the request further upstream for the real reply.

Comment 12 Jan Mynarik 2005-08-10 22:23:45 UTC
(In reply to comment #11)
> Been using NetworkManager as my primary connection method for quite a while with
> FC4 and I have noticed one thing. This problem only happens at home, never at
> work. Which makes me believe the problem lies somewhere with the interaction
> between named and the dns system inside my 3com ADSL router. I am not sure what
> my router actually does, but when I get an IP from it my dns is set to be the
> routers, so it might contain its own caching DNS server. And as mentioned
> earlier a few addresses seems to keep working when this happens. My current
> guess would be that somehow named and the dns system inside my router do
> something together which one of the parts don't fully comprehend so the reply I
> get back is that the dns name is unknown instead of the ADSL router system
> sending the request further upstream for the real reply.

This is exactly my experience, I'd like to confirm this behaviour too with ADSL
router SMC 7804WBR-B EU. It has its own DNS cashing too, just like in
Christian's case.

Tested on Ubuntu Breezy with NetworkManager CVS version from 2005-07-27 (with
named, resolvconf disabled for those curious :-)). I read about this bug on
Christian's blog and just wanted to tell it's not RedHat specific ;-)

Comment 13 Christian Schaller 2005-08-19 15:52:02 UTC
Tested using nslookup at home as I assumed it might be the same issue as
reported in bug 165588. Unfortunatly the behaviour at home is different than
165588 as nslookup wasn't even able to connect to the router dns server directly
when this issue occurs.

Comment 14 John Hunt 2005-10-07 07:00:11 UTC
I too am suffering from the same problem. However, I was previously using
wpa_supplicant to connect to my wireless network and dhcpcd to do all the dhcp
stuff. Using dhcpcd as a dhcp client seemed to work flawlessly on any wireless
network, surely there's a way of scripting a work-around to perhaps use that
instead?

Re-starting bind seems to fix my problem, perhaps it differs slightly from above?

Comment 15 Dan Williams 2005-10-07 10:26:21 UTC
FYI, at least one case of this problem should be fixed in CVS HEAD since we talk
to named directly now with dbus, and don't spawn our own copy of named.  This
fixes the issue where forwarders don't get cleared when updating DNS
information.  This, however, was most often seen when switching connections
and/or using VPN (ie, situations in which your nameservers would change at least
once).

Comment 16 Christian Schaller 2005-11-09 11:33:26 UTC
Ok, tested with 0.5.1 of NetworkManager and problem persists for me. That said I
guess this bug is a duplicate of my 165588 bug, I will test some more tonight to
try to figure out if these two problems are in fact one and the same issue.

Comment 17 Christian Schaller 2005-11-15 20:04:27 UTC
After some more testing it seems 0.5.1 has solved this issue for me. So I am now
closing this bug report. Thanks for the good work.

Comment 18 Todd Gee 2006-02-20 18:48:51 UTC
bug report still open -- last comment said it would be closed?

Comment 19 Christopher Aillon 2006-02-20 18:52:06 UTC
closing