This service will be undergoing maintenance at 00:00 UTC, 2016-08-01. It is expected to last about 1 hours
Bug 146682 - named locks up after several hours
named locks up after several hours
Status: CLOSED CANTFIX
Product: Fedora
Classification: Fedora
Component: bind (Show other bugs)
4
i386 Linux
medium Severity medium
: ---
: ---
Assigned To: Adam Tkac
Ben Levenson
:
: 168829 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-01-31 13:25 EST by P Fudd
Modified: 2013-04-30 19:33 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-09-20 08:14:10 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description P Fudd 2005-01-31 13:25:26 EST
Description of problem:
I've set up sendmail, mimedefang, spamassassin, clamav and bind. 
Everything works great for a few hours, but then suddenly every single
 incoming email connection is rejected, with messages like:

Jan 31 10:01:41 jtmmx sendmail[11897]: j0VI1fNS011897: rejecting
commands from [193.140.183.189] [193.140.183.189] due to pre-greeting
traffic

If I telnet to port 25 at this time, the connection hangs for 20
seconds, then disconnects.  When things are operating correctly,
FEATURE(`greet_pause', `5000') makes the connection delay for 5000
milliseconds, to detect web-proxy spam; if the client sends anything
during that delay, they are deemed to be spammers.

Also, at this time if I type 'nslookup www.google.com localhost', it
times out; this happens for any domain name.

After much experimentation, I found that typing 'service named
restart' causes the problem to go away, and mail starts working again.
 Restarting the other daemons involved doesn't help.  I'm tempted to
create a cron job that restarts named every 15 minutes just to deal
with this.

In addition, I've found that I can't do 'strace -fp `pidof named`'. 
Strace just sits there, whether named is working or locked up at the
time. 

Version-Release number of selected component (if applicable):
bind-9.3.0-2
bind-9.2.4-2
sendmail-8.13.3-1
sendmail-8.13.1-2
mimedefang-2.49-2t
clamav-0.80-4
spamassassin-3.0.1-0.FC3

How reproducible:
Always (after 5 hours or so).


Steps to Reproduce:
1. Transfer email normally in a spam-filled world
2. Watch logs
3. Restart named

  
Actual results:
All email rejected, named locked up.

Expected results:
Spam email rejected, named not locked up.

Additional info:
This server is a brand new 3Ghz Pentium 4, with a fresh install of fc3
'everything', 2 gigs of ram, 15 gigs of hard drive, load average less
than .1 all the time, that has been dedicated to running just email.
Comment 1 P Fudd 2005-01-31 13:32:08 EST
Note: I haven't made any changes to the configuration for named; it is
just the way it was when I unpackaged it.
ftp://mirror.hiwaay.net/redhat/fedora/linux/core/development/i386/SRPMS/bind-9.3.0-2.src.rpm

I created a new rpm for sendmail by taking the old srpm, putting in
the latest source, and tweaking the spec file version numbers.  Didn't
matter, the new version and the old version still reject email.
Comment 2 Jason Vas Dias 2005-01-31 14:06:51 EST
These comments were made before your last comment - they still apply:
---
What version of bind are you running ? ( run named-checkconf -v ).
I see you have bind-9.3.0-2 installed - is this the binary version
from FC4 ? Or did you build it from the source rpm ?
You should not expect a binary RPM from FC4 to work correctly on an
FC3 system. I've re-built bind-9.3.0-2 for FC3 and you can download
the rpms from:
  http://people.redhat.com/~jvdias/BIND/9.3.0/FC3 
Please do:
  # rpm -e --nodeps bind-9.2.4-2 bind-9.3.0-2
  # rpm -ivh bind-*-9.3.0-2.i386.rpm

It would be useful in debugging this problem to see your named 
config files - please append the 
     $ROOTDIR/etc/named.conf $ROOTDIR/var/named/*
files to this bug or send them to jvdias@redhat.com .

Since the problem is reproducable so quickly, please turn on 
debugging for the named process:

  # chown named:named $ROOTDIR/var/named
  # rndc trace 99
  
And when you have reproduced the problem, please gzip the file:
  $ROOTDIR/var/named/named.run
and append named.run.gz to this bug or send it to me. 

Please show the latest log messages from named when this problem
was reproduced by appending the output of this command to this bug
or sending it to me:
  # tail +`grep -n 'named startup succeeded' /var/log/messages | tail
-1 | sed 's/:.*$//'` /var/log/messages | grep named

A tcpdump gathered from when the problem was reproduced would also be
useful:
  # tcpdump -nl -vvv -s 2048 port 53 2>&1 | tee /tmp/tcpdump.log
---

Seeing your last comment, it appears that you do not have a valid
named configuration.  The bind- package itself only installs the
bare minimum configuration necessary to allow named to run; it is
not even configured as a caching nameserver. To get a caching
nameserver, you need to install the caching-nameserver package:

ftp://download.fedora.redhat.com/pub/fedora/linux/core/3/i386/os/Fedora/RPMS/caching-nameserver-7.3.3-noarch.rpm

If the problem still occurs with this package installed, please 
follow the above steps and append the requested information to this
bug - thanks.


Comment 3 Jason Vas Dias 2005-06-01 19:39:12 EDT
Trying to clear out old bugs here. 
As there was no response to the previous comment, I'm assuming the
comment helped resolve the problem - if not, please let me know.
There is now version bind-9.3.1 RPMs for FC-3 at:
 http://people.redhat.com/~jvdias/bind/FC3
 
Comment 4 Ian Donaldson 2005-10-09 07:21:32 EDT
I've seen the strace and hang problem on newly installed fc4 today (with all yum
updates; bind-9.3.1-10_FC4; kernel 2.6.13-1.1526_FC4smp on 3.06GHz P4 with HT
enabled).

strace of named returns no output; named isn't spinning according to top but
refuses to do return results; netstat shows it bound to the correct ports. 
Nothing useful in the syslogs.  tcpdump interestingly shows it making requests
to outsiders corresponding to the requests made of it and receiving responses
but not returning the responses to the requestor.  Firewalling (iptables) not
showing any hits.

Restarting named cleared it up though... something pretty flakey here.
Comment 5 Jason Vas Dias 2005-10-09 15:08:29 EDT
In response to Comment #4:
This is very strange. I too am running latest FC-4 - bind-9.3.1-10_FC4 and
kernel 2.6.13-1.1526_FC4 , and cannot reproduce this problem. 
You say you've enabled firewalling - did you unblock UDP and TCP for 
port domain (53) ? If so, are you using the named.conf statement
   'options { ...query-source port 53;... } '
to stop named using the any port for its query source address ?
If not, this could be the problem. You should also ensure that there are
no firewall rules for the localhost loopback (lo) device.

What tcpdump command did you use ? Please could you append the output of this
tcpdump command from when the problem occurs:
  # tcpdump -nl -vvv -i any -s 4096 port domain 2>&1 | tee /tmp/tcpdump.log
Please reproduce the problem and append the /tmp/tcpdump.log to this bug report.

It would also be most helpful to enable named debugging:
  # . /etc/sysconfig/named
  # chown named:named ${ROOTDIR}/var/named
  # rndc trace 99
Then reproduce the problem and append the ${ROOTDIR}/var/named/named.run
file to this bug report.

What ethernet hardware are you using ? 

What is the chkconfig startup order for named -
  # chkconfig --list named
For the first runlevel $L for which named is running, please do
  # echo /etc/rc.d/rc${L}.d/S*
and append the output to this bug report.

It would also be most helpful to see your named.conf and zone configuration
files - please tar them up 
   # tar -cpf /tmp/named.tar /etc/{named,rndc}* /var/named
and append the /tmp/named.tar file to this bug report or send it to me:
jvdias@redhat.com .
Comment 6 Ian Donaldson 2005-10-10 00:08:46 EDT
In response to your posting, here is some info...
(admitted not all of what you asked for but a start at least; lmk if
you want more detail)

Server was running RH9 for the last couple of years until we upgraded
to FC4 yesterday.  Firewalling (iptables) hasn't changed during that upgrade.

Running a chrooted named setup (see dir list below; only domain name 
changed to MYDOMAIN for this email; chroot setup slightly different to rpm 
supplied one as it was set up on an earlier RH9 env named)

As mentioned, this was a P4 with HT enabled, and I'm running the smp kernel,
which could be a factor (seen strange hangs/slowness with smp kernels before).

    eth0: internet side (broadband ADSL via Cisco)
    eth1: unused
    eth2: LAN side (net 10.0.0.3/24)

The tcpdump used was:

        tcpdump -s1500 -vn -ieth0 port 53

Firewalling lets out all ICMP/TCP/UDP on internet side with stateful return...
(relevant exerpt below)

iptables -A INPUT   -i eth0 -p tcp  -m state --state ESTABLISHED,RELATED  -j ACCEPT
iptables -A INPUT   -i eth0 -p udp  -m state --state ESTABLISHED,RELATED  -j ACCEPT
iptables -A INPUT   -i eth0 -p icmp -m state --state ESTABLISHED,RELATED  -j ACCEPT
iptables -A OUTPUT  -o eth0 -p tcp  -m state --state NEW,ESTABLISHED      -j ACCEPT
iptables -A OUTPUT  -o eth0 -p udp  -m state --state NEW,ESTABLISHED      -j ACCEPT
iptables -A OUTPUT  -o eth0 -p icmp -m state --state NEW,ESTABLISHED      -j ACCEPT

and there are no rules for any of udp/tcp/53/5353 allow or block specifically,
and no iptables rules for lo0.

As mentioned, strace returns no output for named ('strace -p NAMED-PID')
even when its running correctly.

I've not included a tcpdump output; it shows normal DNS lookup activity on 
the internet side.  (both requests and corresponding responses coming in; 
although the request rate is much lower than when named is working properly)

One thing I noticed was that when named was misbehaving, dig would return
instant results for locally mastered domains and cached externally resolved 
results but not return newly resolved external ones.  (dig would just time out)

ie:     dig HOST.MYDOMAIN@127.1
        dig HOST.MYDOMAIN@10.0.0.3 
etc


Ian D
--

# grep -v '^#' /etc/sysconfig/named
ROOTDIR="/var/chroot/bind"

# ls -l /etc/named.conf
lrwxrwxrwx  1 root root 31 Oct  9 14:09 /etc/named.conf ->
/var/chroot/bind/etc/named.conf


# find /var/chroot/bind -ls 2>/dev/null | egrep -v '/var/chroot/bind/proc/|/RCS/'
 96266    4 drwxr-xr-x   6 root     root         4096 Oct  9 15:03 /var/chroot/bind
 96267    4 drwxr-xr-x   3 root     root         4096 Oct  9 17:27
/var/chroot/bind/etc
 96269    4 drwxr-xr-x   7 root     root         4096 Oct 10 11:21
/var/chroot/bind/etc/namedb
 96258    4 -r--r--r--   1 root     root         1430 Oct 10 11:21
/var/chroot/bind/etc/namedb/named.conf
 96271    4 -rw-r--r--   1 root     wheel        2769 Nov  6  1999
/var/chroot/bind/etc/namedb/named.root
 96272    4 drwxr-xr-x   2 named    wheel        4096 Sep 17  2003
/var/chroot/bind/etc/namedb/rev.bak
 96283    4 drwxr-xr-x   2 named    wheel        4096 Sep 17  2003
/var/chroot/bind/etc/namedb/zone.bak
 96292    4 -rw-r--r--   1 root     wheel         781 Nov  6  1999
/var/chroot/bind/etc/namedb/make-localhost
 96293    4 -rw-r--r--   1 root     wheel         416 Nov  6  1999
/var/chroot/bind/etc/namedb/PROTO.localhost.rev
 96294    4 drwxr-xr-x   3 root     wheel        4096 Sep 20 13:01
/var/chroot/bind/etc/namedb/zone
 96306    4 -r--r--r--   1 root     root          253 Aug 18  2003
/var/chroot/bind/etc/namedb/zone/localhost.zone
 97338    4 -r--r--r--   1 root     root         2676 Sep 19 18:08
/var/chroot/bind/etc/namedb/zone/MYDOMAIN.com.au
 96301    4 drwxr-xr-x   3 root     wheel        4096 Sep 19 18:10
/var/chroot/bind/etc/namedb/rev
 96275    4 -r--r--r--   1 root     root          332 Sep 17  2003
/var/chroot/bind/etc/namedb/rev/127.0.0
 97357    4 -r--r--r--   1 root     root         2911 Sep 19 18:10
/var/chroot/bind/etc/namedb/rev/10.0.0
 96307    0 lrwxrwxrwx   1 root     root           17 Aug 18  2003
/var/chroot/bind/etc/named.conf -> namedb/named.conf
 96308    4 -rw-r--r--   1 root     root          785 Jun 25  2003
/var/chroot/bind/etc/localtime
 96309    4 -rw-r--r--   1 root     root          119 Sep  5  2003
/var/chroot/bind/etc/resolv.conf
 96310    4 -rw-r-----   1 root     named         132 Mar 13  2003
/var/chroot/bind/etc/rndc.key
 97354    8 -rw-r--r--   1 root     root         1323 Aug 26  2004
/var/chroot/bind/etc/named.conf.rpmnew
 96311    4 drwxr-xr-x   3 root     root         4096 Mar 16  2003
/var/chroot/bind/var
 96312    4 drwxr-xr-x   3 root     root         4096 Mar 16  2003
/var/chroot/bind/var/run
 96313    4 drwxr-xr-x   2 named    root         4096 Oct 10 11:21
/var/chroot/bind/var/run/named
 96261    4 -rw-r--r--   1 named    named           6 Oct 10 11:21
/var/chroot/bind/var/run/named/named.pid
 96315    4 drwxr-xr-x   2 root     root         4096 Mar 16  2003
/var/chroot/bind/dev
 96316    0 crw-rw-rw-   1 root     root              Jan 30  2003
/var/chroot/bind/dev/null
 96317    0 crw-rw-rw-   1 root     root              Jan 30  2003
/var/chroot/bind/dev/zero
 96318    0 crw-r--r--   1 root     root              Jan 30  2003
/var/chroot/bind/dev/random
     1    0 dr-xr-xr-x 185 root     root            0 Oct  9 19:57
/var/chroot/bind/proc


named.conf

-----
// $Id: named.conf,v 1.2 2005/10/10 01:21:43 blah Exp $
// $Source: /var/chroot/bind/etc/namedb/RCS/named.conf,v $
//
// Refer to the named(8) man page for details.  If you are ever going
// to setup a primary server, make sure you've understood the hairy
// details of how DNS is working.  Even with simple mistakes, you can
// break connectivity for affected parties, or cause huge amount of
// useless Internet traffic.

options {
        directory "/etc/namedb";

        dump-file "/var/run/named/named_dump.db";

        query-source address * port 5353;

// do direct resolution to the internet; don't need to depend 
// on Telstra for this
//      forwarders {
//              // telstra ADSL service name resolvers
//              139.130.4.4;
//              203.50.2.71;
//      };
//      forward only;

        // don't let outsiders use us as a DNS cache for stuff that
        // isn't ours, but do allow insiders to do recursion
        allow-recursion {
                10.0.0.0/24;
                127.0.0.1/32;
        };
};
controls {
        inet 127.0.0.1 allow { localhost; } keys { rndckey; };
};
include "/etc/rndc.key";


zone "." {
        type hint;
        file "named.root";
};

zone "localhost" IN {
        type master;
        file "zone/localhost.zone";
        allow-update { none; };
};

zone "0.0.127.IN-ADDR.ARPA" {
        type master;
        file "rev/127.0.0";
        allow-update { none; };
};

zone "MYDOMAIN.com.au" {
        type master;
        file "zone/MYDOMAIN.com.au";
        allow-update { none; };
};

zone "0.0.10.IN-ADDR.ARPA" {
        type master;
        file "rev/10.0.0";
        allow-update { none; };
-----


# cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Pentium(R) 4 CPU 3.06GHz
stepping        : 7
cpu MHz         : 3081.818
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips        : 6169.40

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Pentium(R) 4 CPU 3.06GHz
stepping        : 7
cpu MHz         : 3081.818
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips        : 6163.48


Related startup stuff about the ethernet:

Oct  9 19:58:19 magic kernel: e100: eth0: e100_probe: addr 0xe3061000, irq 185,
MAC addr 00:20:ED:85:41:1C
Oct  9 19:58:19 magic kernel: ne2k-pci.c:v1.03 9/22/2003 D. Becker/P. Gortmaker
Oct  9 19:58:19 magic kernel:   http://www.scyld.com/network/ne2k-pci.html
Oct  9 19:58:19 magic kernel: ACPI: PCI Interrupt 0000:02:01.0[A] -> GSI 21
(level, low) -> IRQ 193

Oct  9 19:58:19 magic kernel: eth1: RealTek RTL-8029 found at 0xc000, IRQ 193,
00:40:05:5F:92:20.
Oct  9 19:58:20 magic kernel: Intel(R) PRO/1000 Network Driver - version
6.0.60-k2-NAPI
Oct  9 19:58:20 magic kernel: Copyright (c) 1999-2005 Intel Corporation.
Oct  9 19:58:20 magic kernel: ACPI: PCI Interrupt 0000:02:02.0[A] -> GSI 22
(level, low) -> IRQ 201

Oct  9 19:58:20 magic kernel: e1000: eth2: e1000_probe: Intel(R) PRO/1000
Network Connection
Oct  9 19:58:20 magic kernel: ACPI: PCI Interrupt 0000:00:1f.5[B] -> GSI 17
(level, low) -> IRQ 209
Oct  9 19:58:20 magic kernel: PCI: Setting latency timer of device 0000:00:1f.5
to 64


# cat /etc/modprobe.conf
# Note: for use under 2.4, changes must also be made to modules.conf!
alias eth0 e100
#alias scsi_hostadapter dpt_i2o
alias scsi_hostadapter i2o_block
alias usb-controller ehci-hcd
alias snd-card-0 snd-intel8x0
alias ieee1394-controller ohci1394
alias eth1 ne2k-pci
alias eth2 e1000
install sound-slot-0 /sbin/modprobe --first-time --ignore-install sound-slot-0
&& { /bin/aumix-minimal -f /etc/.aumixrc -L >/dev/null 2>&1 || :; }
remove sound-slot-0 { /bin/aumix-minimal -f /etc/.aumixrc -S >/dev/null 2>&1 ||
:; } ; /sbin/modprobe -r --first-time --ignore-remove sound-slot-0
options snd-card-0 index=0
options i810_audio index=0
remove i810_audio { /usr/sbin/alsactl store 0 >/dev/null 2>&1 || : ; };
/sbin/modprobe -r --ignore-remove i810_audio


runlevel 5 is the default

# chkconfig --list named
named           0:off   1:off   2:off   3:on    4:off   5:on    6:off

# ls /etc/rc.d/rc5.d/S*
/etc/rc.d/rc5.d/S01sysstat          /etc/rc.d/rc5.d/S19rpcgssd       
/etc/rc.d/rc5.d/S85gpm
/etc/rc.d/rc5.d/S04readahead_early  /etc/rc.d/rc5.d/S25bluetooth     
/etc/rc.d/rc5.d/S87iiim
/etc/rc.d/rc5.d/S05kudzu            /etc/rc.d/rc5.d/S25netfs         
/etc/rc.d/rc5.d/S90crond
/etc/rc.d/rc5.d/S06cpuspeed         /etc/rc.d/rc5.d/S26apmd          
/etc/rc.d/rc5.d/S90cups
/etc/rc.d/rc5.d/S08ip6tables        /etc/rc.d/rc5.d/S26lm_sensors    
/etc/rc.d/rc5.d/S90saslauthd
/etc/rc.d/rc5.d/S08ipchains         /etc/rc.d/rc5.d/S28autofs        
/etc/rc.d/rc5.d/S90squid
/etc/rc.d/rc5.d/S09isdn             /etc/rc.d/rc5.d/S33nifd          
/etc/rc.d/rc5.d/S90xfs
/etc/rc.d/rc5.d/S09pcmcia           /etc/rc.d/rc5.d/S34mDNSResponder 
/etc/rc.d/rc5.d/S95anacron
/etc/rc.d/rc5.d/S10network          /etc/rc.d/rc5.d/S40smartd        
/etc/rc.d/rc5.d/S95atd
/etc/rc.d/rc5.d/S10psacct           /etc/rc.d/rc5.d/S55sshd          
/etc/rc.d/rc5.d/S96readahead
/etc/rc.d/rc5.d/S11iptables         /etc/rc.d/rc5.d/S55sshd_rsa_only 
/etc/rc.d/rc5.d/S97messagebus
/etc/rc.d/rc5.d/S11named            /etc/rc.d/rc5.d/S56xinetd        
/etc/rc.d/rc5.d/S97rhnsd
/etc/rc.d/rc5.d/S11ntpd             /etc/rc.d/rc5.d/S59hpoj          
/etc/rc.d/rc5.d/S98cups-config-daemon
/etc/rc.d/rc5.d/S12syslog           /etc/rc.d/rc5.d/S65dhcpd         
/etc/rc.d/rc5.d/S98haldaemon
/etc/rc.d/rc5.d/S13irqbalance       /etc/rc.d/rc5.d/S80httpd         
/etc/rc.d/rc5.d/S99cyrus
/etc/rc.d/rc5.d/S13portmap          /etc/rc.d/rc5.d/S80postfix       
/etc/rc.d/rc5.d/S99local
/etc/rc.d/rc5.d/S14nfslock          /etc/rc.d/rc5.d/S80spamassassin  
/etc/rc.d/rc5.d/S99mdmonitor
/etc/rc.d/rc5.d/S18rpcidmapd        /etc/rc.d/rc5.d/S83smb           
/etc/rc.d/rc5.d/S99stunnel

$ rpm -q bind
bind-9.3.1-10_FC4

# rpm -q --verify bind
..?.....  c /etc/rndc.conf
S.?.....  c /etc/rndc.key
SM5....T  c /etc/sysconfig/named
missing     /var/named/data
missing     /var/named/slaves
Comment 7 Jason Vas Dias 2005-10-10 12:55:14 EDT
Thanks for the data you sent - several issues are suggested by it:

1. Were the firewall rules copied from your RHL-9 installation ? 
   If so, they may not be appropriate for the FC-4 firewall.
   Does the problem occur if iptables is disabled ? 

   Try 'chkconfig --del iptables' and rebooting. Does the problem still occur?
    
   I'm no firewalling expert, but it seems to me, reviewing the iptables 
   man-page, that they could explain why no responses get through to queriers:
   From the iptables man-page documentation on '--state':
   "ESTABLISHED meaning that the packet is associated with a connection
                which  has seen packets in both directions ...
    NEW meaning that the packet has started a new connection ... 
    RELATED meaning that the packet is starting a new connection,
              but is associated with an existing connection ...
   "
You have the rules:
iptables -A INPUT   -i eth0 -p tcp  -m state --state ESTABLISHED,RELATED  -j ACCEPT
iptables -A INPUT   -i eth0 -p udp  -m state --state ESTABLISHED,RELATED  -j ACCEPT

Not "NEW" ? If you only accept incoming packets associated with existing
connections, how is the nameserver to accept new clients?

Please try disabling iptables and see if the problem occurs - if not, it
would appear to be a firewall problem.

2. The tcpdump output - 

   tcpdump -s1500 -vn -ieth0 port 53

If you are making requests from the localhost, this will not show them.
Neither will it shown requests the server makes on port 5353 .

Also named will by default use packets of up to 4096 bytes unless the 
edns0-udp-size option is supplied, regardless of the MTU - it relies
on IP fragmentation - I've heard that this can be a problem with some
firewalls / modems - does setting 'options { ... edns0-udp-size 1500; ...}'
in named.conf make any difference ?

Please try using this tcpdump command:
  # tcpdump -vvv -i any -nl -s 4096 port domain or port 5353

The output of this command when the problem occurs would be most useful.


3. You said:

"One thing I noticed was that when named was misbehaving, dig would return
instant results for locally mastered domains and cached externally resolved 
results but not return newly resolved external ones.  (dig would just time out)
"

It sounds like named is unable to contact the root nameservers in this case.

Are you sure that your ADSL service provider does not block access to the root
nameservers ? Many ISPs do. Does the problem persist when you uncomment the
forwarding :

// do direct resolution to the internet; don't need to depend 
// on Telstra for this
//      forwarders {
//              // telstra ADSL service name resolvers
//              139.130.4.4;
//              203.50.2.71;
//      };
//      forward only;

Use 'forward first' so that you can continue to serve your authoritative zones.


4. Do you have SELinux in Enforcing mode ? If so, it could be seriously 
   confused by your use of the non-standard /etc/namedb "Directory" option -
   it will assign correct security contexts only to files under 
   ${ROOTDIR}/var/named, where ${ROOTDIR} is set in /etc/sysconfig/named .
   If so, does the problem occur with SELinux in Permissive mode ? 
   # setenforce 0
   or boot with the 'enforcing=0' boot argument. 

5. Have you updated your root cache file recently ? If you are using an 
   installation from RHL-9, it could be out of date, resulting in problems
   with querying the root nameservers .
   Try updating this file with 
   # dig . ns @198.41.0.4 > /var/named/chroot/etc/namedb/named.root
   # curl ftp://ftp.rs.internic.net/domain/named.root \
     > /var/named/chroot/etc/namedb/named.root
    
6. If none of the above resolves the problem, please enable named debugging:
   Put 'OPTIONS=-d99' in /etc/sysconfig/named, and ensure ownership of 
   /var/named/chroot/etc/namedb is named:named. Then reproduce the problem, 
   and please gzip the /var/named/chroot/etc/namedb/named.run file and append
   it to this bug report or send it to me - it will tell us exactly why named
   is not responding.
Comment 8 Ian Donaldson 2005-10-10 21:36:40 EDT
In answer to the above
1. not prepared to disable iptables to test this, as its an internet 
   facing production  host.

   Shouldn't be a factor anyway as I've mentioned that when named is having
   troubles, *just* restarting named makes it better.
 
   As to the firewalling, its correct for the internet side.  The named
   is not a delegated name server; its just a server for internal
   domains and acting as a cache/proxy for internet DNS.  It only gets
   requests from eth2 (internal LAN) and lo0 (current machine)

   Thus the only packets allowed in on eth0 (internet side) are those
   related to requests made out that side (statefully)

2.  tcpdump command is correct for sniffing DNS traffic on eth0.  
    'port 53' means either SRC or DST port, and TCP or UDP (check the man page)
    The use of  -s1500 is also fine as eth0 is 100M and 1500 is the 
    maximum packet size of 100M ethernet.

    Haven't played with edns0-udp-size.  Shouldn't be required; the requests
    made out to the internet will all be pretty small; the responses
    returned by other sites may be larger, but the cisco router
    and ISP will fragment the response packets if required (unless path
    mtu discovery is on, in which case the sender should retransmit it smaller)

3.  the root servers are contactable; the ADSL provider doesn't block access.
    Indeed if it did, restarting named to fix it wouldn't have worked.
    
    (the ADSL provider has an awful DNS cache -- often slow so this is
    why we don't use it)

4.  My selinux is not in enforcing mode.

5.  root cache is up to date.

6.  will get a trace next time the problem occurs (hasn't occurred in last 24 
    hours, but did occur twice since the system boot the day prior; first
    time right after boot; 2nd time about 15 hours later)
Comment 9 Jason Vas Dias 2005-10-11 12:51:37 EDT
RE: > tcpdump command is correct for sniffing DNS traffic on eth0.  
    > 'port 53' means either SRC or DST port, and TCP or UDP (check the man page)
    > The use of  -s1500 is also fine as eth0 is 100M and 1500 is the 
    > maximum packet size of 100M ethernet.

The tcpdump command you use: 'tcpdump -s1500 -vn -ieth0 port 53' - will only
show information for packets to/from port 53 on eth0. There are several problems
with this:
  1. You use 'query-source port 5353;' in your named.conf, so packets to/from
     this port will not be shown.
  2. Responses sent to interfaces eth1, eth2, lo0 will not be shown.
  3. Packets may by default be up to 4096 bytes unless the 'edns-udp-size' 
     option is given in named.conf, so tcpdump output for some packets will
     be truncated. 
     Named uses up to 4096 byte packets by default, regardless of the media
     MTU - IP fragmentation will be used for packets of length greater than
     the MTU. 
     Some cable modems / firewalls / routers are known to have
     problems with the IP fragmentation packets that result when the MTU is
     less than the edns-udp-size (default: 4096). Try putting 
     'edns-udp-size 1500;' in named.conf 'options{...}' OR use '-s 4096' in
     the tcpdump command.
  4. The '-v' option does not give the maximum amount of information. 

So when the problem occurs, please try the following:

1. Get tcpdump output:
   # tcpdump -nl -vvv -i any -s 4096 port 53 or port 5353 2>&1 |\
     tee /tmp/tcpdump.log &

2. Obtain a dump of the DNS cache:
   # rndc dumpdb

3. Turn on named debugging:
   # chown named:named /var/chroot/bind/etc/namedb
   # rndc trace 99

And then reproduce the problem with several queries:
   # dig . ns
   # host www.google.com
   # host www.redhat.com

Then please tar and compress these files:
   # tar -cpf - /tmp/tcpdump.log /var/chroot/bind/etc/namedb/named.run   \
                                 /var/chroot/bind/var/run/named_dump.db |\
     gzip > /tmp/named_debug.tar.gz
and append this file to this bug report or send it to me: jvdias@redhat.com -
as this problem cannot be reproduced here, this is the only way I can assist
in resolving it - thank you.
Comment 10 Jason Vas Dias 2005-10-11 18:14:24 EDT
*** Bug 168829 has been marked as a duplicate of this bug. ***
Comment 11 Ian Donaldson 2005-10-11 20:30:14 EDT
This is getting a bit off topic but regarding tcpdump usage...

1.  you are partly correct ... packets to/from 5353 won't be shown
    by my 'tcpdump port 53' command... unless they are also to/from port 53.

    However all DNS requests  from named use from-port 5353 and to-port 53; and 
    responses to named use from-port 53 and to-port 5353  so my use of
    just 'port 53' in tcpdump *will* pick them all up.    True if
    named has gone nuts and is sending to some port other than 53 then 
    my tcpdump won't see that. 

2.  true about eth2, lo0... didn't realize the '-i any' option existed
    (I'm a 15 year tcpdump user... those options are only relatively recent)
    but for my purposes I was only interested in traffic on eth0 (internet
    side).  However will try -i any

3.  The 4096 byte thing.  named may very well generate UDP messages
    of that size.  They will be fragmented to the MTU of the sending interface
    and thus 'tcpdump -s1500' will pick them up in their *entirity* on 
    a 10M or 100M interface (and usually configured 1000M interfaces too; 
    although 1000M can do larger; 16k I think).

    tcpdump doesn't  do fragmentation reassembly;
    the argument to -s is the size of the *physical* packet to be captured,
    not the size of the reassembled message.
    Recent (last few years) tcpdump's seem to accept '-s0' also to accept
    packets of any size too; although I'm still using some
    tcpdump's around here that don't have that ability.

As for the traces requested; will do that next time the problem occurs.
Comment 12 Penelope Fudd 2005-10-11 23:56:36 EDT
May I recommend setting up a tcpdump to record DNS traffic constantly, rotating
whenever the log gets too large.  That way, the traffic that leads to the
lock-up might be found.  You can set a cron job to delete anything older than x
days (or hours) to prevent a full drive.

# tcpdump -i any -s 0 -C 1 -w dnslog.pcap. port 53 or port 5353

This will
  listen on any interface,
  read the whole packet,
  rotate the log file every 1 million bytes,
  write to a file called dnslog.pcap.###,
  and listening to tcp or udp port 53 and 5353.

These features are present in fc2, and should still be there in fc4.
Comment 13 Penelope Fudd 2005-10-11 23:59:32 EDT
find /path/to/logdir -name 'dnslog.pcap.*' -mtime +3 -exec rm -f {} \;

This will
  find anything in /path/to/logdir
  with a name like 'dnslog.pcap.*'
  that's older than 3 days
  and run 'rm -f #######' on it.
Comment 14 Christian Iseli 2007-01-22 06:46:42 EST
This report targets the FC3 or FC4 products, which have now been EOL'd.

Could you please check that it still applies to a current Fedora release, and
either update the target product or close it ?

Thanks.
Comment 15 Adam Tkac 2007-09-20 08:14:10 EDT
Very old bug, I believe this is fixed in current releases, closing

Note You need to log in before you can comment on or make changes to this bug.