Red Hat Bugzilla – Bug 38228
Bind times out intermittently on multiprocessor
Last modified: 2007-04-18 12:32:54 EDT
I had bind 9.1.1 compiled and running fine under RH 7.0, but the upgrade to
7.1 broke it. The symptom was that name lookups would take longer than
usual and would usually fail with a timed out error, some consistently
(like my own domain name, which is hosted elsewhere in the DNS hierarchy),
some not. I tried installing the RH-distributed build of bind 9.1.0 to see
if that would help, but it made no difference. I rebuilt 9.1.1, with the
same options i used the last time, but to no avail. Finally, i'm 90%
positive that i figured it out: i'm on a quad PPro box running an SMP
kernel, and bind is built with thread support by default. As soon as i
built it with the --disable-threads option, it works fine. I guess i might
suspect a race condition in the SMP kernel, but i'm no expert, so i'll let
You guys figure that out.
You're all going to think i'm crazy, and i'm doubting my own sanity at this
point. It seems bind is not the only problem. The single-threaded version that i
built started acting unstable, too, so i ceased using bind on my machine and
pointed /etc/resolv.conf to my ISP's DNS servers. I still get the same problem
sometimes. To make things worse, when i try browsing the Internet, i am
sometimes not able to connect to big sites like Yahoo! and www.kernel.org (but i
connect to others instantly, and later i can connect to Yahoo! and kernel.org
just fine, but i can't connect to Bugzilla, etc.). Name resolution may work
fine, but it will then fail to connect. I can even telnet to port 80 and i still
get no response. Next, You will probably think that my Internet connection is
flaky. Well, it is, BUT, it flashes alarm lights when it's being flaky. I've had
this connection for about a year, and i know when it's acting up and when it
isn't. It's been solid as a rock for the past few days (during the time i've
been having trouble. Could it be that the TCP/IP stack in the 2.4 kernel is not
SMP safe? I don't know, but i've been beating my head against this one for more
than a day now.
Arjan, could this be a kernel problem? I can't reproduce this; at least not on
a UP box and a 2 CPU box.
1) Do you have any sort of firewall running ?
2) Are you using our stock kernels or did you compile your own ?
I have a separate machine with an ipchains-based firewall that has not changed
for months and works very reliably. I am very familiar with the ruleset i set
up, and it would not be causing what i am seeing. I am using the
RedHat-distributed 2.4.2-2smp kernel. It was installed by the RedHat 7.1
This problem truly is intermittant. If at all possible, spend a day or two
working on a multiprocessor system and see if it turns up. I will keep working
on my system, and if a day or two goes by without event, i will commit myself to
an assylum, close this case, and say "i guess it's just one of those things".
Okay, it happened again today. All of yesterday life was good, and it was still
good when i started browsing today, but then i hit a "bad block". All of a
sudden, i couldn't connect to any Web servers (including via telnet to port 80),
including ones that i had visited mere minutes ago, sometimes name resolution
was failing or taking much longer than usual, and then Netscape crashed. I am
using Netscape Communicator 4.77, and it is normally stable under my usage
patterns. All i was doing was waiting for a connection/timeout, and the browser
crashed. Could it possibly be that the TCP/IP stack (or SMP - TCP/IP
interaction) is corrupting something and passing that on to Netscape, which then
fails to deal well with unexpected input? I deleted the core file, which was
probably dumb. Anyway, then i was able to connect to O'Reilly's Web site just
fine, but i still wasn't getting to other sites (including bugzilla.redhat.com).
So, i went out for ice cream, and when i got back, life was good again. No
problems with my physical Internet connection during the entire time (i could
see the traffic lights flickering when i tried to connect, etc.). I sincerely
doubt that it will help at all, but i am attaching the output of strace when run
on a telnet attempt to a Web server i couldn't reach.
I know that if You even believe me this is the worst kind of problem to try to
debug. I will try anything that will help. I could possibly boot into the
uniprocessor kernel and use the machine that way for a few days to see if
anything breaks... Let me know.
Created attachment 17006 [details]
strace of "telnet www.futurekids-mannheim.de 80"
Let me add just one more thing. Is it possible that it's an ethernet driver
problem? During the time that i cannot connect to other machines, i can still
connect to the loopback interface just fine. I am using a ThunderLan card (the
output from lspci -n is "01:07.0 Class 0280: 0e11:ae43 (rev 10)"), and after
installing 7.1, kudzu claimed that the card had been removed from the system. I
told it to leave everything as it is and i had no further problems.
Dave: could you take a peek at this ?
Any interesting messages in your kernel logs?
I bet the thunderlan driver is crapping out in 2.4.x
In fact, if you have another kind of ethernet card handy
(say an eepro100 or a 3c59x), you can prove my theory
by putting that other card in your machine and seeing if the
I don't think this is a networking/SMP/whatever problem
You can do other kinds of experiments, BTW, to help narrow
the problem down. When your machine enters this state, and
you can't connect anywhere, use tcpdump on another machine
on the same subnet to see if your machine is sending out any
packets at all.
I've had no 2.4 problems reported with the thunderlan in recent kernels. There
are three obvious possibilities -
1. Some kind of cabling/link funny that is tripping up the driver
2. Actual network layer problems (eg another box using the same IP)
3. Early PPro steppings. What PPro steps are the cpus in the 4 way box
I'm guessing #1 or #2 right now
I did a tcpdump on some failed connection attempts as soon as i hit what i call
a "bad block". The tcpdump was run from another machine on the same network that
sits inside my firewall, so it's proof positive of what i'm sending. I'm sending
SYNs to these sites and getting nothing back. So, it turns out that it probably
is my fault (or rather my ISP's fault -- stupid ISP). Even if it isn't, i don't
have the time anymore to track it down. I have to find a job in Germany as
quickly as possible and move there. Sorry for wasting Your time.