Bug 38228 - Bind times out intermittently on multiprocessor
Summary: Bind times out intermittently on multiprocessor
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: kernel
Version: 7.1
Hardware: i686
OS: Linux
Target Milestone: ---
Assignee: David Miller
QA Contact: David Lawrence
Depends On:
TreeView+ depends on / blocked
Reported: 2001-04-29 04:11 UTC by Andrew Rucker Jones
Modified: 2007-04-18 16:32 UTC (History)
1 user (show)

Clone Of:
Last Closed: 2001-05-02 11:30:34 UTC

Attachments (Terms of Use)
strace of "telnet www.futurekids-mannheim.de 80" (7.89 KB, text/plain)
2001-05-01 23:01 UTC, Andrew Rucker Jones
no flags Details

Description Andrew Rucker Jones 2001-04-29 04:11:57 UTC
I had bind 9.1.1 compiled and running fine under RH 7.0, but the upgrade to
7.1 broke it. The symptom was that name lookups would take longer than
usual and would usually fail with a timed out error,  some consistently
(like my own domain name, which is hosted elsewhere in the DNS hierarchy),
some not. I tried installing the RH-distributed build of bind 9.1.0 to see
if that would help, but it made no difference. I rebuilt 9.1.1, with the
same options i used the last time, but to no avail. Finally, i'm 90%
positive that i figured it out: i'm on a quad PPro box running an SMP
kernel, and bind is built with thread support by default. As soon as i
built it with the --disable-threads option, it works fine. I guess i might
suspect a race condition in the SMP kernel, but i'm no expert, so i'll let
You guys figure that out.

Comment 1 Andrew Rucker Jones 2001-04-29 22:37:16 UTC
You're all going to think i'm crazy, and i'm doubting my own sanity at this
point. It seems bind is not the only problem. The single-threaded version that i
built started acting unstable, too, so i ceased using bind on my machine and
pointed  /etc/resolv.conf  to my ISP's DNS servers. I still get the same problem
sometimes. To make things worse, when i try browsing the Internet, i am
sometimes not able to connect to big sites like Yahoo! and www.kernel.org (but i
connect to others instantly, and later i can connect to Yahoo! and kernel.org
just fine, but i can't connect to Bugzilla, etc.). Name resolution may work
fine, but it will then fail to connect. I can even telnet to port 80 and i still
get no response. Next, You will probably think that my Internet connection is
flaky. Well, it is, BUT, it flashes alarm lights when it's being flaky. I've had
this connection for about a year, and i know when it's acting up and when it
isn't. It's been solid as a rock for the past few days (during the time i've
been having trouble. Could it be that the TCP/IP stack in the 2.4 kernel is not
SMP safe? I don't know, but i've been beating my head against this one for more
than a day now.

Comment 2 Bernhard Rosenkraenzer 2001-04-30 13:12:56 UTC
Arjan, could this be a kernel problem? I can't reproduce this; at least not on 
a UP box and a 2 CPU box.

Comment 3 Arjan van de Ven 2001-04-30 13:16:37 UTC
1) Do you have any sort of firewall running ?
2) Are you using our stock kernels or did you compile your own ?

Comment 4 Andrew Rucker Jones 2001-04-30 18:51:04 UTC
I have a separate machine with an ipchains-based firewall that has not changed
for months and works very reliably. I am very familiar with the ruleset i set
up, and it would not be causing what i am seeing. I am using the
RedHat-distributed 2.4.2-2smp kernel. It was installed by the RedHat 7.1

This problem truly is intermittant. If at all possible, spend a day or two
working on a multiprocessor system and see if it turns up. I will keep working
on my system, and if a day or two goes by without event, i will commit myself to
an assylum, close this case, and say "i guess it's just one of those things".

Comment 5 Andrew Rucker Jones 2001-05-01 23:00:32 UTC
Okay, it happened again today. All of yesterday life was good, and it was still
good when i started browsing today, but then i hit a "bad block". All of a
sudden, i couldn't connect to any Web servers (including via telnet to port 80),
including ones that i had visited mere minutes ago, sometimes name resolution
was failing or taking much longer than usual, and then Netscape crashed. I am
using Netscape Communicator 4.77, and it is normally stable under my usage
patterns. All i was doing was waiting for a connection/timeout, and the browser
crashed. Could it possibly be that the TCP/IP stack (or SMP - TCP/IP
interaction) is corrupting something and passing that on to Netscape, which then
fails to deal well with unexpected input? I deleted the core file, which was
probably dumb. Anyway, then i was able to connect to O'Reilly's Web site just
fine, but i still wasn't getting to other sites (including bugzilla.redhat.com).
So, i went out for ice cream, and when i got back, life was good again. No
problems with my physical Internet connection during the entire time (i could
see the traffic lights flickering when i tried to connect, etc.). I sincerely
doubt that it will help at all, but i am attaching the output of strace when run
on a telnet attempt to a Web server i couldn't reach.

I know that if You even believe me this is the worst kind of problem to try to
debug. I will try anything that will help. I could possibly boot into the
uniprocessor kernel and use the machine that way for a few days to see if
anything breaks... Let me know.

Comment 6 Andrew Rucker Jones 2001-05-01 23:01:53 UTC
Created attachment 17006 [details]
strace of "telnet www.futurekids-mannheim.de 80"

Comment 7 Andrew Rucker Jones 2001-05-01 23:17:51 UTC
Let me add just one more thing. Is it possible that it's an ethernet driver
problem? During the time that i cannot connect to other machines, i can still
connect to the loopback interface just fine. I am using a ThunderLan card (the
output from lspci -n is "01:07.0 Class 0280: 0e11:ae43 (rev 10)"), and after
installing 7.1, kudzu claimed that the card had been removed from the system. I
told it to leave everything as it is and i had no further problems.

Comment 8 Arjan van de Ven 2001-05-02 10:55:36 UTC
Dave:  could you take a peek at this ?

Comment 9 David Miller 2001-05-02 11:16:13 UTC
Any interesting messages in your kernel logs?
I bet the thunderlan driver is crapping out in 2.4.x
In fact, if you have another kind of ethernet card handy
(say an eepro100 or a 3c59x), you can prove my theory
by putting that other card in your machine and seeing if the
problem persists.

I don't think this is a networking/SMP/whatever problem
at all.

You can do other kinds of experiments, BTW, to help narrow
the problem down.  When your machine enters this state, and
you can't connect anywhere, use tcpdump on another machine
on the same subnet to see if your machine is sending out any
packets at all.

Comment 10 Alan Cox 2001-05-02 11:30:30 UTC
I've had no 2.4 problems  reported with the thunderlan in recent kernels. There
are three obvious possibilities -  

1. Some kind of cabling/link funny that is tripping up the driver
2. Actual network layer problems (eg another box using the same IP)
3. Early PPro steppings. What PPro steps are the cpus in the 4 way box

I'm guessing #1 or #2 right now

Comment 11 Andrew Rucker Jones 2001-05-05 13:31:09 UTC
I did a tcpdump on some failed connection attempts as soon as i hit what i call
a "bad block". The tcpdump was run from another machine on the same network that
sits inside my firewall, so it's proof positive of what i'm sending. I'm sending
SYNs to these sites and getting nothing back. So, it turns out that it probably
is my fault (or rather my ISP's fault -- stupid ISP). Even if it isn't, i don't
have the time anymore to track it down. I have to find a job in Germany as
quickly as possible and move there. Sorry for wasting Your time.

Note You need to log in before you can comment on or make changes to this bug.