|Summary:||Bind times out intermittently on multiprocessor|
|Product:||[Retired] Red Hat Linux||Reporter:||Andrew Rucker Jones <arjones>|
|Component:||kernel||Assignee:||David Miller <davem>|
|Status:||CLOSED NOTABUG||QA Contact:||David Lawrence <dkl>|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2001-05-02 11:30:34 UTC||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description Andrew Rucker Jones 2001-04-29 04:11:57 UTC
I had bind 9.1.1 compiled and running fine under RH 7.0, but the upgrade to 7.1 broke it. The symptom was that name lookups would take longer than usual and would usually fail with a timed out error, some consistently (like my own domain name, which is hosted elsewhere in the DNS hierarchy), some not. I tried installing the RH-distributed build of bind 9.1.0 to see if that would help, but it made no difference. I rebuilt 9.1.1, with the same options i used the last time, but to no avail. Finally, i'm 90% positive that i figured it out: i'm on a quad PPro box running an SMP kernel, and bind is built with thread support by default. As soon as i built it with the --disable-threads option, it works fine. I guess i might suspect a race condition in the SMP kernel, but i'm no expert, so i'll let You guys figure that out.
Comment 1 Andrew Rucker Jones 2001-04-29 22:37:16 UTC
You're all going to think i'm crazy, and i'm doubting my own sanity at this point. It seems bind is not the only problem. The single-threaded version that i built started acting unstable, too, so i ceased using bind on my machine and pointed /etc/resolv.conf to my ISP's DNS servers. I still get the same problem sometimes. To make things worse, when i try browsing the Internet, i am sometimes not able to connect to big sites like Yahoo! and www.kernel.org (but i connect to others instantly, and later i can connect to Yahoo! and kernel.org just fine, but i can't connect to Bugzilla, etc.). Name resolution may work fine, but it will then fail to connect. I can even telnet to port 80 and i still get no response. Next, You will probably think that my Internet connection is flaky. Well, it is, BUT, it flashes alarm lights when it's being flaky. I've had this connection for about a year, and i know when it's acting up and when it isn't. It's been solid as a rock for the past few days (during the time i've been having trouble. Could it be that the TCP/IP stack in the 2.4 kernel is not SMP safe? I don't know, but i've been beating my head against this one for more than a day now.
Comment 2 Bernhard Rosenkraenzer 2001-04-30 13:12:56 UTC
Arjan, could this be a kernel problem? I can't reproduce this; at least not on a UP box and a 2 CPU box.
Comment 3 Arjan van de Ven 2001-04-30 13:16:37 UTC
1) Do you have any sort of firewall running ? 2) Are you using our stock kernels or did you compile your own ?
Comment 4 Andrew Rucker Jones 2001-04-30 18:51:04 UTC
I have a separate machine with an ipchains-based firewall that has not changed for months and works very reliably. I am very familiar with the ruleset i set up, and it would not be causing what i am seeing. I am using the RedHat-distributed 2.4.2-2smp kernel. It was installed by the RedHat 7.1 installation. This problem truly is intermittant. If at all possible, spend a day or two working on a multiprocessor system and see if it turns up. I will keep working on my system, and if a day or two goes by without event, i will commit myself to an assylum, close this case, and say "i guess it's just one of those things".
Comment 5 Andrew Rucker Jones 2001-05-01 23:00:32 UTC
Okay, it happened again today. All of yesterday life was good, and it was still good when i started browsing today, but then i hit a "bad block". All of a sudden, i couldn't connect to any Web servers (including via telnet to port 80), including ones that i had visited mere minutes ago, sometimes name resolution was failing or taking much longer than usual, and then Netscape crashed. I am using Netscape Communicator 4.77, and it is normally stable under my usage patterns. All i was doing was waiting for a connection/timeout, and the browser crashed. Could it possibly be that the TCP/IP stack (or SMP - TCP/IP interaction) is corrupting something and passing that on to Netscape, which then fails to deal well with unexpected input? I deleted the core file, which was probably dumb. Anyway, then i was able to connect to O'Reilly's Web site just fine, but i still wasn't getting to other sites (including bugzilla.redhat.com). So, i went out for ice cream, and when i got back, life was good again. No problems with my physical Internet connection during the entire time (i could see the traffic lights flickering when i tried to connect, etc.). I sincerely doubt that it will help at all, but i am attaching the output of strace when run on a telnet attempt to a Web server i couldn't reach. I know that if You even believe me this is the worst kind of problem to try to debug. I will try anything that will help. I could possibly boot into the uniprocessor kernel and use the machine that way for a few days to see if anything breaks... Let me know.
Comment 6 Andrew Rucker Jones 2001-05-01 23:01:53 UTC
Created attachment 17006 [details] strace of "telnet www.futurekids-mannheim.de 80"
Comment 7 Andrew Rucker Jones 2001-05-01 23:17:51 UTC
Let me add just one more thing. Is it possible that it's an ethernet driver problem? During the time that i cannot connect to other machines, i can still connect to the loopback interface just fine. I am using a ThunderLan card (the output from lspci -n is "01:07.0 Class 0280: 0e11:ae43 (rev 10)"), and after installing 7.1, kudzu claimed that the card had been removed from the system. I told it to leave everything as it is and i had no further problems.
Comment 8 Arjan van de Ven 2001-05-02 10:55:36 UTC
Dave: could you take a peek at this ?
Comment 9 David Miller 2001-05-02 11:16:13 UTC
Any interesting messages in your kernel logs? I bet the thunderlan driver is crapping out in 2.4.x In fact, if you have another kind of ethernet card handy (say an eepro100 or a 3c59x), you can prove my theory by putting that other card in your machine and seeing if the problem persists. I don't think this is a networking/SMP/whatever problem at all. You can do other kinds of experiments, BTW, to help narrow the problem down. When your machine enters this state, and you can't connect anywhere, use tcpdump on another machine on the same subnet to see if your machine is sending out any packets at all.
Comment 10 Alan Cox 2001-05-02 11:30:30 UTC
I've had no 2.4 problems reported with the thunderlan in recent kernels. There are three obvious possibilities - 1. Some kind of cabling/link funny that is tripping up the driver 2. Actual network layer problems (eg another box using the same IP) 3. Early PPro steppings. What PPro steps are the cpus in the 4 way box I'm guessing #1 or #2 right now
Comment 11 Andrew Rucker Jones 2001-05-05 13:31:09 UTC
I did a tcpdump on some failed connection attempts as soon as i hit what i call a "bad block". The tcpdump was run from another machine on the same network that sits inside my firewall, so it's proof positive of what i'm sending. I'm sending SYNs to these sites and getting nothing back. So, it turns out that it probably is my fault (or rather my ISP's fault -- stupid ISP). Even if it isn't, i don't have the time anymore to track it down. I have to find a job in Germany as quickly as possible and move there. Sorry for wasting Your time.