When keeping the machine on my network for long periods
of time, usually within a week's time, there will be
at least one period of 1-4 hours where the linux box
disappears completely off the network!!!!
During the failure, I can log into the linux box from
the console, Linux never crashes. I cannot ping, telnet,
rsh, rlogin into the machine, it's like it's not there.
Also I cannot telnet, ping, etc... from the machine that
is screwed up. After 1-4 hours, the problem goes away
by itself without any intervention from me!
A friend of mine has the same problem, he runs 5.1.
We both use different NICs, and we have 2 NICs in our
Linux boxes. Both run Samba also, though this is a
kernel issue. I have checked Usenet, and 3 other people
that run either Redhat 5.1 or 5.2 are experiencing the
identical problems. Most of those people however only
have 1 NIC in their machine. This bug is impossible
to reproduce (I wish I knew how), as it emerges once or
twice a week.
BTW both my NICs are tier 1 supported.
What NIC cards are you using exactly? What role does your Linux
machine play on your network? What version of the kernel are you
Are you sure it's a software problem? That is, when the network goes
dead, can other machines on the same hub ping each other? Can you
quick pop in another machine on the network drop the broken machine is
using and ping with it? Are there any kernel messages in the system
logs or "dmesg" listing?
I've been having network problems too. In one console session, so a
'netstat -ic'. then from another session, do a 'ping w.x.y.z' (with a
note whether the correct device counter increments. My guess is that
the 'lo' device will increment.
(please forward a copy of your results to me as well as RedHat)
I don't know how to reproduce this problem. Please
reopen with conditions to reproduce.
I cannot reproduce this bug, it's very difficult. I have seen this
behaviour in other machines at work running normal RedHat 5.1
distributions also. The common thread between all cases that I have
seen is that the machines are not used or hardly used. So here comes
the first challenge, how do you notice this? I just notice it by
chance once in a while when I try to unsuccessful telnet into the
machine. Another time I had remotely logged in, ran an X app from the
linux machine, left it there for like 3-4 days, and came back to it.
By chance it was one "that time of the week" where it crapped out and
the apps was frozen. Didn't take long too see what was going on...
That machine in particular had NFS drives mounted, also password
system went through NIS, however I've seen the same behaviour
elsewhere with no mounted NFS drives and no NIS. Could it be a driver
bug? If a driver crashes, is linux smart enough to re-load it again?
In my machine I have a tulip card, and a EtherExpress 100. I can
check on the other two machines at work. When checking the logs,
there is nothing to hint at the problem. For the one machine with NFS
drives and NIS on it, you see messages of it cannot reaching their
proper server. Like if I yanked the ethernet cable out of the
computer minus all the heart-beat missing complaints.
A way you can reproduce eventually reproduce the problem is setup up 2
linux machines with 5.2 (normal setup, nothing fancy) with a
cross-over cable. Let one machine sit idle and literally do nothing,
while the other one just pings it once every 5-10 mins. Let it sit
for a week or two, pipe the pings to a file and look at it after a
week. If you see packet loss you know something is up!!!
No guarantees this wil re-produce the bug, this is a very subtle
thing. If your linux box gets used frequently (sorry can't quantify)
you'll never experience this problem. Ever since I used my linux box
as a server, I've never seen this behaviour any more for little over 3
months now, because I suspect it's doing things more or less around
the clock now. As opposed to sitting around doing jack for 3-4 days
at a time.
Perhaps a quick manual inspection of the code to see what the kernel
may do if no network traffic goes through it??? Sorry I wish I could
say more... It may not be the kernel, but like if you can't even ping
the machine, ICMP is handled by the kernel right? So what else could
This smells like Advanced Power Management powering down
the machine and not powering it back up correctly.
Meanwhile, I don't know how to fix this problem (short of
suggesting that you run on a sparc :-)
Oops sorry about that, I hit re-load and re-posted the data by
accident. I don't think it's a power management issue because I
disabled it in my BIOS. I use a ASUS TX97 motherboard.
We cannot reprodcue this problem.