Description of problem: This morning I got disconnected from my machine and when I got into the office I went through the logs and found some kernel errors logged at about the same time that network interface broke. I found the machine and the interface still up, but that interface no longer had the correct network information attached to it and I used ifdown and ifup to get it running again. Another network interface was still working properly. I am attaching part of /var/log/messages that includes the errors. Version-Release number of selected component (if applicable): 2.6.25-0.172.rc7.git4.fc9.i686 The machine has a single Pentium III CPU. How reproducible: I am not sure. I have seen these symptoms previously, but not for several weeks and that was a lot of kernel updates ago. At the time it happened I was syncing up (using lftp) my copy of rawhide. At least one time in the past I was doing the same thing when I got a similar kind of failure. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 299889 [details] /var/log/messages extract
Created attachment 301387 [details] /var/log/messages extract It happened again with kernel 2.6.25-0.195.rc8.git1.fc9.i686. I was wrong about the interface losing address information. I probably just looked at the wrong device. I have attached another log extract that covers from when I lost access until I restarted the network device (eth4).
I brought this up on Linux-kernel last week. http://lkml.org/lkml/2008/4/1/428 Discussion is ongoing.
I looked through that thread and would like to note that this is not without negative effects. Something bad happens to the network interface that was under load when this happened. It doesn't seem to die all at once. (My ssh session died in the middle of an lftp run, but when I got to the box I found the lftp run had completed, even though it would have needed to have run for another several minutes past the point where the ssh session locked up.) Eventually no inbound or outbound connection attempts (or pings) work until I reset that interface.
I saw this again with 2.6.25-0.204.rc8.git4.fc9.i686. I noticed when my network connection failed. ifdown followed by ifup got things working before any of my ssh connections timed out.
It also happened with 2.6.25-0.212.rc8.git6.fc9.i686. Again while using lftp to mirror the x86 rawhide tree.
Is there something I can do to help track this down? It is annoying to get locked out of the machine (though I do have a cron job resetting the network interface to limit how long I get locked out) when doing stuff remotely and I have another machine I want to upgrade to F9 for which this would be even more of a problem. So I have some extra incentitive to help get this fixed. Also one of piece of info that may give a hint as to what changes affected this is that I think bug 433594 is very likely the same problem. It stopped happening for long enough that we closed that bug.
I think it might help if you disable TSO and/or LRO and/or GSO on the adapter.
I don't think the built in device does off loading. ethtool didn't show any offloading turned on. This is the ethtool -i output: driver: e100 version: 3.5.23-k4-NAPI firmware-version: N/A bus-info: 0000:01:08.0 And from lspci: 01:08.0 Ethernet controller: Intel Corporation 82801BA/BAM/CA/CAM Ethernet Controller (rev 01) I do have some other cheap cards in that box and maybe they wouldn't have this problem so I can try swapping which one is used for my external link.
I an still seeing this with the 2.6.25-0.218.rc8.git7.fc9.i686 kernel.
Since I haven't noticed this happen on the other interfaces, there is a reasonable chance that this is a bug specific to the e100 driver. That driver won't be used on another machine (where it would cause more of a problem). I haven't seen a lockup on the other interfaces on the machine where the problem has been occurring. They are different hardware, but also don't get stressed as often. None of the network devices are common between the two machines. So I'll do some minimal testing and then just risk the upgrade.
The e100 driver was still having this issue with 2.6.25-1.fc9.i686. I am now using a different card using a different driver for the connection that was causing problems. Since I couldn't reliably get the problem to occur it may take a bit for it to happen again or to have some confidence that the network hang part of the issue is driver specific.
I haven't noticed this problem since switching my outside link to use a different network card. I have also not seen that issue on another machine of similar size that also does not use the e100 driver. While it hasn't been long enough to be sure (and I have upgraded the kernel to 2.6.25-14), this does point to the e100 driver having a defect.
I still haven't seen this problme reoccur since I stopped using the e100 nic. I think it is very likely this is an e100 driver problem.
Changing version to '9' as part of upcoming Fedora 9 GA. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
I retired the machine (at work) that had the hardware with the problem and I haven't seen it happen on any of the other NICs I have. So going forward I probably won't be able to help test any fixes.
This message is a reminder that Fedora 9 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 9. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '9'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 9's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 9 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 9 changed to end-of-life (EOL) status on 2009-07-10. Fedora 9 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.