Red Hat Bugzilla – Bug 440028
multi-page atomic allocations fail under memory pressure
Last modified: 2009-07-14 11:43:40 EDT
Description of problem:
This morning I got disconnected from my machine and when I got into the office I
went through the logs and found some kernel errors logged at about the same time
that network interface broke. I found the machine and the interface still up,
but that interface no longer had the correct network information attached to it
and I used ifdown and ifup to get it running again. Another network interface
was still working properly.
I am attaching part of /var/log/messages that includes the errors.
Version-Release number of selected component (if applicable):
The machine has a single Pentium III CPU.
I am not sure. I have seen these symptoms previously, but not for several weeks
and that was a lot of kernel updates ago.
At the time it happened I was syncing up (using lftp) my copy of rawhide. At
least one time in the past I was doing the same thing when I got a similar kind
Steps to Reproduce:
Created attachment 299889 [details]
Created attachment 301387 [details]
It happened again with kernel 2.6.25-0.195.rc8.git1.fc9.i686.
I was wrong about the interface losing address information. I probably just
looked at the wrong device.
I have attached another log extract that covers from when I lost access until I
restarted the network device (eth4).
I brought this up on Linux-kernel last week. http://lkml.org/lkml/2008/4/1/428
Discussion is ongoing.
I looked through that thread and would like to note that this is not without
negative effects. Something bad happens to the network interface that was under
load when this happened. It doesn't seem to die all at once. (My ssh session
died in the middle of an lftp run, but when I got to the box I found the lftp
run had completed, even though it would have needed to have run for another
several minutes past the point where the ssh session locked up.) Eventually no
inbound or outbound connection attempts (or pings) work until I reset that
I saw this again with 2.6.25-0.204.rc8.git4.fc9.i686. I noticed when my network
connection failed. ifdown followed by ifup got things working before any of my
ssh connections timed out.
It also happened with 2.6.25-0.212.rc8.git6.fc9.i686. Again while using lftp to
mirror the x86 rawhide tree.
Is there something I can do to help track this down? It is annoying to get
locked out of the machine (though I do have a cron job resetting the network
interface to limit how long I get locked out) when doing stuff remotely and I
have another machine I want to upgrade to F9 for which this would be even more
of a problem. So I have some extra incentitive to help get this fixed.
Also one of piece of info that may give a hint as to what changes affected this
is that I think bug 433594 is very likely the same problem. It stopped happening
for long enough that we closed that bug.
I think it might help if you disable TSO and/or LRO and/or GSO on the adapter.
I don't think the built in device does off loading. ethtool didn't show any
offloading turned on.
This is the ethtool -i output:
And from lspci:
01:08.0 Ethernet controller: Intel Corporation 82801BA/BAM/CA/CAM Ethernet
Controller (rev 01)
I do have some other cheap cards in that box and maybe they wouldn't have this
problem so I can try swapping which one is used for my external link.
I an still seeing this with the 2.6.25-0.218.rc8.git7.fc9.i686 kernel.
Since I haven't noticed this happen on the other interfaces, there is a
reasonable chance that this is a bug specific to the e100 driver. That driver
won't be used on another machine (where it would cause more of a problem). I
haven't seen a lockup on the other interfaces on the machine where the problem
has been occurring. They are different hardware, but also don't get stressed as
often. None of the network devices are common between the two machines. So I'll
do some minimal testing and then just risk the upgrade.
The e100 driver was still having this issue with 2.6.25-1.fc9.i686. I am now
using a different card using a different driver for the connection that was
causing problems. Since I couldn't reliably get the problem to occur it may take
a bit for it to happen again or to have some confidence that the network hang
part of the issue is driver specific.
I haven't noticed this problem since switching my outside link to use a
different network card. I have also not seen that issue on another machine of
similar size that also does not use the e100 driver. While it hasn't been long
enough to be sure (and I have upgraded the kernel to 2.6.25-14), this does point
to the e100 driver having a defect.
I still haven't seen this problme reoccur since I stopped using the e100 nic.
I think it is very likely this is an e100 driver problem.
Changing version to '9' as part of upcoming Fedora 9 GA.
More information and reason for this action is here:
I retired the machine (at work) that had the hardware with the problem and I haven't seen it happen on any of the other NICs I have. So going forward I probably won't be able to help test any fixes.
This message is a reminder that Fedora 9 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 9. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora
'version' of '9'.
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version prior to Fedora 9's end of life.
Bug Reporter: Thank you for reporting this issue and we are sorry that
we may not be able to fix it before Fedora 9 is end of life. If you
would still like to see this bug fixed and are able to reproduce it
against a later version of Fedora please change the 'version' of this
bug to the applicable version. If you are unable to change the version,
please add a comment here and someone will do it for you.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
The process we are following is described here:
Fedora 9 changed to end-of-life (EOL) status on 2009-07-10. Fedora 9 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.
If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version.
Thank you for reporting this bug and we are sorry it could not be fixed.