440028 – multi-page atomic allocations fail under memory pressure

Bug 440028 - multi-page atomic allocations fail under memory pressure

Summary: multi-page atomic allocations fail under memory pressure

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	9
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-04-01 13:34 UTC by Bruno Wolff III
Modified:	2009-07-14 15:43 UTC (History)
CC List:	0 users
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-07-14 15:43:40 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
/var/log/messages extract (310.44 KB, text/plain) 2008-04-01 13:34 UTC, Bruno Wolff III	no flags	Details
/var/log/messages extract (200.77 KB, text/plain) 2008-04-05 19:07 UTC, Bruno Wolff III	no flags	Details
View All

Description Bruno Wolff III 2008-04-01 13:34:53 UTC

Description of problem:
This morning I got disconnected from my machine and when I got into the office I
went through the logs and found some kernel errors logged at about the same time
that network interface broke. I found the machine and the interface still up,
but that interface no longer had the correct network information attached to it
and I used ifdown and ifup to get it running again. Another network interface
was still working properly.
I am attaching part of /var/log/messages that includes the errors.

Version-Release number of selected component (if applicable):
2.6.25-0.172.rc7.git4.fc9.i686
The machine has a single Pentium III CPU.

How reproducible:
I am not sure. I have seen these symptoms previously, but not for several weeks
and that was a lot of kernel updates ago.
At the time it happened I was syncing up (using lftp) my copy of rawhide. At
least one time in the past I was doing the same thing when I got a similar kind
of failure.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Bruno Wolff III 2008-04-01 13:34:53 UTC

Created attachment 299889 [details]
/var/log/messages extract

Comment 2 Bruno Wolff III 2008-04-05 19:07:15 UTC

Created attachment 301387 [details]
/var/log/messages extract

It happened again with kernel 2.6.25-0.195.rc8.git1.fc9.i686.
I was wrong about the interface losing address information. I probably just
looked at the wrong device.
I have attached another log extract that covers from when I lost access until I
restarted the network device (eth4).

Comment 3 Dave Jones 2008-04-05 19:40:52 UTC

I brought this up on Linux-kernel last week.  http://lkml.org/lkml/2008/4/1/428
Discussion is ongoing.

Comment 4 Bruno Wolff III 2008-04-06 05:07:08 UTC

I looked through that thread and would like to note that this is not without
negative effects. Something bad happens to the network interface that was under
load when this happened. It doesn't seem to die all at once. (My ssh session
died in the middle of an lftp run, but when I got to the box I found the lftp
run had completed, even though it would have needed to have run for another
several minutes past the point where the ssh session locked up.) Eventually no
inbound or outbound connection attempts (or pings) work until I reset that
interface.

Comment 5 Bruno Wolff III 2008-04-08 15:16:36 UTC

I saw this again with 2.6.25-0.204.rc8.git4.fc9.i686. I noticed when my network
connection failed. ifdown followed by ifup got things working before any of my
ssh connections timed out.

Comment 6 Bruno Wolff III 2008-04-11 15:20:22 UTC

It also happened with 2.6.25-0.212.rc8.git6.fc9.i686. Again while using lftp to
mirror the x86 rawhide tree.

Comment 7 Bruno Wolff III 2008-04-11 18:00:20 UTC

Is there something I can do to help track this down? It is annoying to get
locked out of the machine (though I do have a cron job resetting the network
interface to limit how long I get locked out) when doing stuff remotely and I
have another machine I want to upgrade to F9 for which this would be even more
of a problem. So I have some extra incentitive to help get this fixed.
Also one of piece of info that may give a hint as to what changes affected this
is that I think bug 433594 is very likely the same problem. It stopped happening
for long enough that we closed that bug.

Comment 8 Chuck Ebbert 2008-04-11 22:31:46 UTC

I think it might help if you disable TSO and/or LRO and/or GSO on the adapter.

Comment 9 Bruno Wolff III 2008-04-12 22:02:39 UTC

I don't think the built in device does off loading. ethtool didn't show any
offloading turned on.
This is the ethtool -i output:
driver: e100
version: 3.5.23-k4-NAPI
firmware-version: N/A
bus-info: 0000:01:08.0
And from lspci:
01:08.0 Ethernet controller: Intel Corporation 82801BA/BAM/CA/CAM Ethernet
Controller (rev 01)
I do have some other cheap cards in that box and maybe they wouldn't have this
problem so I can try swapping which one is used for my external link.

Comment 10 Bruno Wolff III 2008-04-17 16:23:14 UTC

I an still seeing this with the 2.6.25-0.218.rc8.git7.fc9.i686 kernel.

Comment 11 Bruno Wolff III 2008-04-18 19:55:11 UTC

Since I haven't noticed this happen on the other interfaces, there is a
reasonable chance that this is a bug specific to the e100 driver. That driver
won't be used on another machine (where it would cause more of a problem). I
haven't seen a lockup on the other interfaces on the machine where the problem
has been occurring. They are different hardware, but also don't get stressed as
often. None of the network devices are common between the two machines. So I'll
do some minimal testing and then just risk the upgrade.

Comment 12 Bruno Wolff III 2008-04-23 20:27:29 UTC

The e100 driver was still having this issue with 2.6.25-1.fc9.i686. I am now
using a different card using a different driver for the connection that was
causing problems. Since I couldn't reliably get the problem to occur it may take
a bit for it to happen again or to have some confidence that the network hang
part of the issue is driver specific.

Comment 13 Bruno Wolff III 2008-05-03 17:21:19 UTC

I haven't noticed this problem since switching my outside link to use a
different network card. I have also not seen that issue on another machine of
similar size that also does not use the e100 driver. While it hasn't been long
enough to be sure (and I have upgraded the kernel to 2.6.25-14), this does point
to the e100 driver having a defect.

Comment 14 Bruno Wolff III 2008-05-12 05:26:38 UTC

I still haven't seen this problme reoccur since I stopped using the e100 nic.
I think it is very likely this is an e100 driver problem.

Comment 15 Bug Zapper 2008-05-14 08:31:27 UTC

Changing version to '9' as part of upcoming Fedora 9 GA.
More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 16 Bruno Wolff III 2008-08-23 15:32:27 UTC

I retired the machine (at work) that had the hardware with the problem and I haven't seen it happen on any of the other NICs I have. So going forward I probably won't be able to help test any fixes.

Comment 17 Bug Zapper 2009-06-09 23:57:36 UTC

This message is a reminder that Fedora 9 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 9.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '9'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 9's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 9 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 18 Bug Zapper 2009-07-14 15:43:40 UTC

Fedora 9 changed to end-of-life (EOL) status on 2009-07-10. Fedora 9 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.