Red Hat Bugzilla – Bug 154680
Kernel panic on 8GB machines under stress running e1000 diagnostics
Last modified: 2008-08-02 19:40:32 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050323 Firefox/1.0.2 Fedora/1.0.2-1.3.1
Description of problem:
Test box in question is running ctcs (http://sourceforge.net/projects/va-ctcs/) on a dual 3.2 Ghz XEON box with 8GB of DRAM with 4 e1000 ports. This box runs stable until a simple script to repeatedly call ethtool -t <interface> online and ethtool -t <interface> offline is called. I don't know if it matters, but this script randomizes the order the ethtool tests are called, but it uses a lock file, so only one test runs at a time.
The script to run ethtool -t runs without error as long as ctcs is not running, while ctcs will run without error if the ethtool script is not running.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Run ctcs
2. Run script to run ethtool -t on all interfaces
Actual Results: After between 17 minutes and 2 hours, the kernel panics.
Expected Results: The tests should run without error.
Created attachment 113094 [details]
Sample oops output
Created attachment 113095 [details]
Sample oops output
Created attachment 113096 [details]
Sample oops output
Created attachment 113097 [details]
Sample oops output - tainted
Yes, it's tainted, but it might be useful.
Created attachment 113098 [details]
Sample oops output - tainted
Yet another tainted oops output.
Created attachment 113099 [details]
Oops output with kernel patched for bugzilla 151920.
Running ctcs on this box sometimes fails due to bugzilla 151920. This oops is
from a kernel that has this patch applied.
I have test kernels built with an updated e1000 driver available here:
Please try to recreate the problem using these kernels, and post the results
(including any oopses). Thanks!
Created attachment 114017 [details]
Oops with latest test kernel.
The problem still occurs. This took 4 minutes to happen.
Created attachment 114051 [details]
Another oops with latest test kernel.
This one took 11 hours to occur.
Created attachment 114063 [details]
Yet another oops
This took 1h49m to fail.
David, thanks for your cooperation so far!
Those oopses look somewhat like a problem we are seeing elsewhere. I have
included a patch for the problem in the test kernels at the location
referenced in comment 7. Could you give the new kernels a try to see if you
can recreate this problem with them? Thanks!
Created attachment 114154 [details]
Oops with latest kernel (2.4.21-32.3.EL.jwltest.22smp)
This failure took 8h34m to occur.
Well, there is a slightly later version of that patch available in my new test
kernels (2.4.21-32.3.EL.jwltest.24). It might be worth a shot to try that as
well unless/until I get a better idea... :-)
Created attachment 114243 [details]
oops with an even newer kernel (2.4.21-32.3.EL.jwltest.24smp)
This took 5h46m to fail.
David, have you tried testing w/ the config changes from bug 151054 comment
26? If so, does it change the situation?
Created attachment 115404 [details]
oops with 2.4.21-32.6.EL.jwltest.29smp
Problem still occurs, even with requested change to ifup. This run took 5h26m
to fail on a box with 16 e1000 interfaces.
Have you tested it with memtest86+ in order to be 100% sure you don't have bad
I will run memtest86+. This problem has been observed on multiple servers and
memtest has been run on them in the past, but it never hurts to rule things out.
I also set up netdump and it captured a 8GB vmcore. Let me know if this is
useful to you in any way.
Apparently you can upload it to enterprise.redhat.com through anonymous ftp.
Please do so, and let me know the filename you use...thanks!
I have been unable to upload to enterprise.redhat.com. Eventually the ftp
times out. I do not know if it is at my end or not. I would recommend
cleaning out all of the files in incoming that start with vmcore.tklc*. I will
continue to try to get this to work.
David, were you ever able to upload successfully?
Also, I have yet another e1000 update in the kernels at the location from
comment 7. Would you mind testing with them "just in case"? Thanks!
Funny you should ask. I finally got the files to push up today. I had some
problems, so I had to split the file up. Here's the md5sum for the files, in
case something got corrupted:
cat them all together in order and you should get a single file with the
I will also see about testing with the other kernel. At the moment I am testing
with Intel's latest e1000 driver.
I have tested with 2.4.21-37.8.EL.jwltest.71smp/i686. It failed as well:
Unable to handle kernel NULL pointer dereference at virtual address 00000077
*pde = 24929001
*pte = 1c723067
netconsole sr_mod sg autofs4 iptable_filter ip_tables e1000 microcode ide-scsi
ide-cd cdrom usb-storage loop keybdev moused
ev hid input ehci-hcd usb-uhci usbc
EIP: 0060:[<f89eddc7>] Not tainted
EIP is at e1000_free_desc_rings [e1000] 0xb7 (2.4.21-37.8.EL.jwltest.71smp/i686)
eax: e0ffc000 ebx: 00000001 ecx: 00000100 edx: ffffffff
esi: cc1baf58 edi: cc1baf30 ebp: 00000014 esp: e6539e58
ds: 0068 es: 0068 ss: 0068
Process ethtool (pid: 1093, stackpage=e6539000)
Stack: cc3af400 00000000 c0159c1c c0113f1a 00000246 00000000 00000014 cdaa8400
cc1ba9c0 cc1baf30 00000000 00000000 f89edf04 cc1ba9c0 00001000 cc1baf34
00000002 7949a9c0 cdaa8400 cc1baf58 d5935318 cc1ba9c0 e6539f00 00000001
Call Trace: [<c0159c1c>] __get_free_pages [kernel] 0x1c (0xe6539e60)
[<c0113f1a>] pci_alloc_consistent [kernel] 0x4a (0xe6539e64)
[<f89edf04>] e1000_setup_desc_rings [e1000] 0x74 (0xe6539e88)
[<f89eec7b>] e1000_loopback_test [e1000] 0x1b (0xe6539eb8)
[<f89eef21>] e1000_diag_test [e1000] 0x131 (0xe6539ec8)
[<f89f6700>] e1000_ethtool_ops [e1000] 0x0 (0xe6539edc)
[<c0233849>] ethtool_self_test [kernel] 0xa9 (0xe6539eec)
[<c0233da3>] dev_ethtool [kernel] 0x263 (0xe6539f20)
[<c0231cb4>] dev_ioctl [kernel] 0x124 (0xe6539f40)
[<c0226370>] sock_ioctl [kernel] 0x40 (0xe6539f80)
[<c0178ff6>] sys_ioctl [kernel] 0xf6 (0xe6539f94)
[<c02af06f>] no_timing [kernel] 0x7 (0xe6539fc0)
Code: 8b 42 78 48 74 0b f0 ff 4a 78 0f 94 c0 84 c0 74 08 89 14 24
CPU#0 is frozen.
CPU#1 is frozen.
CPU#2 is executing netdump.
CPU#3 is frozen.
< netdump activated - performing handshake with the server. >
Created attachment 120791 [details]
It looks like there could be a memory leak in e1000_loopback_test if
New test kernels w/ the above patch are available at the same location as in
comment 7. Please give them a try and post the results here...thanks!
Created attachment 120816 [details]
Inspiration struck...I think the test rings are not getting totally
cleaned-up, leading to possible oops if subsequent tests fail.
Even newer test kernels are available at the same location as in commen 7.
Please give them a try and post the results here...thanks!
I ran kernel 2.4.21-37.8.EL.jwltest.73smp for the last 20 hours and 25 minutes
without error. Looks like you have found it! I'll set up a test to run
longer just to be sure.
Created attachment 120860 [details]
David, that is great news! Can I prevail upon you a bit more? That patch is a
bit blunt. I need something sharper to send upstream. Would you mind testing
the latest kernels at the same location as comment 7? Thanks!
I have retested with kernel version 2.4.21-37.8.EL.jwltest.74smp. It ran over
the weekend without failure 2 days, 16 hours. Do you feel this is your final patch?
Yes, this is the patch I'm pushing upstream.
Thank you so much for the patience and the test results!
A fix for this problem has just been committed to the RHEL3 U8
patch pool this evening (in kernel version 2.4.21-40.3.EL).
A kernel has been released that contains a patch for this problem. Please
verify if your problem is fixed with the latest available kernel from the RHEL3
public beta channel at rhn.redhat.com and post your results to this bugzilla.
Reverting to ON_QA.
I have retested with kernel 2.4.21-44.ELsmp. It ran without errors for 2 days
and 22.5 hours. I would say it is fixed. Thanks.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.