154680 – Kernel panic on 8GB machines under stress running e1000 diagnostics

Bug 154680 - Kernel panic on 8GB machines under stress running e1000 diagnostics

Summary: Kernel panic on 8GB machines under stress running e1000 diagnostics

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	John W. Linville
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	RHEL3U8CanFix
TreeView+	depends on / blocked

Reported:	2005-04-13 14:25 UTC by David Knierim
Modified:	2008-08-02 23:40 UTC (History)
CC List:	3 users (show)
Fixed In Version:	RHSA-2006-0437
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-07-20 13:23:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Sample oops output (1.64 KB, text/plain) 2005-04-13 14:33 UTC, David Knierim	no flags	Details
Sample oops output (1.61 KB, text/plain) 2005-04-13 14:34 UTC, David Knierim	no flags	Details
Sample oops output (1.63 KB, text/plain) 2005-04-13 14:34 UTC, David Knierim	no flags	Details
Sample oops output - tainted (906 bytes, text/plain) 2005-04-13 14:35 UTC, David Knierim	no flags	Details
Sample oops output - tainted (841 bytes, text/plain) 2005-04-13 14:36 UTC, David Knierim	no flags	Details
Oops output with kernel patched for bugzilla 151920. (1.67 KB, text/plain) 2005-04-13 14:55 UTC, David Knierim	no flags	Details
Oops with latest test kernel. (1.57 KB, text/plain) 2005-05-04 15:10 UTC, David Knierim	no flags	Details
Another oops with latest test kernel. (1.69 KB, text/plain) 2005-05-05 12:59 UTC, David Knierim	no flags	Details
Yet another oops (1.75 KB, text/plain) 2005-05-05 17:15 UTC, David Knierim	no flags	Details
Oops with latest kernel (2.4.21-32.3.EL.jwltest.22smp) (1.65 KB, text/plain) 2005-05-09 12:45 UTC, David Knierim	no flags	Details
oops with an even newer kernel (2.4.21-32.3.EL.jwltest.24smp) (1.58 KB, text/plain) 2005-05-11 13:17 UTC, David Knierim	no flags	Details
oops with 2.4.21-32.6.EL.jwltest.29smp (1.63 KB, text/plain) 2005-06-14 12:28 UTC, David Knierim	no flags	Details
jwltest-e1000-loopback.patch (665 bytes, patch) 2005-11-07 19:28 UTC, John W. Linville	no flags	Details \| Diff
jwltest-e1000-loopback-2.patch (452 bytes, patch) 2005-11-08 16:06 UTC, John W. Linville	no flags	Details \| Diff
jwltest-e1000-loopback-2.patch (748 bytes, patch) 2005-11-09 22:06 UTC, John W. Linville	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2006:0437	0	normal	SHIPPED_LIVE	Important: Updated kernel packages for Red Hat Enterprise Linux 3 Update 8	2006-07-20 13:11:00 UTC

Description David Knierim 2005-04-13 14:25:23 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050323 Firefox/1.0.2 Fedora/1.0.2-1.3.1

Description of problem:
Test box in question is running ctcs (http://sourceforge.net/projects/va-ctcs/) on a dual 3.2 Ghz XEON box with 8GB of DRAM with 4 e1000 ports.  This box runs stable until a simple script to repeatedly call ethtool -t <interface> online and ethtool -t <interface> offline is called. I don't know if it matters, but this script randomizes the order the ethtool tests are called, but it uses a lock file, so only one test runs at a time.

The script to run ethtool -t runs without error as long as ctcs is not running, while ctcs will run without error if the ethtool script is not running.

Version-Release number of selected component (if applicable):
2.4.21-27.0.2.ELsmp

How reproducible:
Always

Steps to Reproduce:
1. Run ctcs
2. Run script to run ethtool -t on all interfaces
3. wait
  

Actual Results:  After between 17 minutes and 2 hours, the kernel panics.

Expected Results:  The tests should run without error.

Additional info:

Comment 1 David Knierim 2005-04-13 14:33:26 UTC

Created attachment 113094 [details]
Sample oops output

Comment 2 David Knierim 2005-04-13 14:34:00 UTC

Created attachment 113095 [details]
Sample oops output

Comment 3 David Knierim 2005-04-13 14:34:24 UTC

Created attachment 113096 [details]
Sample oops output

Comment 4 David Knierim 2005-04-13 14:35:26 UTC

Created attachment 113097 [details]
Sample oops output - tainted

Yes, it's tainted, but it might be useful.

Comment 5 David Knierim 2005-04-13 14:36:25 UTC

Created attachment 113098 [details]
Sample oops output - tainted

Yet another tainted oops output.

Comment 6 David Knierim 2005-04-13 14:55:58 UTC

Created attachment 113099 [details]
Oops output with kernel patched for bugzilla 151920.

Running ctcs on this box sometimes fails due to bugzilla 151920.   This oops is
from a kernel that has this patch applied.

Comment 7 John W. Linville 2005-04-25 14:39:53 UTC

I have test kernels built with an updated e1000 driver available here: 
 
   http://people.redhat.com/linville/kernels/rhel3/ 
 
Please try to recreate the problem using these kernels, and post the results 
(including any oopses).  Thanks!

Comment 8 David Knierim 2005-05-04 15:10:13 UTC

Created attachment 114017 [details]
Oops with latest test kernel.

The problem still occurs.   This took 4 minutes to happen.

Comment 10 David Knierim 2005-05-05 12:59:35 UTC

Created attachment 114051 [details]
Another oops with latest test kernel.

This one took 11 hours to occur.

Comment 11 David Knierim 2005-05-05 17:15:40 UTC

Created attachment 114063 [details]
Yet another oops

This took 1h49m to fail.

Comment 12 John W. Linville 2005-05-06 18:37:44 UTC

David, thanks for your cooperation so far! 
 
Those oopses look somewhat like a problem we are seeing elsewhere.  I have 
included a patch for the problem in the test kernels at the location 
referenced in comment 7.  Could you give the new kernels a try to see if you 
can recreate this problem with them?  Thanks!

Comment 13 David Knierim 2005-05-09 12:45:38 UTC

Created attachment 114154 [details]
Oops with latest kernel (2.4.21-32.3.EL.jwltest.22smp)

This failure took 8h34m to occur.

Comment 14 John W. Linville 2005-05-09 18:40:42 UTC

Well, there is a slightly later version of that patch available in my new test 
kernels (2.4.21-32.3.EL.jwltest.24).  It might be worth a shot to try that as 
well unless/until I get a better idea... :-)

Comment 15 David Knierim 2005-05-11 13:17:45 UTC

Created attachment 114243 [details]
oops with an even newer kernel (2.4.21-32.3.EL.jwltest.24smp)

This took 5h46m to fail.

Comment 16 John W. Linville 2005-06-10 19:24:35 UTC

David, have you tried testing w/ the config changes from bug 151054 comment 
26?  If so, does it change the situation?

Comment 17 David Knierim 2005-06-14 12:28:26 UTC

Created attachment 115404 [details]
oops with 2.4.21-32.6.EL.jwltest.29smp

Problem still occurs, even with requested change to ifup.  This run took 5h26m
to fail on a box with 16 e1000 interfaces.

Comment 18 John W. Linville 2005-10-03 20:36:15 UTC

Have you tested it with memtest86+ in order to be 100% sure you don't have bad 
RAM? 
 
http://www.memtest.org/

Comment 19 David Knierim 2005-10-07 13:17:08 UTC

I will run memtest86+.  This problem has been observed on multiple servers and
memtest has been run on them in the past, but it never hurts to rule things out.  

I also set up netdump and it captured a 8GB vmcore.  Let me know if this is
useful to you in any way.

Comment 20 John W. Linville 2005-10-10 19:41:45 UTC

Apparently you can upload it to enterprise.redhat.com through anonymous ftp.  
Please do so, and let me know the filename you use...thanks!

Comment 21 David Knierim 2005-10-19 13:59:51 UTC

I have been unable to upload to enterprise.redhat.com.   Eventually the ftp
times out.   I do not know if it is at my end or not.  I would recommend
cleaning out all of the files in incoming that start with vmcore.tklc*.  I will
continue to try to get this to work.

Comment 22 John W. Linville 2005-11-02 14:25:44 UTC

David, were you ever able to upload successfully? 
 
Also, I have yet another e1000 update in the kernels at the location from 
comment 7.  Would you mind testing with them "just in case"?  Thanks!

Comment 23 David Knierim 2005-11-02 18:15:14 UTC

Funny you should ask.   I finally got the files to push up today.   I had some
problems, so I had to split the file up.   Here's the md5sum for the files, in
case something got corrupted:
3f91b3c4e20915c311f3472e7eebb957  vmcore.tklc.part1a.bz2
bfa2dd9d20e546046b622bd38f1e3d40  vmcore.tklc.part2.bz2
027d52b1608e461f19d022b89a3166ca  vmcore.tklc.part3.bz2
01548ab8d6441b7755201db9736a50d8  vmcore.tklc.part4.bz2
b9b138a3d196413992cbfc3004c276b5  vmcore.tklc.part5.bz2
fc95cf5dcf3d926e636dfd7e9a435acc  vmcore.tklc.part6.bz2
14cef11d99b4726a1d72db2f86d75b5a  vmcore.tklc.part7.bz2
4259b3f593cf83210635bf3fcb568f81  vmcore.tklc.part8.bz2

cat them all together in order and you should get a single file with the
following checksum:
66fcfcec90885b2fe107d466c58c6a96  vmcore.tklc.bz2

I will also see about testing with the other kernel.  At the moment I am testing
with Intel's latest e1000 driver.

Comment 24 David Knierim 2005-11-07 14:34:44 UTC

I have tested with 2.4.21-37.8.EL.jwltest.71smp/i686.   It failed as well:
Unable to handle kernel NULL pointer dereference at virtual address 00000077
*pde = 24929001
*pte = 1c723067
Oops: 0000
netconsole sr_mod sg autofs4 iptable_filter ip_tables e1000 microcode ide-scsi
ide-cd cdrom usb-storage loop keybdev moused
ev hid input ehci-hcd usb-uhci usbc
CPU:    2
EIP:    0060:[<f89eddc7>]    Not tainted
EFLAGS: 00010286

EIP is at e1000_free_desc_rings [e1000] 0xb7 (2.4.21-37.8.EL.jwltest.71smp/i686)
eax: e0ffc000   ebx: 00000001   ecx: 00000100   edx: ffffffff
esi: cc1baf58   edi: cc1baf30   ebp: 00000014   esp: e6539e58
ds: 0068   es: 0068   ss: 0068
Process ethtool (pid: 1093, stackpage=e6539000)
Stack: cc3af400 00000000 c0159c1c c0113f1a 00000246 00000000 00000014 cdaa8400
       cc1ba9c0 cc1baf30 00000000 00000000 f89edf04 cc1ba9c0 00001000 cc1baf34
       00000002 7949a9c0 cdaa8400 cc1baf58 d5935318 cc1ba9c0 e6539f00 00000001
Call Trace:   [<c0159c1c>] __get_free_pages [kernel] 0x1c (0xe6539e60)
[<c0113f1a>] pci_alloc_consistent [kernel] 0x4a (0xe6539e64)
[<f89edf04>] e1000_setup_desc_rings [e1000] 0x74 (0xe6539e88)
[<f89eec7b>] e1000_loopback_test [e1000] 0x1b (0xe6539eb8)
[<f89eef21>] e1000_diag_test [e1000] 0x131 (0xe6539ec8)
[<f89f6700>] e1000_ethtool_ops [e1000] 0x0 (0xe6539edc)
[<c0233849>] ethtool_self_test [kernel] 0xa9 (0xe6539eec)
[<c0233da3>] dev_ethtool [kernel] 0x263 (0xe6539f20)
[<c0231cb4>] dev_ioctl [kernel] 0x124 (0xe6539f40)
[<c0226370>] sock_ioctl [kernel] 0x40 (0xe6539f80)
[<c0178ff6>] sys_ioctl [kernel] 0xf6 (0xe6539f94)
[<c02af06f>] no_timing [kernel] 0x7 (0xe6539fc0)

Code: 8b 42 78 48 74 0b f0 ff 4a 78 0f 94 c0 84 c0 74 08 89 14 24

CPU#0 is frozen.
CPU#1 is frozen.
CPU#2 is executing netdump.
CPU#3 is frozen.
< netdump activated - performing handshake with the server. >

Comment 27 John W. Linville 2005-11-07 19:28:28 UTC

Created attachment 120791 [details]
jwltest-e1000-loopback.patch

It looks like there could be a memory leak in e1000_loopback_test if
e1000_setup_loopback_test fails.

Comment 28 John W. Linville 2005-11-07 19:29:53 UTC

New test kernels w/ the above patch are available at the same location as in 
comment 7.  Please give them a try and post the results here...thanks!

Comment 29 John W. Linville 2005-11-08 16:06:46 UTC

Created attachment 120816 [details]
jwltest-e1000-loopback-2.patch

Comment 30 John W. Linville 2005-11-08 18:07:06 UTC

Inspiration struck...I think the test rings are not getting totally 
cleaned-up, leading to possible oops if subsequent tests fail. 
 
Even newer test kernels are available at the same location as in commen 7.  
Please give them a try and post the results here...thanks!

Comment 31 David Knierim 2005-11-09 15:19:50 UTC

I ran kernel 2.4.21-37.8.EL.jwltest.73smp for the last 20 hours and 25 minutes
without error.   Looks like you have found it!   I'll set up a test to run
longer just to be sure.

Thanks,
   David

Comment 32 John W. Linville 2005-11-09 22:06:02 UTC

Created attachment 120860 [details]
jwltest-e1000-loopback-2.patch

David, that is great news!  Can I prevail upon you a bit more?	That patch is a
bit blunt.  I need something sharper to send upstream.	Would you mind testing
the latest kernels at the same location as comment 7?  Thanks!

Comment 33 David Knierim 2005-11-14 14:24:13 UTC

I have retested with kernel version  2.4.21-37.8.EL.jwltest.74smp.   It ran over
the weekend without failure 2 days, 16 hours.  Do you feel this is your final patch?

Comment 34 John W. Linville 2005-11-14 14:37:52 UTC

Yes, this is the patch I'm pushing upstream. 
 
Thank you so much for the patience and the test results!

Comment 38 Ernie Petrides 2006-02-24 03:45:26 UTC

A fix for this problem has just been committed to the RHEL3 U8
patch pool this evening (in kernel version 2.4.21-40.3.EL).

Comment 41 Joshua Giles 2006-05-30 17:54:03 UTC

A kernel has been released that contains a patch for this problem.  Please
verify if your problem is fixed with the latest available kernel from the RHEL3
public beta channel at rhn.redhat.com and post your results to this bugzilla.

Comment 42 Ernie Petrides 2006-05-30 20:21:40 UTC

Reverting to ON_QA.

Comment 43 David Knierim 2006-06-19 15:25:42 UTC

I have retested with kernel 2.4.21-44.ELsmp.   It ran without errors for 2 days
and 22.5 hours.   I would say it is fixed.   Thanks.

Comment 45 Red Hat Bugzilla 2006-07-20 13:23:11 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0437.html

Note You need to log in before you can comment on or make changes to this bug.