From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050323 Firefox/1.0.2 Fedora/1.0.2-1.3.1 Description of problem: Test box in question is running ctcs (http://sourceforge.net/projects/va-ctcs/) on a dual 3.2 Ghz XEON box with 8GB of DRAM with 4 e1000 ports. This box runs stable until a simple script to repeatedly call ethtool -t <interface> online and ethtool -t <interface> offline is called. I don't know if it matters, but this script randomizes the order the ethtool tests are called, but it uses a lock file, so only one test runs at a time. The script to run ethtool -t runs without error as long as ctcs is not running, while ctcs will run without error if the ethtool script is not running. Version-Release number of selected component (if applicable): 2.4.21-27.0.2.ELsmp How reproducible: Always Steps to Reproduce: 1. Run ctcs 2. Run script to run ethtool -t on all interfaces 3. wait Actual Results: After between 17 minutes and 2 hours, the kernel panics. Expected Results: The tests should run without error. Additional info:
Created attachment 113094 [details] Sample oops output
Created attachment 113095 [details] Sample oops output
Created attachment 113096 [details] Sample oops output
Created attachment 113097 [details] Sample oops output - tainted Yes, it's tainted, but it might be useful.
Created attachment 113098 [details] Sample oops output - tainted Yet another tainted oops output.
Created attachment 113099 [details] Oops output with kernel patched for bugzilla 151920. Running ctcs on this box sometimes fails due to bugzilla 151920. This oops is from a kernel that has this patch applied.
I have test kernels built with an updated e1000 driver available here: http://people.redhat.com/linville/kernels/rhel3/ Please try to recreate the problem using these kernels, and post the results (including any oopses). Thanks!
Created attachment 114017 [details] Oops with latest test kernel. The problem still occurs. This took 4 minutes to happen.
Created attachment 114051 [details] Another oops with latest test kernel. This one took 11 hours to occur.
Created attachment 114063 [details] Yet another oops This took 1h49m to fail.
David, thanks for your cooperation so far! Those oopses look somewhat like a problem we are seeing elsewhere. I have included a patch for the problem in the test kernels at the location referenced in comment 7. Could you give the new kernels a try to see if you can recreate this problem with them? Thanks!
Created attachment 114154 [details] Oops with latest kernel (2.4.21-32.3.EL.jwltest.22smp) This failure took 8h34m to occur.
Well, there is a slightly later version of that patch available in my new test kernels (2.4.21-32.3.EL.jwltest.24). It might be worth a shot to try that as well unless/until I get a better idea... :-)
Created attachment 114243 [details] oops with an even newer kernel (2.4.21-32.3.EL.jwltest.24smp) This took 5h46m to fail.
David, have you tried testing w/ the config changes from bug 151054 comment 26? If so, does it change the situation?
Created attachment 115404 [details] oops with 2.4.21-32.6.EL.jwltest.29smp Problem still occurs, even with requested change to ifup. This run took 5h26m to fail on a box with 16 e1000 interfaces.
Have you tested it with memtest86+ in order to be 100% sure you don't have bad RAM? http://www.memtest.org/
I will run memtest86+. This problem has been observed on multiple servers and memtest has been run on them in the past, but it never hurts to rule things out. I also set up netdump and it captured a 8GB vmcore. Let me know if this is useful to you in any way.
Apparently you can upload it to enterprise.redhat.com through anonymous ftp. Please do so, and let me know the filename you use...thanks!
I have been unable to upload to enterprise.redhat.com. Eventually the ftp times out. I do not know if it is at my end or not. I would recommend cleaning out all of the files in incoming that start with vmcore.tklc*. I will continue to try to get this to work.
David, were you ever able to upload successfully? Also, I have yet another e1000 update in the kernels at the location from comment 7. Would you mind testing with them "just in case"? Thanks!
Funny you should ask. I finally got the files to push up today. I had some problems, so I had to split the file up. Here's the md5sum for the files, in case something got corrupted: 3f91b3c4e20915c311f3472e7eebb957 vmcore.tklc.part1a.bz2 bfa2dd9d20e546046b622bd38f1e3d40 vmcore.tklc.part2.bz2 027d52b1608e461f19d022b89a3166ca vmcore.tklc.part3.bz2 01548ab8d6441b7755201db9736a50d8 vmcore.tklc.part4.bz2 b9b138a3d196413992cbfc3004c276b5 vmcore.tklc.part5.bz2 fc95cf5dcf3d926e636dfd7e9a435acc vmcore.tklc.part6.bz2 14cef11d99b4726a1d72db2f86d75b5a vmcore.tklc.part7.bz2 4259b3f593cf83210635bf3fcb568f81 vmcore.tklc.part8.bz2 cat them all together in order and you should get a single file with the following checksum: 66fcfcec90885b2fe107d466c58c6a96 vmcore.tklc.bz2 I will also see about testing with the other kernel. At the moment I am testing with Intel's latest e1000 driver.
I have tested with 2.4.21-37.8.EL.jwltest.71smp/i686. It failed as well: Unable to handle kernel NULL pointer dereference at virtual address 00000077 *pde = 24929001 *pte = 1c723067 Oops: 0000 netconsole sr_mod sg autofs4 iptable_filter ip_tables e1000 microcode ide-scsi ide-cd cdrom usb-storage loop keybdev moused ev hid input ehci-hcd usb-uhci usbc CPU: 2 EIP: 0060:[<f89eddc7>] Not tainted EFLAGS: 00010286 EIP is at e1000_free_desc_rings [e1000] 0xb7 (2.4.21-37.8.EL.jwltest.71smp/i686) eax: e0ffc000 ebx: 00000001 ecx: 00000100 edx: ffffffff esi: cc1baf58 edi: cc1baf30 ebp: 00000014 esp: e6539e58 ds: 0068 es: 0068 ss: 0068 Process ethtool (pid: 1093, stackpage=e6539000) Stack: cc3af400 00000000 c0159c1c c0113f1a 00000246 00000000 00000014 cdaa8400 cc1ba9c0 cc1baf30 00000000 00000000 f89edf04 cc1ba9c0 00001000 cc1baf34 00000002 7949a9c0 cdaa8400 cc1baf58 d5935318 cc1ba9c0 e6539f00 00000001 Call Trace: [<c0159c1c>] __get_free_pages [kernel] 0x1c (0xe6539e60) [<c0113f1a>] pci_alloc_consistent [kernel] 0x4a (0xe6539e64) [<f89edf04>] e1000_setup_desc_rings [e1000] 0x74 (0xe6539e88) [<f89eec7b>] e1000_loopback_test [e1000] 0x1b (0xe6539eb8) [<f89eef21>] e1000_diag_test [e1000] 0x131 (0xe6539ec8) [<f89f6700>] e1000_ethtool_ops [e1000] 0x0 (0xe6539edc) [<c0233849>] ethtool_self_test [kernel] 0xa9 (0xe6539eec) [<c0233da3>] dev_ethtool [kernel] 0x263 (0xe6539f20) [<c0231cb4>] dev_ioctl [kernel] 0x124 (0xe6539f40) [<c0226370>] sock_ioctl [kernel] 0x40 (0xe6539f80) [<c0178ff6>] sys_ioctl [kernel] 0xf6 (0xe6539f94) [<c02af06f>] no_timing [kernel] 0x7 (0xe6539fc0) Code: 8b 42 78 48 74 0b f0 ff 4a 78 0f 94 c0 84 c0 74 08 89 14 24 CPU#0 is frozen. CPU#1 is frozen. CPU#2 is executing netdump. CPU#3 is frozen. < netdump activated - performing handshake with the server. >
Created attachment 120791 [details] jwltest-e1000-loopback.patch It looks like there could be a memory leak in e1000_loopback_test if e1000_setup_loopback_test fails.
New test kernels w/ the above patch are available at the same location as in comment 7. Please give them a try and post the results here...thanks!
Created attachment 120816 [details] jwltest-e1000-loopback-2.patch
Inspiration struck...I think the test rings are not getting totally cleaned-up, leading to possible oops if subsequent tests fail. Even newer test kernels are available at the same location as in commen 7. Please give them a try and post the results here...thanks!
I ran kernel 2.4.21-37.8.EL.jwltest.73smp for the last 20 hours and 25 minutes without error. Looks like you have found it! I'll set up a test to run longer just to be sure. Thanks, David
Created attachment 120860 [details] jwltest-e1000-loopback-2.patch David, that is great news! Can I prevail upon you a bit more? That patch is a bit blunt. I need something sharper to send upstream. Would you mind testing the latest kernels at the same location as comment 7? Thanks!
I have retested with kernel version 2.4.21-37.8.EL.jwltest.74smp. It ran over the weekend without failure 2 days, 16 hours. Do you feel this is your final patch?
Yes, this is the patch I'm pushing upstream. Thank you so much for the patience and the test results!
A fix for this problem has just been committed to the RHEL3 U8 patch pool this evening (in kernel version 2.4.21-40.3.EL).
A kernel has been released that contains a patch for this problem. Please verify if your problem is fixed with the latest available kernel from the RHEL3 public beta channel at rhn.redhat.com and post your results to this bugzilla.
Reverting to ON_QA.
I have retested with kernel 2.4.21-44.ELsmp. It ran without errors for 2 days and 22.5 hours. I would say it is fixed. Thanks.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0437.html