Description of problem: While running network stress through Samba to an ext3 disk partition, a kernel oops has been repeatedly observed (see attachments). Version-Release number of selected component (if applicable): kernel-2.6.9-5.ELhugemem How reproducible: Takes on an average about 6hrs to reproduce with 8GB clients. Configuration: PE750, 1 x 3.0GHz CPU, HT enabled 512M memory Red Hat Enterprise Linux 4, kernel 2.6.9-5.ELhugemem LOM0: inactive/not connected LOM1: e1000-5.6.10.5, 8021q (2 VLANs) Samba share: /share on ext3 partition Stress: 8 Nettack clients, 4 per VLAN, reading/writing to /share over VLANs NOTE: Nettack is Dell's home-grown network stress tool which does File Read/Write/Compare over the network in a loop. This has been reliably used for several years at Dell now. Steps to Reproduce: 1. Install Red Hat Enterprise Linux 4 (everything, kernel-2.6.9- 5.ELhugemem) on an ext3 partition (/boot - 100MB - ext3, / - 10GB - ext3, swap - 1GB). 2. Create a Samba share at /share. 3. Install the Intel gigabit NIC driver, e1000-5.6.10.5 4. Load the native 8021q module and create 2 VLANs on eth1. 5. Configure a switch (Dell 5012 used in original test) with two VLANs to match the server. 6. Connect and run 4 Network Stress clients with gigabit NICs to each VLAN against the samba share. 7. Wait for panic/oops to occur. Actual results: OOPS Expected results: No OOPS Additional info: Issue has been reproduced about 6 times by now. It has been reproduced on alternate platforms like SC1425 as well. The 802.1q module has also been ruled out since one of the failing configs did not have VLANs setup.
Created attachment 111591 [details] PE750_oops_trace_20050302 This is an OOPS trace from the last failure that did not have any sort of special VLAN setup.
Created attachment 111593 [details] PE750_sysrq_20050302 This is the SysRq output from the same failure from the previous comment.
Created attachment 111614 [details] sysreport from afflicted system
This is a bug I've seen reported elsewhere, but so far I have not been able to reproduce it nor to get to the bottom of it. I do have a couple of debugging patches which implement extensive ext3 buffer tracing and a few extra consistency checks; it would be very helpful if we could get the results of reproducing this problem with these patches in place.
Created attachment 111616 [details] Debug patch 1 of 2: core ext3 buffer tracing Run "make oldconfig" and enable CONFIG_BUFFER_DEBUG when using this patch. ext3 may need to be built in, not modular.
Created attachment 111617 [details] Debug patch 2 of 2: targetted assert tests for kjournald t_locked_list oops This patch depends on the previous ext3-debug.patch.
Created attachment 111782 [details] Fix for destroying in-use journal_head The following patch has been committed for U1 to fix this problem. Please report testing with it enabled.
Awesome! I am building a kernel with the patch from your previous comment right now. We will test as soon as possible. For what it's worth, I will provide you the output for the failure on the debug kernel that incorporated patches from comment #5 and #6 by the end of the day since that test got finally kicked off just today morning.
We ran the debug kernel till Friday afternoon without a failure. So it seems that the debug prints has masked the issue. On Friday we were able to setup another test platform that exhibited the failure before the end of the day. (using the stock RHEL kernel) We then switched both machines to the Fixed kernel mentioned in #7. Both boxes ran the entire weekend without failure. That bodes well... but it is disconcerting that the debug kernel also masked the issue.
Robert has also done a code-review and confirms that the fix is good; he's also run with it successfully for over 1 week.
Confirmed that "ext3-release-race.patch", now in U1 Beta. Closing.