Description of problem:
While running network stress through Samba to an ext3 disk partition,
a kernel oops has been repeatedly observed (see attachments).
Version-Release number of selected component (if applicable):
Takes on an average about 6hrs to reproduce with 8GB clients.
PE750, 1 x 3.0GHz CPU, HT enabled
Red Hat Enterprise Linux 4, kernel 2.6.9-5.ELhugemem
LOM0: inactive/not connected
LOM1: e1000-220.127.116.11, 8021q (2 VLANs)
Samba share: /share on ext3 partition
Stress: 8 Nettack clients, 4 per VLAN, reading/writing to /share over
NOTE: Nettack is Dell's home-grown network stress tool which does
File Read/Write/Compare over the network in a loop. This has been
reliably used for several years at Dell now.
Steps to Reproduce:
1. Install Red Hat Enterprise Linux 4 (everything, kernel-2.6.9-
5.ELhugemem) on an ext3 partition (/boot - 100MB - ext3, / - 10GB -
ext3, swap - 1GB).
2. Create a Samba share at /share.
3. Install the Intel gigabit NIC driver, e1000-18.104.22.168
4. Load the native 8021q module and create 2 VLANs on eth1.
5. Configure a switch (Dell 5012 used in original test) with two
VLANs to match the server.
6. Connect and run 4 Network Stress clients with gigabit NICs to each
VLAN against the samba share.
7. Wait for panic/oops to occur.
Issue has been reproduced about 6 times by now. It has been
reproduced on alternate platforms like SC1425 as well. The 802.1q
module has also been ruled out since one of the failing configs did
not have VLANs setup.
Created attachment 111591 [details]
This is an OOPS trace from the last failure that did not have any sort of
special VLAN setup.
Created attachment 111593 [details]
This is the SysRq output from the same failure from the previous comment.
Created attachment 111614 [details]
sysreport from afflicted system
This is a bug I've seen reported elsewhere, but so far I have not been able to
reproduce it nor to get to the bottom of it. I do have a couple of debugging
patches which implement extensive ext3 buffer tracing and a few extra
consistency checks; it would be very helpful if we could get the results of
reproducing this problem with these patches in place.
Created attachment 111616 [details]
Debug patch 1 of 2: core ext3 buffer tracing
Run "make oldconfig" and enable CONFIG_BUFFER_DEBUG when using this patch.
ext3 may need to be built in, not modular.
Created attachment 111617 [details]
Debug patch 2 of 2: targetted assert tests for kjournald t_locked_list oops
This patch depends on the previous ext3-debug.patch.
Created attachment 111782 [details]
Fix for destroying in-use journal_head
The following patch has been committed for U1 to fix this problem. Please
report testing with it enabled.
Awesome! I am building a kernel with the patch from your previous comment
right now. We will test as soon as possible.
For what it's worth, I will provide you the output for the failure on the
debug kernel that incorporated patches from comment #5 and #6 by the end of
the day since that test got finally kicked off just today morning.
We ran the debug kernel till Friday afternoon without a failure. So it seems
that the debug prints has masked the issue.
On Friday we were able to setup another test platform that exhibited the
failure before the end of the day. (using the stock RHEL kernel)
We then switched both machines to the Fixed kernel mentioned in #7. Both boxes
ran the entire weekend without failure. That bodes well... but it is
disconcerting that the debug kernel also masked the issue.
Robert has also done a code-review and confirms that the fix is good; he's also
run with it successfully for over 1 week.
Confirmed that "ext3-release-race.patch", now in U1 Beta. Closing.