150135 – Kernel OOPS in jbd While Running Network Stress

Bug 150135 - Kernel OOPS in jbd While Running Network Stress

Summary: Kernel OOPS in jbd While Running Network Stress

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Stephen Tweedie
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	150568
TreeView+	depends on / blocked

Reported:	2005-03-02 20:17 UTC by Amit Bhutani
Modified:	2007-11-30 22:07 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-04-22 20:08:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
PE750_oops_trace_20050302 (1.42 KB, text/plain) 2005-03-02 20:21 UTC, Amit Bhutani	no flags	Details
PE750_sysrq_20050302 (34.60 KB, text/plain) 2005-03-02 20:24 UTC, Amit Bhutani	no flags	Details
sysreport from afflicted system (310.47 KB, application/octet-stream) 2005-03-03 14:56 UTC, Thomas Chenault	no flags	Details
Debug patch 1 of 2: core ext3 buffer tracing (44.79 KB, patch) 2005-03-03 15:14 UTC, Stephen Tweedie	no flags	Details \| Diff
Debug patch 2 of 2: targetted assert tests for kjournald t_locked_list oops (8.13 KB, patch) 2005-03-03 15:16 UTC, Stephen Tweedie	no flags	Details \| Diff
Fix for destroying in-use journal_head (1.36 KB, patch) 2005-03-08 15:58 UTC, Stephen Tweedie	no flags	Details \| Diff
View All

Description Amit Bhutani 2005-03-02 20:17:48 UTC

Description of problem:
While running network stress through Samba to an ext3 disk partition, 
a kernel oops has been repeatedly observed (see attachments).

Version-Release number of selected component (if applicable):
kernel-2.6.9-5.ELhugemem

How reproducible:
Takes on an average about 6hrs to reproduce with 8GB clients.

Configuration:
PE750, 1 x 3.0GHz CPU, HT enabled
512M memory
Red Hat Enterprise Linux 4, kernel 2.6.9-5.ELhugemem
LOM0: inactive/not connected
LOM1: e1000-5.6.10.5, 8021q (2 VLANs)
Samba share: /share on ext3 partition
Stress: 8 Nettack clients, 4 per VLAN, reading/writing to /share over 
VLANs
NOTE: Nettack is Dell's home-grown network stress tool which does 
File Read/Write/Compare over the network in a loop. This has been 
reliably used for several years at Dell now. 

Steps to Reproduce:
1. Install Red Hat Enterprise Linux 4 (everything, kernel-2.6.9-
5.ELhugemem) on an ext3 partition (/boot - 100MB - ext3, / - 10GB - 
ext3, swap - 1GB).
2. Create a Samba share at /share.
3. Install the Intel gigabit NIC driver, e1000-5.6.10.5
4. Load the native 8021q module and create 2 VLANs on eth1.
5. Configure a switch (Dell 5012 used in original test) with two 
VLANs to match the server.
6. Connect and run 4 Network Stress clients with gigabit NICs to each 
VLAN against the samba share.
7. Wait for panic/oops to occur.

Actual results:
OOPS

Expected results:
No OOPS

Additional info:
Issue has been reproduced about 6 times by now. It has been 
reproduced on alternate platforms like SC1425 as well. The 802.1q 
module has also been ruled out since one of the failing configs did 
not have VLANs setup.

Comment 1 Amit Bhutani 2005-03-02 20:21:54 UTC

Created attachment 111591 [details]
PE750_oops_trace_20050302

This is an OOPS trace from the last failure that did not have any sort of
special VLAN setup.

Comment 2 Amit Bhutani 2005-03-02 20:24:22 UTC

Created attachment 111593 [details]
PE750_sysrq_20050302

This is the SysRq output from the same failure from the previous comment.

Comment 3 Thomas Chenault 2005-03-03 14:56:58 UTC

Created attachment 111614 [details]
sysreport from afflicted system

Comment 4 Stephen Tweedie 2005-03-03 15:13:09 UTC

This is a bug I've seen reported elsewhere, but so far I have not been able to
reproduce it nor to get to the bottom of it.  I do have a couple of debugging
patches which implement extensive ext3 buffer tracing and a few extra
consistency checks; it would be very helpful if we could get the results of
reproducing this problem with these patches in place.

Comment 5 Stephen Tweedie 2005-03-03 15:14:53 UTC

Created attachment 111616 [details]
Debug patch 1 of 2: core ext3 buffer tracing

Run "make oldconfig" and enable CONFIG_BUFFER_DEBUG when using this patch. 
ext3 may need to be built in, not modular.

Comment 6 Stephen Tweedie 2005-03-03 15:16:11 UTC

Created attachment 111617 [details]
Debug patch 2 of 2: targetted assert tests for kjournald t_locked_list oops

This patch depends on the previous ext3-debug.patch.

Comment 7 Stephen Tweedie 2005-03-08 15:58:31 UTC

Created attachment 111782 [details]
Fix for destroying in-use journal_head

The following patch has been committed for U1 to fix this problem.  Please
report testing with it enabled.

Comment 8 Amit Bhutani 2005-03-08 16:18:17 UTC

Awesome! I am building a kernel with the patch from your previous comment 
right now. We will test as soon as possible.

For what it's worth, I will provide you the output for the failure on the 
debug kernel that incorporated patches from comment #5 and #6 by the end of 
the day since that test got finally kicked off just today morning.

Comment 9 Robert Hentosh 2005-03-14 15:40:06 UTC

We ran the debug kernel till Friday afternoon without a failure. So it seems 
that the debug prints has masked the issue.

On Friday we were able to setup another test platform that exhibited the 
failure before the end of the day. (using the stock RHEL kernel)

We then switched both machines to the Fixed kernel mentioned in #7. Both boxes 
ran the entire weekend without failure.  That bodes well... but it is 
disconcerting that the debug kernel also masked the issue.

Comment 10 Susan Denham 2005-03-31 14:38:24 UTC

Robert has also done a code-review and confirms that the fix is good; he's also
run with it successfully for over 1 week.

Comment 11 Amit Bhutani 2005-04-22 20:08:50 UTC

Confirmed that "ext3-release-race.patch", now in U1 Beta. Closing.

Note You need to log in before you can comment on or make changes to this bug.