49348 – Kernel panic caused by heavy disk activity on ext3 filesystem

Bug 49348 - Kernel panic caused by heavy disk activity on ext3 filesystem

Summary: Kernel panic caused by heavy disk activity on ext3 filesystem

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.3
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Stephen Tweedie
QA Contact:	Brock Organ
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-07-18 14:31 UTC by Clay Cooper
Modified:	2007-04-18 16:34 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2001-07-24 16:10:46 UTC
Embargoed:

Attachments	(Terms of Use)
Output of ksymoops (11.62 KB, text/plain) 2001-07-18 16:34 UTC, Clay Cooper	no flags	Details
View All

Description Clay Cooper 2001-07-18 14:31:56 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux 2.4.2-2 i686; en-US; m18)
Gecko/20010131 Netscape6/6.01

Description of problem:
On a pe6400 w/ 5GB ram, 2 GB swap, perc3dc raid 1 container, bios
A08,enterprise kernel, ext3 filesystems.  After 10-15 minutes of 5
concurrently running streams of copy and compare (a script which copies a
650MB iso image to two localdisk locations then compares them), the kernel
panics.  I have reproduced this twice.
Here is the screen output:

ext3_reserve_inode_write: failed
Assertion failure in journal_dirty_metadata_Rsmp_38f0f21a() at
transaction.c:1094: "jh->b_next_transaction == transaction"
invalid operand: 0000
CPU:    2
EIP:    0010:[<f8853451>]
EFLAGS: 00010282
eax: 00000022   ebx: ea50e454   ecx: c02b4080   edx: 000046e2
esi: eea01544   edi: f7f2f800   ebp: f7d0326c   esp: efa1bd14
ds: 0018   es: 0018   ss: 0018
Process cp (pid: 1141, stackpage=efa1b000)
Stack: f885c520 00000446 001c90fe f6ab0000 d6a8b000 00000000 f8862518 f7d0326c
       d1665274 00001000 00000000 f6a48400 f6097720 000010fe 00000039 d1665274
       f7fd0784 00000008 f6ba1400 00000000 00000001 f7d0326c 00001000 001c90ed
Call Trace: [<c014af52>] [<c014b7d1>] [<c014c1e2>] [<c0136798>] [<c0134480>]
   [<c0146b06>] [<c0114959>] [<c010772b>]

Code: 0f 0b 59 5b eb 4a 8b 53 0c 85 d2 74 34 68 06 cc 85 f8 68 52
 <0>Assertion failure in journal_start_Rsmp_5c4f038c() at
transaction.c:219: "handle->h_transaction->t_journal == journal"
kernel BUG at transaction.c:219!
invalid operand: 0000
CPU:    2
EIP:    0010:[<f885132a>]
EFLAGS: 00010286
eax: 00000021   ebx: f7d0326c   ecx: c02b4080   edx: 00004a6f
esi: efa1a000   edi: f60a32b0   ebp: f6b8a800   esp: efa1baac
ds: 0018   es: 0018   ss: 0018
Process cp (pid: 1141, stackpage=efa1b000)
Stack: f885c520 000000db f7d0326c f60a32b0 f7d0326c f60a32b0 efa1bb70 f8866cb2
       f6b8a800 00000001 f60a32b0 00000001 f609b800 c016247a f60a32b0 efa1a000
       fffff000 00000000 c0136637 f60a32b0 00000001 c6a50000 00000000 00000006
Call Trace: [<c016247a>] [<c0136637>] [<c01d6412>] [<c0125ec9>] [<c0125fb8>]
   [<c0121f2b>] [<c0250018>] [<c0107d0b>] [<c0107f50>] [<c0107fd1>]
[<c019786d>]
   [<c019cf62>] [<c019cf88>] [<c02471f9>] [<c011e5fb>] [<c011e99b>]
[<c0107828>]
   [<c014af52>] [<c014b7d1>] [<c014c1e2>] [<c0136798>] [<c0134480>]
[<c0146b06>]
   [<c0114959>] [<c010772b>]

Code: 0f 0b 5d 58 ff 43 08 eb 5f 68 f0 00 00 00 6a 10 e8 01 92 8e



How reproducible:
Always

Steps to Reproduce:
1.Install beta2 on described configuration w/ ext3 filesystems
2.Generate fairly heavy disk IO while running enterprise kernel
3.Look for eventual kernel panic
	

Actual Results:  Kernel panic		

Expected Results:  No kernel panic

Additional info:

Problem seen first on enterprise kernel w/ AMI raid.
As we move through our test plan, we will look for this error on UP and SMP
kernels, and on Adaptec Raid and straight scsi.

Comment 1 Stephen Tweedie 2001-07-18 15:24:25 UTC

Could you please run this through ksymoops?  The error contains a stack trace
which references loaded modules, and it's not possible to know from the trace
exactly where the modules were in memory (and hence what code the trace
represents.)  The most reliable way to get the module trace is to run ksymoops
on the machine which just generated the oops, if that is possible.

I'll also try to reproduce this here --- are the source and destination copies
of the .iso images you use all on the same filesystem in your tests?

Thanks.

Comment 2 Clay Cooper 2001-07-18 16:34:37 UTC

Created attachment 24009 [details]
Output of ksymoops

Comment 3 Clay Cooper 2001-07-18 16:36:00 UTC

Yes, the iso image source and destinations are all on local ext3 filesystems.

Comment 4 Stephen Tweedie 2001-07-18 17:12:58 UTC

But are they on the *same* filesystem?  The oops seems to indicate that a
transaction that started on one ext3 filesystem recursed into a different one. 
To reproduce, I'd like to know if you were doing copies between different
partitions or just doing a heavy load on a single partition.  Thanks.

Comment 5 Clay Cooper 2001-07-18 18:11:47 UTC

Yes, the iso image source and destinations are all on local ext3 filesystems.

Comment 6 Clay Cooper 2001-07-18 18:17:37 UTC

The source iso and the destination directories were all on a separate /home
partition.

Comment 7 Glen Foster 2001-07-18 19:59:46 UTC

This defect should be fixed before next release.

Comment 8 Stephen Tweedie 2001-07-20 16:10:32 UTC

The beta2 kernel had a slightly older ext3.  We are now building a kernel with
all the most recent ext3 fixes included, and with better diagnostic ability for
ext3 bugs (ext3 is in the main kernel, not in a module, specifically for this
next build).  We'll do a little prelim testing then make it available for wider
testing.

I'll try reproducing the scenario you describe on this new kernel immediately.

Comment 9 Bulent Abali 2001-07-22 18:27:38 UTC

I see the same bug under low memory conditions.  Different hardware, 2GB
memory.  Sorry I don't have the ksymoops output.  I will produce one
when I trigger the bug again.

Bulent (abali.com)

====================

JBD: out of memory for journal_head
ext3_reserve_inode_write: failed
Assertion failure in journal_dirty_metadata_Rsmp_93e5c062() at 
transaction.c:1094: "jh->b_next_transaction == transaction"
invalid operand: 0000
CPU:    1
EIP:    0010:[<f8847451>]
EFLAGS: 00010282
eax: 00000022   ebx: f0ae1454   ecx: c02b2b00   edx: 00025bdd
esi: f5f377ac   edi: f730fe00   ebp: f580d94c   esp: cd6efe60
ds: 0018   es: 0018   ss: 0018
Process wafgen99 (pid: 9703, stackpage=cd6ef000)
Stack: f88504c0 00000446 c3c397f0 e959d70c f4627228 f580d94c f885b8a3 f580d94c 
       c3c397f0 f4627218 cd6efe98 f747ac00 00000010 00000208 00000000 cd6efeb0 
       f885ab9e f580d94c ebf97a00 ebf97a00 f580d94c e959d704 e959d6a8 f885b9f9 
Call Trace: [<c0156b69>] [<c0156836>] [<c0156d57>] [<c013999e>] [<c01453b6>] 
   [<c01456e7>] [<c010771b>] 

Code: 0f 0b 59 5b eb 4a 8b 53 0c 85 d2 74 34 68 a6 0b 85 f8 68 52 


Here is another one
===================

ext3_reserve_inode_write: failed
ext3_reserve_inode_write: failed
Assertion failure in journal_dirty_metadata_Rsmp_93e5c062() at 
transaction.c:1094: "jh->b_next_transaction == transaction"
invalid operand: 0000
CPU:    0
EIP:    0010:[<f8847451>]
EFLAGS: 00010282
eax: 00000022   ebx: c3da417c   ecx: c02b2b00   edx: 00003bff
esi: f6c6fb1c   edi: f730ce00   ebp: f65bff8c   esp: f6629e70
ds: 0018   es: 0018   ss: 0018
Process wafgen99 (pid: 1006, stackpage=f6629000)
Stack: f88504c0 00000446 f7478e40 00008000 f65bff8c f72d1400 f88579cc f65bff8c 
       f72d4d6c 00003fc0 001c69f8 f730ce00 f619fd78 f7479400 f7478e40 f619fd78 
       00000072 f72d4d6c f65bff8c f619f2b0 f65bff8c 000081a4 f6ecbe3c f885b9a4 
Call Trace: [<c0156b69>] [<c0156836>] [<c0156d57>] [<c0160779>] [<c015f93c>] 
   [<c01453b6>] [<c01456e7>] [<c010771b>] 

Code: 0f 0b 59 5b eb 4a 8b 53 0c 85 d2 74 34 68 a6 0b 85 f8 68 52

Comment 10 Bulent Abali 2001-07-22 18:31:14 UTC

In case if it helps with debugging.  When panic occurs, there is large amount 
of dirty pages in memory.  The process is creating thousands of files totaling 
few gigabytes of data.

Comment 11 Stephen Tweedie 2001-07-23 21:13:35 UTC

There is a newer ext3 with a number of fixes which just missed the beta2
deadline.  A couple of extra fixes have gone in since then and we have a build
tonight which fixes a number of ext3 bugs similar to yours.  We'll let you know
when that build is available and will make it available for testing.

I've been unable to reproduce your problem locally with the newer kernels, but
my setup may not mirror yours exactly, so I'd still appreciate testing of the
new build once it is out.

Comment 12 Clay Cooper 2001-07-23 21:27:41 UTC

I tried enterprise kernel 2.4.6-2.3 per Arjan van de Ven's request and went 2
hours without error.  The oops was occurring within 15 minutes previously.  I'm
going to let it run overnight, but looks like the newer kernel fixes my problem.

Comment 13 Clay Cooper 2001-07-24 16:10:41 UTC

After letting the disk activity continue overnight with enterprise kernel
2.4.6-2.3, copy and compare finished successfully but with several warnings in
/var/log/messages:

Jul 23 18:47:12 localhost cpcmp[2507]: glob 1, stream 1, pass 4, sev 6 Starting
4 ...
Jul 23 18:48:02 localhost kernel: journal_write_metadata_buffer: ENOMEM at
get_unused_buffer_head, trying again.
Jul 23 18:48:05 localhost last message repeated 13 times
Jul 23 18:48:05 localhost kernel: : ENOMEM at get_unused_buffer_head, trying again.
Jul 23 18:48:05 localhost kernel: journal_write_metadata_buffer: ENOMEM at
get_unused_buffer_head, trying again.
Jul 23 18:48:44 localhost last message repeated 399 times
Jul 23 18:48:47 localhost last message repeated 25 times
Jul 23 18:48:47 localhost kernel: : ENOMEM at get_unused_buffer_head, trying again.
Jul 23 18:48:47 localhost kernel: journal_write_metadata_buffer: ENOMEM at
get_unused_buffer_head, trying again.
Jul 23 18:48:50 localhost last message repeated 398 times
Jul 23 18:49:21 localhost cpcmp[2516]: glob 1, stream 2, pass 3, sev 6 ... Ending 3
Jul 23 18:49:22 localhost cpcmp[2521]: glob 1, stream 2, pass 4, sev 6 Starting
4 ...
Jul 23 18:51:00 localhost cpcmp[2535]: glob 1, stream 4, pass 3, sev 6 ... Ending 3
Jul 23 18:51:00 localhost cpcmp[2540]: glob 1, stream 4, pass 4, sev 6 Starting
4 ...
Jul 23 18:52:44 localhost kernel: journal_write_metadata_buffer: ENOMEM at
get_unused_buffer_head, trying again.
Jul 23 18:54:19 localhost last message repeated 181 times
Jul 23 18:54:22 localhost last message repeated 19 times
Jul 23 18:54:22 localhost kernel: : ENOMEM at get_unused_buffer_head, trying again.
Jul 23 18:54:22 localhost kernel: journal_write_metadata_buffer: ENOMEM at
get_unused_buffer_head, trying again.
Jul 23 18:54:34 localhost last message repeated 398 times
Jul 23 18:55:02 localhost kernel: VM: killing process mrtg
Jul 23 18:56:53 localhost syslogd: select: Cannot allocate memory
Jul 23 18:57:19 localhost kernel: journal_write_metadata_buffer: ENOMEM at
get_unused_buffer_head, trying again.
Jul 23 18:57:19 localhost last message repeated 12 times
Jul 23 18:57:24 localhost kernel: : ENOMEM at get_unused_buffer_head, trying again.
Jul 23 18:57:27 localhost kernel: journal_write_metadata_buffer: ENOMEM at
get_unused_buffer_head, trying again.
Jul 23 18:57:27 localhost last message repeated 398 times
Jul 23 18:57:55 localhost syslogd: select: Cannot allocate memory
Jul 23 18:57:59 localhost kernel: journal_write_metadata_buffer: ENOMEM at
get_unused_buffer_head, trying again.
Jul 23 18:58:05 localhost last message repeated 65 times
Jul 23 18:59:58 localhost cpcmp[2572]: glob 1, stream 5, pass 4, sev 6 ... Ending 4

Comment 14 Stephen Tweedie 2001-07-24 17:10:39 UTC

OK, I'm glad it fixed your problem.

The VM warnings are side-effects of ongoing VM tuning.  A number of core VM
problems have been identified and fixed in the main Linux kernel recently and we
are tracking those, but unfortunately they seem to have the ability to interact
badly with large memory machines.  We are actively trying to fix that, but it's
definitely a separate problem.

For now I'll add in a rate-limiter to the low-memory warnings from ext3 to avoid
flooding the syslogs.  That won't fix the problem of other tasks being hurt by
the low memory conditions, of course.  If you want to track the VM problems in
bugzilla, it's probably best to open that as a separate bug.

Comment 15 Bulent Abali 2001-07-25 21:07:26 UTC

Beta3 works for me too.  no panics anymore.

Note You need to log in before you can comment on or make changes to this bug.