From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux 2.4.2-2 i686; en-US; m18) Gecko/20010131 Netscape6/6.01 Description of problem: On a pe6400 w/ 5GB ram, 2 GB swap, perc3dc raid 1 container, bios A08,enterprise kernel, ext3 filesystems. After 10-15 minutes of 5 concurrently running streams of copy and compare (a script which copies a 650MB iso image to two localdisk locations then compares them), the kernel panics. I have reproduced this twice. Here is the screen output: ext3_reserve_inode_write: failed Assertion failure in journal_dirty_metadata_Rsmp_38f0f21a() at transaction.c:1094: "jh->b_next_transaction == transaction" invalid operand: 0000 CPU: 2 EIP: 0010:[<f8853451>] EFLAGS: 00010282 eax: 00000022 ebx: ea50e454 ecx: c02b4080 edx: 000046e2 esi: eea01544 edi: f7f2f800 ebp: f7d0326c esp: efa1bd14 ds: 0018 es: 0018 ss: 0018 Process cp (pid: 1141, stackpage=efa1b000) Stack: f885c520 00000446 001c90fe f6ab0000 d6a8b000 00000000 f8862518 f7d0326c d1665274 00001000 00000000 f6a48400 f6097720 000010fe 00000039 d1665274 f7fd0784 00000008 f6ba1400 00000000 00000001 f7d0326c 00001000 001c90ed Call Trace: [<c014af52>] [<c014b7d1>] [<c014c1e2>] [<c0136798>] [<c0134480>] [<c0146b06>] [<c0114959>] [<c010772b>] Code: 0f 0b 59 5b eb 4a 8b 53 0c 85 d2 74 34 68 06 cc 85 f8 68 52 <0>Assertion failure in journal_start_Rsmp_5c4f038c() at transaction.c:219: "handle->h_transaction->t_journal == journal" kernel BUG at transaction.c:219! invalid operand: 0000 CPU: 2 EIP: 0010:[<f885132a>] EFLAGS: 00010286 eax: 00000021 ebx: f7d0326c ecx: c02b4080 edx: 00004a6f esi: efa1a000 edi: f60a32b0 ebp: f6b8a800 esp: efa1baac ds: 0018 es: 0018 ss: 0018 Process cp (pid: 1141, stackpage=efa1b000) Stack: f885c520 000000db f7d0326c f60a32b0 f7d0326c f60a32b0 efa1bb70 f8866cb2 f6b8a800 00000001 f60a32b0 00000001 f609b800 c016247a f60a32b0 efa1a000 fffff000 00000000 c0136637 f60a32b0 00000001 c6a50000 00000000 00000006 Call Trace: [<c016247a>] [<c0136637>] [<c01d6412>] [<c0125ec9>] [<c0125fb8>] [<c0121f2b>] [<c0250018>] [<c0107d0b>] [<c0107f50>] [<c0107fd1>] [<c019786d>] [<c019cf62>] [<c019cf88>] [<c02471f9>] [<c011e5fb>] [<c011e99b>] [<c0107828>] [<c014af52>] [<c014b7d1>] [<c014c1e2>] [<c0136798>] [<c0134480>] [<c0146b06>] [<c0114959>] [<c010772b>] Code: 0f 0b 5d 58 ff 43 08 eb 5f 68 f0 00 00 00 6a 10 e8 01 92 8e How reproducible: Always Steps to Reproduce: 1.Install beta2 on described configuration w/ ext3 filesystems 2.Generate fairly heavy disk IO while running enterprise kernel 3.Look for eventual kernel panic Actual Results: Kernel panic Expected Results: No kernel panic Additional info: Problem seen first on enterprise kernel w/ AMI raid. As we move through our test plan, we will look for this error on UP and SMP kernels, and on Adaptec Raid and straight scsi.
Could you please run this through ksymoops? The error contains a stack trace which references loaded modules, and it's not possible to know from the trace exactly where the modules were in memory (and hence what code the trace represents.) The most reliable way to get the module trace is to run ksymoops on the machine which just generated the oops, if that is possible. I'll also try to reproduce this here --- are the source and destination copies of the .iso images you use all on the same filesystem in your tests? Thanks.
Created attachment 24009 [details] Output of ksymoops
Yes, the iso image source and destinations are all on local ext3 filesystems.
But are they on the *same* filesystem? The oops seems to indicate that a transaction that started on one ext3 filesystem recursed into a different one. To reproduce, I'd like to know if you were doing copies between different partitions or just doing a heavy load on a single partition. Thanks.
The source iso and the destination directories were all on a separate /home partition.
This defect should be fixed before next release.
The beta2 kernel had a slightly older ext3. We are now building a kernel with all the most recent ext3 fixes included, and with better diagnostic ability for ext3 bugs (ext3 is in the main kernel, not in a module, specifically for this next build). We'll do a little prelim testing then make it available for wider testing. I'll try reproducing the scenario you describe on this new kernel immediately.
I see the same bug under low memory conditions. Different hardware, 2GB memory. Sorry I don't have the ksymoops output. I will produce one when I trigger the bug again. Bulent (abali.com) ==================== JBD: out of memory for journal_head ext3_reserve_inode_write: failed Assertion failure in journal_dirty_metadata_Rsmp_93e5c062() at transaction.c:1094: "jh->b_next_transaction == transaction" invalid operand: 0000 CPU: 1 EIP: 0010:[<f8847451>] EFLAGS: 00010282 eax: 00000022 ebx: f0ae1454 ecx: c02b2b00 edx: 00025bdd esi: f5f377ac edi: f730fe00 ebp: f580d94c esp: cd6efe60 ds: 0018 es: 0018 ss: 0018 Process wafgen99 (pid: 9703, stackpage=cd6ef000) Stack: f88504c0 00000446 c3c397f0 e959d70c f4627228 f580d94c f885b8a3 f580d94c c3c397f0 f4627218 cd6efe98 f747ac00 00000010 00000208 00000000 cd6efeb0 f885ab9e f580d94c ebf97a00 ebf97a00 f580d94c e959d704 e959d6a8 f885b9f9 Call Trace: [<c0156b69>] [<c0156836>] [<c0156d57>] [<c013999e>] [<c01453b6>] [<c01456e7>] [<c010771b>] Code: 0f 0b 59 5b eb 4a 8b 53 0c 85 d2 74 34 68 a6 0b 85 f8 68 52 Here is another one =================== ext3_reserve_inode_write: failed ext3_reserve_inode_write: failed Assertion failure in journal_dirty_metadata_Rsmp_93e5c062() at transaction.c:1094: "jh->b_next_transaction == transaction" invalid operand: 0000 CPU: 0 EIP: 0010:[<f8847451>] EFLAGS: 00010282 eax: 00000022 ebx: c3da417c ecx: c02b2b00 edx: 00003bff esi: f6c6fb1c edi: f730ce00 ebp: f65bff8c esp: f6629e70 ds: 0018 es: 0018 ss: 0018 Process wafgen99 (pid: 1006, stackpage=f6629000) Stack: f88504c0 00000446 f7478e40 00008000 f65bff8c f72d1400 f88579cc f65bff8c f72d4d6c 00003fc0 001c69f8 f730ce00 f619fd78 f7479400 f7478e40 f619fd78 00000072 f72d4d6c f65bff8c f619f2b0 f65bff8c 000081a4 f6ecbe3c f885b9a4 Call Trace: [<c0156b69>] [<c0156836>] [<c0156d57>] [<c0160779>] [<c015f93c>] [<c01453b6>] [<c01456e7>] [<c010771b>] Code: 0f 0b 59 5b eb 4a 8b 53 0c 85 d2 74 34 68 a6 0b 85 f8 68 52
In case if it helps with debugging. When panic occurs, there is large amount of dirty pages in memory. The process is creating thousands of files totaling few gigabytes of data.
There is a newer ext3 with a number of fixes which just missed the beta2 deadline. A couple of extra fixes have gone in since then and we have a build tonight which fixes a number of ext3 bugs similar to yours. We'll let you know when that build is available and will make it available for testing. I've been unable to reproduce your problem locally with the newer kernels, but my setup may not mirror yours exactly, so I'd still appreciate testing of the new build once it is out.
I tried enterprise kernel 2.4.6-2.3 per Arjan van de Ven's request and went 2 hours without error. The oops was occurring within 15 minutes previously. I'm going to let it run overnight, but looks like the newer kernel fixes my problem.
After letting the disk activity continue overnight with enterprise kernel 2.4.6-2.3, copy and compare finished successfully but with several warnings in /var/log/messages: Jul 23 18:47:12 localhost cpcmp[2507]: glob 1, stream 1, pass 4, sev 6 Starting 4 ... Jul 23 18:48:02 localhost kernel: journal_write_metadata_buffer: ENOMEM at get_unused_buffer_head, trying again. Jul 23 18:48:05 localhost last message repeated 13 times Jul 23 18:48:05 localhost kernel: : ENOMEM at get_unused_buffer_head, trying again. Jul 23 18:48:05 localhost kernel: journal_write_metadata_buffer: ENOMEM at get_unused_buffer_head, trying again. Jul 23 18:48:44 localhost last message repeated 399 times Jul 23 18:48:47 localhost last message repeated 25 times Jul 23 18:48:47 localhost kernel: : ENOMEM at get_unused_buffer_head, trying again. Jul 23 18:48:47 localhost kernel: journal_write_metadata_buffer: ENOMEM at get_unused_buffer_head, trying again. Jul 23 18:48:50 localhost last message repeated 398 times Jul 23 18:49:21 localhost cpcmp[2516]: glob 1, stream 2, pass 3, sev 6 ... Ending 3 Jul 23 18:49:22 localhost cpcmp[2521]: glob 1, stream 2, pass 4, sev 6 Starting 4 ... Jul 23 18:51:00 localhost cpcmp[2535]: glob 1, stream 4, pass 3, sev 6 ... Ending 3 Jul 23 18:51:00 localhost cpcmp[2540]: glob 1, stream 4, pass 4, sev 6 Starting 4 ... Jul 23 18:52:44 localhost kernel: journal_write_metadata_buffer: ENOMEM at get_unused_buffer_head, trying again. Jul 23 18:54:19 localhost last message repeated 181 times Jul 23 18:54:22 localhost last message repeated 19 times Jul 23 18:54:22 localhost kernel: : ENOMEM at get_unused_buffer_head, trying again. Jul 23 18:54:22 localhost kernel: journal_write_metadata_buffer: ENOMEM at get_unused_buffer_head, trying again. Jul 23 18:54:34 localhost last message repeated 398 times Jul 23 18:55:02 localhost kernel: VM: killing process mrtg Jul 23 18:56:53 localhost syslogd: select: Cannot allocate memory Jul 23 18:57:19 localhost kernel: journal_write_metadata_buffer: ENOMEM at get_unused_buffer_head, trying again. Jul 23 18:57:19 localhost last message repeated 12 times Jul 23 18:57:24 localhost kernel: : ENOMEM at get_unused_buffer_head, trying again. Jul 23 18:57:27 localhost kernel: journal_write_metadata_buffer: ENOMEM at get_unused_buffer_head, trying again. Jul 23 18:57:27 localhost last message repeated 398 times Jul 23 18:57:55 localhost syslogd: select: Cannot allocate memory Jul 23 18:57:59 localhost kernel: journal_write_metadata_buffer: ENOMEM at get_unused_buffer_head, trying again. Jul 23 18:58:05 localhost last message repeated 65 times Jul 23 18:59:58 localhost cpcmp[2572]: glob 1, stream 5, pass 4, sev 6 ... Ending 4
OK, I'm glad it fixed your problem. The VM warnings are side-effects of ongoing VM tuning. A number of core VM problems have been identified and fixed in the main Linux kernel recently and we are tracking those, but unfortunately they seem to have the ability to interact badly with large memory machines. We are actively trying to fix that, but it's definitely a separate problem. For now I'll add in a rate-limiter to the low-memory warnings from ext3 to avoid flooding the syslogs. That won't fix the problem of other tasks being hurt by the low memory conditions, of course. If you want to track the VM problems in bugzilla, it's probably best to open that as a separate bug.
Beta3 works for me too. no panics anymore.