From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; sk-SK; rv:1.2.1) Gecko/20030225 Description of problem: Kernel freezes under heavy disk I/O load (every day around 5:00 am after cron.daily execution). I expect it is an error in ext3 journal implementation. Probably it is same bug as 114740. The only difference is that console is not completely dead. After pressing enter it requests login, but I am not able to login. But Ctrl+Alt+Del does not respond. Sometimes I find in logs: Feb 16 04:32:07 mail kernel: Assertion failure in __journal_drop_transaction() a t checkpoint.c:597: "transaction->t_ilist == NULL" Feb 16 04:32:07 mail kernel: ------------[ cut here ]------------ Feb 16 04:32:07 mail kernel: kernel BUG at checkpoint.c:597! Feb 16 04:32:07 mail kernel: invalid operand: 0000 Feb 16 04:32:07 mail kernel: n_hdlc ppp_synctty ppp_async ppp_generic slhc 8139too mii sis900 ipt_TOS ipt_MARK iptable_mangle ipt_LOG ipt_limit ipt_state ipt_RE JECT ipt_REDIRECT iptable_f Feb 16 04:32:07 mail kernel: CPU: 0 Feb 16 04:32:07 mail kernel: EIP: 0060:[<cf811e6c>] Not tainted Feb 16 04:32:07 mail kernel: EFLAGS: 00010286 Feb 16 04:32:07 mail kernel: Feb 16 04:32:07 mail kernel: EIP is at __journal_drop_transaction [jbd] 0x36c (2 .4.22-1.2166.nptl) Feb 16 04:32:07 mail kernel: eax: 00000069 ebx: cde1e840 ecx: 00000001 edx : cdbbbf7c Feb 16 04:32:07 mail kernel: esi: c12e5380 edi: 00000000 ebp: ce768000 esp : ce769e40 Feb 16 04:32:07 mail kernel: ds: 0068 es: 0068 ss: 0068 Feb 16 04:32:07 mail kernel: Process kjournald (pid: 12, stackpage=ce769000) Feb 16 04:32:07 mail kernel: Stack: cf815fa0 cf814d30 cf814cc4 00000255 cf814d85 cde1e840 cf8119aa c12e5380 Feb 16 04:32:07 mail kernel: cde1e840 c379d900 cdf0d3c0 cf80fd80 c379d900 cdf0d3c0 00000004 000017f1 Feb 16 04:32:07 mail kernel: 00000000 c12e5440 ce768000 c12e53f4 00000000 00000fbc c1f98044 00000000 Feb 16 04:32:07 mail kernel: Call Trace: [<cf815fa0>] .rodata.str1.32 [jbd] 0x e80 (0xce769e40) Feb 16 04:32:07 mail kernel: [<cf814d30>] .rodata.str1.1 [jbd] 0x3c8 (0xce769e44 ) Feb 16 04:32:07 mail kernel: [<cf814cc4>] .rodata.str1.1 [jbd] 0x35c (0xce769e48 ) Feb 16 04:32:07 mail kernel: [<cf814d85>] .rodata.str1.1 [jbd] 0x41d (0xce769e50 ) Feb 16 04:32:07 mail kernel: [<cf8119aa>] __journal_remove_checkpoint [jbd] 0x4a (0xce769e58) Feb 16 04:32:07 mail kernel: [<cf80fd80>] journal_commit_transaction [jbd] 0x6b0 (0xce769e6c) Feb 16 04:32:07 mail kernel: [<c01190e8>] schedule [kernel] 0x2b8 (0xce769f64) Feb 16 04:32:07 mail kernel: [<cf812a14>] kjournald [jbd] 0x134 (0xce769fbc) Feb 16 04:32:07 mail kernel: [<cf8128c0>] commit_timeout [jbd] 0x0 (0xce769fd8)Feb 16 04:32:07 mail kernel: [<cf8128e0>] kjournald [jbd] 0x0 (0xce769fe8) Feb 16 04:32:07 mail kernel: [<c010719d>] kernel_thread_helper [kernel] 0x5 (0xc e769ff0) Feb 16 04:32:07 mail kernel: Feb 16 04:32:07 mail kernel: Feb 16 04:32:07 mail kernel: Code: 0f 0b 55 02 c4 4c 81 cf e9 ba fc ff ff 8b 53 44 31 c0 39 da Version-Release number of selected component (if applicable): kernel-2.4.22-1.2166.nptl.athlon How reproducible: Always Steps to Reproduce: 1. Boot kernel. 2. Leave it running. 3. At 7:00 am it is dead. Sometimes corresponding logs are present. (There are also some on the console even if they are not in syslog.) Actual Results: Kernel freezes. Expected Results: Kernel does not freeze. Additional info:
OK, we have a problem here. That particular field really should be zero, because ext3 never, ever, ever, ever, ever touches it. It is never even mentioned in the kernel outside comments, except for that one sanity check to see if it's zero. So ext3 cannot set it to a non-zero value. Hmm. So, something *else* is corrupting ext3's memory. It might be bad hardware. It might be software. We'd need to see more oops messages to see if there's a pattern here. So, if you still have other failure reports in your logs, please attach them to this report so that we can hunt through and look for common patterns. I'd also recommend a run of memtest86 to check for possible memory problems.
The machine did not pass memtest86. I will change RAM tomorrow. Sorry. You can close the bug. Just for case it will not help there is another log. Feb 19 02:13:27 mail kernel: Assertion failure in __journal_drop_transaction() a t checkpoint.c:597: "transaction->t_ilist == NULL" Feb 19 02:13:27 mail kernel: ------------[ cut here ]------------ Feb 19 02:13:27 mail kernel: kernel BUG at checkpoint.c:597! Feb 19 02:13:27 mail kernel: invalid operand: 0000 Feb 19 02:13:27 mail kernel: n_hdlc ppp_synctty ppp_async ppp_generic slhc 8139t oo mii sis900 ipt_TOS ipt_MARK iptable_mangle ipt_LOG ipt_limit ipt_state ipt_RE JECT ipt_REDIRECT iptable_f Feb 19 02:13:27 mail kernel: CPU: 0 Feb 19 02:13:27 mail kernel: EIP: 0060:[<cf811e6c>] Not tainted Feb 19 02:13:27 mail kernel: EFLAGS: 00010282 Feb 19 02:13:27 mail kernel: Feb 19 02:13:27 mail kernel: EIP is at __journal_drop_transaction [jbd] 0x36c (2 .4.22-1.2166.nptl) Feb 19 02:13:27 mail kernel: eax: 00000069 ebx: cd0f5140 ecx: 00000001 edx : cd199f7c Feb 19 02:13:27 mail kernel: esi: c12e5380 edi: cd0f5140 ebp: cd91a000 esp : cd91be08 Feb 19 02:13:27 mail kernel: ds: 0068 es: 0068 ss: 0068 Feb 19 02:13:27 mail kernel: Process kjournald (pid: 12, stackpage=cd91b000) Feb 19 02:13:27 mail kernel: Stack: cf815fa0 cf814d30 cf814cc4 00000255 cf814d85 cd0f5140 cf8119aa c12e5380 Feb 19 02:13:27 mail kernel: cd0f5140 cdf79140 cebc63c0 cf81132f cebc63c0 cebc63c0 cf81192d cebc63c0 Feb 19 02:13:27 mail kernel: cebc63c0 00000003 00000001 ccc28840 cd0f5a40 cd0f5cc0 c12e5380 00000000 Feb 19 02:13:27 mail kernel: Call Trace: [<cf815fa0>] .rodata.str1.32 [jbd] 0x e80 (0xcd91be08) Feb 19 02:13:27 mail kernel: [<cf814d30>] .rodata.str1.1 [jbd] 0x3c8 (0xcd91be0c ) Feb 19 02:13:27 mail kernel: [<cf814cc4>] .rodata.str1.1 [jbd] 0x35c (0xcd91be10 ) Feb 19 02:13:27 mail kernel: [<cf814d85>] .rodata.str1.1 [jbd] 0x41d (0xcd91be18 ) Feb 19 02:13:27 mail kernel: [<cf8119aa>] __journal_remove_checkpoint [jbd] 0x4a (0xcd91be20) Feb 19 02:13:27 mail kernel: [<cf81132f>] __try_to_free_cp_buf [jbd] 0x2f (0xcd9 1be34) Feb 19 02:13:27 mail kernel: [<cf81192d>] __journal_clean_checkpoint_list [jbd] 0x6d (0xcd91be40) Feb 19 02:13:27 mail kernel: [<cf80f7df>] journal_commit_transaction [jbd] 0x10f (0xcd91be6c) Feb 19 02:13:27 mail kernel: [<c01190e8>] schedule [kernel] 0x2b8 (0xcd91bf64) Feb 19 02:13:27 mail kernel: [<cf812a14>] kjournald [jbd] 0x134 (0xcd91bfbc) Feb 19 02:13:27 mail kernel: [<cf8128c0>] commit_timeout [jbd] 0x0 (0xcd91bfd8) Feb 19 02:13:27 mail kernel: [<cf8128e0>] kjournald [jbd] 0x0 (0xcd91bfe8) Feb 19 02:13:27 mail kernel: [<c010719d>] kernel_thread_helper [kernel] 0x5 (0xc d91bff0) Feb 19 02:13:27 mail kernel: Feb 19 02:13:27 mail kernel: Feb 19 02:13:27 mail kernel: Code: 0f 0b 55 02 c4 4c 81 cf e9 ba fc ff ff 8b 53 44 31 c0 39 da
After changing all components except CPU the problem remains. Also after upgrading to kernel 2.4.22-1.2174.nptl. Memtest passes. I also reinstalled the system and reformatted partitions. I am going to change the motherboard to a one with other chipset tomorrow. I discovered it is not related to heavy disk I/O load. The lockup time varies between two a clock and seven a clock in the morning and it is every second or third day after upgrade to the new kernel. I haven't noticed anything in the log now, but I can send a fotography of the console dump. Kernel reports some problem at __journal_drop_transaction. I suspect it could be related to some bug in USB code I heard about.
"Kernel reports some problem at __journal_drop_transaction" We really need to see the exact error. This looks very, very much like random memory corruption hitting a certain offset in memory; it could be a CPU cache problem, just as easily. Without seeing the exact error, it's impossible to start eliminating possibilities.
After changing the CPU problem remains. The only not changed component is CD-ROM drive. I tested also kernel parameter acpi=off. Nothing helps. Red Hat kernel 2.4.20-30.9 works. There does not appear anything in the logs but there is log on the console. I can send photographs of the screen (Shift Page-Up works)in Jpeg. To what e-mail address can I send it?
We are running pppoe over DSL on that machine. It could be related to bug 110941
Any debugging files or console logs can be attached to the bugzilla report via the "create a new attachment" link. If you've got another box, the easiest way to capture it is usually to set up serial console and use ttywatch to record the logs.
Created attachment 98916 [details] Screenshot of the console log.
OK, thanks. We're still seeing non-zero in a field that ext3 simply does not use. I suspect what's happening is that the pppoe is using a field after the struct has been freed, and that is resulting in ext3 corruption simply because ext3 happens to allocate transactions from the same kmalloc bucket size. Can you rebuild a kernel with CONFIG_DEBUG_SLAB set and see if you can reproduce?
Sorry, I am unable to compile kernel. I copied /boot/config-2.4.22-1.2174.nptl to .config edited CONFIG_DEBUG_SLAB=y, ran make oldconfig; make dep; make bzImage.... and I get following error. What am I doing wrong? How did you compile the kernel? make[2]: Entering directory `/usr/src/linux-2.4.22-1.2174.nptl/drivers/acpi' gcc32 -D__KERNEL__ -I/usr/src/linux-2.4.22-1.2174.nptl/include -Wall -Wstrict-pr ototypes -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-commo n -fomit-frame-pointer -pipe -mpreferred-stack-boundary=2 -march=athlon -DMODULE -DMODVERSIONS -include /usr/src/linux-2.4.22-1.2174.nptl/include/linux/modversi ons.h -Os -nostdinc -iwithprefix include -DKBUILD_BASENAME=ac -c -o ac.o ac.c In file included from ac.c:26: /usr/src/linux-2.4.22-1.2174.nptl/include/linux/kernel.h:60: invalid suffix on i nteger constant /usr/src/linux-2.4.22-1.2174.nptl/include/linux/kernel.h:60: syntax error before numeric constant /usr/src/linux-2.4.22-1.2174.nptl/include/linux/kernel.h:60: warning: function d eclaration isn't a prototype ... and so on.
Line 60 of /usr/src/linux-2.4.22-1.2174.nptl/include/linux/kernel.h looks like: extern struct notifier_block *panic_notifier_list; so I can't see where those errors would possibly be coming from. Could you check the package is still intact, with "rpm -V kernel-source"?
I tested compiling also on another box, fresh kernel-source install and I reproduce the results. Running make bzImage works, but make modules fails on ac.c. Maybe it depends on .config file. The line 60 of /usr/src/linux-2.4.22-1.2174.nptl/include/linux/kernel.h looks the same as yours. # LANG=C rpm --checksig kernel-source-2.4.22-1.2174.nptl.i386.rpm kernel-source-2.4.22-1.2174.nptl.i386.rpm: (SHA1) DSA sha1 md5 (GPG) NOT OK (MISSING KEYS: GPG#4f2a6fd2) rpm -V kernel-source nothing suspicious Could you, please, send me right .config file with CONFIG_DEBUG_SLAB enabled which works for you? Thanks.
Try "make mrproper" before doing the configure/build, that clears the kernel tree ready for a new build.
OK, works. Sorry, I didn't expect the tree is not clean after fresh install.
I tested the kernel with CONFIG_DEBUG_SLAB enabled. I got no other information than the previous. Or I can't see the difference. There is nothing in the log and I have got screenshots looking similar like the previous. Will it help you?
I doublechecked the /var/log messages and I found log today. Apr 4 04:25:03 mail kernel: Assertion failure in __journal_drop_transaction() a t checkpoint.c:597: "transaction->t_ilist == NULL" Apr 4 04:25:03 mail kernel: ------------[ cut here ]------------ Apr 4 04:25:03 mail kernel: kernel BUG at checkpoint.c:597! Apr 4 04:25:03 mail kernel: invalid operand: 0000 Apr 4 04:25:03 mail kernel: n_hdlc ppp_synctty ppp_async ppp_generic slhc 8139t oo mii sis900 ipt_TOS ipt_MARK iptable_mangle ipt_LOG ipt_limit ipt_state ipt_RE JECT ipt_MASQUERADE ipt_RED Apr 4 04:25:03 mail kernel: CPU: 0 Apr 4 04:25:03 mail kernel: EIP: 0060:[<ce811e6c>] Not tainted Apr 4 04:25:03 mail kernel: EFLAGS: 00010282 Apr 4 04:25:03 mail kernel: Apr 4 04:25:03 mail kernel: EIP is at __journal_drop_transaction [jbd] 0x36c (2 .4.22-1.2174.nptlcustom) Apr 4 04:25:03 mail kernel: eax: 00000069 ebx: ccb5da54 ecx: 00000001 edx : ccb71f7c Apr 4 04:25:03 mail kernel: esi: cdf3639c edi: ccb5da54 ebp: cd7a8000 esp : cd7a9e08 Apr 4 04:25:03 mail kernel: ds: 0068 es: 0068 ss: 0068 Apr 4 04:25:03 mail kernel: Process kjournald (pid: 12, stackpage=cd7a9000) Apr 4 04:25:04 mail kernel: Stack: ce815fa0 ce814d30 ce814cc4 00000255 ce814d85 ccb5da54 ce8119aa cdf3639c Apr 4 04:25:04 mail kernel: ccb5da54 cbe719c0 c1c0b79c ce81132f c1c0b79c c1c0b79c ce81192d c1c0b79c Apr 4 04:25:04 mail kernel: c1c0b79c 00000003 00000059 cc093f1c cc093f1c cd20625c cdf3639c 00000000 Apr 4 04:25:04 mail kernel: Call Trace: [<ce815fa0>] .rodata.str1.32 [jbd] 0x e80 (0xcd7a9e08) Apr 4 04:25:04 mail kernel: [<ce814d30>] .rodata.str1.1 [jbd] 0x3c8 (0xcd7a9e0c ) Apr 4 04:25:04 mail kernel: [<ce814cc4>] .rodata.str1.1 [jbd] 0x35c (0xcd7a9e10 ) Apr 4 04:25:04 mail kernel: [<ce814d85>] .rodata.str1.1 [jbd] 0x41d (0xcd7a9e18 ) Apr 4 04:25:04 mail kernel: [<ce8119aa>] __journal_remove_checkpoint [jbd] 0x4a (0xcd7a9e20) Apr 4 04:25:04 mail kernel: [<ce81132f>] __try_to_free_cp_buf [jbd] 0x2f (0xcd7 a9e34) Apr 4 04:25:04 mail kernel: [<ce81192d>] __journal_clean_checkpoint_list [jbd] 0x6d (0xcd7a9e40) Apr 4 04:25:04 mail kernel: [<ce80f7df>] journal_commit_transaction [jbd] 0x10f (0xcd7a9e6c) Apr 4 04:25:04 mail kernel: [<c01190a8>] schedule [kernel] 0x2b8 (0xcd7a9f64) Apr 4 04:25:04 mail kernel: [<ce812a14>] kjournald [jbd] 0x134 (0xcd7a9fbc) Apr 4 04:25:04 mail kernel: [<ce8128c0>] commit_timeout [jbd] 0x0 (0xcd7a9fd8) Apr 4 04:25:04 mail kernel: [<ce8128e0>] kjournald [jbd] 0x0 (0xcd7a9fe8) Apr 4 04:25:04 mail kernel: [<c010719d>] kernel_thread_helper [kernel] 0x5 (0xc d7a9ff0) Apr 4 04:25:04 mail kernel: Apr 4 04:25:04 mail kernel: Apr 4 04:25:04 mail kernel: Code: 0f 0b 55 02 c4 4c 81 ce e9 ba fc ff ff 8b 53 44 31 c0 39 da This was in /var/log/debug (*.=debug /var/log/debug): Apr 2 11:29:36 mail kernel: divert: allocating divert_blk for eth0 Apr 2 11:29:36 mail kernel: divert: allocating divert_blk for eth1 Apr 2 11:29:36 mail kernel: eth1: Identified 8139 chip type 'RTL-8139B' Apr 2 11:29:36 mail kernel: divert: freeing divert_blk for eth0 Apr 2 11:29:36 mail kernel: divert: freeing divert_blk for eth1 Apr 2 11:29:36 mail kernel: divert: allocating divert_blk for eth0 Apr 2 11:29:37 mail kernel: divert: allocating divert_blk for eth1 Apr 2 11:29:37 mail kernel: eth1: Identified 8139 chip type 'RTL-8139B' Apr 2 11:29:37 mail kernel: divert: not allocating divert_blk for non-ethernet device ppp0 I expected mode debug info on CONFIG_DEBUG_SLAB. Maybe I missed something...
CONFIG_DEBUG_SLAB doesn't change the existing debugging, it simply adds additional debugging checks on top of them. If there's slab corruption happening (as appears to be the case here), then the slab debugging adds new checks which may pinpoint the problem more accurately. But it's still going to be pretty much random whether the new checks trap the fault first or the old ones. We may have to wait for a few instances of the fault before the new checks come up with anything.
I notices this problem on another machine. It was also dead at 4:02 am. The only difference is that the machine was completely dead. I was not able to scroll the console, nothing in the logs. The only difference is the lower frequency of crashes. This was the first after upgrading to Fedora Core 1 (two or three weeks). The kernel is 2.4.22-1.2188.nptlsmp.i686. The one who saw the console said there was something about process chown. I expect it died during some syscall from the chown after sarg. The log of sarg was too short. It usually logs megabytes not kilobytes. But this may be because of buffered log. I cannot understand how it is possible that nobody else encounters this problem, only me, and on the two machines... this one does not run PPPoE, so it is not related to it. It is definitely related to some cron jobs run in the early morning (4:00 am). Maybe the hot candidates should be sarg and possible prelink.
The one-off error on a different box sounds completely unrelated. It's the kind of thing that happens sometimes, most often due to bad harwdare in my experience. There's really nothing I can add to that without an oops log.
The second machine was running Red Hat 9 for 5 months without any problem. The hardware is completely different than the first one. It is P4 on Abit IC7-G board. Adaptec SCSI adapter and 3 SCSI disks. So I don't expect it could be a hardware bug. I am not able to catch the oops log, because the machine is completely dead. Unable to scroll the console.
Serial console is invaluable for such cases. Without that, the second box's bug report is just "an oops happened", which is not debuggable.
OK. It was definitely random. Kernel does not freeze any more. Also the first machine is functioning properly for several days. Kernel 2188 solved the problem. Probably it was happening due to removed low latency patch.