115935 – Kernel freezes on disk I/O

Bug 115935 - Kernel freezes on disk I/O

Summary: Kernel freezes on disk I/O

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	1
Hardware:	athlon
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Stephen Tweedie
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-02-17 09:28 UTC by Marek Greško
Modified:	2007-11-30 22:10 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-06-22 11:58:14 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Screenshot of the console log. (470.00 KB, application/x-tar) 2004-03-29 06:29 UTC, Marek Greško	no flags	Details
View All

Description Marek Greško 2004-02-17 09:28:02 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; sk-SK; rv:1.2.1)
Gecko/20030225

Description of problem:
Kernel freezes under heavy disk I/O load (every day around 5:00 am
after cron.daily execution).

I expect it is an error in ext3 journal implementation.

Probably it is same bug as 114740. The only difference is that console
 is not completely dead. After pressing enter it requests login, but I
am not able to login. But Ctrl+Alt+Del does not respond.

Sometimes I find in logs:

Feb 16 04:32:07 mail kernel: Assertion failure in
__journal_drop_transaction() a
t checkpoint.c:597: "transaction->t_ilist == NULL"
Feb 16 04:32:07 mail kernel: ------------[ cut here ]------------
Feb 16 04:32:07 mail kernel: kernel BUG at checkpoint.c:597!
Feb 16 04:32:07 mail kernel: invalid operand: 0000
Feb 16 04:32:07 mail kernel: n_hdlc ppp_synctty ppp_async ppp_generic
slhc 8139too mii sis900 ipt_TOS ipt_MARK iptable_mangle ipt_LOG
ipt_limit ipt_state ipt_RE
JECT ipt_REDIRECT iptable_f
Feb 16 04:32:07 mail kernel: CPU:    0
Feb 16 04:32:07 mail kernel: EIP:    0060:[<cf811e6c>]    Not tainted
Feb 16 04:32:07 mail kernel: EFLAGS: 00010286
Feb 16 04:32:07 mail kernel: 
Feb 16 04:32:07 mail kernel: EIP is at __journal_drop_transaction
[jbd] 0x36c (2
.4.22-1.2166.nptl)
Feb 16 04:32:07 mail kernel: eax: 00000069   ebx: cde1e840   ecx:
00000001   edx
: cdbbbf7c
Feb 16 04:32:07 mail kernel: esi: c12e5380   edi: 00000000   ebp:
ce768000   esp
: ce769e40
Feb 16 04:32:07 mail kernel: ds: 0068   es: 0068   ss: 0068
Feb 16 04:32:07 mail kernel: Process kjournald (pid: 12,
stackpage=ce769000)
Feb 16 04:32:07 mail kernel: Stack: cf815fa0 cf814d30 cf814cc4
00000255 cf814d85
 cde1e840 cf8119aa c12e5380 
Feb 16 04:32:07 mail kernel:        cde1e840 c379d900 cdf0d3c0
cf80fd80 c379d900
 cdf0d3c0 00000004 000017f1 
Feb 16 04:32:07 mail kernel:        00000000 c12e5440 ce768000
c12e53f4 00000000
 00000fbc c1f98044 00000000 
Feb 16 04:32:07 mail kernel: Call Trace:   [<cf815fa0>]
.rodata.str1.32 [jbd] 0x
e80 (0xce769e40)
Feb 16 04:32:07 mail kernel: [<cf814d30>] .rodata.str1.1 [jbd] 0x3c8
(0xce769e44
)
Feb 16 04:32:07 mail kernel: [<cf814cc4>] .rodata.str1.1 [jbd] 0x35c
(0xce769e48
)
Feb 16 04:32:07 mail kernel: [<cf814d85>] .rodata.str1.1 [jbd] 0x41d
(0xce769e50
)
Feb 16 04:32:07 mail kernel: [<cf8119aa>] __journal_remove_checkpoint
[jbd] 0x4a
 (0xce769e58)
Feb 16 04:32:07 mail kernel: [<cf80fd80>] journal_commit_transaction
[jbd] 0x6b0
 (0xce769e6c)
Feb 16 04:32:07 mail kernel: [<c01190e8>] schedule [kernel] 0x2b8
(0xce769f64)
Feb 16 04:32:07 mail kernel: [<cf812a14>] kjournald [jbd] 0x134
(0xce769fbc)
Feb 16 04:32:07 mail kernel: [<cf8128c0>] commit_timeout [jbd] 0x0
(0xce769fd8)Feb 16 04:32:07 mail kernel: [<cf8128e0>] kjournald [jbd]
0x0 (0xce769fe8)
Feb 16 04:32:07 mail kernel: [<c010719d>] kernel_thread_helper
[kernel] 0x5 (0xc
e769ff0)
Feb 16 04:32:07 mail kernel: 
Feb 16 04:32:07 mail kernel: 
Feb 16 04:32:07 mail kernel: Code: 0f 0b 55 02 c4 4c 81 cf e9 ba fc ff
ff 8b 53 
44 31 c0 39 da 



Version-Release number of selected component (if applicable):
kernel-2.4.22-1.2166.nptl.athlon

How reproducible:
Always

Steps to Reproduce:
1. Boot kernel.
2. Leave it running.
3. At 7:00 am it is dead. Sometimes corresponding logs are present.
(There are also some on the console even if they are not in syslog.)
    

Actual Results:  Kernel freezes.

Expected Results:  Kernel does not freeze.

Additional info:

Comment 1 Stephen Tweedie 2004-02-17 15:30:44 UTC

OK, we have a problem here.  That particular field really should be
zero, because ext3 never, ever, ever, ever, ever touches it.  It is
never even mentioned in the kernel outside comments, except for that
one sanity check to see if it's zero.

So ext3 cannot set it to a non-zero value.

Hmm.

So, something *else* is corrupting ext3's memory.  It might be bad
hardware.  It might be software.  We'd need to see more oops messages
to see if there's a pattern here.

So, if you still have other failure reports in your logs, please
attach them to this report so that we can hunt through and look for
common patterns.  I'd also recommend a run of memtest86 to check for
possible memory problems.

Comment 2 Marek Greško 2004-02-19 09:17:27 UTC

The machine did not pass memtest86. I will change RAM tomorrow. Sorry.
You can close the bug. Just for case it will not help there is another
log.

Feb 19 02:13:27 mail kernel: Assertion failure in
__journal_drop_transaction() a
t checkpoint.c:597: "transaction->t_ilist == NULL"
Feb 19 02:13:27 mail kernel: ------------[ cut here ]------------
Feb 19 02:13:27 mail kernel: kernel BUG at checkpoint.c:597!
Feb 19 02:13:27 mail kernel: invalid operand: 0000
Feb 19 02:13:27 mail kernel: n_hdlc ppp_synctty ppp_async ppp_generic
slhc 8139t
oo mii sis900 ipt_TOS ipt_MARK iptable_mangle ipt_LOG ipt_limit
ipt_state ipt_RE
JECT ipt_REDIRECT iptable_f
Feb 19 02:13:27 mail kernel: CPU:    0
Feb 19 02:13:27 mail kernel: EIP:    0060:[<cf811e6c>]    Not tainted
Feb 19 02:13:27 mail kernel: EFLAGS: 00010282
Feb 19 02:13:27 mail kernel:
Feb 19 02:13:27 mail kernel: EIP is at __journal_drop_transaction
[jbd] 0x36c (2
.4.22-1.2166.nptl)
Feb 19 02:13:27 mail kernel: eax: 00000069   ebx: cd0f5140   ecx:
00000001   edx
: cd199f7c
Feb 19 02:13:27 mail kernel: esi: c12e5380   edi: cd0f5140   ebp:
cd91a000   esp
: cd91be08
Feb 19 02:13:27 mail kernel: ds: 0068   es: 0068   ss: 0068
Feb 19 02:13:27 mail kernel: Process kjournald (pid: 12,
stackpage=cd91b000)
Feb 19 02:13:27 mail kernel: Stack: cf815fa0 cf814d30 cf814cc4
00000255 cf814d85
 cd0f5140 cf8119aa c12e5380
Feb 19 02:13:27 mail kernel:        cd0f5140 cdf79140 cebc63c0
cf81132f cebc63c0
 cebc63c0 cf81192d cebc63c0
Feb 19 02:13:27 mail kernel:        cebc63c0 00000003 00000001
ccc28840 cd0f5a40
 cd0f5cc0 c12e5380 00000000
Feb 19 02:13:27 mail kernel: Call Trace:   [<cf815fa0>]
.rodata.str1.32 [jbd] 0x
e80 (0xcd91be08)
Feb 19 02:13:27 mail kernel: [<cf814d30>] .rodata.str1.1 [jbd] 0x3c8
(0xcd91be0c
)
Feb 19 02:13:27 mail kernel: [<cf814cc4>] .rodata.str1.1 [jbd] 0x35c
(0xcd91be10
)
Feb 19 02:13:27 mail kernel: [<cf814d85>] .rodata.str1.1 [jbd] 0x41d
(0xcd91be18
)
Feb 19 02:13:27 mail kernel: [<cf8119aa>] __journal_remove_checkpoint
[jbd] 0x4a
 (0xcd91be20)
Feb 19 02:13:27 mail kernel: [<cf81132f>] __try_to_free_cp_buf [jbd]
0x2f (0xcd9
1be34)
Feb 19 02:13:27 mail kernel: [<cf81192d>]
__journal_clean_checkpoint_list [jbd]
0x6d (0xcd91be40)
Feb 19 02:13:27 mail kernel: [<cf80f7df>] journal_commit_transaction
[jbd] 0x10f
 (0xcd91be6c)
Feb 19 02:13:27 mail kernel: [<c01190e8>] schedule [kernel] 0x2b8
(0xcd91bf64)
Feb 19 02:13:27 mail kernel: [<cf812a14>] kjournald [jbd] 0x134
(0xcd91bfbc)
Feb 19 02:13:27 mail kernel: [<cf8128c0>] commit_timeout [jbd] 0x0
(0xcd91bfd8)
Feb 19 02:13:27 mail kernel: [<cf8128e0>] kjournald [jbd] 0x0 (0xcd91bfe8)
Feb 19 02:13:27 mail kernel: [<c010719d>] kernel_thread_helper
[kernel] 0x5 (0xc
d91bff0)
Feb 19 02:13:27 mail kernel:
Feb 19 02:13:27 mail kernel:
Feb 19 02:13:27 mail kernel: Code: 0f 0b 55 02 c4 4c 81 cf e9 ba fc ff
ff 8b 53
44 31 c0 39 da

Comment 3 Marek Greško 2004-03-15 18:41:14 UTC

After changing all components except CPU the problem remains. Also
after upgrading to kernel 2.4.22-1.2174.nptl. Memtest passes. I also
reinstalled the system and reformatted partitions. I am going to
change the motherboard to a one with other chipset tomorrow.

I discovered it is not related to heavy disk I/O load. The lockup time
varies between two a clock and seven a clock in the morning and it is
every second or third day after upgrade to the new kernel. I haven't
noticed anything in the log now, but I can send a fotography of the
console dump.

Kernel reports some problem at __journal_drop_transaction.

I suspect it could be related to some bug in USB code I heard about.

Comment 4 Stephen Tweedie 2004-03-15 20:00:01 UTC

"Kernel reports some problem at __journal_drop_transaction"

We really need to see the exact error.  This looks very, very much
like random memory corruption hitting a certain offset in memory; it
could be a CPU cache problem, just as easily.  Without seeing the
exact error, it's impossible to start eliminating possibilities.

Comment 5 Marek Greško 2004-03-26 09:58:03 UTC

After changing the CPU problem remains. The only not changed component
is CD-ROM drive. I tested also kernel parameter acpi=off. Nothing
helps.   Red Hat kernel 2.4.20-30.9 works. There does not appear
anything in the logs but there is log on the console. I can send
photographs of the screen (Shift Page-Up works)in Jpeg. To what e-mail
address can I send it?

Comment 6 Marek Greško 2004-03-26 10:03:39 UTC

We are running pppoe over DSL on that machine. It could be related to
bug 110941

Comment 7 Stephen Tweedie 2004-03-26 18:00:27 UTC

Any debugging files or console logs can be attached to the bugzilla
report via the "create a new attachment" link.  If you've got another
box, the easiest way to capture it is usually to set up serial console
and use ttywatch to record the logs.

Comment 8 Marek Greško 2004-03-29 06:29:52 UTC

Created attachment 98916 [details]
Screenshot of the console log.

Comment 9 Stephen Tweedie 2004-03-29 12:33:33 UTC

OK, thanks.  We're still seeing non-zero in a field that ext3 simply
does not use.  I suspect what's happening is that the pppoe is using a
field after the struct has been freed, and that is resulting in ext3
corruption simply because ext3 happens to allocate transactions from
the same kmalloc bucket size.

Can you rebuild a kernel with CONFIG_DEBUG_SLAB set and see if you can
reproduce?

Comment 10 Marek Greško 2004-04-01 14:46:39 UTC

Sorry, I am unable to compile kernel. I copied
/boot/config-2.4.22-1.2174.nptl to .config edited CONFIG_DEBUG_SLAB=y,
ran make oldconfig; make dep; make bzImage.... and I get following
error. What am I doing wrong? How did you compile the kernel?

make[2]: Entering directory
`/usr/src/linux-2.4.22-1.2174.nptl/drivers/acpi'
gcc32 -D__KERNEL__ -I/usr/src/linux-2.4.22-1.2174.nptl/include -Wall
-Wstrict-pr
ototypes -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing
-fno-commo
n -fomit-frame-pointer -pipe -mpreferred-stack-boundary=2
-march=athlon -DMODULE
 -DMODVERSIONS -include
/usr/src/linux-2.4.22-1.2174.nptl/include/linux/modversi
ons.h -Os  -nostdinc -iwithprefix include -DKBUILD_BASENAME=ac  -c -o
ac.o ac.c
In file included from ac.c:26:
/usr/src/linux-2.4.22-1.2174.nptl/include/linux/kernel.h:60: invalid
suffix on i
nteger constant
/usr/src/linux-2.4.22-1.2174.nptl/include/linux/kernel.h:60: syntax
error before
 numeric constant
/usr/src/linux-2.4.22-1.2174.nptl/include/linux/kernel.h:60: warning:
function d
eclaration isn't a prototype
... and so on.

Comment 11 Stephen Tweedie 2004-04-01 16:06:14 UTC

Line 60 of /usr/src/linux-2.4.22-1.2174.nptl/include/linux/kernel.h
looks like:

extern struct notifier_block *panic_notifier_list;

so I can't see where those errors would possibly be coming from. 
Could you check the package is still intact, with "rpm -V kernel-source"?

Comment 12 Marek Greško 2004-04-01 17:38:59 UTC

I tested compiling also on another box, fresh kernel-source install
and I reproduce the results. Running make bzImage works, but make
modules fails on ac.c. Maybe it depends on .config file. The line 60
of /usr/src/linux-2.4.22-1.2174.nptl/include/linux/kernel.h looks the
same as yours.

# LANG=C rpm --checksig kernel-source-2.4.22-1.2174.nptl.i386.rpm 
kernel-source-2.4.22-1.2174.nptl.i386.rpm: (SHA1) DSA sha1 md5 (GPG)
NOT OK (MISSING KEYS: GPG#4f2a6fd2) 

rpm -V kernel-source
nothing suspicious

Could you, please, send me right .config file with CONFIG_DEBUG_SLAB
enabled which works for you? Thanks.

Comment 13 Stephen Tweedie 2004-04-01 20:53:09 UTC

Try "make mrproper" before doing the configure/build, that clears the
kernel tree ready for a new build.

Comment 14 Marek Greško 2004-04-02 12:04:15 UTC

OK, works. Sorry, I didn't expect the tree is not clean after fresh
install.

Comment 15 Marek Greško 2004-04-05 08:03:08 UTC

I tested the kernel with CONFIG_DEBUG_SLAB enabled. I got no other
information than the previous. Or I can't see the difference. There is
nothing in the log and I have got screenshots looking similar like the
previous. Will it help you?

Comment 16 Marek Greško 2004-04-05 10:53:39 UTC

I doublechecked the /var/log messages and I found log today.

Apr  4 04:25:03 mail kernel: Assertion failure in
__journal_drop_transaction() a
t checkpoint.c:597: "transaction->t_ilist == NULL"
Apr  4 04:25:03 mail kernel: ------------[ cut here ]------------
Apr  4 04:25:03 mail kernel: kernel BUG at checkpoint.c:597!
Apr  4 04:25:03 mail kernel: invalid operand: 0000
Apr  4 04:25:03 mail kernel: n_hdlc ppp_synctty ppp_async ppp_generic
slhc 8139t
oo mii sis900 ipt_TOS ipt_MARK iptable_mangle ipt_LOG ipt_limit
ipt_state ipt_RE
JECT ipt_MASQUERADE ipt_RED
Apr  4 04:25:03 mail kernel: CPU:    0
Apr  4 04:25:03 mail kernel: EIP:    0060:[<ce811e6c>]    Not tainted
Apr  4 04:25:03 mail kernel: EFLAGS: 00010282
Apr  4 04:25:03 mail kernel: 
Apr  4 04:25:03 mail kernel: EIP is at __journal_drop_transaction
[jbd] 0x36c (2
.4.22-1.2174.nptlcustom)
Apr  4 04:25:03 mail kernel: eax: 00000069   ebx: ccb5da54   ecx:
00000001   edx
: ccb71f7c
Apr  4 04:25:03 mail kernel: esi: cdf3639c   edi: ccb5da54   ebp:
cd7a8000   esp
: cd7a9e08
Apr  4 04:25:03 mail kernel: ds: 0068   es: 0068   ss: 0068
Apr  4 04:25:03 mail kernel: Process kjournald (pid: 12,
stackpage=cd7a9000)
Apr  4 04:25:04 mail kernel: Stack: ce815fa0 ce814d30 ce814cc4
00000255 ce814d85
 ccb5da54 ce8119aa cdf3639c 
Apr  4 04:25:04 mail kernel:        ccb5da54 cbe719c0 c1c0b79c
ce81132f c1c0b79c
 c1c0b79c ce81192d c1c0b79c 
Apr  4 04:25:04 mail kernel:        c1c0b79c 00000003 00000059
cc093f1c cc093f1c
 cd20625c cdf3639c 00000000 
Apr  4 04:25:04 mail kernel: Call Trace:   [<ce815fa0>]
.rodata.str1.32 [jbd] 0x
e80 (0xcd7a9e08)
Apr  4 04:25:04 mail kernel: [<ce814d30>] .rodata.str1.1 [jbd] 0x3c8
(0xcd7a9e0c
)
Apr  4 04:25:04 mail kernel: [<ce814cc4>] .rodata.str1.1 [jbd] 0x35c
(0xcd7a9e10
)
Apr  4 04:25:04 mail kernel: [<ce814d85>] .rodata.str1.1 [jbd] 0x41d
(0xcd7a9e18
)
Apr  4 04:25:04 mail kernel: [<ce8119aa>] __journal_remove_checkpoint
[jbd] 0x4a
 (0xcd7a9e20)
Apr  4 04:25:04 mail kernel: [<ce81132f>] __try_to_free_cp_buf [jbd]
0x2f (0xcd7
a9e34)
Apr  4 04:25:04 mail kernel: [<ce81192d>]
__journal_clean_checkpoint_list [jbd] 
0x6d (0xcd7a9e40)
Apr  4 04:25:04 mail kernel: [<ce80f7df>] journal_commit_transaction
[jbd] 0x10f
 (0xcd7a9e6c)
Apr  4 04:25:04 mail kernel: [<c01190a8>] schedule [kernel] 0x2b8
(0xcd7a9f64)
Apr  4 04:25:04 mail kernel: [<ce812a14>] kjournald [jbd] 0x134
(0xcd7a9fbc)
Apr  4 04:25:04 mail kernel: [<ce8128c0>] commit_timeout [jbd] 0x0
(0xcd7a9fd8)
Apr  4 04:25:04 mail kernel: [<ce8128e0>] kjournald [jbd] 0x0 (0xcd7a9fe8)
Apr  4 04:25:04 mail kernel: [<c010719d>] kernel_thread_helper
[kernel] 0x5 (0xc
d7a9ff0)
Apr  4 04:25:04 mail kernel: 
Apr  4 04:25:04 mail kernel: 
Apr  4 04:25:04 mail kernel: Code: 0f 0b 55 02 c4 4c 81 ce e9 ba fc ff
ff 8b 53 
44 31 c0 39 da 



This was in /var/log/debug (*.=debug /var/log/debug):
Apr  2 11:29:36 mail kernel: divert: allocating divert_blk for eth0
Apr  2 11:29:36 mail kernel: divert: allocating divert_blk for eth1
Apr  2 11:29:36 mail kernel: eth1:  Identified 8139 chip type 'RTL-8139B'
Apr  2 11:29:36 mail kernel: divert: freeing divert_blk for eth0
Apr  2 11:29:36 mail kernel: divert: freeing divert_blk for eth1
Apr  2 11:29:36 mail kernel: divert: allocating divert_blk for eth0
Apr  2 11:29:37 mail kernel: divert: allocating divert_blk for eth1
Apr  2 11:29:37 mail kernel: eth1:  Identified 8139 chip type 'RTL-8139B'
Apr  2 11:29:37 mail kernel: divert: not allocating divert_blk for
non-ethernet 
device ppp0

I expected mode debug info on CONFIG_DEBUG_SLAB. Maybe I missed
something...

Comment 17 Stephen Tweedie 2004-04-05 13:22:41 UTC

CONFIG_DEBUG_SLAB doesn't change the existing debugging, it simply
adds additional debugging checks on top of them.  If there's slab
corruption happening (as appears to be the case here), then the slab
debugging adds new checks which may pinpoint the problem more accurately.

But it's still going to be pretty much random whether the new checks
trap the fault first or the old ones.  We may have to wait for a few
instances of the fault before the new checks come up with anything.

Comment 18 Marek Greško 2004-05-11 08:33:05 UTC

I notices this problem on another machine. It was also dead at 4:02
am. The only difference is that the machine was completely dead. I was
not able to scroll the console, nothing in the logs. The only
difference is the lower frequency of crashes. This was the first after
upgrading to Fedora Core 1 (two or three weeks). The kernel is
2.4.22-1.2188.nptlsmp.i686.

The one who saw the console said there was something about process
chown. I expect it died during some syscall from the chown after sarg.
The log of sarg was too short. It usually logs megabytes not
kilobytes. But this may be because of buffered log.

I cannot understand how it is possible that nobody else encounters
this problem, only me, and on the two machines... this one does not
run PPPoE, so it is not related to it. It is definitely related to
some cron jobs run in the early morning (4:00 am). Maybe the hot
candidates should be sarg and possible prelink.

Comment 19 Stephen Tweedie 2004-05-11 08:53:17 UTC

The one-off error on a different box sounds completely unrelated. 
It's the kind of thing that happens sometimes, most often due to bad
harwdare in my experience.  There's really nothing I can add to that
without an oops log.

Comment 20 Marek Greško 2004-05-11 09:16:25 UTC

The second machine was running Red Hat 9 for 5 months without any
problem. The hardware is completely different than the first one. It
is P4 on Abit IC7-G board. Adaptec SCSI adapter and 3 SCSI disks. So I
don't expect it could be a hardware bug.

I am not able to catch the oops log, because the machine is completely
dead. Unable to scroll the console.

Comment 21 Stephen Tweedie 2004-05-11 09:39:06 UTC

Serial console is invaluable for such cases.  Without that, the second
box's bug report is just "an oops happened", which is not debuggable.

Comment 22 Marek Greško 2004-06-22 11:58:14 UTC

OK. It was definitely random. Kernel does not freeze any more. Also 
the first machine is functioning properly for several days.

Kernel 2188 solved the problem. Probably it was happening due to 
removed low latency patch.

Note You need to log in before you can comment on or make changes to this bug.