Bug 99447 - kernel crashed when access files hard with 3ware 6410, reiserfs, and RAID 5
Summary: kernel crashed when access files hard with 3ware 6410, reiserfs, and RAID 5
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: kernel
Version: 9
Hardware: i386
OS: Linux
medium
high
Target Milestone: ---
Assignee: Arjan van de Ven
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2003-07-19 19:50 UTC by Kazushi Marukawa
Modified: 2007-04-18 16:55 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-09-30 15:41:19 UTC
Embargoed:


Attachments (Terms of Use)

Description Kazushi Marukawa 2003-07-19 19:50:23 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 
1.1.4322)

Description of problem:
When I made software RAID 5 using 4x WD800AB hard drives and accessed them 
hardly, I got error message and kernel panic.  I attached error message at the 
bottom.

The initialization of RAID 5 and reiserfs were done by RH9 installer.  It is 
using 64k chunk size.  No special options.  When I run "iozone -s 16G -r 4096" 
on this file system to check the speed of system while reading/writing 16GB 
data sets, Linux just crashed.

I reinitialized reiserfs and did this test again.  The same crash happened.

If I use ext3j as the file system (on RAID 5 and Escalade 6410) or 
reiserfs "/" partition (it is not using RAID 5 nor Escalade 6410), nothing 
happened.  So, I guess this is happened only if I use the combination of 
reiserfs and RAID 5 on Escalade 6410.

I appreciate if someone can check this problem deeply.  Thanks.

Version-Release number of selected component (if applicable):
kernel-smp-2.4.20-8, reiserfs-utils-3.6.4-5

How reproducible:
Always

Steps to Reproduce:
1. Attach 4 hard drives (prefer WD800AB) to Escalade 6410
2. Make RAID 5 on them.  Create reiserfs on RAID 5.
3. Run iozone -s 16G -r 4096


Actual Results:  System crash.

Expected Results:  Some kind of numbers through benchmark instead of instant 
crash.

Additional info:
Log messages are here.

Jul 11 10:40:16 xing kernel: 3w-xxxx: scsi0: Command failed: status = 0xc7, 
flags = 0x1b, unit #0.
Jul 11 10:40:16 xing last message repeated 30 times
Jul 11 10:40:16 xing kernel: 3w-xxxx: scsi0: AEN: WARNING: ATA port timeout: 
Port #0.
Jul 11 10:40:16 xing last message repeated 18 times
Jul 11 10:40:16 xing kernel: 3w-xxxx: scsi0: AEN: INFO: AEN queue overflow.
Jul 11 10:40:42 xing kernel: 3w-xxxx: scsi0: Command failed: status = 0xc7, 
flags = 0x1b, unit #0.
Jul 11 10:40:42 xing last message repeated 13 times
Jul 11 10:40:43 xing kernel: 3w-xxxx: scsi0: AEN: WARNING: ATA port timeout: 
Port #0.
Jul 11 10:40:43 xing last message repeated 13 times
Jul 11 10:40:43 xing kernel: 3w-xxxx: scsi0: Reset succeeded.
Jul 11 10:40:53 xing kernel: 3w-xxxx: scsi0: Command failed: status = 0xcf, 
flags = 0x0, unit #0.
Jul 11 10:40:53 xing kernel: 3w-xxxx: scsi0: AEN: ERROR: Drive error: Port #0.
Jul 11 10:40:53 xing kernel: scsi: device set offline - not ready or command 
retry failed after host reset: host 0 channel 0 id 0 lun 0
Jul 11 10:40:53 xing kernel: 3w-xxxx: scsi0: Command failed: status = 0xcf, 
flags = 0x0, unit #0.
Jul 11 10:40:53 xing kernel: scsi: device set offline - not ready or command 
retry failed after host reset: host 0 channel 0 id 0 lun 0
Jul 11 10:40:53 xing kernel: 3w-xxxx: scsi0: Command failed: status = 0xcf, 
flags = 0x0, unit #0.
   <cut here little>
Jul 11 10:40:53 xing kernel: scsi: device set offline - not ready or command 
retry failed after host reset: host 0 channel 0 id 0 lun 0
Jul 11 10:40:53 xing kernel: SCSI disk error : host 0 channel 0 id 0 lun 0 
return code = 2
Jul 11 10:40:53 xing kernel:  I/O error: dev 08:01, sector 92736168
Jul 11 10:40:53 xing kernel:  I/O error: dev 08:01, sector 92736176
Jul 11 10:40:53 xing kernel: SCSI disk error : host 0 channel 0 id 0 lun 0 
return code = 2
Jul 11 10:40:53 xing kernel:  I/O error: dev 08:01, sector 92741288
Jul 11 10:40:53 xing kernel:  I/O error: dev 08:01, sector 92741296
   <cut here again>
Jul 11 10:40:54 xing kernel: SCSI disk error : host 0 channel 0 id 0 lun 0 
return code = 2
Jul 11 10:40:54 xing kernel:  I/O error: dev 08:01, sector 92736680
Jul 11 10:40:54 xing kernel:  I/O error: dev 08:01, sector 92736688
Jul 11 10:40:54 xing kernel:  I/O error: dev 08:01, sector 14352
Jul 11 10:40:54 xing kernel: journal-601, buffer write failed
Jul 11 10:40:54 xing kernel: ------------[ cut here ]------------
Jul 11 10:40:54 xing kernel: kernel BUG at prints.c:334!
Jul 11 10:40:54 xing kernel: invalid operand: 0000
Jul 11 10:40:54 xing kernel: nfs lockd sunrpc autofs 3c59x ext3 jbd keybdev 
mousedev hid input usb-uhci usbcore reiserfs raid0 3w-xxxx sd_mod scsi_mod
Jul 11 10:40:54 xing kernel: CPU:    0
Jul 11 10:40:55 xing kernel: EIP:    0060:[<e884d7e8>]    Not tainted
Jul 11 10:40:55 xing kernel: EFLAGS: 00010286
Jul 11 10:40:55 xing kernel:
Jul 11 10:40:55 xing kernel: EIP is at reiserfs_panic [reiserfs] 0x38 (2.4.20-
8)
Jul 11 10:40:55 xing kernel: eax: 00000024   ebx: e70c5800   ecx: 00000001   
edx: 00000001
Jul 11 10:40:55 xing kernel: esi: 00000000   edi: 00000049   ebp: e70c5800   
esp: e7fe5e94
Jul 11 10:40:55 xing kernel: ds: 0068   es: 0068   ss: 0068
Jul 11 10:40:55 xing kernel: Process kupdated (pid: 10, stackpage=e7fe5000)
Jul 11 10:40:55 xing kernel: Stack: e88662f0 e886aa00 00000049 e8a9c6d8 
e8858bca e70c5800 e8863aa0 00001000
Jul 11 10:40:55 xing kernel:        00000000 0000004c e7fe4000 0000004a 
00000000 e3c8f900 e70c5800 00000000
Jul 11 10:40:55 xing kernel:        e7fe5f74 00000017 e885c076 e70c5800 
e8a9c6d8 00000001 00000001 00000017
Jul 11 10:40:55 xing kernel: Call Trace:   [<e88662f0>] .rodata.str1.1 
[reiserfs] 0x4ee (0xe7fe5e94))
Jul 11 10:40:55 xing kernel: [<e886aa00>] error_buf [reiserfs] 0x0 
(0xe7fe5e98))
Jul 11 10:40:55 xing kernel: [<e8858bca>] flush_commit_list [reiserfs] 0x2ea 
(0xe7fe5ea4))
Jul 11 10:40:55 xing kernel: [<e8863aa0>] .rodata.str1.32 [reiserfs] 0x3440 
(0xe7fe5eac))
Jul 11 10:40:55 xing kernel: [<e885c076>] check_journal_end [reiserfs] 0x106 
(0xe7fe5edc))
Jul 11 10:40:55 xing kernel: [<e885c764>] do_journal_end [reiserfs] 0xe4 
(0xe7fe5f08))
Jul 11 10:40:55 xing kernel: [<c012595f>] update_process_times [kernel] 0x3f 
(0xe7fe5f1c))
Jul 11 10:40:55 xing kernel: [<e885bee5>] flush_old_commits [reiserfs] 0x125 
(0xe7fe5f60))
Jul 11 10:40:55 xing kernel: [<e88665cc>] .rodata.str1.1 [reiserfs] 0x7ca 
(0xe7fe5f74))
Jul 11 10:40:55 xing kernel: [<e884a5f0>] reiserfs_write_super [reiserfs] 0x30 
(0xe7fe5fa4))
Jul 11 10:40:55 xing kernel: [<c014c38b>] sync_supers [kernel] 0xbb 
(0xe7fe5fb4))
Jul 11 10:40:55 xing kernel: [<c014b6cc>] sync_old_buffers [kernel] 0x1c 
(0xe7fe5fc8))
Jul 11 10:40:55 xing kernel: [<c014ba54>] kupdate [kernel] 0xa4 (0xe7fe5fd4))
Jul 11 10:40:55 xing kernel: [<c014b9b0>] kupdate [kernel] 0x0 (0xe7fe5fe4))
Jul 11 10:40:55 xing kernel: [<c010742d>] kernel_thread_helper [kernel] 0x5 
(0xe7fe5ff0))
Jul 11 10:40:55 xing kernel:
Jul 11 10:40:55 xing kernel:
Jul 11 10:40:55 xing kernel: Code: 0f 0b 4e 01 f6 62 86 e8 85 db 74 0e 0f b7 43
08 89 04 24 e8
Jul 11 10:40:55 xing kernel:   I/O error: dev 08:01, sector 92741960
Jul 11 10:40:55 xing kernel:  I/O error: dev 08:01, sector 92742216
Jul 11 10:40:55 xing kernel:  I/O error: dev 08:01, sector 92742472
   <cut here again.  just I/O error report until the end>

Comment 1 Hrunting Johnson 2003-08-09 00:15:01 UTC
We are having similar problems with software RAID5 and ext3 on a 3ware 7500 12-
port card with WD 250GB drives.  We see the same 3ware errors as listed in the 
above log output, except in our case, the kernel doesn't panic.  Instead, the 
filesystem becomes corrupt and the box just stops working.  Upon reboot, a lot 
of data is lost/corrupted.

The errors from the 3ware driver happen sporadically for hours before either 
the drive is marked bad, the system notices the corrupted filesystem, or the 
system dies (or all three, as just happened).  It sure looks like the 3ware 
card is getting notices about problems at the drive level and it's not making 
it back to the kernel so it can flag the drive as failed and stop writing to 
it.  I see drive errors, warnings, and resets and yet, the drive is not removed 
from the RAID.

Comment 2 Hrunting Johnson 2003-08-09 00:16:43 UTC
To add to this, the same kernel, same driver, same RAID setup on 3ware 7500-8 
port cards with WD-120GB drives don't have this problem.  We do occasionally 
drop a drive on these systems, but we've never had filesystem corruption.  On 
the bad systems, the firmware and drivers are both updated to the latest from 
3ware.

Comment 3 Kazushi Marukawa 2003-08-09 04:55:14 UTC
I was having kernel panic whenever a drive on a channel 0 hanged.  So, I 
exchanged drives between a channel 0 and 1.  Then, I stressed RAID5 and got 
regular drive failure on a drive on a channel 1.  This time, kernel didn't 
panic.  I didn't check the filesystem, so not sure I had filesystem corruption 
or not.

I exchanged a drive that was hanging all the time under heavy stress to new 
one.  After that, all worked very well.

However, I know RH9 software RAID5 may crash or cause filesystem corruption if 
one of drives is hanged.  I really appreciate if RH9 can investigate this 
problem and fix this.  BTW, Johnson, if I were you, I'll change to use hardware 
RAID5 since it should be stable.  I'm thinking to buy 7506.

Comment 4 Bugzilla owner 2004-09-30 15:41:19 UTC
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/



Note You need to log in before you can comment on or make changes to this bug.