Bug 243878

Summary: kernel BUG when rebuilding raid5 array with hardware failures
Product: Red Hat Enterprise Linux 4 Reporter: Orion Poplawski <orion>
Component: kernelAssignee: Doug Ledford <dledford>
Status: CLOSED WONTFIX QA Contact: Martin Jenner <mjenner>
Severity: low Docs Contact:
Priority: low    
Version: 4.5CC: jbaron, mgahagan
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-04-07 13:31:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Full logs none

Description Orion Poplawski 2007-06-12 14:55:26 UTC
Description of problem:

Reporting here mainly for the record - I'm running centos and not the official RHEL.

With kernel 2.6.9-55.plus.c4smp, I was rebuilding a raid5 array that encountered
hardware failures during the rebuild.  This ended up triggering a kernel bug
when the sync completed:

Jun 12 07:08:36 alexandria kernel: md: md2: sync done.
Jun 12 07:08:36 alexandria kernel: eip: c011e913
Jun 12 07:08:36 alexandria kernel: ------------[ cut here ]------------
Jun 12 07:08:36 alexandria kernel: kernel BUG at include/asm/spinlock.h:146!
Jun 12 07:08:36 alexandria kernel: invalid operand: 0000 [#1]
Jun 12 07:08:36 alexandria kernel: SMP
Jun 12 07:08:36 alexandria kernel: Modules linked in: loop nfs nfsd exportfs
lockd nfs_acl
 md5 ipv6 parport_pc lp parport autofs4 w83627hf w83781d adm1021 i2c_sensor
i2c_isa i2c_i8
01 i2c_dev i2c_core sunrpc jfs button battery ac raid5 xor uhci_hcd hw_random
eepro100 e10
00 e100 mii sata_mv dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod mv_sata(U)
ata_piix liba
ta sd_mod scsi_mod
Jun 12 07:08:36 alexandria kernel: CPU:    2
Jun 12 07:08:36 alexandria kernel: EIP:    0060:[<c02e174b>]    Not tainted VLI
Jun 12 07:08:36 alexandria kernel: EFLAGS: 00010046   (2.6.9-55.plus.c4smp)
Jun 12 07:08:36 alexandria kernel: EIP is at _spin_lock_irqsave+0x20/0x45
Jun 12 07:08:36 alexandria kernel: eax: c011e913   ebx: 00000246   ecx: c02f58b6
  edx: c0
2f58b6
Jun 12 07:08:36 alexandria kernel: esi: f1057f54   edi: d04ce000   ebp: d04cef9c
  esp: d0
4cef84
Jun 12 07:08:36 alexandria kernel: ds: 007b   es: 007b   ss: 0068
Jun 12 07:08:36 alexandria kernel: Process md2_resync (pid: 17435,
threadinfo=d04ce000 tas
k=e04198b0)
Jun 12 07:08:36 alexandria kernel: Stack: f1057f50 f1057f54 c011e913 da1e18c0
00000000 d04
ce000 00000000 c02804b2
Jun 12 07:08:36 alexandria kernel:        00000000 e04198b0 c0120561 d04cefd0
d04cefd0 f10
57ee4 c02e2ac6 f736d730
Jun 12 07:08:36 alexandria kernel:        00000000 e04198b0 c0120561 d04cefd0
d04cefd0 000
00000 00000000 0000007b
Jun 12 07:08:36 alexandria kernel: Call Trace:
Jun 12 07:08:36 alexandria kernel:  [<c011e913>] complete+0x12/0x3d
Jun 12 07:08:36 alexandria kernel:  [<c02804b2>] md_thread+0x15f/0x168
Jun 12 07:08:36 alexandria kernel:  [<c0120561>] autoremove_wake_function+0x0/0x2d
Jun 12 07:08:36 alexandria kernel:  [<c02e2ac6>] ret_from_fork+0x6/0x14
Jun 12 07:08:36 alexandria kernel:  [<c0120561>] autoremove_wake_function+0x0/0x2d
Jun 12 07:08:36 alexandria kernel:  [<c0280353>] md_thread+0x0/0x168
Jun 12 07:08:36 alexandria kernel:  [<c01041f5>] kernel_thread_helper+0x5/0xb
Jun 12 07:08:36 alexandria kernel: Code: 81 00 00 00 00 01 c3 f0 ff 00 c3 56 89
c6 53 9c 5
b fa 81 78 04 ad 4e ad de 74 18 ff 74 24 08 68 b6 58 2f c0 e8 dd 11 e4 ff 59 58
<0f> 0b 92
 00 21 49 2f c0 f0 fe 0e 79 13 f7 c3 00 02 00 00 74 01
Jun 12 07:08:36 alexandria kernel:  <0>Fatal exception: panic in 5 seconds

I've attached the full logs so you can see all of the md output.

Google turned up a couple of similar reports:

http://lists.centos.org/pipermail/centos/2005-March/003547.html
http://bugzilla.kernel.org/show_bug.cgi?id=6866

Comment 1 Orion Poplawski 2007-06-12 14:55:26 UTC
Created attachment 156799 [details]
Full logs

Comment 2 RHEL Program Management 2008-10-03 16:15:08 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 3 Doug Ledford 2008-12-16 22:31:57 UTC
I know this was quite a while ago, but can you possibly expand on what you mean by "I was rebuilding a raid5 array that encountered hardware failures during the rebuild".  This report indicates that the oops happened at the end of the rebuild when the resync thread was shutting down, but raid5 is a single failure only raid subsystem, so if it experienced hardware failure during the rebuild, how did it ever complete in the first place?

Comment 4 Orion Poplawski 2008-12-16 23:21:31 UTC
What might be more relevant is the time starting at Jun 11 17:40:42, when a sync of md2 completes and then immediately there are several disk failures on that array.   This seems to send the md driver into some kind of loop attempting to sync the array, even though it doesn't have enough devices.

Otherwise, not sure I can be of more help, and I haven't reproduced it (thankfully).  I eventually got things back up I believe by using badblocks -w to remap the sectors on the various drives before re-assembling the array.

Comment 5 Doug Ledford 2009-04-07 13:31:02 UTC
I don't have the ability to reproduce this, and as it involves multiple drive failures in a raid level that's only tolerant of a single drive failure, it doesn't rank very high on the "needs fixing" list.  I'm closing this bug out as WONTFIX.  Please reopen if you think this needs further attention.