This service will be undergoing maintenance at 00:00 UTC, 2016-09-28. It is expected to last about 1 hours
Bug 243878 - kernel BUG when rebuilding raid5 array with hardware failures
kernel BUG when rebuilding raid5 array with hardware failures
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.5
All Linux
low Severity low
: ---
: ---
Assigned To: Doug Ledford
Martin Jenner
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-06-12 10:55 EDT by Orion Poplawski
Modified: 2009-04-07 09:31 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-04-07 09:31:02 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Full logs (363.17 KB, application/x-tar)
2007-06-12 10:55 EDT, Orion Poplawski
no flags Details

  None (edit)
Description Orion Poplawski 2007-06-12 10:55:26 EDT
Description of problem:

Reporting here mainly for the record - I'm running centos and not the official RHEL.

With kernel 2.6.9-55.plus.c4smp, I was rebuilding a raid5 array that encountered
hardware failures during the rebuild.  This ended up triggering a kernel bug
when the sync completed:

Jun 12 07:08:36 alexandria kernel: md: md2: sync done.
Jun 12 07:08:36 alexandria kernel: eip: c011e913
Jun 12 07:08:36 alexandria kernel: ------------[ cut here ]------------
Jun 12 07:08:36 alexandria kernel: kernel BUG at include/asm/spinlock.h:146!
Jun 12 07:08:36 alexandria kernel: invalid operand: 0000 [#1]
Jun 12 07:08:36 alexandria kernel: SMP
Jun 12 07:08:36 alexandria kernel: Modules linked in: loop nfs nfsd exportfs
lockd nfs_acl
 md5 ipv6 parport_pc lp parport autofs4 w83627hf w83781d adm1021 i2c_sensor
i2c_isa i2c_i8
01 i2c_dev i2c_core sunrpc jfs button battery ac raid5 xor uhci_hcd hw_random
eepro100 e10
00 e100 mii sata_mv dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod mv_sata(U)
ata_piix liba
ta sd_mod scsi_mod
Jun 12 07:08:36 alexandria kernel: CPU:    2
Jun 12 07:08:36 alexandria kernel: EIP:    0060:[<c02e174b>]    Not tainted VLI
Jun 12 07:08:36 alexandria kernel: EFLAGS: 00010046   (2.6.9-55.plus.c4smp)
Jun 12 07:08:36 alexandria kernel: EIP is at _spin_lock_irqsave+0x20/0x45
Jun 12 07:08:36 alexandria kernel: eax: c011e913   ebx: 00000246   ecx: c02f58b6
  edx: c0
2f58b6
Jun 12 07:08:36 alexandria kernel: esi: f1057f54   edi: d04ce000   ebp: d04cef9c
  esp: d0
4cef84
Jun 12 07:08:36 alexandria kernel: ds: 007b   es: 007b   ss: 0068
Jun 12 07:08:36 alexandria kernel: Process md2_resync (pid: 17435,
threadinfo=d04ce000 tas
k=e04198b0)
Jun 12 07:08:36 alexandria kernel: Stack: f1057f50 f1057f54 c011e913 da1e18c0
00000000 d04
ce000 00000000 c02804b2
Jun 12 07:08:36 alexandria kernel:        00000000 e04198b0 c0120561 d04cefd0
d04cefd0 f10
57ee4 c02e2ac6 f736d730
Jun 12 07:08:36 alexandria kernel:        00000000 e04198b0 c0120561 d04cefd0
d04cefd0 000
00000 00000000 0000007b
Jun 12 07:08:36 alexandria kernel: Call Trace:
Jun 12 07:08:36 alexandria kernel:  [<c011e913>] complete+0x12/0x3d
Jun 12 07:08:36 alexandria kernel:  [<c02804b2>] md_thread+0x15f/0x168
Jun 12 07:08:36 alexandria kernel:  [<c0120561>] autoremove_wake_function+0x0/0x2d
Jun 12 07:08:36 alexandria kernel:  [<c02e2ac6>] ret_from_fork+0x6/0x14
Jun 12 07:08:36 alexandria kernel:  [<c0120561>] autoremove_wake_function+0x0/0x2d
Jun 12 07:08:36 alexandria kernel:  [<c0280353>] md_thread+0x0/0x168
Jun 12 07:08:36 alexandria kernel:  [<c01041f5>] kernel_thread_helper+0x5/0xb
Jun 12 07:08:36 alexandria kernel: Code: 81 00 00 00 00 01 c3 f0 ff 00 c3 56 89
c6 53 9c 5
b fa 81 78 04 ad 4e ad de 74 18 ff 74 24 08 68 b6 58 2f c0 e8 dd 11 e4 ff 59 58
<0f> 0b 92
 00 21 49 2f c0 f0 fe 0e 79 13 f7 c3 00 02 00 00 74 01
Jun 12 07:08:36 alexandria kernel:  <0>Fatal exception: panic in 5 seconds

I've attached the full logs so you can see all of the md output.

Google turned up a couple of similar reports:

http://lists.centos.org/pipermail/centos/2005-March/003547.html
http://bugzilla.kernel.org/show_bug.cgi?id=6866
Comment 1 Orion Poplawski 2007-06-12 10:55:26 EDT
Created attachment 156799 [details]
Full logs
Comment 2 RHEL Product and Program Management 2008-10-03 12:15:08 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 3 Doug Ledford 2008-12-16 17:31:57 EST
I know this was quite a while ago, but can you possibly expand on what you mean by "I was rebuilding a raid5 array that encountered hardware failures during the rebuild".  This report indicates that the oops happened at the end of the rebuild when the resync thread was shutting down, but raid5 is a single failure only raid subsystem, so if it experienced hardware failure during the rebuild, how did it ever complete in the first place?
Comment 4 Orion Poplawski 2008-12-16 18:21:31 EST
What might be more relevant is the time starting at Jun 11 17:40:42, when a sync of md2 completes and then immediately there are several disk failures on that array.   This seems to send the md driver into some kind of loop attempting to sync the array, even though it doesn't have enough devices.

Otherwise, not sure I can be of more help, and I haven't reproduced it (thankfully).  I eventually got things back up I believe by using badblocks -w to remap the sectors on the various drives before re-assembling the array.
Comment 5 Doug Ledford 2009-04-07 09:31:02 EDT
I don't have the ability to reproduce this, and as it involves multiple drive failures in a raid level that's only tolerant of a single drive failure, it doesn't rank very high on the "needs fixing" list.  I'm closing this bug out as WONTFIX.  Please reopen if you think this needs further attention.

Note You need to log in before you can comment on or make changes to this bug.