Bug 466355 - 64-bit RHEL 4.7 crashes during boot if LSI mirror is resyncing
Summary: 64-bit RHEL 4.7 crashes during boot if LSI mirror is resyncing
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: mkinitrd
Version: 4.7
Hardware: All
OS: Linux
medium
high
Target Milestone: rc
: ---
Assignee: Brian Lane
QA Contact: Release Test Team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-10-09 20:43 UTC by John Caruso
Modified: 2011-07-27 18:01 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-07-27 18:01:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description John Caruso 2008-10-09 20:43:37 UTC
Description of problem:
64-bit RHEL 4.7 won't boot if LSI mirror is resyncing

Version-Release number of selected component (if applicable):
mkinitrd-4.2.1.13-1
(or possibly: kernel-smp-2.6.9-78.0.1.EL.x86_64)
(and maybe: MPTBIOS-IME-5.10.02.04)

How reproducible:
Try to boot a 64-bit RHEL 4.7 server while its LSI mirror is resyncing.

Steps to Reproduce:
1. Start an LSI mirror resynchronization (e.g. via the "Synchronize Whole Mirror" link in the LSI BIOS utility)
2. Try to boot Redhat

Actual results:
Kernel crash (see output below).

Expected results:
Successful boot.

Additional info:
On 64-bit servers with two disks managed by an LSI RAID controller, configured as a single mirrored volume, RHEL 4.7 won't boot if the mirror is resynchronizing--there's an OOPS right after nash starts (full output below).  I believe that we saw this on RHEL 4.6 as well, though I can't be certain.  This is a fairly serious issue, since it effectively disables the server until the array is finished resyncing.

We've only been able to reproduce this on 64-bit servers so far (in particular, two Sun v40z servers and an IBM HS20-8843 blade).  When I tried to reproduce it on a 32-bit blade server (an IBM HS20-8678), the blade was able to boot with no problems even when the disk mirror was resyncing.

Also, we tried booting a Sun v40z with just one disk and then hot-inserting the second disk, and the system booted fine and continued running fine after the disk was inserted and the array started resyncing.  So it's not the case that Redhat has a problem running while an LSI RAID volume is resyncing in general--it's just during the boot sequence that it causes a problem.

Here's the output of the kernel crash (there's no netdump output or any other output, since the netdump service hasn't started when this crash occurs):

GRUB loading, please wait...
Press any key to continue.
Press any key to continue.
Press any key to continue.
Press any key to continue.
Press any key to continue.
Red Hat nash version 4.2.1.13 starting
Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP:
<ffffffffa0034d80>{:mptbase:mpt_base_reply+1624}
PML4 4000ca067 PGD 4000ce067 PMD 0
Oops: 0000 [1] SMP
CPU 7
Modules linked in: mptspi mptscsi mptbase sd_mod scsi_mod
Pid: 0, comm: swapper Not tainted 2.6.9-78.0.1.ELsmp
RIP: 0010:[<ffffffffa0034d80>] <ffffffffa0034d80>{:mptbase:mpt_base_reply+1624}
RSP: 0000:00000101f8f9fe38  EFLAGS: 00010282
RAX: 00000000ffffffff RBX: 0000000000000001 RCX: 0000000000000246
RDX: 0000000000000000 RSI: 00000104000e0008 RDI: ffffffff803f64c0
RBP: 0000000000000003 R08: 00000000fffffffb R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: 00000100fbe806fc
R13: 00000000000000ff R14: 00000104000e0000 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffffffff8050d600(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 00000007f8fbe000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo 00000105f8f08000, task 00000100fbf55030)
Stack: 0000000000000000 000000000b000001 0000000100032853 00000100fbe806e0
       00000100fbe82800 0000000000000007 7461726765746e49 3a64696152206465
       20656d756c6f5620 4320737574617453
Call Trace:<IRQ> <ffffffffa002d638>{:mptbase:mpt_interrupt+1211} <ffffffff80112ff2>{handle_IRQ_event+41}
       <ffffffff8011326c>{do_IRQ+197} <ffffffff801108bf>{ret_from_intr+0}
        <EOI> <ffffffff8010e789>{default_idle+0} <ffffffff8010e7a9>{default_idle+32}
       <ffffffff8010e81c>{cpu_idle+26}

Code: 0f b7 42 08 77 05 80 cc 01 eb 03 80 e4 fe 41 f6 c7 04 66 89
RIP <ffffffffa0034d80>{:mptbase:mpt_base_reply+1624} RSP <00000101f8f9fe38>
CR2: 0000000000000008
 <0>Kernel panic - not syncing: Oops
 Badness in panic at kernel/panic.c:118

Call Trace:<IRQ> <ffffffff8013871e>{panic+527} <ffffffff80110c81>{apic_timer_interrupt+133}
       <ffffffff80111b7c>{oops_end+38} <ffffffff80111b97>{oops_end+65}
       <ffffffff80124aed>{do_page_fault+1125} <ffffffff80138f0e>{release_console_sem+369}
       <ffffffff8013913c>{vprintk+498} <ffffffff80110e1d>{error_exit+0}
       <ffffffffa0034d80>{:mptbase:mpt_base_reply+1624} <ffffffffa0034d2c>{:mptbase:mpt_base_reply+1540}
       <ffffffffa002d638>{:mptbase:mpt_interrupt+1211} <ffffffff80112ff2>{handle_IRQ_event+41}
       <ffffffff8011326c>{do_IRQ+197} <ffffffff801108bf>{ret_from_intr+0}
        <EOI> <ffffffff8010e789>{default_idle+0} <ffffffff8010e7a9>{default_idle+32}
       <ffffffff8010e81c>{cpu_idle+26}
Badness in i8042_panic_blink at drivers/input/serio/i8042.c:987

Call Trace:<IRQ> <ffffffff802481d3>{i8042_panic_blink+238} <ffffffff801386cc>{panic+445}
       <ffffffff80110c81>{apic_timer_interrupt+133} <ffffffff80111b7c>{oops_end+38}
       <ffffffff80111b97>{oops_end+65} <ffffffff80124aed>{do_page_fault+1125}
       <ffffffff80138f0e>{release_console_sem+369} <ffffffff8013913c>{vprintk+498}
       <ffffffff80110e1d>{error_exit+0} <ffffffffa0034d80>{:mptbase:mpt_base_reply+1624}
       <ffffffffa0034d2c>{:mptbase:mpt_base_reply+1540} <ffffffffa002d638>{:mptbase:mpt_interrupt+1211}
       <ffffffff80112ff2>{handle_IRQ_event+41} <ffffffff8011326c>{do_IRQ+197}
       <ffffffff801108bf>{ret_from_intr+0}  <EOI> <ffffffff8010e789>{default_idle+0}
       <ffffffff8010e7a9>{default_idle+32} <ffffffff8010e81c>{cpu_idle+26}

Badness in i8042_panic_blink at drivers/input/serio/i8042.c:990

Call Trace:<IRQ> <ffffffff80248265>{i8042_panic_blink+384} <ffffffff801386cc>{panic+445}
       <ffffffff80110c81>{apic_timer_interrupt+133} <ffffffff80111b7c>{oops_end+38}
       <ffffffff80111b97>{oops_end+65} <ffffffff80124aed>{do_page_fault+1125}
       <ffffffff80138f0e>{release_console_sem+369} <ffffffff8013913c>{vprintk+498}
       <ffffffff80110e1d>{error_exit+0} <ffffffffa0034d80>{:mptbase:mpt_base_reply+1624}
       <ffffffffa0034d2c>{:mptbase:mpt_base_reply+1540} <ffffffffa002d638>{:mptbase:mpt_interrupt+1211}
       <ffffffff80112ff2>{handle_IRQ_event+41} <ffffffff8011326c>{do_IRQ+197}
       <ffffffff801108bf>{ret_from_intr+0}  <EOI> <ffffffff8010e789>{default_idle+0}
       <ffffffff8010e7a9>{default_idle+32} <ffffffff8010e81c>{cpu_idle+26}

Badness in i8042_panic_blink at drivers/input/serio/i8042.c:992

Call Trace:<IRQ> <ffffffff802482ca>{i8042_panic_blink+485} <ffffffff801386cc>{panic+445}
       <ffffffff80110c81>{apic_timer_interrupt+133} <ffffffff80111b7c>{oops_end+38}
       <ffffffff80111b97>{oops_end+65} <ffffffff80124aed>{do_page_fault+1125}
       <ffffffff80138f0e>{release_console_sem+369} <ffffffff8013913c>{vprintk+498}
       <ffffffff80110e1d>{error_exit+0} <ffffffffa0034d80>{:mptbase:mpt_base_reply+1624}
       <ffffffffa0034d2c>{:mptbase:mpt_base_reply+1540} <ffffffffa002d638>{:mptbase:mpt_interrupt+1211}
       <ffffffff80112ff2>{handle_IRQ_event+41} <ffffffff8011326c>{do_IRQ+197}
       <ffffffff801108bf>{ret_from_intr+0}  <EOI> <ffffffff8010e789>{default_idle+0}
       <ffffffff8010e7a9>{default_idle+32} <ffffffff8010e81c>{cpu_idle+26}

Comment 1 John Caruso 2008-10-09 20:56:27 UTC
Clarification: When I said "we tried booting a Sun v40z with just one disk and then hot-inserting the second disk" I meant that we hot-inserted the second disk *after* the system had finished booting.


Note You need to log in before you can comment on or make changes to this bug.