Bug 161160 - Reproducable panic in mdadm multipathing
Summary: Reproducable panic in mdadm multipathing
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
Assignee: Doug Ledford
QA Contact:
URL:
Whiteboard:
Keywords:
Depends On:
Blocks: 168424
TreeView+ depends on / blocked
 
Reported: 2005-06-20 21:20 UTC by Wendy Cheng
Modified: 2007-11-30 22:07 UTC (History)
5 users (show)

(edit)
Clone Of:
(edit)
Last Closed: 2006-03-15 16:07:10 UTC


Attachments (Terms of Use)
patch submitted by IBM (3.12 KB, patch)
2005-07-21 20:09 UTC, Wendy Cheng
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2006:0144 qe-ready SHIPPED_LIVE Moderate: Updated kernel packages available for Red Hat Enterprise Linux 3 Update 7 2006-03-15 05:00:00 UTC

Description Wendy Cheng 2005-06-20 21:20:08 UTC
Description of problem:

Two recreatable kernel oops have been reported with mdadm multpathing - one on
i686 and one on IPF machines. With the 2.4.21-32.0.1.ELsmp kernel, the panic route:

md0: former device sdi is unavailable, removing from array!
Unable to handle kernel NULL pointer dereference at virtual address 00000040
printing eip:
f8b8859a
*pde = 35779001
*pte = 3c01c067
Oops: 0000
multipath netconsole usbserial lp parport autofs4 audit pool e1000 floppy sg
microcode loop lvm-mod keybdev mousedev hid input usb-uhci usbcore ext3 jbd qla
23
CPU:    1
EIP:    0060:[<f8b8859a>]    Not tainted
EFLAGS: 00010246

EIP is at multipath_run [multipath] 0x1ea (2.4.21-32.0.1.ELsmp/i686)
eax: d1210000   ebx: 00000000   ecx: 00000000   edx: f7caa294
esi: 00000000   edi: f7caa294   ebp: f7caa294   esp: f57cbd94
ds: 0068   es: 0068   ss: 0068
Process mdadm (pid: 4381, stackpage=f57cb000)
Stack: d1210000 00000000 000002c4 cf940000 c043fc80 c0440054 f7caa294 c043fc80
      f57cbde8 00000086 00000000 00000000 cf940000 f57ca000 f5c43000 00000001
      0000000a d1210000 00000000 c048135f 00007ca3 c0129553 00000282 00007ca3
Call Trace:   [<c0129553>] call_console_drivers [kernel] 0x63 (0xf57cbde8)
[<c0129883>] printk [kernel] 0x153 (0xf57cbe20)
[<c0217594>] device_size_calculation [kernel] 0x154 (0xf57cbe40)
[<c021786d>] do_md_run [kernel] 0x1dd (0xf57cbe6c)
[<c0129883>] printk [kernel] 0x153 (0xf57cbe88)
[<c0215a45>] bind_rdev_to_array [kernel] 0xa5 (0xf57cbea8)
[<c02186ed>] add_new_disk [kernel] 0x24d (0xf57cbec8)
[<c021928c>] md_ioctl [kernel] 0x38c (0xf57cbeec)
[<c0126154>] context_switch [kernel] 0xa4 (0xf57cbf60)
[<c01b2a3f>] tty_write [kernel] 0x14f (0xf57cbf68)
[<c016dbfe>] blkdev_ioctl [kernel] 0x3e (0xf57cbf80)
[<c0178756>] sys_ioctl [kernel] 0xf6 (0xf57cbf94)

Code: 8b 49 40 85 c9 0f 85 5f 02 00 00 8b 44 24 38 bf 01 00 00 00


Version-Release number of selected component (if applicable):
All versions of RHEL 3 kernels up to the current RHN distribution
(2.4.21-32.0.1.ELsmp).

How reproducible:
Each time and every time

Steps to Reproduce:
1. connect linux box to SAN storage with multipath.
2. create a lun on SAN storage, and start up with SAN boot.
3. create two more luns on SAN storage, then reboot.

e.g.
/dev/sda:  50GB (including /, /boot, swap partition)
/dev/sdb:  12GB
/dev/sdc:  1GB
/dev/sdd:  multipath device for /dev/sda
/dev/sde:  multipath device for /dev/sdb
/dev/sdf:  multipath device for /dev/sdc

4. create a partition on /dev/sdc (multipath /dev/sdf) by parted, then assign
them to /dev/md0
5. On shell> mdadm -C -lmp -n2 /dev/md0 /dev/sdc1 /dev/sdf1
6. removing /dev/sdb and /dev/sde on SAN storage, then reboot.

now the device names have changed:
previous /dev/sdc becomes /dev/sdb, and previous /dev/sdf becomes /dev/sdd.
/dev/sda:  50GB (including /, /boot, swap partition)
/dev/sdb:  1GB (previous /dev/sdc)
/dev/sdd:  multipath device for /dev/sda
/dev/sde:  multipath device for /dev/sdb (previous /dev/sdf)

7. after editing /etc/mdadm.conf, does a "mdadm -As /dev/md0"

Actual result:
kernel oops.

Expected result:
no oops.

Additional Info:

--- /etc/mdadm.conf ---
DEVICE /dev/sd[abcdef][0-9]
ARRAY /dev/md0 devices=/dev/sdb1,/dev/sdd1

Comment 1 Wendy Cheng 2005-06-20 21:53:01 UTC
Sorry, typo in the device names have changed lines - should be:

now the device names have changed:
previous /dev/sdc becomes /dev/sdb, and previous /dev/sdf becomes /dev/sdd.
/dev/sda:  50GB (including /, /boot, swap partition)
/dev/sdb:  1GB (previous /dev/sdc)
/dev/sdc:  multipath device for /dev/sda
/dev/sdd:  multipath device for /dev/sdb (previous /dev/sdf)

Comment 11 Rene Klootwijk 2005-09-26 14:25:53 UTC
This same problem is happening when creating a multipath device on one system,
and activating the mulitpath device on another system which has assigned other
device names for these LUN's. We require several multipath devices activated on
multiple system for a Oracle10g RAC environment.

Comment 12 Doug Ledford 2005-09-26 21:28:00 UTC
This patch has passed my internal testing and the patch has been submitted
internally for review and possible inclusion in the next RHEL3 update release. 
I've also built a test kernel that has this patch included.  RPMs can be found
at http://people.redhat.com/dledford/st_tape_test/ and the kernel version that
includes this patch is 2.4.21-37.1.EL_st_tape_test3.

Comment 13 Rene Klootwijk 2005-09-27 07:16:08 UTC
Can you compile a hugemem version of the kernel?

Comment 14 Doug Ledford 2005-09-27 14:24:01 UTC
One is already present in the i686 directory.

Comment 17 Ernie Petrides 2005-10-08 02:12:57 UTC
A fix for this problem has just been committed to the RHEL3 U7
patch pool this evening (in kernel version 2.4.21-37.5.EL).


Comment 27 Red Hat Bugzilla 2006-03-15 16:07:11 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0144.html



Note You need to log in before you can comment on or make changes to this bug.