Bug 441832 - mptscsi race between hotremove and mptscsih_bus_reset
mptscsi race between hotremove and mptscsih_bus_reset
Check for NULL pointers when retrieving vdevice / vdevice->vtarget in mptscsih_bus_reset (638 bytes, patch)
2008-04-10 09:54 EDT, Bryn M. Reeves
Description Bryn M. Reeves 2008-04-10 09:54:37 EDT
Description of problem:
If a device is hot-removed from an mptscsi controller while an error handler
that will attempt a bus reset is outstanding, the call to mptscsih_bus_reset may
end up accessing the vdevice structure (stored in SCpnt->device->hostdata) after
it has been freed, leading to an oops:

mptscsih: ioc0: attempting task abort! (sc=e000000115651a80)
sd 1:0:1:0:
mptscsih: ioc0: task abort: SUCCESS (sc=e000000115651a80)
mptscsih: ioc0: attempting task abort! (sc=e000000115651a80)
sd 1:0:1:0:
mptscsih: ioc0: task abort: SUCCESS (sc=e000000115651a80)
mptscsih: ioc0: attempting target reset! (sc=e000000115651a80)
sd 1:0:1:0:
mptscsih: ioc0: target reset: SUCCESS (sc=e000000115651a80)
mptscsih: ioc0: attempting bus reset! (sc=e000000115651a80)                  <--##
scsi 1:0:1:0:
Unable to handle kernel NULL pointer dereference (address 0000000000000000)  <--##
scsi_eh_1[660]: Oops 8813272891392 [1]
Modules linked in: ipmi_watchdog e1000(U) tg3 e100 ipv6 autofs4 hidp l2cap
bluetooth sunrpc fefpcl(U) panicforpcl(U) mptctl ipmi_devintf ipmi_si
ipmi_msghandler vfat fat dm_mirror dm_multipath dm_mod button parport_pc lp
parport sr_mod shpchp cdrom mii sg ext3 jbd lpfc(U) scsi_transport_fc mptspi
scsi_transport_spi mptsas scsi_transport_sas mptscsih mptbase sd_mod usb_storage
ehci_hcd ohci_hcd uhci_hcd scsi_mod

Pid: 660, CPU 0, comm:            scsi_eh_1
psr : 00001010085a6010 ifs : 800000000000030d ip  : [<a000000207b65540>]    Not
ip is at mptscsih_bus_reset+0xe0/0x220 [mptscsih]
unat: 0000000000000000 pfs : 000000000000030d rsc : 0000000000000003
rnat: 0000000000000000 bsps: a0000001002c50c0 pr  : 0000000000006a41
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a000000207b654e0 b6  : a0000001002c50c0 b7  : a0000001002c50c0
f6  : 0fffafffffffff0000000 f7  : 0ffdb8000000000000000
f8  : 0ffff8000000000000000 f9  : 100038000000000000000
f10 : 0fffafffffffff0000000 f11 : 1003e0000000000000000
r1  : a000000207b750d8 r2  : a0000001009f6b40 r3  : a0000001009e0438
r8  : 0000000000000001 r9  : 0000000000000002 r10 : e00000010f972888
r11 : e0000001149a1000 r12 : e00000011565fd10 r13 : e000000115658000
r14 : a0000001009f6b40 r15 : e0000001149a1040 r16 : 0000000000000000
r17 : e00000010f972800 r18 : e000000114b39590 r19 : 0000000000000003
r20 : 0000000000000000 r21 : a0000001009df7e8 r22 : a0000001009f6ee0
r23 : a000000100829280 r24 : a0000001009df7e8 r25 : a0000001009f6b48
r26 : a0000001009f6b48 r27 : 0000000000000000 r28 : 000000000000000a
r29 : 0000000000000000 r30 : 0000000000000000 r31 : a0000001009f6ebc

Call Trace:
[<a000000100013b20>] show_stack+0x40/0xa0
                               sp=e00000011565f8a0 bsp=e000000115659330
[<a000000100014420>] show_regs+0x840/0x880
                               sp=e00000011565fa70 bsp=e0000001156592d8
[<a000000100037740>] die+0x1c0/0x2c0
                               sp=e00000011565fa70 bsp=e000000115659290
[<a00000010062ae00>] ia64_do_page_fault+0x8a0/0x9e0
                               sp=e00000011565fa90 bsp=e000000115659240
[<a00000010000c020>] __ia64_leave_kernel+0x0/0x280
                               sp=e00000011565fb40 bsp=e000000115659240
[<a000000207b65540>] mptscsih_bus_reset+0xe0/0x220 [mptscsih]
                               sp=e00000011565fd10 bsp=e0000001156591d0
[<a0000002079c04d0>] scsi_try_bus_reset+0xf0/0x240 [scsi_mod]
                               sp=e00000011565fd10 bsp=e0000001156591a0
[<a0000002079c24d0>] scsi_eh_ready_devs+0x710/0xbe0 [scsi_mod]
                               sp=e00000011565fd10 bsp=e000000115659158
[<a0000002079c34c0>] scsi_error_handler+0x840/0xc60 [scsi_mod]
                               sp=e00000011565fd10 bsp=e000000115659110
[<a0000001000abeb0>] kthread+0x230/0x2c0
                               sp=e00000011565fd50 bsp=e0000001156590c8
[<a000000100012090>] kernel_thread_helper+0x30/0x60
                               sp=e00000011565fe30 bsp=e0000001156590a0
[<a0000001000090c0>] start_kernel_thread+0x20/0x40
                               sp=e00000011565fe30 bsp=e0000001156590a0
<0>Kernel panic - not syncing: Fatal exception

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Offline a device while a bus reset is occurring

Actual results:
Above oops

Expected results:
Error handler does not panic.

Additional info:
Comment 1 Bryn M. Reeves 2008-04-10 09:54:37 EDT
Created attachment 301986 [details]
Check for NULL pointers when retrieving vdevice / vdevice->vtarget in mptscsih_bus_reset
Comment 3 Bryn M. Reeves 2008-04-10 09:57:42 EDT
Comment #0 should read "hot-removed or offlined" - you don't actually need to
disconnect the drive, just have it offlined so that the kernel removes it.
Comment 4 Bryn M. Reeves 2008-04-22 12:38:24 EDT
Eric posted a patch containing this change (along with one or two others :) to
linux-scsi last year:


$ diffstat /tmp/mpt-linux-scsi.patch
mptscsih.c | 1498 +++++++++++++++++++++++++++----------------------------------
mptscsih.h |    8
2 files changed, 666 insertions(+), 840 deletions(-)
$ diffstat /tmp/mpt_bus_reset.patch
mptscsih.c |    3 +++
1 file changed, 3 insertion
Comment 6 Jan Kratochvil 2008-07-03 04:37:44 EDT
Just seen now on F9 RHTS x86_64 ibm-taroko.rhts.bos.redhat.com, Job 24616,
kernel-2.6.25-14.fc9.x86_64, during startup, no device removal/plugging:


BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
IP: [<ffffffff880a37d6>] :mptscsih:mptscsih_bus_reset+0xa6/0x109
PGD 3e172067 PUD 3edcf067 PMD 3e0ca067 PTE 0
Oops: 0000 [1] SMP
Modules linked in: nfs lockd nfs_acl bridge bnep rfcomm l2cap bluetooth sunrpc
ipv6 cpufreq_ondemand acpi_cpufreq freq_table loop dm_multipath sr_mod cdrom
pata_acpi ata_generic ppdev snd_hda_intel parport_pc snd_seq_dummy parport
snd_seq_oss floppy snd_seq_midi_event snd_seq snd_seq_device firewire_ohci
firewire_core pcspkr serio_raw snd_pcm_oss snd_mixer_oss crc_itu_t snd_pcm
i2c_i801 snd_timer snd_page_alloc ahci i2c_core iTCO_wdt ata_piix
iTCO_vendor_support snd_hwdep libata button i82975x_edac snd edac_core tg3
soundcore sg dm_snapshot dm_zero dm_mirror dm_mod shpchp mptsas mptscsih mptbase
scsi_transport_sas sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd
[last unloaded: microcode]
Pid: 462, comm: scsi_eh_0 Not tainted 2.6.25-14.fc9.x86_64 #1
RIP: 0010:[<ffffffff880a37d6>]  [<ffffffff880a37d6>]
RSP: 0018:ffff81003e061dd0  EFLAGS: 00010246
RAX: ffff81003f3e3802 RBX: ffff81003f3e2c80 RCX: 000000000000000a
RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffffffff814e64e4
RBP: ffff81003e061e00 R08: 0000000000000002 R09: 0000000000000000
R10: ffffffff8806027f R11: ffffffff814e6900 R12: ffff81003c94c3c0
R13: ffff81003e4e7000 R14: ffff81003e4e7008 R15: ffff81003f3e2800
FS:  0000000000000000(0000) GS:ffff81003f802680(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000003edf8000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process scsi_eh_0 (pid: 462, threadinfo ffff81003e060000, task ffff81003e096000)
Stack:  ffff81003f3e3810 0000000000000000 ffff81003c94c3c0 0000000000002003
 0000000000000000 ffff81003e061ee0 ffff81003e061e20 ffffffff880541e1
 ffff81003c94c3c0 0000000000000000 ffff81003e061e60 ffffffff88054f53
Call Trace:
 [<ffffffff880541e1>] :scsi_mod:scsi_try_bus_reset+0x52/0xde
 [<ffffffff88054f53>] :scsi_mod:scsi_eh_ready_devs+0x2d3/0x4af
 [<ffffffff8805562f>] :scsi_mod:scsi_error_handler+0x352/0x4f1
 [<ffffffff81026ae5>] ? __wake_up_common+0x46/0x75
 [<ffffffff880552dd>] ? :scsi_mod:scsi_error_handler+0x0/0x4f1
 [<ffffffff810477e3>] kthread+0x49/0x76
 [<ffffffff8100ccf8>] child_rip+0xa/0x12
 [<ffffffff8104779a>] ? kthread+0x0/0x76
 [<ffffffff8100ccee>] ? child_rip+0x0/0x12

Code: 00 00 49 8b 04 24 b9 28 00 00 00 48 8b 90 88 00 00 00 41 8a 85 98 00 00 00
84 c0 74 0e 31 c9 3c 02 0f 94 c1 8d 0c cd 02 00 00 00 <48> 8b 02 45 31 c9 45 31
c0 48 89 df be 04 00 00 00 0f b6 50 0b
RIP  [<ffffffff880a37d6>] :mptscsih:mptscsih_bus_reset+0xa6/0x109
 RSP <ffff81003e061dd0>
CR2: 0000000000000000
---[ end trace 0e0ecc73240609da ]---
Comment 7 Don Zickus 2008-09-02 23:39:20 EDT
in kernel-2.6.18-107.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 12 errata-xmlrpc 2009-01-20 14:49:38 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.