441832 – mptscsi race between hotremove and mptscsih_bus_reset

Bug 441832 - mptscsi race between hotremove and mptscsih_bus_reset

Summary: mptscsi race between hotremove and mptscsih_bus_reset

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Doug Ledford
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	391501 409971 KernelPrio5.3
TreeView+	depends on / blocked

Reported:	2008-04-10 13:54 UTC by Bryn M. Reeves
Modified:	2018-12-06 14:32 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-01-20 19:49:38 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Check for NULL pointers when retrieving vdevice / vdevice->vtarget in mptscsih_bus_reset (638 bytes, patch) 2008-04-10 13:54 UTC, Bryn M. Reeves	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:0225	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update	2009-01-20 16:06:24 UTC

Description Bryn M. Reeves 2008-04-10 13:54:37 UTC

Description of problem:
If a device is hot-removed from an mptscsi controller while an error handler
that will attempt a bus reset is outstanding, the call to mptscsih_bus_reset may
end up accessing the vdevice structure (stored in SCpnt->device->hostdata) after
it has been freed, leading to an oops:

mptscsih: ioc0: attempting task abort! (sc=e000000115651a80)
sd 1:0:1:0:
mptscsih: ioc0: task abort: SUCCESS (sc=e000000115651a80)
mptscsih: ioc0: attempting task abort! (sc=e000000115651a80)
sd 1:0:1:0:
mptscsih: ioc0: task abort: SUCCESS (sc=e000000115651a80)
mptscsih: ioc0: attempting target reset! (sc=e000000115651a80)
sd 1:0:1:0:
mptscsih: ioc0: target reset: SUCCESS (sc=e000000115651a80)
mptscsih: ioc0: attempting bus reset! (sc=e000000115651a80)                  <--##
scsi 1:0:1:0:
Unable to handle kernel NULL pointer dereference (address 0000000000000000)  <--##
scsi_eh_1[660]: Oops 8813272891392 [1]
Modules linked in: ipmi_watchdog e1000(U) tg3 e100 ipv6 autofs4 hidp l2cap
bluetooth sunrpc fefpcl(U) panicforpcl(U) mptctl ipmi_devintf ipmi_si
ipmi_msghandler vfat fat dm_mirror dm_multipath dm_mod button parport_pc lp
parport sr_mod shpchp cdrom mii sg ext3 jbd lpfc(U) scsi_transport_fc mptspi
scsi_transport_spi mptsas scsi_transport_sas mptscsih mptbase sd_mod usb_storage
ehci_hcd ohci_hcd uhci_hcd scsi_mod

Pid: 660, CPU 0, comm:            scsi_eh_1
psr : 00001010085a6010 ifs : 800000000000030d ip  : [<a000000207b65540>]    Not
tainted
ip is at mptscsih_bus_reset+0xe0/0x220 [mptscsih]
unat: 0000000000000000 pfs : 000000000000030d rsc : 0000000000000003
rnat: 0000000000000000 bsps: a0000001002c50c0 pr  : 0000000000006a41
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a000000207b654e0 b6  : a0000001002c50c0 b7  : a0000001002c50c0
f6  : 0fffafffffffff0000000 f7  : 0ffdb8000000000000000
f8  : 0ffff8000000000000000 f9  : 100038000000000000000
f10 : 0fffafffffffff0000000 f11 : 1003e0000000000000000
r1  : a000000207b750d8 r2  : a0000001009f6b40 r3  : a0000001009e0438
r8  : 0000000000000001 r9  : 0000000000000002 r10 : e00000010f972888
r11 : e0000001149a1000 r12 : e00000011565fd10 r13 : e000000115658000
r14 : a0000001009f6b40 r15 : e0000001149a1040 r16 : 0000000000000000
r17 : e00000010f972800 r18 : e000000114b39590 r19 : 0000000000000003
r20 : 0000000000000000 r21 : a0000001009df7e8 r22 : a0000001009f6ee0
r23 : a000000100829280 r24 : a0000001009df7e8 r25 : a0000001009f6b48
r26 : a0000001009f6b48 r27 : 0000000000000000 r28 : 000000000000000a
r29 : 0000000000000000 r30 : 0000000000000000 r31 : a0000001009f6ebc

Call Trace:
[<a000000100013b20>] show_stack+0x40/0xa0
                               sp=e00000011565f8a0 bsp=e000000115659330
[<a000000100014420>] show_regs+0x840/0x880
                               sp=e00000011565fa70 bsp=e0000001156592d8
[<a000000100037740>] die+0x1c0/0x2c0
                               sp=e00000011565fa70 bsp=e000000115659290
[<a00000010062ae00>] ia64_do_page_fault+0x8a0/0x9e0
                               sp=e00000011565fa90 bsp=e000000115659240
[<a00000010000c020>] __ia64_leave_kernel+0x0/0x280
                               sp=e00000011565fb40 bsp=e000000115659240
[<a000000207b65540>] mptscsih_bus_reset+0xe0/0x220 [mptscsih]
                               sp=e00000011565fd10 bsp=e0000001156591d0
[<a0000002079c04d0>] scsi_try_bus_reset+0xf0/0x240 [scsi_mod]
                               sp=e00000011565fd10 bsp=e0000001156591a0
[<a0000002079c24d0>] scsi_eh_ready_devs+0x710/0xbe0 [scsi_mod]
                               sp=e00000011565fd10 bsp=e000000115659158
[<a0000002079c34c0>] scsi_error_handler+0x840/0xc60 [scsi_mod]
                               sp=e00000011565fd10 bsp=e000000115659110
[<a0000001000abeb0>] kthread+0x230/0x2c0
                               sp=e00000011565fd50 bsp=e0000001156590c8
[<a000000100012090>] kernel_thread_helper+0x30/0x60
                               sp=e00000011565fe30 bsp=e0000001156590a0
[<a0000001000090c0>] start_kernel_thread+0x20/0x40
                               sp=e00000011565fe30 bsp=e0000001156590a0
<0>Kernel panic - not syncing: Fatal exception

Version-Release number of selected component (if applicable):
kernel-2.6.18-53.el5

How reproducible:
Always

Steps to Reproduce:
1. Offline a device while a bus reset is occurring

  
Actual results:
Above oops

Expected results:
Error handler does not panic.

Additional info:

Comment 1 Bryn M. Reeves 2008-04-10 13:54:37 UTC

Created attachment 301986 [details]
Check for NULL pointers when retrieving vdevice / vdevice->vtarget in mptscsih_bus_reset

Comment 3 Bryn M. Reeves 2008-04-10 13:57:42 UTC

Comment #0 should read "hot-removed or offlined" - you don't actually need to
disconnect the drive, just have it offlined so that the kernel removes it.

Comment 4 Bryn M. Reeves 2008-04-22 16:38:24 UTC

Eric posted a patch containing this change (along with one or two others :) to
linux-scsi last year:

http://marc.info/?l=linux-scsi&m=119008142831206&w=2

$ diffstat /tmp/mpt-linux-scsi.patch
mptscsih.c | 1498 +++++++++++++++++++++++++++----------------------------------
mptscsih.h |    8
2 files changed, 666 insertions(+), 840 deletions(-)
$ diffstat /tmp/mpt_bus_reset.patch
mptscsih.c |    3 +++
1 file changed, 3 insertion

Comment 6 Jan Kratochvil 2008-07-03 08:37:44 UTC

Just seen now on F9 RHTS x86_64 ibm-taroko.rhts.bos.redhat.com, Job 24616,
kernel-2.6.25-14.fc9.x86_64, during startup, no device removal/plugging:

http://rhts.redhat.com/testlogs/24616/89682/749716/3448895-test_log--tools-gdb-gdb-any-EXTERNALWATCHDOG.log

BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
IP: [<ffffffff880a37d6>] :mptscsih:mptscsih_bus_reset+0xa6/0x109
PGD 3e172067 PUD 3edcf067 PMD 3e0ca067 PTE 0
Oops: 0000 [1] SMP
CPU 1
Modules linked in: nfs lockd nfs_acl bridge bnep rfcomm l2cap bluetooth sunrpc
ipv6 cpufreq_ondemand acpi_cpufreq freq_table loop dm_multipath sr_mod cdrom
pata_acpi ata_generic ppdev snd_hda_intel parport_pc snd_seq_dummy parport
snd_seq_oss floppy snd_seq_midi_event snd_seq snd_seq_device firewire_ohci
firewire_core pcspkr serio_raw snd_pcm_oss snd_mixer_oss crc_itu_t snd_pcm
i2c_i801 snd_timer snd_page_alloc ahci i2c_core iTCO_wdt ata_piix
iTCO_vendor_support snd_hwdep libata button i82975x_edac snd edac_core tg3
soundcore sg dm_snapshot dm_zero dm_mirror dm_mod shpchp mptsas mptscsih mptbase
scsi_transport_sas sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd
[last unloaded: microcode]
Pid: 462, comm: scsi_eh_0 Not tainted 2.6.25-14.fc9.x86_64 #1
RIP: 0010:[<ffffffff880a37d6>]  [<ffffffff880a37d6>]
:mptscsih:mptscsih_bus_reset+0xa6/0x109
RSP: 0018:ffff81003e061dd0  EFLAGS: 00010246
RAX: ffff81003f3e3802 RBX: ffff81003f3e2c80 RCX: 000000000000000a
RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffffffff814e64e4
RBP: ffff81003e061e00 R08: 0000000000000002 R09: 0000000000000000
R10: ffffffff8806027f R11: ffffffff814e6900 R12: ffff81003c94c3c0
R13: ffff81003e4e7000 R14: ffff81003e4e7008 R15: ffff81003f3e2800
FS:  0000000000000000(0000) GS:ffff81003f802680(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000003edf8000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process scsi_eh_0 (pid: 462, threadinfo ffff81003e060000, task ffff81003e096000)
Stack:  ffff81003f3e3810 0000000000000000 ffff81003c94c3c0 0000000000002003
 0000000000000000 ffff81003e061ee0 ffff81003e061e20 ffffffff880541e1
 ffff81003c94c3c0 0000000000000000 ffff81003e061e60 ffffffff88054f53
Call Trace:
 [<ffffffff880541e1>] :scsi_mod:scsi_try_bus_reset+0x52/0xde
 [<ffffffff88054f53>] :scsi_mod:scsi_eh_ready_devs+0x2d3/0x4af
 [<ffffffff8805562f>] :scsi_mod:scsi_error_handler+0x352/0x4f1
 [<ffffffff81026ae5>] ? __wake_up_common+0x46/0x75
 [<ffffffff880552dd>] ? :scsi_mod:scsi_error_handler+0x0/0x4f1
 [<ffffffff810477e3>] kthread+0x49/0x76
 [<ffffffff8100ccf8>] child_rip+0xa/0x12
 [<ffffffff8104779a>] ? kthread+0x0/0x76
 [<ffffffff8100ccee>] ? child_rip+0x0/0x12


Code: 00 00 49 8b 04 24 b9 28 00 00 00 48 8b 90 88 00 00 00 41 8a 85 98 00 00 00
84 c0 74 0e 31 c9 3c 02 0f 94 c1 8d 0c cd 02 00 00 00 <48> 8b 02 45 31 c9 45 31
c0 48 89 df be 04 00 00 00 0f b6 50 0b
RIP  [<ffffffff880a37d6>] :mptscsih:mptscsih_bus_reset+0xa6/0x109
 RSP <ffff81003e061dd0>
CR2: 0000000000000000
---[ end trace 0e0ecc73240609da ]---

Comment 7 Don Zickus 2008-09-03 03:39:20 UTC

in kernel-2.6.18-107.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 12 errata-xmlrpc 2009-01-20 19:49:38 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Note You need to log in before you can comment on or make changes to this bug.