Description of problem: If a device is hot-removed from an mptscsi controller while an error handler that will attempt a bus reset is outstanding, the call to mptscsih_bus_reset may end up accessing the vdevice structure (stored in SCpnt->device->hostdata) after it has been freed, leading to an oops: mptscsih: ioc0: attempting task abort! (sc=e000000115651a80) sd 1:0:1:0: mptscsih: ioc0: task abort: SUCCESS (sc=e000000115651a80) mptscsih: ioc0: attempting task abort! (sc=e000000115651a80) sd 1:0:1:0: mptscsih: ioc0: task abort: SUCCESS (sc=e000000115651a80) mptscsih: ioc0: attempting target reset! (sc=e000000115651a80) sd 1:0:1:0: mptscsih: ioc0: target reset: SUCCESS (sc=e000000115651a80) mptscsih: ioc0: attempting bus reset! (sc=e000000115651a80) <--## scsi 1:0:1:0: Unable to handle kernel NULL pointer dereference (address 0000000000000000) <--## scsi_eh_1[660]: Oops 8813272891392 [1] Modules linked in: ipmi_watchdog e1000(U) tg3 e100 ipv6 autofs4 hidp l2cap bluetooth sunrpc fefpcl(U) panicforpcl(U) mptctl ipmi_devintf ipmi_si ipmi_msghandler vfat fat dm_mirror dm_multipath dm_mod button parport_pc lp parport sr_mod shpchp cdrom mii sg ext3 jbd lpfc(U) scsi_transport_fc mptspi scsi_transport_spi mptsas scsi_transport_sas mptscsih mptbase sd_mod usb_storage ehci_hcd ohci_hcd uhci_hcd scsi_mod Pid: 660, CPU 0, comm: scsi_eh_1 psr : 00001010085a6010 ifs : 800000000000030d ip : [<a000000207b65540>] Not tainted ip is at mptscsih_bus_reset+0xe0/0x220 [mptscsih] unat: 0000000000000000 pfs : 000000000000030d rsc : 0000000000000003 rnat: 0000000000000000 bsps: a0000001002c50c0 pr : 0000000000006a41 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f csd : 0000000000000000 ssd : 0000000000000000 b0 : a000000207b654e0 b6 : a0000001002c50c0 b7 : a0000001002c50c0 f6 : 0fffafffffffff0000000 f7 : 0ffdb8000000000000000 f8 : 0ffff8000000000000000 f9 : 100038000000000000000 f10 : 0fffafffffffff0000000 f11 : 1003e0000000000000000 r1 : a000000207b750d8 r2 : a0000001009f6b40 r3 : a0000001009e0438 r8 : 0000000000000001 r9 : 0000000000000002 r10 : e00000010f972888 r11 : e0000001149a1000 r12 : e00000011565fd10 r13 : e000000115658000 r14 : a0000001009f6b40 r15 : e0000001149a1040 r16 : 0000000000000000 r17 : e00000010f972800 r18 : e000000114b39590 r19 : 0000000000000003 r20 : 0000000000000000 r21 : a0000001009df7e8 r22 : a0000001009f6ee0 r23 : a000000100829280 r24 : a0000001009df7e8 r25 : a0000001009f6b48 r26 : a0000001009f6b48 r27 : 0000000000000000 r28 : 000000000000000a r29 : 0000000000000000 r30 : 0000000000000000 r31 : a0000001009f6ebc Call Trace: [<a000000100013b20>] show_stack+0x40/0xa0 sp=e00000011565f8a0 bsp=e000000115659330 [<a000000100014420>] show_regs+0x840/0x880 sp=e00000011565fa70 bsp=e0000001156592d8 [<a000000100037740>] die+0x1c0/0x2c0 sp=e00000011565fa70 bsp=e000000115659290 [<a00000010062ae00>] ia64_do_page_fault+0x8a0/0x9e0 sp=e00000011565fa90 bsp=e000000115659240 [<a00000010000c020>] __ia64_leave_kernel+0x0/0x280 sp=e00000011565fb40 bsp=e000000115659240 [<a000000207b65540>] mptscsih_bus_reset+0xe0/0x220 [mptscsih] sp=e00000011565fd10 bsp=e0000001156591d0 [<a0000002079c04d0>] scsi_try_bus_reset+0xf0/0x240 [scsi_mod] sp=e00000011565fd10 bsp=e0000001156591a0 [<a0000002079c24d0>] scsi_eh_ready_devs+0x710/0xbe0 [scsi_mod] sp=e00000011565fd10 bsp=e000000115659158 [<a0000002079c34c0>] scsi_error_handler+0x840/0xc60 [scsi_mod] sp=e00000011565fd10 bsp=e000000115659110 [<a0000001000abeb0>] kthread+0x230/0x2c0 sp=e00000011565fd50 bsp=e0000001156590c8 [<a000000100012090>] kernel_thread_helper+0x30/0x60 sp=e00000011565fe30 bsp=e0000001156590a0 [<a0000001000090c0>] start_kernel_thread+0x20/0x40 sp=e00000011565fe30 bsp=e0000001156590a0 <0>Kernel panic - not syncing: Fatal exception Version-Release number of selected component (if applicable): kernel-2.6.18-53.el5 How reproducible: Always Steps to Reproduce: 1. Offline a device while a bus reset is occurring Actual results: Above oops Expected results: Error handler does not panic. Additional info:
Created attachment 301986 [details] Check for NULL pointers when retrieving vdevice / vdevice->vtarget in mptscsih_bus_reset
Comment #0 should read "hot-removed or offlined" - you don't actually need to disconnect the drive, just have it offlined so that the kernel removes it.
Eric posted a patch containing this change (along with one or two others :) to linux-scsi last year: http://marc.info/?l=linux-scsi&m=119008142831206&w=2 $ diffstat /tmp/mpt-linux-scsi.patch mptscsih.c | 1498 +++++++++++++++++++++++++++---------------------------------- mptscsih.h | 8 2 files changed, 666 insertions(+), 840 deletions(-) $ diffstat /tmp/mpt_bus_reset.patch mptscsih.c | 3 +++ 1 file changed, 3 insertion
Just seen now on F9 RHTS x86_64 ibm-taroko.rhts.bos.redhat.com, Job 24616, kernel-2.6.25-14.fc9.x86_64, during startup, no device removal/plugging: http://rhts.redhat.com/testlogs/24616/89682/749716/3448895-test_log--tools-gdb-gdb-any-EXTERNALWATCHDOG.log BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 IP: [<ffffffff880a37d6>] :mptscsih:mptscsih_bus_reset+0xa6/0x109 PGD 3e172067 PUD 3edcf067 PMD 3e0ca067 PTE 0 Oops: 0000 [1] SMP CPU 1 Modules linked in: nfs lockd nfs_acl bridge bnep rfcomm l2cap bluetooth sunrpc ipv6 cpufreq_ondemand acpi_cpufreq freq_table loop dm_multipath sr_mod cdrom pata_acpi ata_generic ppdev snd_hda_intel parport_pc snd_seq_dummy parport snd_seq_oss floppy snd_seq_midi_event snd_seq snd_seq_device firewire_ohci firewire_core pcspkr serio_raw snd_pcm_oss snd_mixer_oss crc_itu_t snd_pcm i2c_i801 snd_timer snd_page_alloc ahci i2c_core iTCO_wdt ata_piix iTCO_vendor_support snd_hwdep libata button i82975x_edac snd edac_core tg3 soundcore sg dm_snapshot dm_zero dm_mirror dm_mod shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode] Pid: 462, comm: scsi_eh_0 Not tainted 2.6.25-14.fc9.x86_64 #1 RIP: 0010:[<ffffffff880a37d6>] [<ffffffff880a37d6>] :mptscsih:mptscsih_bus_reset+0xa6/0x109 RSP: 0018:ffff81003e061dd0 EFLAGS: 00010246 RAX: ffff81003f3e3802 RBX: ffff81003f3e2c80 RCX: 000000000000000a RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffffffff814e64e4 RBP: ffff81003e061e00 R08: 0000000000000002 R09: 0000000000000000 R10: ffffffff8806027f R11: ffffffff814e6900 R12: ffff81003c94c3c0 R13: ffff81003e4e7000 R14: ffff81003e4e7008 R15: ffff81003f3e2800 FS: 0000000000000000(0000) GS:ffff81003f802680(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000000 CR3: 000000003edf8000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process scsi_eh_0 (pid: 462, threadinfo ffff81003e060000, task ffff81003e096000) Stack: ffff81003f3e3810 0000000000000000 ffff81003c94c3c0 0000000000002003 0000000000000000 ffff81003e061ee0 ffff81003e061e20 ffffffff880541e1 ffff81003c94c3c0 0000000000000000 ffff81003e061e60 ffffffff88054f53 Call Trace: [<ffffffff880541e1>] :scsi_mod:scsi_try_bus_reset+0x52/0xde [<ffffffff88054f53>] :scsi_mod:scsi_eh_ready_devs+0x2d3/0x4af [<ffffffff8805562f>] :scsi_mod:scsi_error_handler+0x352/0x4f1 [<ffffffff81026ae5>] ? __wake_up_common+0x46/0x75 [<ffffffff880552dd>] ? :scsi_mod:scsi_error_handler+0x0/0x4f1 [<ffffffff810477e3>] kthread+0x49/0x76 [<ffffffff8100ccf8>] child_rip+0xa/0x12 [<ffffffff8104779a>] ? kthread+0x0/0x76 [<ffffffff8100ccee>] ? child_rip+0x0/0x12 Code: 00 00 49 8b 04 24 b9 28 00 00 00 48 8b 90 88 00 00 00 41 8a 85 98 00 00 00 84 c0 74 0e 31 c9 3c 02 0f 94 c1 8d 0c cd 02 00 00 00 <48> 8b 02 45 31 c9 45 31 c0 48 89 df be 04 00 00 00 0f b6 50 0b RIP [<ffffffff880a37d6>] :mptscsih:mptscsih_bus_reset+0xa6/0x109 RSP <ffff81003e061dd0> CR2: 0000000000000000 ---[ end trace 0e0ecc73240609da ]---
in kernel-2.6.18-107.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html