Bug 179057 - SCSI LLDD's oops on rmmod if devices scan w/ PQ=3
Summary: SCSI LLDD's oops on rmmod if devices scan w/ PQ=3
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Mike Christie
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 168430
TreeView+ depends on / blocked
 
Reported: 2006-01-26 21:26 UTC by James Smart
Modified: 2007-11-30 22:07 UTC (History)
2 users (show)

Fixed In Version: RHSA-2006-0132
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-03-07 21:21:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2006:0132 0 qe-ready SHIPPED_LIVE Moderate: Updated kernel packages available for Red Hat Enterprise Linux 4 Update 3 2006-03-09 16:31:00 UTC

Description James Smart 2006-01-26 21:26:56 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7

Description of problem:
scsi_scan.c:scsi_sequential_lun_scan(), when it encounters a LUN whose Inquiry
returns with the PQ bits set to 0x3, will leave the partially initialized sdev
structure around. If a LLDD then attempts to unload, an oops with a stack trace
similar to the following:

[<a000000100019980>] show_stack+0x80/0xa0
                  sp=e000000071fc7810 bsp=e000000071fc1420
[<a00000010003d1e0>] die+0x1c0/0x2e0
                  sp=e000000071fc79e0 bsp=e000000071fc13e0
[<a00000010003d470>] ia64_fault+0x110/0x1060
                  sp=e000000071fc79e0 bsp=e000000071fc1388
[<a000000100010180>] ia64_leave_kernel+0x0/0x260
                  sp=e000000071fc7bf0 bsp=e000000071fc1388
[<a00000010020de90>] sysfs_hash_and_remove+0x10/0x2a0
                  sp=e000000071fc7dc0 bsp=e000000071fc1348
[<a000000100210830>] sysfs_remove_link+0x30/0x60
                  sp=e000000071fc7dc0 bsp=e000000071fc1320
[<a0000001003d7e20>] class_device_del+0x340/0x380
                  sp=e000000071fc7dc0 bsp=e000000071fc12d8
[<a0000001003d7e80>] class_device_unregister+0x20/0x60
                  sp=e000000071fc7dc0 bsp=e000000071fc12b8
[<a0000002008a84b0>] scsi_remove_device+0xb0/0x2e0 [scsi_mod]
                  sp=e000000071fc7dc0 bsp=e000000071fc1288
[<a0000002008a42a0>] scsi_forget_host+0xe0/0x1e0 [scsi_mod]
                  sp=e000000071fc7dc0 bsp=e000000071fc1258
[<a000000200896210>] scsi_remove_host+0x30/0xe0 [scsi_mod]
                  sp=e000000071fc7dc0 bsp=e000000071fc1238
[<a00000020099e8d0>] lpfc_pci_remove_one+0x410/0x860 [lpfc]
                  sp=e000000071fc7dc0 bsp=e000000071fc11c8
[<a000000100331720>] pci_device_remove+0xc0/0xe0
                  sp=e000000071fc7dc0 bsp=e000000071fc11a0
[<a0000001003d6080>] device_release_driver+0x120/0x140
                  sp=e000000071fc7dc0 bsp=e000000071fc1170
[<a0000001003d6160>] bus_remove_driver+0xc0/0x1c0
                  sp=e000000071fc7dc0 bsp=e000000071fc1138
[<a0000001003d6a50>] driver_unregister+0x30/0xc0
                  sp=e000000071fc7dc0 bsp=e000000071fc1118
[<a000000100331c50>] pci_unregister_driver+0x30/0x160
                  sp=e000000071fc7dc0 bsp=e000000071fc10d8
[<a0000002009a7610>] lpfc_exit+0x30/0x70 [lpfc]
                  sp=e000000071fc7dc0 bsp=e000000071fc10c0
[<a0000001000f7290>] sys_delete_module+0x370/0x620
                  sp=e000000071fc7dc0 bsp=e000000071fc1040
[<a000000100010000>] ia64_ret_from_syscall+0x0/0x20
                  sp=e000000071fc7e30 bsp=e000000071fc1040
[<a000000000010640>] 0xa000000000010640
                  sp=e000000071fc8000 bsp=e000000071fc1040




The following patch (which I'm sure won't match your sources) corrects the issue:

--- a/drivers/scsi/scsi_scan.c   2005-12-12
18:49:19.000000000 -0500
+++ ./scsi_scan.c       2006-01-26 11:28:55.000000000 -0500
@@ -816,6 +816,8 @@ static void scsi_sequential_lun_scan(str
                uint id, int bflags, int lun0_res, int scsi_level, int rescan)
 {
        unsigned int sparse_lun, lun, max_dev_lun;
+       struct scsi_device *sdev;
+       int res;

        SCSI_LOG_SCAN_BUS(3, printk(KERN_INFO "scsi scan: Sequential scan of"
                        " host %d channel %d id %d\n", shost->host_no,
@@ -880,11 +882,14 @@ static void scsi_sequential_lun_scan(str
         * until we reach the max, or no LUN is found and we are not
         * sparse_lun.
         */
-       for (lun = 1; lun < max_dev_lun; ++lun)
-               if ((scsi_probe_and_add_lun(shost, channel, id, lun,
-                     NULL, NULL, rescan) != SCSI_SCAN_LUN_PRESENT) &&
-                   !sparse_lun)
+       for (lun = 1; lun < max_dev_lun; ++lun) {
+               res = scsi_probe_and_add_lun(shost, channel, id, lun,
+                       NULL, &sdev, rescan);
+               if (res == SCSI_SCAN_LUN_IGNORED)
+                       scsi_scan_remove(sdev);
+               if ((res != SCSI_SCAN_LUN_PRESENT) && !sparse_lun)
                        return;
+       }
 }

 /**

Version-Release number of selected component (if applicable):
kernel 2.6.9-24 and higher

How reproducible:
Always

Steps to Reproduce:
1. Must have a device in which Inquiries to unconfigured luns return w/ PQ bits = 0x03
2.after scanning and seeing luns, rmmod the LLDD
3.
  

Actual Results:  oops!!

Expected Results:  no oops :)

Additional info:

Comment 2 Mike Christie 2006-01-27 18:01:30 UTC
Hi James,

I think we have this fixed in the newest kernel. Could you try this kernel real
quick http://people.redhat.com/~jbaron/rhel4/

I just looked at the source and I am pretty sure it is fixed. For pq=3 and
sdevp=NULL we remove the device for the caller. I do not have a box that returns
pq=3, but I will download pat's scsi_debug patch again, and try out the
sequential scan path to verify this.

Comment 3 James Smart 2006-01-30 21:03:25 UTC
I've looked at the patch, and believe it should be ok.

However, it seems messy. If nothing else, the code could be much clearer. Right
now, you have to remember that if lun0 scan by scsi_scan_target(), sdevp is set
and you need to leave the sdev, while if called by a report_luns or sequential
scan, sdevp is Null. Also, it's leaving the "if (res == SCSI_SCAN_LUN_IGNORED)
scsi_scan_remove(sdev)" snippets in scsi_scan_target() and __scsi_add_device().
So there's essentially 2 pieces of logic doing the same thing.

As it should work - I'm trying to have some test put together to ensure it.
However, it may take a couple of days.

Also - I take it that this fix is currently planned to be part of U3, right ?

Thanks

-- james

Comment 4 Mike Christie 2006-01-30 22:45:11 UTC
In the current code if you the caller needs the sdev pointer then it removes the
device when it is done with it. This is why the second call to scsi_scan_target
and __scsi_add_device do this. And if the caller does not need the sdev pointer
so it passes NULL then the removal is handled for it.

We should have done the starget stuff, like upstream but did not have time. I
guess we could have just made the sdev pointer a mandatory argument to
scsi_probe_and_add_lun but it seemed like a waste to have all the callers check
this and remove it if necessary. Did you have something else in mind?

Comment 5 James Smart 2006-01-31 19:26:01 UTC
We can keep it as is.

We put it through some tests and everything looks good.

Thank You.

-- james

Comment 11 Red Hat Bugzilla 2006-03-07 21:21:11 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0132.html



Note You need to log in before you can comment on or make changes to this bug.