Bug 179057

Summary: SCSI LLDD's oops on rmmod if devices scan w/ PQ=3
Product: Red Hat Enterprise Linux 4 Reporter: James Smart <james.smart>
Component: kernelAssignee: Mike Christie <mchristi>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: coughlan, jbaron
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2006-0132 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-03-07 21:21:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 168430    

Description James Smart 2006-01-26 21:26:56 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7

Description of problem:
scsi_scan.c:scsi_sequential_lun_scan(), when it encounters a LUN whose Inquiry
returns with the PQ bits set to 0x3, will leave the partially initialized sdev
structure around. If a LLDD then attempts to unload, an oops with a stack trace
similar to the following:

[<a000000100019980>] show_stack+0x80/0xa0
                  sp=e000000071fc7810 bsp=e000000071fc1420
[<a00000010003d1e0>] die+0x1c0/0x2e0
                  sp=e000000071fc79e0 bsp=e000000071fc13e0
[<a00000010003d470>] ia64_fault+0x110/0x1060
                  sp=e000000071fc79e0 bsp=e000000071fc1388
[<a000000100010180>] ia64_leave_kernel+0x0/0x260
                  sp=e000000071fc7bf0 bsp=e000000071fc1388
[<a00000010020de90>] sysfs_hash_and_remove+0x10/0x2a0
                  sp=e000000071fc7dc0 bsp=e000000071fc1348
[<a000000100210830>] sysfs_remove_link+0x30/0x60
                  sp=e000000071fc7dc0 bsp=e000000071fc1320
[<a0000001003d7e20>] class_device_del+0x340/0x380
                  sp=e000000071fc7dc0 bsp=e000000071fc12d8
[<a0000001003d7e80>] class_device_unregister+0x20/0x60
                  sp=e000000071fc7dc0 bsp=e000000071fc12b8
[<a0000002008a84b0>] scsi_remove_device+0xb0/0x2e0 [scsi_mod]
                  sp=e000000071fc7dc0 bsp=e000000071fc1288
[<a0000002008a42a0>] scsi_forget_host+0xe0/0x1e0 [scsi_mod]
                  sp=e000000071fc7dc0 bsp=e000000071fc1258
[<a000000200896210>] scsi_remove_host+0x30/0xe0 [scsi_mod]
                  sp=e000000071fc7dc0 bsp=e000000071fc1238
[<a00000020099e8d0>] lpfc_pci_remove_one+0x410/0x860 [lpfc]
                  sp=e000000071fc7dc0 bsp=e000000071fc11c8
[<a000000100331720>] pci_device_remove+0xc0/0xe0
                  sp=e000000071fc7dc0 bsp=e000000071fc11a0
[<a0000001003d6080>] device_release_driver+0x120/0x140
                  sp=e000000071fc7dc0 bsp=e000000071fc1170
[<a0000001003d6160>] bus_remove_driver+0xc0/0x1c0
                  sp=e000000071fc7dc0 bsp=e000000071fc1138
[<a0000001003d6a50>] driver_unregister+0x30/0xc0
                  sp=e000000071fc7dc0 bsp=e000000071fc1118
[<a000000100331c50>] pci_unregister_driver+0x30/0x160
                  sp=e000000071fc7dc0 bsp=e000000071fc10d8
[<a0000002009a7610>] lpfc_exit+0x30/0x70 [lpfc]
                  sp=e000000071fc7dc0 bsp=e000000071fc10c0
[<a0000001000f7290>] sys_delete_module+0x370/0x620
                  sp=e000000071fc7dc0 bsp=e000000071fc1040
[<a000000100010000>] ia64_ret_from_syscall+0x0/0x20
                  sp=e000000071fc7e30 bsp=e000000071fc1040
[<a000000000010640>] 0xa000000000010640
                  sp=e000000071fc8000 bsp=e000000071fc1040




The following patch (which I'm sure won't match your sources) corrects the issue:

--- a/drivers/scsi/scsi_scan.c   2005-12-12
18:49:19.000000000 -0500
+++ ./scsi_scan.c       2006-01-26 11:28:55.000000000 -0500
@@ -816,6 +816,8 @@ static void scsi_sequential_lun_scan(str
                uint id, int bflags, int lun0_res, int scsi_level, int rescan)
 {
        unsigned int sparse_lun, lun, max_dev_lun;
+       struct scsi_device *sdev;
+       int res;

        SCSI_LOG_SCAN_BUS(3, printk(KERN_INFO "scsi scan: Sequential scan of"
                        " host %d channel %d id %d\n", shost->host_no,
@@ -880,11 +882,14 @@ static void scsi_sequential_lun_scan(str
         * until we reach the max, or no LUN is found and we are not
         * sparse_lun.
         */
-       for (lun = 1; lun < max_dev_lun; ++lun)
-               if ((scsi_probe_and_add_lun(shost, channel, id, lun,
-                     NULL, NULL, rescan) != SCSI_SCAN_LUN_PRESENT) &&
-                   !sparse_lun)
+       for (lun = 1; lun < max_dev_lun; ++lun) {
+               res = scsi_probe_and_add_lun(shost, channel, id, lun,
+                       NULL, &sdev, rescan);
+               if (res == SCSI_SCAN_LUN_IGNORED)
+                       scsi_scan_remove(sdev);
+               if ((res != SCSI_SCAN_LUN_PRESENT) && !sparse_lun)
                        return;
+       }
 }

 /**

Version-Release number of selected component (if applicable):
kernel 2.6.9-24 and higher

How reproducible:
Always

Steps to Reproduce:
1. Must have a device in which Inquiries to unconfigured luns return w/ PQ bits = 0x03
2.after scanning and seeing luns, rmmod the LLDD
3.
  

Actual Results:  oops!!

Expected Results:  no oops :)

Additional info:

Comment 2 Mike Christie 2006-01-27 18:01:30 UTC
Hi James,

I think we have this fixed in the newest kernel. Could you try this kernel real
quick http://people.redhat.com/~jbaron/rhel4/

I just looked at the source and I am pretty sure it is fixed. For pq=3 and
sdevp=NULL we remove the device for the caller. I do not have a box that returns
pq=3, but I will download pat's scsi_debug patch again, and try out the
sequential scan path to verify this.

Comment 3 James Smart 2006-01-30 21:03:25 UTC
I've looked at the patch, and believe it should be ok.

However, it seems messy. If nothing else, the code could be much clearer. Right
now, you have to remember that if lun0 scan by scsi_scan_target(), sdevp is set
and you need to leave the sdev, while if called by a report_luns or sequential
scan, sdevp is Null. Also, it's leaving the "if (res == SCSI_SCAN_LUN_IGNORED)
scsi_scan_remove(sdev)" snippets in scsi_scan_target() and __scsi_add_device().
So there's essentially 2 pieces of logic doing the same thing.

As it should work - I'm trying to have some test put together to ensure it.
However, it may take a couple of days.

Also - I take it that this fix is currently planned to be part of U3, right ?

Thanks

-- james

Comment 4 Mike Christie 2006-01-30 22:45:11 UTC
In the current code if you the caller needs the sdev pointer then it removes the
device when it is done with it. This is why the second call to scsi_scan_target
and __scsi_add_device do this. And if the caller does not need the sdev pointer
so it passes NULL then the removal is handled for it.

We should have done the starget stuff, like upstream but did not have time. I
guess we could have just made the sdev pointer a mandatory argument to
scsi_probe_and_add_lun but it seemed like a waste to have all the callers check
this and remove it if necessary. Did you have something else in mind?

Comment 5 James Smart 2006-01-31 19:26:01 UTC
We can keep it as is.

We put it through some tests and everything looks good.

Thank You.

-- james

Comment 11 Red Hat Bugzilla 2006-03-07 21:21:11 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0132.html