Bug 754907

Summary: 2.6.41.1-1.fc15.x86_64: cciss module crash
Product: [Fedora] Fedora Reporter: Jan ONDREJ <ondrejj>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 15CC: gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, sgruszka, steve.cameron, thenzl, xiaoli
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.41.4-1.fc15 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-10 19:51:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Full dmesg
none
cciss.patch
none
Patch to add IRQF_SHARED flag to hpsa for non-msi interrupt handler
none
Patch to add IRQF_SHARED flag to cciss for non msi(x) interrupts none

Description Jan ONDREJ 2011-11-18 07:26:22 UTC
Created attachment 534351 [details]
Full dmesg

Description of problem:
After upgrade and reboot doesn't work.

Version-Release number of selected component (if applicable):
kernel-2.6.41.1-1.fc15.x86_64

How reproducible:
always
  
Actual results:
[ 1093.978987] HP CISS Driver (v 3.6.26)
[ 1094.003078] IRQ handler type mismatch for IRQ 16
[ 1094.003228] current handler: uhci_hcd:usb2
[ 1094.003366] Pid: 2921, comm: modprobe Not tainted 2.6.41.1-1.fc15.x86_64 #1
[ 1094.003509] Call Trace:
[ 1094.003659]  [<ffffffff810b1d76>] __setup_irq+0x39e/0x432
[ 1094.003805]  [<ffffffff8111971c>] ? kmem_cache_alloc_trace+0xb3/0xc5
[ 1094.003956]  [<ffffffffa01d5899>] ? process_indexed_cmd+0xa6/0xa6 [cciss]
[ 1094.004001]  [<ffffffff810b1ef4>] request_threaded_irq+0xea/0x116
[ 1094.004001]  [<ffffffffa01d7bca>] cciss_request_irq+0x66/0x98 [cciss]
[ 1094.004001]  [<ffffffffa01d6ddb>] cciss_init_one+0x1123/0x1a2f [cciss]
[ 1094.004001]  [<ffffffff8111a707>] ? kmem_cache_alloc+0x31/0xf8
[ 1094.004001]  [<ffffffff811843a4>] ? sysfs_find_dirent+0x3c/0x55
[ 1094.004001]  [<ffffffff81085d77>] ? arch_local_irq_save+0x15/0x1b
[ 1094.004001]  [<ffffffff81262487>] local_pci_probe+0x44/0x75
[ 1094.004001]  [<ffffffff81262fea>] pci_device_probe+0xd0/0xff
[ 1094.004001]  [<ffffffff81301017>] driver_probe_device+0x131/0x213
[ 1094.004001]  [<ffffffff81301153>] __driver_attach+0x5a/0x7e
[ 1094.004001]  [<ffffffff813010f9>] ? driver_probe_device+0x213/0x213
[ 1094.004001]  [<ffffffff8130009f>] bus_for_each_dev+0x53/0x89
[ 1094.004001]  [<ffffffff81300bf6>] driver_attach+0x1e/0x20
[ 1094.004001]  [<ffffffff8130081a>] bus_add_driver+0xd1/0x224
[ 1094.004001]  [<ffffffffa016e000>] ? 0xffffffffa016dfff
[ 1094.004001]  [<ffffffff813015f7>] driver_register+0x98/0x105
[ 1094.004001]  [<ffffffffa016e000>] ? 0xffffffffa016dfff
[ 1094.004001]  [<ffffffff812638ad>] __pci_register_driver+0x56/0xc1
[ 1094.004001]  [<ffffffffa016e000>] ? 0xffffffffa016dfff
[ 1094.004001]  [<ffffffffa016e07d>] cciss_init+0x7d/0xa1 [cciss]
[ 1094.004001]  [<ffffffff81002099>] do_one_initcall+0x7f/0x136
[ 1094.004001]  [<ffffffff8108a59d>] sys_init_module+0x88/0x1d0
[ 1094.004001]  [<ffffffff814a3102>] system_call_fastpath+0x16/0x1b
[ 1094.011026] cciss 0000:09:02.0: Unable to get irq 16 for cciss0
[ 1094.012661] cciss: probe of 0000:09:02.0 failed with error -1

Additional info:
I can make some tests on this machine, if required.

Comment 1 Stephen Cameron 2011-11-22 19:01:05 UTC
What controller and what server is this?
What kernel are you upgrading from?

And 2.6.41?  I thought they stopped at 2.6.38 and then 3.0.

-- steve

Comment 2 Jan ONDREJ 2011-11-22 19:12:52 UTC
(In reply to comment #1)
> What controller and what server is this?

[root@ftp ~]# cciss_vol_status /dev/cciss/c0d0
/dev/cciss/c0d0: (Smart Array 641) RAID 5 Volume 0 status: OK. 
/dev/cciss/c0d0: (Smart Array 641) Enclosure PROLIANT 6L6I (S/N: ) on Bus 0, Physical Port J1 status: OK.

product: ProLiant ML350 G4

> What kernel are you upgrading from?
> 
> And 2.6.41?  I thought they stopped at 2.6.38 and then 3.0.

Last working kernel: 2.6.40.6-0.fc15.x86_64
First bad kernel (no newer fedora kernel yet): kernel-2.6.41.1-1.fc15.x86_64

Comment 3 Josh Boyer 2011-11-22 19:25:22 UTC
(In reply to comment #2)
> > What kernel are you upgrading from?
> > 
> > And 2.6.41?  I thought they stopped at 2.6.38 and then 3.0.
> 
> Last working kernel: 2.6.40.6-0.fc15.x86_64
> First bad kernel (no newer fedora kernel yet): kernel-2.6.41.1-1.fc15.x86_64

2.6.40.6 is 3.0.6 renamed to avoid breaking F15 userspace that wasn't ready for the 3.0 change.

Similarly 2.6.41.1 is 3.1.1.

FYI.

Comment 4 Stephen Cameron 2011-11-22 19:40:08 UTC
This is the entirety of the difference between the cciss drivers in 3.0.6 and 3.1.1 from kernel.org:

[scameron@localhost fedora-bug]$ for x in linux-3.0.6/drivers/block/cciss*[ch]; do  f=`basename $x`; echo ==== $f ====; diff -u linux-3.0.6/drivers/block/$f linux-3.1.1/drivers/block/$f; done
==== cciss.c ====
--- linux-3.0.6/drivers/block/cciss.c	2011-10-03 15:25:23.000000000 -0500
+++ linux-3.1.1/drivers/block/cciss.c	2011-11-11 14:19:27.000000000 -0600
@@ -4533,6 +4533,13 @@
 		pmcsr &= ~PCI_PM_CTRL_STATE_MASK;
 		pmcsr |= PCI_D0;
 		pci_write_config_word(pdev, pos + PCI_PM_CTRL, pmcsr);
+
+		/*
+		 * The P600 requires a small delay when changing states.
+		 * Otherwise we may think the board did not reset and we bail.
+		 * This for kdump only and is particular to the P600.
+		 */
+		msleep(500);
 	}
 	return 0;
 }
==== cciss_cmd.h ====
==== cciss.h ====
==== cciss_scsi.c ====
--- linux-3.0.6/drivers/block/cciss_scsi.c	2011-10-03 15:25:23.000000000 -0500
+++ linux-3.1.1/drivers/block/cciss_scsi.c	2011-11-11 14:19:27.000000000 -0600
@@ -33,7 +33,7 @@
 #include <linux/slab.h>
 #include <linux/string.h>
 
-#include <asm/atomic.h>
+#include <linux/atomic.h>
 
 #include <scsi/scsi_cmnd.h>
 #include <scsi/scsi_device.h>
==== cciss_scsi.h ====
[scameron@localhost fedora-bug]$

I think whatever broke must reside outside the driver.

-- steve

Comment 5 Stephen Cameron 2011-11-22 19:41:36 UTC
git-bisect would probably pin it down.

Comment 6 Stanislaw Gruszka 2011-11-23 14:58:32 UTC
Seems just irq routing was changed, so now cciss share interrupt with other device. Since cciss request_irq without IRQF_SHARED flags, request fail.

Is there any reason why cciss can not share interrupts?

Comment 7 Stephen Cameron 2011-11-23 16:16:46 UTC
No.

Most smart arrays use MSI or MSIX these days, so... wouldn't be shared, right?
(I don't need IRQF_SHARED when using MSI/MSIX, correct?  But for non-MSI/MSIX, then I do need IRQF_SHARED, correct?)

This appears to be the commit which removed IRQF_SHARED

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=0c2b39087c900bdb240b50ac95ee9da00d844565

That was more than a year ago, 2010-08-07... if that's it, I'm surprised nobody has complained before.

Maybe something like this?

Author: Stephen M. Cameron <scameron.hp.com>
Date:   Wed Nov 23 10:16:34 2011 -0600

    cciss: Add IRQF_SHARED back in for the non-MSI(X) interrupt handler

diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index 486f94e..942ccf8 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -4884,7 +4884,7 @@ static int cciss_request_irq(ctlr_info_t *h,
        }
 
        if (!request_irq(h->intr[h->intr_mode], intxhandler,
-                       IRQF_DISABLED, h->devname, h))
+                       IRQF_DISABLED | IRQF_SHARED, h->devname, h))
                return 0;
        dev_err(&h->pdev->dev, "Unable to get irq %d for %s\n",
                h->intr[h->intr_mode], h->devname);

Comment 8 Stephen Cameron 2011-11-23 16:19:41 UTC
If that is correct, probably need similar for hpsa (though I think all boards officially supported by hpsa use MSI, but the "hpsa_allow_any=1" kernel option may expose older boards.)

-- steve

Comment 9 Stanislaw Gruszka 2011-11-24 07:23:34 UTC
Jan, can you apply patch from comment 7 and test? Let me know, if you are not familiar with kernel compilation, I will lunch kernel build with patch in http://koji.fedoraproject.org/koji/ .

Comment 10 Jan ONDREJ 2011-11-24 07:26:51 UTC
(In reply to comment #9)
> Jan, can you apply patch from comment 7 and test? Let me know, if you are not
> familiar with kernel compilation, I will lunch kernel build with patch in
> http://koji.fedoraproject.org/koji/ .

Hello. I have no time to build a kernel now, but if you can build me a new build in koji, no problem to test it.

Comment 11 Stanislaw Gruszka 2011-11-24 11:04:16 UTC
Ok, here is the kernel with patch:
http://koji.fedoraproject.org/koji/taskinfo?taskID=3537034

Comment 12 Jan ONDREJ 2011-11-24 11:12:33 UTC
(In reply to comment #11)
> Ok, here is the kernel with patch:
> http://koji.fedoraproject.org/koji/taskinfo?taskID=3537034

Works well. All disks are present.

Comment 13 Stanislaw Gruszka 2011-11-24 11:40:05 UTC
Created attachment 535864 [details]
cciss.patch

This is exact patch I used in the test kernel. Josh please apply it. Stephen please post it :-) Also note that you can get rid of IRQF_DISABLED, according to include/linux/interrupt.h it is noop and deprecated.

Comment 14 Stephen Cameron 2011-11-28 15:03:14 UTC
(In reply to comment #13)
> Created attachment 535864 [details]
> cciss.patch
> 
> This is exact patch I used in the test kernel. Josh please apply it. Stephen
> please post it :-) Also note that you can get rid of IRQF_DISABLED, according
> to include/linux/interrupt.h it is noop and deprecated.

So I also have IRQF_DISABLED in the msix path as the only flag.  Should I just use 0 for the flags there?  Should I add in IRQF_SAMPLE_RANDOM?  I seem to remember that used to be in there at one time as well.

-- steve

Comment 15 Stephen Cameron 2011-11-28 15:07:19 UTC
(In reply to comment #14)
> (In reply to comment #13)
> > Created attachment 535864 [details]
> > cciss.patch
> > 
> > This is exact patch I used in the test kernel. Josh please apply it. Stephen
> > please post it :-) Also note that you can get rid of IRQF_DISABLED, according
> > to include/linux/interrupt.h it is noop and deprecated.
> 
> So I also have IRQF_DISABLED in the msix path as the only flag.  Should I just
> use 0 for the flags there?  Should I add in IRQF_SAMPLE_RANDOM?  I seem to
> remember that used to be in there at one time as well.
> 
> -- steve

Well, digging around, I see there are plenty of uses of request_irq with flags passed as 0, and no scsi drivers use IRQF_SAMPLE_RANDOM and only one block driver, so I guess I shouldn't use IRQF_SAMPLE_RANDOM.

Comment 16 Stanislaw Gruszka 2011-11-28 15:42:55 UTC
Zero should be fine. As long as device do not generate interrupts in truly random maner.

Comment 17 Stephen Cameron 2011-11-28 17:11:41 UTC
Created attachment 537541 [details]
Patch to add IRQF_SHARED flag to hpsa for non-msi interrupt handler

Here is the patch I sent to the lkml for hpsa to add IRQF_SHARED to non msi(x) interrupt request.

Comment 18 Stephen Cameron 2011-11-28 17:12:51 UTC
Created attachment 537543 [details]
Patch to add IRQF_SHARED flag to cciss for non msi(x) interrupts

Here is the patch I sent to lkml for cciss to add IRQF_SHARED to the non msi(x) interrupt request.

Comment 19 Chuck Ebbert 2011-11-28 21:39:12 UTC
Patches added to F15 and F16, will be in the next update.

Comment 20 Fedora Update System 2011-11-29 13:51:50 UTC
kernel-2.6.41.4-1.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/kernel-2.6.41.4-1.fc15

Comment 21 Fedora Update System 2011-11-30 02:03:19 UTC
Package kernel-2.6.41.4-1.fc15:
* should fix your issue,
* was pushed to the Fedora 15 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-2.6.41.4-1.fc15'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2011-16621/kernel-2.6.41.4-1.fc15
then log in and leave karma (feedback).

Comment 22 Fedora Update System 2011-12-10 19:51:25 UTC
kernel-2.6.41.4-1.fc15 has been pushed to the Fedora 15 stable repository.  If problems still persist, please make note of it in this bug report.