Bug 808880

Summary: ACPI/IRQ assignment regression in kernels > 2.6.40 (Asus M2N-LR + 3Ware-9xxx)
Product: [Fedora] Fedora Reporter: Solomon Peachy <pizza>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 15CC: gansalmon, itamar, jonathan, kernel-maint, kevin.hobbs.1, madhu.chinakonda, sassmann
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-07-11 17:51:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
kernel log of successful 2.6.40.6 bootup.
none
output of lspci -v
none
dmidecode output
none
dmesg log of failed 2.6.43.1-2 bootup
none
PRT.patch none

Description Solomon Peachy 2012-04-01 13:21:36 UTC
Description of problem:

I have a pair of 3Ware-9550SX cards plugged into an ASUS M2N-LR motherboard.  The system is running Fedora 15, kernel 2.6.40.6.

If I boot the system with the 2.6.41 or 2.6.42 kernels, the SCSI partition probes never succeed, and the kernel logs are peppered with scsi command timeout messages.

Eventually systemd kicks me to an emergency shell if I boot in single user mode, udev spits out timeout errors, and the system is completely unusable.  I can't even plug in a USB stick to get the kernel message log.

Version-Release number of selected component (if applicable):

kernel-2.6.40.6-0.fc15.x86_64:  Works
kernel-2.6.42.12-1.fc15.x86_64: Fails

(I no longer have a 2.6.41 kernel installed on this system, but those failed too)

How reproducible:

100%

Additional info:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=658670

This appears to be the same problem I'm having.  It's reported against the Debian 3.1.x kernel, and the reporter has the same motherboard, but I can't tell if it's the same controller.

Additionally, I tried the Fedora 16 LiveCD, and that also failed in the same way.  If someone can suggest a way to get a log out of the system, I'm all ears.
I can easily attach copies of a successful 2.6.40 boot, but I don't know how useful that will be.

00:00.0 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a2)
00:01.0 ISA bridge: nVidia Corporation MCP55 LPC Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation MCP55 SMBus (rev a3)
00:02.0 USB Controller: nVidia Corporation MCP55 USB Controller (rev a1)
00:02.1 USB Controller: nVidia Corporation MCP55 USB Controller (rev a2)
00:05.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:05.1 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:05.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:06.0 PCI bridge: nVidia Corporation MCP55 PCI bridge (rev a2)
00:0a.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0b.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0c.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0d.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0e.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0f.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor HyperTransport Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor Miscellaneous Control
00:18.4 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor Link Control
01:05.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)
02:00.0 PCI bridge: Intel Corporation 6702PXH PCI Express-to-PCI Bridge A (rev 09)
03:00.0 RAID bus controller: 3ware Inc 9550SX SATA-II RAID PCI-X
03:04.0 RAID bus controller: 3ware Inc 9550SX SATA-II RAID PCI-X
04:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 21)
05:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 21)

Comment 1 Solomon Peachy 2012-04-01 15:50:02 UTC
Created attachment 574347 [details]
kernel log of successful 2.6.40.6 bootup.

Successful bootup with 2.6.40.6 -- I'm including this for reference reasons.  I'm still trying to get a kernel boot log with the newer (failing) kernels.

Comment 2 Solomon Peachy 2012-04-01 15:51:22 UTC
Created attachment 574348 [details]
output of lspci -v

Comment 3 Solomon Peachy 2012-04-01 15:52:47 UTC
Created attachment 574350 [details]
dmidecode output

Comment 4 Solomon Peachy 2012-04-08 13:19:15 UTC
I grabbed kernel-2.6.43.1-2.fc15.x86_64.rpm out of Koji, and it also failed in the same way.  However, I was finally able to get a kernel log -- I had to wait until the kernel finished trying to probe every LUN on the 3Ware cards.

The basic problem is that for whatever reason, the scsi bus probes aren't succeeding.  Further investigation shows that the 3Ware cards' IRQ assignments are all wonky; they're being routed to legacy IRQs.

2.6.40 (good)
3ware 9000 Storage Controller device driver for Linux v2.26.02.014.
ACPI: PCI Interrupt Link [LNEC] enabled at IRQ 18
3w-9xxx 0000:03:00.0: PCI INT A -> Link[LNEC] -> GSI 18 (level, low) -> IRQ 18
scsi2 : 3ware 9000 Storage Controller
3w-9xxx: scsi2: Found a 3ware 9000 Storage Controller at 0xefdff000, IRQ: 18.
3w-9xxx: scsi2: Firmware FE9X 3.08.00.029, BIOS BE9X 3.10.00.003, Ports: 8.
3w-9xxx 0000:03:04.0: PCI INT A -> Link[LNEC] -> GSI 18 (level, low) -> IRQ 18
scsi 2:0:0:0: Direct-Access     AMCC     9550SXU-8L DISK  3.08 PQ: 0 ANSI: 5
scsi 2:0:1:0: Direct-Access     AMCC     9550SXU-8L DISK  3.08 PQ: 0 ANSI: 5
scsi7 : 3ware 9000 Storage Controller
3w-9xxx: scsi7: Found a 3ware 9000 Storage Controller at 0xefdfe000, IRQ: 18.
3w-9xxx: scsi7: Firmware FE9X 3.08.00.029, BIOS BE9X 3.10.00.003, Ports: 4.
scsi 7:0:0:0: Direct-Access     AMCC     9550SX-4LP DISK  3.08 PQ: 0 ANSI: 5

2.6.43: (and 2.6.41/2.6.42: bad)
3ware 9000 Storage Controller device driver for Linux v2.26.02.014.
3w-9xxx 0000:03:00.0: PCI IRQ 0 -> rerouted to legacy IRQ 16
ACPI: Invalid index 16
3w-9xxx 0000:03:00.0: PCI INT A: no GSI - using ISA IRQ 14
scsi4 : 3ware 9000 Storage Controller
3w-9xxx: scsi4: Found a 3ware 9000 Storage Controller at 0xefdff000, IRQ: 14.
3w-9xxx: scsi4: Firmware FE9X 3.08.00.029, BIOS BE9X 3.10.00.003, Ports: 8.
3w-9xxx 0000:03:04.0: PCI IRQ 0 -> rerouted to legacy IRQ 16
ACPI: Invalid index 16
3w-9xxx 0000:03:04.0: PCI INT A: no GSI - using ISA IRQ 14
scsi8 : 3ware 9000 Storage Controller
3w-9xxx: scsi8: Found a 3ware 9000 Storage Controller at 0xefdfe000, IRQ: 14.
3w-9xxx: scsi8: Firmware FE9X 3.08.00.029, BIOS BE9X 3.10.00.003, Ports: 4.
scsi: waiting for bus probes to complete ...
scsi 4:0:0:0: WARNING: (0x06:0x002C): Command (0x12) timed out, resetting card.
scsi 8:0:0:0: WARNING: (0x06:0x002C): Command (0x12) timed out, resetting card.
scsi 4:0:0:0: WARNING: (0x06:0x002C): Command (0x0) timed out, resetting card.
scsi 8:0:0:0: WARNING: (0x06:0x002C): Command (0x0) timed out, resetting card.
scsi 4:0:0:0: Device offlined - not ready after error recovery
scsi 8:0:0:0: Device offlined - not ready after error recovery
[repeat above six lines fifteen more times, once for each LUN]

Comment 5 Solomon Peachy 2012-04-08 13:20:01 UTC
Created attachment 576032 [details]
dmesg log of failed 2.6.43.1-2 bootup

Comment 6 Stefan Assmann 2012-04-27 11:09:12 UTC
Created attachment 580730 [details]
PRT.patch

Solomon,
here's a patch that should apply on top of a 3.3 kernel. Please provide the dmesg output of a boot with this patch applied. Thanks!

Comment 7 Josh Boyer 2012-07-11 17:51:27 UTC
Fedora 15 has reached it's end of life as of June 26, 2012.  As a result, we will not be fixing any remaining bugs found in Fedora 15.

In the event that you have upgraded to a newer release and the bug you reported is still present, please reopen the bug and set the version field to the newest release you have encountered the issue with.  Before doing so, please ensure you are testing the latest kernel update in that release and attach any new and relevant information you may have gathered.

Thank you for taking the time to file a report.  We hope newer versions of Fedora suit your needs.

Comment 8 Kevin Hobbs 2012-09-04 17:04:21 UTC
This problem occurs for me while booting the fedora 17 netinstall disk.

In order to observe the problem I have to append modprobe.blacklist=3w_9xxx to the kernel command line.

Then when I modprobe the 3w_9xxx module the controller tail of dmesg is:

[ 1036.768034] scsi 4:0:11:0: WARNING: (0x06:0x002C): Command (0x12) timed out, resetting card.
[ 1042.934020] 3w-9xxx: scsi4: AEN: INFO (0x04:0x0029): Verify started:unit=0.
[ 1063.139023] scsi 4:0:11:0: WARNING: (0x06:0x002C): Command (0x0) timed out, resetting card.
[ 1084.304031] scsi 4:0:11:0: Device offlined - not ready after error recovery
[ 1105.760032] scsi 4:0:12:0: WARNING: (0x06:0x002C): Command (0x12) timed out, resetting card.
[ 1116.924023] 3w-9xxx: scsi4: AEN: INFO (0x04:0x0029): Verify started:unit=0.
[ 1137.129013] scsi 4:0:12:0: WARNING: (0x06:0x002C): Command (0x0) timed out, resetting card.
[ 1158.243050] scsi 4:0:12:0: Device offlined - not ready after error recovery
[ 1179.744049] scsi 4:0:13:0: WARNING: (0x06:0x002C): Command (0x12) timed out, resetting card.
[ 1190.959020] 3w-9xxx: scsi4: AEN: INFO (0x04:0x0029): Verify started:unit=0.
[ 1211.164023] scsi 4:0:13:0: WARNING: (0x06:0x002C): Command (0x0) timed out, resetting card.
[ 1227.329039] scsi 4:0:13:0: Device offlined - not ready after error recovery
[ 1248.736029] scsi 4:0:14:0: WARNING: (0x06:0x002C): Command (0x12) timed out, resetting card.
[ 1254.902019] 3w-9xxx: scsi4: AEN: INFO (0x04:0x0029): Verify started:unit=0.
[ 1275.107060] scsi 4:0:14:0: WARNING: (0x06:0x002C): Command (0x0) timed out, resetting card.
[ 1296.272027] scsi 4:0:14:0: Device offlined - not ready after error recovery
[ 1317.728037] scsi 4:0:15:0: WARNING: (0x06:0x002C): Command (0x12) timed out, resetting card.
[ 1328.841022] 3w-9xxx: scsi4: AEN: INFO (0x04:0x0029): Verify started:unit=0.
[ 1349.046058] scsi 4:0:15:0: WARNING: (0x06:0x002C): Command (0x0) timed out, resetting card.
[ 1370.160016] scsi 4:0:15:0: Device offlined - not ready after error recovery