Bug 693542 - bnx2 / BCM5716 on PowerEdge R210 (certified hw) crashes (works on RHEL5.5+)
Summary: bnx2 / BCM5716 on PowerEdge R210 (certified hw) crashes (works on RHEL5.5+)
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 14
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Neil Horman
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 714322
TreeView+ depends on / blocked
 
Reported: 2011-04-04 22:20 UTC by François Cami
Modified: 2012-05-02 13:49 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 693529
Environment:
PowerEdge R210
Last Closed: 2012-05-02 13:49:23 UTC


Attachments (Terms of Use)
dmesg capture of F15 on IBM x3550 M3 (76.57 KB, text/plain)
2011-07-01 13:37 UTC, Zing
no flags Details
testing patch fix for nmi (42.55 KB, text/plain)
2011-07-01 20:15 UTC, Zing
no flags Details
F15 unknown nmi error capture (77.22 KB, text/plain)
2011-07-05 14:25 UTC, Zing
no flags Details
unknown_nmi_panic on cmdline, no call trace generated though (48.86 KB, text/plain)
2011-07-05 18:49 UTC, Zing
no flags Details
capture of install dvd hang with unknown_nmi_panic=0 (81.09 KB, text/plain)
2011-08-04 16:29 UTC, Zing
no flags Details
3.0.0-4.fc15 (136.54 KB, text/plain)
2011-08-22 15:54 UTC, Zing
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 648005 None None None Never
Red Hat Bugzilla 693529 None None None Never

Internal Links: 648005 693529

Description François Cami 2011-04-04 22:20:47 UTC
+++ This bug was initially created as a clone of Bug #693529 +++

Description of problem:
As soon as anaconda loads the bnx2 kernel module, the server crashes. Note that this is a remote machine I only have remote management (vKVM) to.

I tried linux acpi=off noapic as per redhat kb to no avail.
I then added ignore_loglevel and saw that the last driver loaded before the server becomes unresponsive is bnx2.
I then added noprobe so anaconda prompted me to load drivers. I selected bnx2 and the system became unresponsive.
I then tried bnx2=disable_msi=1 but it had no effect.

Version-Release number of selected component (if applicable):
Fedora 14 GA kernel-2.6.35.6-45.fc14.x86_64.rpm

How reproducible:
Always

Steps to Reproduce:
1. Get a PowerEdge R210 with a BCM5716 NIC (or two ? I have two)
2. Boot the RHEL 6.0 DVD (alt. 6.1 Beta, Fedora 14...)
  
Actual results:
Anaconda stops at the "Waiting for hardware to initialize..." stage

Expected results:
hw initialized and anaconda going to stage 2?

Additional info:

RHEL5.5+ works perfectly.

This is from RHEL5.5 on the same hw:

02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet (rev 20)
	Subsystem: Dell Unknown device 02a5
	Flags: bus master, fast devsel, latency 0, IRQ 169
	Memory at da000000 (64-bit, non-prefetchable) [size=32M]
	Capabilities: [48] Power Management version 3
	Capabilities: [50] Vital Product Data
	Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/4 Enable-
	Capabilities: [a0] MSI-X: Enable+ Mask- TabSize=9
	Capabilities: [ac] Express Endpoint IRQ 0
	Capabilities: [100] Device Serial Number 53-30-d0-fe-ff-5b-30-bc
	Capabilities: [110] Advanced Error Reporting
	Capabilities: [150] Power Budgeting
	Capabilities: [160] Virtual Channel

02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet (rev 20)
	Subsystem: Dell Unknown device 02a5
	Flags: bus master, fast devsel, latency 0, IRQ 233
	Memory at dc000000 (64-bit, non-prefetchable) [size=32M]
	Capabilities: [48] Power Management version 3
	Capabilities: [50] Vital Product Data
	Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/4 Enable-
	Capabilities: [a0] MSI-X: Enable- Mask- TabSize=9
	Capabilities: [ac] Express Endpoint IRQ 0
	Capabilities: [100] Device Serial Number 54-30-d0-fe-ff-5b-30-bc
	Capabilities: [110] Advanced Error Reporting
	Capabilities: [150] Power Budgeting
	Capabilities: [160] Virtual Channel

I can extract data from RHEL5 and possibly try anaconda images you send my way.

Comment 1 François Cami 2011-04-04 22:24:20 UTC
F14 GA DVD doesn't work (and RHEL 6.0 doesn't either, for that matter).

Comment 2 Neil Horman 2011-04-05 14:30:21 UTC
I'm reserving a  poweredge 210 here with a bnx2 card, but I'm sure I've seen it install f14 before just fine.  Does your kvm have a virtual serial port we can use on it?  if you can add console=ttyS0,<speed>n8 to the kernel install commandline, where speed is something appropriate for your kvm, that should catch the oops, which you can post here to help us debug this.

Comment 3 François Cami 2011-04-05 20:54:17 UTC
Hi Neil,
The vKVM is the one integrated to a Dell iDRAC6, so no virtual serial port.
I will boot anaconda using ignore_loglevel since there is more output to the console, but I never saw the oops itself.
Is there anything else I can do (netconsole being for obvious reasons out of the picture)? I suppose dmesg and such from RHEL5 won't help...
Thank you

Comment 4 Neil Horman 2011-04-06 10:51:24 UTC
Unfortunately not.  I could really use the oops here.  I have a few ideas about what may be wrong, but without the oops I'm just guessing.  I'll try to get hold of our poweredge 210 today to see if I can re-create it, but if you would please continue to try figure out whats going on here that would be great.  You might try doing a vnc install so as to not require multiple virtual terminals in anaconda on the console, whcih would obscure your stack trace.

Comment 5 François Cami 2011-04-06 11:00:18 UTC
It's not even reaching stage2 (it crashes when the "Waiting for hardware to initialize..." message is displayed) so the vnc install seems impossible (it comes much later).
I may have a solution to capture the console output but it will take a few days.
I'll keep in touch.

Comment 6 Maxime Thépault 2011-06-06 07:24:56 UTC
Hi, I think we have the same problem, and same hardware :

https://bugzilla.redhat.com/show_bug.cgi?id=710602

I have given screenshots ;)

Comment 7 Neil Horman 2011-06-06 14:10:00 UTC
Can you do 2 things please?

1) Check the bios revision on your system.  The bios I have here is v1.1.4

2) Boot the intstaller with pci=nobios on the command line.  That last line prior to the hang issues a pci write to the device via bios and I'd like to ensure that something isn't wrong with the system bios handling the write to this device.

Comment 8 Maxime Thépault 2011-06-06 14:52:07 UTC
Thank you for your response, I responded into the new bug report :

https://bugzilla.redhat.com/show_bug.cgi?id=710602

Comment 9 François Cami 2011-06-09 16:37:24 UTC
Sorry for the late reply. The machine is in production using RHEL 5.x and I cannot take it out for testing now. I'll do it ASAP, but that means getting a new machine and this won't happen soon.

Comment 10 Neil Horman 2011-06-09 17:35:50 UTC
Ok, fancois, let me know when you get to it.

Comment 11 Maxime Thépault 2011-06-09 17:52:38 UTC
François : maybe do you have a dedicated server in France, with Online / Dedibox ?

If yes : I had a server with them, with business support : it's a trap... no real support is here... and Dell iDrac KVM IP is very unstable with virtual media (ISO).

Comment 12 François Cami 2011-06-09 21:49:29 UTC
My experience with their business support is limited but fine, and the iDrac IP KVM works well if you have enough upload bandwidth.
But yes, this is exactly where the R210 is hosted.

Comment 13 Maxime Thépault 2011-06-10 05:09:50 UTC
I had problems with their iDrac IP and with many DSL connections... and support is absent.

So, bye bye Online, welcome OVH.

I prefer to warn you ;)

Comment 14 Raphaël Gertz 2011-06-21 16:14:15 UTC
Well i did a lot of debug and went to the following conclusion :
- the problem is caused by the Intel Xeon CPU L3426
- while the init of bnx2 module


lshw -C CPU output :
  *-cpu                   
       description: CPU
       product: Intel(R) Xeon(R) CPU           L3426  @ 1.87GHz
       vendor: Intel Corp.
       physical id: 400
       bus info: cpu@0
       version: Intel(R) Xeon(R) CPU           L3426  @ 1.87GHz
       slot: CPU1
       size: 1866MHz
       capacity: 3600MHz
       width: 64 bits
       clock: 4266MHz
       capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
       configuration: cores=4 enabledcores=4 threads=8

I installed manualy (chroot install + grub resetup) the fedora 15.

I blacklisted the bnx2 module.

I started the server and when i do the modprobe bnx2 the server freeze and hang.

The problem is not reproducted with Intel Core i3 or Intel Xeon X3450, so I suspect cpu bug/issue.

May you try with this cpu ?
(else i may grand you a test server to debug from where it may comes).

Best regards.

In general stop whine about absent support, you rent low-price server after all...
And I am part of the support and I am here to help fix this issue...

Comment 15 François Cami 2011-06-22 14:41:31 UTC
Neil, can you see with Raphaël how to get more information if needed? I cannot provide you with more data right now, and Raphaël has the exact hardware to test things on.

Comment 16 Raphaël Gertz 2011-06-27 12:42:19 UTC
I tried to update the microcode with the one avaible at :
http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&ProdId=2680&DwnldID=20050

But it still fail with hangup on bnx2 module load.

Comment 17 Zing 2011-06-30 20:31:40 UTC
I believe I have the same problem with an IBM x3550 M3 (model 7944).

Fedora 15 install hangs at "Waiting for hardware to initialize...".  The pci/raid/nmi light path leds will light and server reboots.

I will usually also see: "Uhhhhh nmi received for unknown reason 2d on cpu0" 

Using ignore_loglevel, I see the megasas module and the bnx2 module are loaded here.

If I tell anaconda to blacklist bnx2, the installer passes this point.  This is as far as I've gotten, as I need the network to continue.  Any questions let me know.

As an aside, I had installed and was running Fedora 14 on this machine successfully.

Comment 18 Zing 2011-07-01 13:37:01 UTC
Created attachment 510862 [details]
dmesg capture of F15 on IBM x3550 M3

Added dmesg capture of F15 install hang on ibm x3550 M3

Comment 19 Zing 2011-07-01 13:53:30 UTC
pci=nobios does not help, immediate hang when bnx2 module is loaded during the "Waiting for hardware init...".

Bios version: UEFI 1.11 BuildID D6E150C

My cpu:
  *-cpu:0
       description: CPU
       product: Intel(R) Xeon(R) CPU           X5677  @ 3.47GHz
       vendor: Intel Corp.
       physical id: 1
       bus info: cpu@0
       version: Intel(R) Xeon(R) CPU           X5677  @ 3.47GHz
       slot: Node 1 Socket 1
       size: 3470MHz
       width: 64 bits
       clock: 1571MHz
       capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm arat tpr_shadow vnmi flexpriority ept vpid
       configuration: cores=4 enabledcores=4 threads=8
  *-cpu:1 DISABLED
       description: CPU [empty]
       physical id: 55
       slot: Node 1 Socket 2

Comment 20 Neil Horman 2011-07-01 15:18:46 UTC
Thank you, thats helpful.  the nmi error makes me think this isn't a bnx2 issue at all, but rather a perf nmi gone bad:
https://patchwork.kernel.org/patch/566721/


I've backported that fix to f14 (along with some supporting infrastructure).  The build is here:
http://koji.fedoraproject.org/koji/taskinfo?taskID=3174116

If you could please, try this kernel and see if it fixes the problem.  You can either rebuild the installer intramfs or you can install using a dvd (blacklisting the NIC), and then update with this kernel and unblacklist bnx2 to see if the issue stops recurring.

Comment 21 Zing 2011-07-01 18:28:46 UTC
Sorry, the f14 kernel didn't work.  I got the hang/nmi when I modprobed the bnx2 module:

[  142.494991] bnx2: Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v2.0.15 (May 4, 20)
[  142.541360] bnx2 0000:0b:00.0: PCI INT A -> GSI 28 (level, low) -> IRQ 28
[  142.595412] bnx2 0000:0b:00.0: eth1: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI 8
[  142.660347] bnx2 0000:0b:00.1: PCI INT B -> GSI 40 (level, low) -> IRQ 40
[  142.707743] bnx2 0000:0b:00.1: eth2: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI a
[  142.774584] bnx2 0000:10:00.0: PCI INT A -> GSI 29 (level, low) -> IRQ 29
[  142.823731] bnx2 0000:10:00.0: eth3: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI c
[  143.417963] Uhhuh. NMI received for unknown reason 2d on CPU 0.
[  143.417965] Do you have a strange power saving mode enabled?
[  143.417965] Dazed and confused, but trying to continue

Weird, got that same NMI, maybe the patch is missing?

Comment 22 Neil Horman 2011-07-01 18:59:15 UTC
I don't think so, the build log indicated that it was applied:

http://koji.fedoraproject.org/koji/getfile?taskID=3174117&name=build.log
----------------------
+ case "$patch" in
+ patch -p1 -F1 -s
+ ApplyPatch linux-2.6-perf-overflow-handling.patch
+ local patch=linux-2.6-perf-overflow-handling.patch
+ shift
+ '[' '!' -f /builddir/build/SOURCES/linux-2.6-perf-overflow-handling.patch ']'
Patch14000: linux-2.6-perf-overflow-handling.patch
-----------------------

Are you sure you booted the right kernel?  I don't see any log messages above that confirm its the test kernel that was running.

Comment 23 Zing 2011-07-01 20:15:05 UTC
Created attachment 510932 [details]
testing patch fix for nmi

Just tried the nmi patched kernel twice to make sure I got the right kernel and I still got the hang both times.  Here's the full serial console capture of my last run.

Comment 24 Neil Horman 2011-07-01 20:32:14 UTC
Hmm, well, ok I'm really not sure whats going on then.  Something else must be causing the unknown nmi code on your system, although I couldn't for the life of me imagine what.  Can you boot with unknown_nmi_panic=0 on the command line to avoid the panic and provide the log of that boot?  That may give us some further idea of whats causing the NMI code.

Comment 25 Zing 2011-07-05 14:25:49 UTC
Created attachment 511324 [details]
F15 unknown nmi error capture

Attached is the dmesg of nmi call trace.

This one has:
Uhhuh. NMI received for unknown reason 3d on CPU 0.

I've noticed "unknown reason" being 2d or 3d when and if the kernel gets a chance to output anything to the console.

Comment 26 Neil Horman 2011-07-05 16:06:44 UTC
you didn't boot with unknown_nmi_panic=0

Comment 27 Zing 2011-07-05 18:49:09 UTC
Created attachment 511365 [details]
unknown_nmi_panic on cmdline, no call trace generated though

This is a capture with unknown_nmi_panic=0 being passed on bootup, but I never got a call trace this way... then I noticed /proc/sys/kernel/unknown_nmi_panic was still being set to 1 (something was resetting this to 1 or a bug? Anway...).  The capture before this one is me echo'ing 0 to the unknown_nmi_panic proc control manually and then modprobing bnx2.  Sorry for the confusion.

Comment 28 Don Zickus 2011-07-11 12:57:20 UTC
If you can get the unknown_nmi_panic=0 working (run a 'grep -r unknown_nmi_panic /etc/*' to see if the system is setting it) and can get to a login prompt, then run 'lspci -vvv' and attach the output.

I might be able to figure out which device is sending the NMI from that output.

Don't worry about the 2d or 3d, of the 8 bits only one is useful/meaningful. A couple of others just natural flip back and forth hence either 2d or 3d.

Cheers,
Don

Comment 29 Raphaël Gertz 2011-07-27 11:39:57 UTC
I tried boot with unknown_nmi_panic=0 and it still crash completely while loading the bnx2 module.

It do the but with Fedora 15 kernel and Centos 6.0 kernel (i tested it as well just to see).

Comment 30 Neil Horman 2011-07-27 16:14:07 UTC
Zing, so what is that trace showing?  It seems to me that if you disable the nmi panic, nothing is crashing (at least your trace doesn't show an oops).  Is something else happening that makes the system unstable (a hang or some such)?

Comment 31 Zing 2011-08-04 16:29:53 UTC
Created attachment 516746 [details]
capture of install dvd hang with unknown_nmi_panic=0

Sorry, took me awhile to get back to this.  Surprisingly I wasn't as easily able to hang the server with 2.6.38.8-32.fc15.x86_64 as some weeks back.  I was able to modprobe bnx2 many times and it worked X/.  It will still hang, and if it does, it's immediately after modprobbing bnx2, otherwise everything seems ok and we continue.  not good.  I was changing bios settings around back then, along with changing to a legacy boot setting.  I'm wondering if that matters and differences between EFI booting.

So I went back to the F15 install dvd again and that hangs it reliably so far.

I passed unknown_nmi_panic=0 to the install dvd and I attached the console capture...

As soon as bnx2 is modprobe'd, the kernel console log output slowed down a lot....   about a few characters a sec.  Tgere is a call trace in the logs.

At the point the log ends, the server automatically forced a hard shutdown.

I can attach an lspci, but it seemed like you needed the output from the kernel that hung at the time.

Comment 32 Zing 2011-08-04 17:01:23 UTC
I've successfully booted to the gui in the installer dvd twice using biosdevname=0 now.  That's never happened before in my tests.  Raphaël does that work for you on your hardware?

Comment 33 Zing 2011-08-04 20:53:26 UTC
I've also found that nosmp allows the F15 install dvd to continue past the modprobe'ing of bnx2.

Comment 34 Raphaël Gertz 2011-08-08 14:40:22 UTC
It still hangup with 2.6.32-71.29.1.el6.x86_64 with biosdevname=0...

And just a not, the 2.6.32 boot perfectly fine on the box.

I tried rebuilding several taggset from the kernel src.rpm, but it seems to fail from the first to last tagging set from various causes :
- usb/tpm detection fail
- video init fail

Comment 35 Raphaël Gertz 2011-08-08 15:03:23 UTC
I have no idea how to fix this, if you need i can grant you access to the box if required...

Comment 36 Raphaël Gertz 2011-08-10 09:42:35 UTC
I tried booting the 2.6.32-131.6.1.el6.x86_64 with all the previous options.

The log give me this after loading bnx2 :
# modprobe bnx2
bnx2: Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v2.1.6 (Mar 7, 2011)
bnx2 0000:02:0.0: found PCI INT A -> IRQ 15
bnx2 0000:02:0.0: sharing IRQ 15 with 0000:00:03.0
bnx2 0000:02:0.0: sharing IRQ 15 with 0000:01:00.0
_

And that's all

For me it seems that one of the redhat patch is triggering the irq conflict.

Comment 37 Raphaël Gertz 2011-08-10 10:08:07 UTC
It seems that the H200 raid card on the server is sharing the irq :
mpt2sas version 08.101.00.00 loaded
scsi0 : Fusion MPT SAS Host
mpt2sas 0000:01:00.0: found PCI INT A -> IRQ 15
mpt2sas 0000:01:00.0: sharing IRQ 15 with 0000:00:03.0
mpt2sas 0000:01:00.0: sharing IRQ 15 with 0000:02:00.0
mpt2sas0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (16459204 kB)
mpt2sas0: IO-APIC enabled: IRQ 15
mpt2sas0: iomem (0x00000000df2b), mapped(0xffffc90002de0000), size(65536)
mpt2sas0: ioport(0x000000000000fc00), size(256)
mpt2sas0: sending diag reset !!
[...usb init...]
mpt2sas0: diag reset: SUCCESS
mpt2sas0: Allocated physical memory: size(7444 kB)
mpt2sas0: Current Controller Queue Depth(3306), Max Controller Queue Depth(3439)
mpt2sas0: Scatter Gather Elements per IO(128)

The devices are (from lspci before loading bnx2 module) :
00:03.0 PCI bridge: Intel Corporation Core Processor PCI Express Root Port 1 (rev 11)
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet (rev 20)

It seems that the mpt2sas don't honor the boot flags noapic :
vga=0x31b acpi=off noacpi noapic nolapic pci=nobios unknown_nmi_panic=0 nosmp biosdevname=0

When booting without all the options it seems that irq 16 is used instead of irq 15 and it freeze just after :
bnx2: Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v2.1.6 (Mar 7, 2011)
bnx2 0000:02:0.0: found PCI INT A -> GSI 16 (level, low) -> IRQ 16

Comment 38 Zing 2011-08-22 15:54:16 UTC
Created attachment 519309 [details]
3.0.0-4.fc15

This capture is a recompiled 3.0.0-4.fc15 with some debug options enabled.  This one has call traces on each of the cpus.

CPU 2 is in ext4 and maybe the soft lockup is somewhere in this chain of code and the pci vpd reads?

Comment 39 Neil Horman 2011-08-29 14:41:24 UTC
Why are your devices not using MSI-X interrupts?

Comment 40 Raphaël Gertz 2011-08-29 16:04:03 UTC
what is MSI-X interrupts ? how do i activate them ? isn't a patch/config from rh kernel that disable it ?

Comment 41 Neil Horman 2011-10-20 12:56:08 UTC
No, msi interrupts should be enabled by default.  The only reason you shouldn't be using them is if disable_msi was specified as a bnx2 module option at load time

Comment 42 John Florian 2011-11-01 15:46:17 UTC
Hi all, I also have a Dell R210 II with dual gigabit NICs that is exhibiting problems very much like described here already.  I may have some new helpful clues due to a different hunch I was following until I found this BZ.  First of all, I had no problems with the F15 installer; my problems arrived rebooting into the fresh install.  My system locks hard whenever the 2nd bnx2 instance is being initialized.

However, I hadn't realized that the NICs were relevant initially.  For some reason I was initially suspect of the PERC H200 6gb/s HBA to which I have an Intel 320 series 40 GB SSD attached.  I was originally trying to install a custom-spin of Fedora Live on the SSD and with that the system would start to boot then halt with "Cannot find root device.  Sleeping forever." or something very close to that effect.

What I'd found is that if I moved the SSD from the PERC to the mainboard's SATA port A -- before or after the Fedora install; makes no difference -- I could boot and operate just fine, including networking!  Only when the PERC was involved did I have problems.

Now that I've read through this BZ and see that interrupt handling may be suspect instead, I tried a few more tests with surprising results.  If I configure the BIOS such that "Integrated Devices"/"Embedded NIC1 and NIC2" is set to "Disabled (OS)" it will boot fine from the PERC, but of course I have no networking then.  Interestingly enough, I see in "PCI IRQ Assignment" that NIC1 and the SAS Controller both share IRQ 10 (as does USB EHCI Controller 2).  I would find this more compelling if it was NIC2 that shared with the SAS Controller since there's where my boot hangs.  Whenever I change the assignment for any one of those three, all three change together -- I cannot make the SAS and NIC1 different.  So I don't suspect fiddling with assignments is going to help.

I realize that IRQ sharing is possible these days, but am not well enough versed to know how that's accomplished or how "fragile" it might be.  Still, I'm hoping I've brought new light upon this problem.

Please let me know if there is anything I can do to help move this along.

Comment 43 John Florian 2011-11-01 15:54:28 UTC
Got my first successful boot into F15, with usable networking (on NIC1, at least) by changing all (even those seemingly unrelated) of the IRQ assignments to "default".  I'm not sure how "default" differs from the explicit defaults set by Dell or what new problems I may have created in consequence, but so far it seems an improvement.

Comment 44 John Florian 2011-11-01 18:23:26 UTC
(In reply to comment #43)
> Got my first successful boot into F15, with usable networking (on NIC1, at
> least) by changing all (even those seemingly unrelated) of the IRQ assignments
> to "default".

Harrumph!  I cannot repeat this now.  I must have well over a hundred boot tests now (what fun with ~5m just in the POST) and this must have been a freak occurrence.

> I'm not sure how "default" differs from the explicit defaults
> set by Dell or what new problems I may have created in consequence, but so far
> it seems an improvement.

The Dell Insyde BIOS holds the "default" value until the next boot, at which point the BIOS will hold the explicit values again.

In summary, no amount of fiddling with the IRQ assignments seems to help.  The only repeatable methods I've found are:
 * disable NIC1 and NIC2 (no option for disabling singularly)
 * bypass the PERC and attach SDD directly to mainboard's SATA port

Comment 45 Neil Horman 2011-12-20 19:24:23 UTC
None of this answers why legacy interrupts are being used on these systems,   They should support MSI interrupts and use those (which will not be shared).  Is there anything in the logs which indicates why msi interrupts are disabled on these nics?

Comment 46 Neil Horman 2012-01-20 14:43:46 UTC
Ping John, has there been any further reproducibility here?  Or should we close this?

Comment 47 John Florian 2012-01-20 16:05:23 UTC
(In reply to comment #46)
> Ping John, has there been any further reproducibility here?  Or should we close
> this?

Neil, first up my apologies for letting this slip.  Must have been the holidays because I completely lost track of this one.

Given the success we saw without the PERC, we went down the road of ordering the 40+ R210s from Dell without the PERCs, so in that sense it's no longer a problem for me.  We were originally told by Dell that we couldn't use an SSD without a PERC, but our experience showed just the opposite.  I doubt there's any clue there, but thought I'd mention it just in case.  (Also IIRC, Dell did say they had an SSD that would work without a PERC, but it was a big expensive monster and we really only need a few gig.)

I do still have the one R210 and the PERC so I could reinstall the PERC and try more tests if that would be helpful here.  Also, we're now targeting F16 instead of F15, so if you want more tests please indicate which or both that I should try to get you the feedback.

Comment 48 nerd65536+redhat 2012-04-27 01:02:49 UTC
We had a similar problem on our Dell R210s that was fixed by upgrading the Broadcom card's firmware.

You can check the current firmware version using "lshw".

On firmware version 6.0.1, loading the bnx2 module would cause the server to hang.
Either blacklisting bnx2, or using the kernel option "nosmp" would workaround the problem.

On firmware version 6.4.5, the system works correctly.

Dell's firmware update package: http://www.dell.com/support/drivers/us/en/04/DriverDetails/DriverFileFormats?DriverId=R319248

Comment 49 Neil Horman 2012-05-02 13:49:23 UTC
well, I think, given the fact that anyone with this problems seems to have resolved it with alternate hardware, that I'll leave testing up to you.  to solve this, I think comment 45 is still the first question that needs answering.  If anyone has the gumption to go and research that, I'll gladly take a look at it.  For now though, I'll close this bug.  Please re-open it if/when you get around to testing.


Note You need to log in before you can comment on or make changes to this bug.