Bug 693542
Summary: | bnx2 / BCM5716 on PowerEdge R210 (certified hw) crashes (works on RHEL5.5+) | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | François Cami <fdc> |
Component: | kernel | Assignee: | Neil Horman <nhorman> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 14 | CC: | dzickus, fcami, gansalmon, itamar, john, jonathan, kernel-maint, madhu.chinakonda, nerd65536+redhat, nhorman, redhat, sites, zing |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | 693529 | Environment: |
PowerEdge R210
|
Last Closed: | 2012-05-02 13:49:23 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 714322 | ||
Attachments: |
Description
François Cami
2011-04-04 22:20:47 UTC
F14 GA DVD doesn't work (and RHEL 6.0 doesn't either, for that matter). I'm reserving a poweredge 210 here with a bnx2 card, but I'm sure I've seen it install f14 before just fine. Does your kvm have a virtual serial port we can use on it? if you can add console=ttyS0,<speed>n8 to the kernel install commandline, where speed is something appropriate for your kvm, that should catch the oops, which you can post here to help us debug this. Hi Neil, The vKVM is the one integrated to a Dell iDRAC6, so no virtual serial port. I will boot anaconda using ignore_loglevel since there is more output to the console, but I never saw the oops itself. Is there anything else I can do (netconsole being for obvious reasons out of the picture)? I suppose dmesg and such from RHEL5 won't help... Thank you Unfortunately not. I could really use the oops here. I have a few ideas about what may be wrong, but without the oops I'm just guessing. I'll try to get hold of our poweredge 210 today to see if I can re-create it, but if you would please continue to try figure out whats going on here that would be great. You might try doing a vnc install so as to not require multiple virtual terminals in anaconda on the console, whcih would obscure your stack trace. It's not even reaching stage2 (it crashes when the "Waiting for hardware to initialize..." message is displayed) so the vnc install seems impossible (it comes much later). I may have a solution to capture the console output but it will take a few days. I'll keep in touch. Hi, I think we have the same problem, and same hardware : https://bugzilla.redhat.com/show_bug.cgi?id=710602 I have given screenshots ;) Can you do 2 things please? 1) Check the bios revision on your system. The bios I have here is v1.1.4 2) Boot the intstaller with pci=nobios on the command line. That last line prior to the hang issues a pci write to the device via bios and I'd like to ensure that something isn't wrong with the system bios handling the write to this device. Thank you for your response, I responded into the new bug report : https://bugzilla.redhat.com/show_bug.cgi?id=710602 Sorry for the late reply. The machine is in production using RHEL 5.x and I cannot take it out for testing now. I'll do it ASAP, but that means getting a new machine and this won't happen soon. Ok, fancois, let me know when you get to it. François : maybe do you have a dedicated server in France, with Online / Dedibox ? If yes : I had a server with them, with business support : it's a trap... no real support is here... and Dell iDrac KVM IP is very unstable with virtual media (ISO). My experience with their business support is limited but fine, and the iDrac IP KVM works well if you have enough upload bandwidth. But yes, this is exactly where the R210 is hosted. I had problems with their iDrac IP and with many DSL connections... and support is absent. So, bye bye Online, welcome OVH. I prefer to warn you ;) Well i did a lot of debug and went to the following conclusion : - the problem is caused by the Intel Xeon CPU L3426 - while the init of bnx2 module lshw -C CPU output : *-cpu description: CPU product: Intel(R) Xeon(R) CPU L3426 @ 1.87GHz vendor: Intel Corp. physical id: 400 bus info: cpu@0 version: Intel(R) Xeon(R) CPU L3426 @ 1.87GHz slot: CPU1 size: 1866MHz capacity: 3600MHz width: 64 bits clock: 4266MHz capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid configuration: cores=4 enabledcores=4 threads=8 I installed manualy (chroot install + grub resetup) the fedora 15. I blacklisted the bnx2 module. I started the server and when i do the modprobe bnx2 the server freeze and hang. The problem is not reproducted with Intel Core i3 or Intel Xeon X3450, so I suspect cpu bug/issue. May you try with this cpu ? (else i may grand you a test server to debug from where it may comes). Best regards. In general stop whine about absent support, you rent low-price server after all... And I am part of the support and I am here to help fix this issue... Neil, can you see with Raphaël how to get more information if needed? I cannot provide you with more data right now, and Raphaël has the exact hardware to test things on. I tried to update the microcode with the one avaible at : http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&ProdId=2680&DwnldID=20050 But it still fail with hangup on bnx2 module load. I believe I have the same problem with an IBM x3550 M3 (model 7944). Fedora 15 install hangs at "Waiting for hardware to initialize...". The pci/raid/nmi light path leds will light and server reboots. I will usually also see: "Uhhhhh nmi received for unknown reason 2d on cpu0" Using ignore_loglevel, I see the megasas module and the bnx2 module are loaded here. If I tell anaconda to blacklist bnx2, the installer passes this point. This is as far as I've gotten, as I need the network to continue. Any questions let me know. As an aside, I had installed and was running Fedora 14 on this machine successfully. Created attachment 510862 [details]
dmesg capture of F15 on IBM x3550 M3
Added dmesg capture of F15 install hang on ibm x3550 M3
pci=nobios does not help, immediate hang when bnx2 module is loaded during the "Waiting for hardware init...". Bios version: UEFI 1.11 BuildID D6E150C My cpu: *-cpu:0 description: CPU product: Intel(R) Xeon(R) CPU X5677 @ 3.47GHz vendor: Intel Corp. physical id: 1 bus info: cpu@0 version: Intel(R) Xeon(R) CPU X5677 @ 3.47GHz slot: Node 1 Socket 1 size: 3470MHz width: 64 bits clock: 1571MHz capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm arat tpr_shadow vnmi flexpriority ept vpid configuration: cores=4 enabledcores=4 threads=8 *-cpu:1 DISABLED description: CPU [empty] physical id: 55 slot: Node 1 Socket 2 Thank you, thats helpful. the nmi error makes me think this isn't a bnx2 issue at all, but rather a perf nmi gone bad: https://patchwork.kernel.org/patch/566721/ I've backported that fix to f14 (along with some supporting infrastructure). The build is here: http://koji.fedoraproject.org/koji/taskinfo?taskID=3174116 If you could please, try this kernel and see if it fixes the problem. You can either rebuild the installer intramfs or you can install using a dvd (blacklisting the NIC), and then update with this kernel and unblacklist bnx2 to see if the issue stops recurring. Sorry, the f14 kernel didn't work. I got the hang/nmi when I modprobed the bnx2 module: [ 142.494991] bnx2: Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v2.0.15 (May 4, 20) [ 142.541360] bnx2 0000:0b:00.0: PCI INT A -> GSI 28 (level, low) -> IRQ 28 [ 142.595412] bnx2 0000:0b:00.0: eth1: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI 8 [ 142.660347] bnx2 0000:0b:00.1: PCI INT B -> GSI 40 (level, low) -> IRQ 40 [ 142.707743] bnx2 0000:0b:00.1: eth2: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI a [ 142.774584] bnx2 0000:10:00.0: PCI INT A -> GSI 29 (level, low) -> IRQ 29 [ 142.823731] bnx2 0000:10:00.0: eth3: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI c [ 143.417963] Uhhuh. NMI received for unknown reason 2d on CPU 0. [ 143.417965] Do you have a strange power saving mode enabled? [ 143.417965] Dazed and confused, but trying to continue Weird, got that same NMI, maybe the patch is missing? I don't think so, the build log indicated that it was applied: http://koji.fedoraproject.org/koji/getfile?taskID=3174117&name=build.log ---------------------- + case "$patch" in + patch -p1 -F1 -s + ApplyPatch linux-2.6-perf-overflow-handling.patch + local patch=linux-2.6-perf-overflow-handling.patch + shift + '[' '!' -f /builddir/build/SOURCES/linux-2.6-perf-overflow-handling.patch ']' Patch14000: linux-2.6-perf-overflow-handling.patch ----------------------- Are you sure you booted the right kernel? I don't see any log messages above that confirm its the test kernel that was running. Created attachment 510932 [details]
testing patch fix for nmi
Just tried the nmi patched kernel twice to make sure I got the right kernel and I still got the hang both times. Here's the full serial console capture of my last run.
Hmm, well, ok I'm really not sure whats going on then. Something else must be causing the unknown nmi code on your system, although I couldn't for the life of me imagine what. Can you boot with unknown_nmi_panic=0 on the command line to avoid the panic and provide the log of that boot? That may give us some further idea of whats causing the NMI code. Created attachment 511324 [details]
F15 unknown nmi error capture
Attached is the dmesg of nmi call trace.
This one has:
Uhhuh. NMI received for unknown reason 3d on CPU 0.
I've noticed "unknown reason" being 2d or 3d when and if the kernel gets a chance to output anything to the console.
you didn't boot with unknown_nmi_panic=0 Created attachment 511365 [details]
unknown_nmi_panic on cmdline, no call trace generated though
This is a capture with unknown_nmi_panic=0 being passed on bootup, but I never got a call trace this way... then I noticed /proc/sys/kernel/unknown_nmi_panic was still being set to 1 (something was resetting this to 1 or a bug? Anway...). The capture before this one is me echo'ing 0 to the unknown_nmi_panic proc control manually and then modprobing bnx2. Sorry for the confusion.
If you can get the unknown_nmi_panic=0 working (run a 'grep -r unknown_nmi_panic /etc/*' to see if the system is setting it) and can get to a login prompt, then run 'lspci -vvv' and attach the output. I might be able to figure out which device is sending the NMI from that output. Don't worry about the 2d or 3d, of the 8 bits only one is useful/meaningful. A couple of others just natural flip back and forth hence either 2d or 3d. Cheers, Don I tried boot with unknown_nmi_panic=0 and it still crash completely while loading the bnx2 module. It do the but with Fedora 15 kernel and Centos 6.0 kernel (i tested it as well just to see). Zing, so what is that trace showing? It seems to me that if you disable the nmi panic, nothing is crashing (at least your trace doesn't show an oops). Is something else happening that makes the system unstable (a hang or some such)? Created attachment 516746 [details]
capture of install dvd hang with unknown_nmi_panic=0
Sorry, took me awhile to get back to this. Surprisingly I wasn't as easily able to hang the server with 2.6.38.8-32.fc15.x86_64 as some weeks back. I was able to modprobe bnx2 many times and it worked X/. It will still hang, and if it does, it's immediately after modprobbing bnx2, otherwise everything seems ok and we continue. not good. I was changing bios settings around back then, along with changing to a legacy boot setting. I'm wondering if that matters and differences between EFI booting.
So I went back to the F15 install dvd again and that hangs it reliably so far.
I passed unknown_nmi_panic=0 to the install dvd and I attached the console capture...
As soon as bnx2 is modprobe'd, the kernel console log output slowed down a lot.... about a few characters a sec. Tgere is a call trace in the logs.
At the point the log ends, the server automatically forced a hard shutdown.
I can attach an lspci, but it seemed like you needed the output from the kernel that hung at the time.
I've successfully booted to the gui in the installer dvd twice using biosdevname=0 now. That's never happened before in my tests. Raphaël does that work for you on your hardware? I've also found that nosmp allows the F15 install dvd to continue past the modprobe'ing of bnx2. It still hangup with 2.6.32-71.29.1.el6.x86_64 with biosdevname=0... And just a not, the 2.6.32 boot perfectly fine on the box. I tried rebuilding several taggset from the kernel src.rpm, but it seems to fail from the first to last tagging set from various causes : - usb/tpm detection fail - video init fail I have no idea how to fix this, if you need i can grant you access to the box if required... I tried booting the 2.6.32-131.6.1.el6.x86_64 with all the previous options. The log give me this after loading bnx2 : # modprobe bnx2 bnx2: Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v2.1.6 (Mar 7, 2011) bnx2 0000:02:0.0: found PCI INT A -> IRQ 15 bnx2 0000:02:0.0: sharing IRQ 15 with 0000:00:03.0 bnx2 0000:02:0.0: sharing IRQ 15 with 0000:01:00.0 _ And that's all For me it seems that one of the redhat patch is triggering the irq conflict. It seems that the H200 raid card on the server is sharing the irq : mpt2sas version 08.101.00.00 loaded scsi0 : Fusion MPT SAS Host mpt2sas 0000:01:00.0: found PCI INT A -> IRQ 15 mpt2sas 0000:01:00.0: sharing IRQ 15 with 0000:00:03.0 mpt2sas 0000:01:00.0: sharing IRQ 15 with 0000:02:00.0 mpt2sas0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (16459204 kB) mpt2sas0: IO-APIC enabled: IRQ 15 mpt2sas0: iomem (0x00000000df2b), mapped(0xffffc90002de0000), size(65536) mpt2sas0: ioport(0x000000000000fc00), size(256) mpt2sas0: sending diag reset !! [...usb init...] mpt2sas0: diag reset: SUCCESS mpt2sas0: Allocated physical memory: size(7444 kB) mpt2sas0: Current Controller Queue Depth(3306), Max Controller Queue Depth(3439) mpt2sas0: Scatter Gather Elements per IO(128) The devices are (from lspci before loading bnx2 module) : 00:03.0 PCI bridge: Intel Corporation Core Processor PCI Express Root Port 1 (rev 11) 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet (rev 20) It seems that the mpt2sas don't honor the boot flags noapic : vga=0x31b acpi=off noacpi noapic nolapic pci=nobios unknown_nmi_panic=0 nosmp biosdevname=0 When booting without all the options it seems that irq 16 is used instead of irq 15 and it freeze just after : bnx2: Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v2.1.6 (Mar 7, 2011) bnx2 0000:02:0.0: found PCI INT A -> GSI 16 (level, low) -> IRQ 16 Created attachment 519309 [details]
3.0.0-4.fc15
This capture is a recompiled 3.0.0-4.fc15 with some debug options enabled. This one has call traces on each of the cpus.
CPU 2 is in ext4 and maybe the soft lockup is somewhere in this chain of code and the pci vpd reads?
Why are your devices not using MSI-X interrupts? what is MSI-X interrupts ? how do i activate them ? isn't a patch/config from rh kernel that disable it ? No, msi interrupts should be enabled by default. The only reason you shouldn't be using them is if disable_msi was specified as a bnx2 module option at load time Hi all, I also have a Dell R210 II with dual gigabit NICs that is exhibiting problems very much like described here already. I may have some new helpful clues due to a different hunch I was following until I found this BZ. First of all, I had no problems with the F15 installer; my problems arrived rebooting into the fresh install. My system locks hard whenever the 2nd bnx2 instance is being initialized. However, I hadn't realized that the NICs were relevant initially. For some reason I was initially suspect of the PERC H200 6gb/s HBA to which I have an Intel 320 series 40 GB SSD attached. I was originally trying to install a custom-spin of Fedora Live on the SSD and with that the system would start to boot then halt with "Cannot find root device. Sleeping forever." or something very close to that effect. What I'd found is that if I moved the SSD from the PERC to the mainboard's SATA port A -- before or after the Fedora install; makes no difference -- I could boot and operate just fine, including networking! Only when the PERC was involved did I have problems. Now that I've read through this BZ and see that interrupt handling may be suspect instead, I tried a few more tests with surprising results. If I configure the BIOS such that "Integrated Devices"/"Embedded NIC1 and NIC2" is set to "Disabled (OS)" it will boot fine from the PERC, but of course I have no networking then. Interestingly enough, I see in "PCI IRQ Assignment" that NIC1 and the SAS Controller both share IRQ 10 (as does USB EHCI Controller 2). I would find this more compelling if it was NIC2 that shared with the SAS Controller since there's where my boot hangs. Whenever I change the assignment for any one of those three, all three change together -- I cannot make the SAS and NIC1 different. So I don't suspect fiddling with assignments is going to help. I realize that IRQ sharing is possible these days, but am not well enough versed to know how that's accomplished or how "fragile" it might be. Still, I'm hoping I've brought new light upon this problem. Please let me know if there is anything I can do to help move this along. Got my first successful boot into F15, with usable networking (on NIC1, at least) by changing all (even those seemingly unrelated) of the IRQ assignments to "default". I'm not sure how "default" differs from the explicit defaults set by Dell or what new problems I may have created in consequence, but so far it seems an improvement. (In reply to comment #43) > Got my first successful boot into F15, with usable networking (on NIC1, at > least) by changing all (even those seemingly unrelated) of the IRQ assignments > to "default". Harrumph! I cannot repeat this now. I must have well over a hundred boot tests now (what fun with ~5m just in the POST) and this must have been a freak occurrence. > I'm not sure how "default" differs from the explicit defaults > set by Dell or what new problems I may have created in consequence, but so far > it seems an improvement. The Dell Insyde BIOS holds the "default" value until the next boot, at which point the BIOS will hold the explicit values again. In summary, no amount of fiddling with the IRQ assignments seems to help. The only repeatable methods I've found are: * disable NIC1 and NIC2 (no option for disabling singularly) * bypass the PERC and attach SDD directly to mainboard's SATA port None of this answers why legacy interrupts are being used on these systems, They should support MSI interrupts and use those (which will not be shared). Is there anything in the logs which indicates why msi interrupts are disabled on these nics? Ping John, has there been any further reproducibility here? Or should we close this? (In reply to comment #46) > Ping John, has there been any further reproducibility here? Or should we close > this? Neil, first up my apologies for letting this slip. Must have been the holidays because I completely lost track of this one. Given the success we saw without the PERC, we went down the road of ordering the 40+ R210s from Dell without the PERCs, so in that sense it's no longer a problem for me. We were originally told by Dell that we couldn't use an SSD without a PERC, but our experience showed just the opposite. I doubt there's any clue there, but thought I'd mention it just in case. (Also IIRC, Dell did say they had an SSD that would work without a PERC, but it was a big expensive monster and we really only need a few gig.) I do still have the one R210 and the PERC so I could reinstall the PERC and try more tests if that would be helpful here. Also, we're now targeting F16 instead of F15, so if you want more tests please indicate which or both that I should try to get you the feedback. We had a similar problem on our Dell R210s that was fixed by upgrading the Broadcom card's firmware. You can check the current firmware version using "lshw". On firmware version 6.0.1, loading the bnx2 module would cause the server to hang. Either blacklisting bnx2, or using the kernel option "nosmp" would workaround the problem. On firmware version 6.4.5, the system works correctly. Dell's firmware update package: http://www.dell.com/support/drivers/us/en/04/DriverDetails/DriverFileFormats?DriverId=R319248 well, I think, given the fact that anyone with this problems seems to have resolved it with alternate hardware, that I'll leave testing up to you. to solve this, I think comment 45 is still the first question that needs answering. If anyone has the gumption to go and research that, I'll gladly take a look at it. For now though, I'll close this bug. Please re-open it if/when you get around to testing. |