Description of problem: After an upgrade to F12, booting fails on an HP DL380G6 server. The kernel complains "NMI received for unknown reason" but continues; however after a short period the boot will hang. Although the message suggests that this is a hardware problem, the same system works with no problems using the latest F11 kernel. Version-Release number of selected component (if applicable): kernel-2.6.31.6-166.fc12.x86_64 How reproducible: Always. Steps to Reproduce: 1. Install F12 on an x86_64 HP DL380G6 server. 2. Try to boot. Actual results: Boot messages look similar to: uhci_hcd 0000:00:1d.3: PCI INT D -> GSI 23 (level, low) -> IRQ 23 uhci_hcd 0000:00:1d.3: UHCI Host Controller uhci_hcd 0000:00:1d.3: new USB bus registered, assigned bus number 5 uhci_hcd 0000:00:1d.3: irq 23, io base 0x00001060 usb usb5: New USB device found, idVendor=1d6b, idProduct=0001 usb usb5: New USB device strings: Mfr=3, Product=2, SerialNumber=1 usb usb5: Product: UHCI Host Controller usb usb5: Manufacturer: Linux 2.6.31.6-166.fc12.x86_64 uhci_hcd usb usb5: SerialNumber: 0000:00:1d.3 usb usb5: configuration #1 chosen from 1 choice hub 5-0:1.0: USB hub found hub 5-0:1.0: 2 ports detected uhci_hcd 0000:01:04.4: PCI INT B -> GSI 22 (level, low) -> IRQ 22 uhci_hcd 0000:01:04.4: UHCI Host Controller uhci_hcd 0000:01:04.4: new USB bus registered, assigned bus number 6 uhci_hcd 0000:01:04.4: port count misdetected? forcing to 2 ports uhci_hcd 0000:01:04.4: irq 22, io base 0x00003800 usb usb6: New USB device found, idVendor=1d6b, idProduct=0001 Uhhuh. NMI received for unknown reason a1 on CPU 0. You have some hardware problem, likely on the PCI bus. Dazed and confused, but trying to continue usb usb6: New USB device strings: Mfr=3, Product=2, SerialNumber=1 usb usb6: Product: UHCI Host Controller usb usb6: Manufacturer: Linux 2.6.31.6-166.fc12.x86_64 uhci_hcd usb usb6: SerialNumber: 0000:01:04.4 usb usb6: configuration #1 chosen from 1 choice hub 6-0:1.0: USB hub found hub 6-0:1.0: 2 ports detected Booting continues and then usually hangs after: ioc0: LSISAS1068E B3: Capabilities={Initiator} uhci_hcd 0000:01:04.4: Unlink after no-IRQ? Controller is probably using the wrong IRQ. scsi0 : ioc0: LSISAS1068E B3, FwRev=01172a00h, Ports=1, MaxQ=163, IRQ=30 mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 5, phy 4, sas_addr 0x500110a0008db4ba scsi 0:0:0:0: Sequential-Access HP Ultrium 4-SCSI U28W PQ: 0 ANSI: 5 scsi 0:0:0:0: Attached scsi generic sg0 type 1 scsi 0:0:0:1: Medium Changer HP MSL G3 Series 6.80 PQ: 0 ANSI: 5 scsi 0:0:0:1: Attached scsi generic sg1 type 8 mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 6, phy 5, sas_addr 0x500110a0008db4bd scsi 0:0:1:0: Sequential-Access HP Ultrium 4-SCSI U28W PQ: 0 ANSI: 5 scsi 0:0:1:0: Attached scsi generic sg2 type 1 Expected results: Boot succeeds. Using the F11 kernel (kernel-2.6.30.9-102.fc11.x86_64) the boot messages look similar to: Dec 16 11:47:06 angklung kernel: uhci_hcd 0000:00:1d.3: PCI INT D -> GSI 23 (level, low) -> IRQ 23 Dec 16 11:47:06 angklung kernel: uhci_hcd 0000:00:1d.3: UHCI Host Controller Dec 16 11:47:06 angklung kernel: uhci_hcd 0000:00:1d.3: new USB bus registered, assigned bus number 5 Dec 16 11:47:06 angklung kernel: uhci_hcd 0000:00:1d.3: irq 23, io base 0x00001060 Dec 16 11:47:06 angklung kernel: usb usb5: New USB device found, idVendor=1d6b, idProduct=0001 Dec 16 11:47:06 angklung kernel: usb usb5: New USB device strings: Mfr=3, Product=2, SerialNumber=1 Dec 16 11:47:06 angklung kernel: usb usb5: Product: UHCI Host Controller Dec 16 11:47:06 angklung kernel: usb usb5: Manufacturer: Linux 2.6.30.9-102.fc11.x86_64 uhci_hcd Dec 16 11:47:06 angklung kernel: usb usb5: SerialNumber: 0000:00:1d.3 Dec 16 11:47:06 angklung kernel: usb usb5: configuration #1 chosen from 1 choice Dec 16 11:47:06 angklung kernel: hub 5-0:1.0: USB hub found Dec 16 11:47:06 angklung kernel: hub 5-0:1.0: 2 ports detected Dec 16 11:47:06 angklung kernel: uhci_hcd 0000:01:04.4: PCI INT B -> GSI 22 (level, low) -> IRQ 22 Dec 16 11:47:06 angklung kernel: uhci_hcd 0000:01:04.4: UHCI Host Controller Dec 16 11:47:06 angklung kernel: uhci_hcd 0000:01:04.4: new USB bus registered, assigned bus number 6 Dec 16 11:47:06 angklung kernel: uhci_hcd 0000:01:04.4: port count misdetected? forcing to 2 ports Dec 16 11:47:06 angklung kernel: uhci_hcd 0000:01:04.4: irq 22, io base 0x00003800 Dec 16 11:47:06 angklung kernel: usb usb6: New USB device found, idVendor=1d6b, idProduct=0001 Dec 16 11:47:06 angklung kernel: usb usb6: New USB device strings: Mfr=3, Product=2, SerialNumber=1 Dec 16 11:47:06 angklung kernel: usb usb6: Product: UHCI Host Controller Dec 16 11:47:06 angklung kernel: usb usb6: Manufacturer: Linux 2.6.30.9-102.fc11.x86_64 uhci_hcd Dec 16 11:47:06 angklung kernel: usb usb6: SerialNumber: 0000:01:04.4 Dec 16 11:47:06 angklung kernel: usb usb6: configuration #1 chosen from 1 choice Dec 16 11:47:06 angklung kernel: hub 6-0:1.0: USB hub found Dec 16 11:47:06 angklung kernel: hub 6-0:1.0: 2 ports detected Booting continues normally, and the "Unlink after no-IRQ?" message is not seen. Additional info:
Also fails with the latest F12 update, kernel-2.6.31.9-174.fc12.x86_64. Normal boot with the latest F11 update, kernel-2.6.30.10-105.fc11.x86_64.
as a workaround - try turning off C-States in BIOS
I see a bunch of settings related to power management (e.g. tuning HP's dynamic power savings mode) in the BIOS, but nothing that obviously says "turn C-States on/off". Do you happen to know what HP might call this in their BIOS?
Try "Minimum Processor Idle Power State" .. if you're navigating the BIOS screen it's under "Power Management Options" => "Advanced Power Management Options" I've had mixed results with this one .. I had a 2.4GHz model that came up without this setting and is still running fine, and a 2.93GHz that worked initially with this but then developed problems later could be an issue with certain issue G6's and power state weirdness - might want to talk to your vendor about it
I have same problem on same server. After changing: rbsu> SET CONFIG MINIMUM PROCESSOR IDLE POWER STATE 4 Minimum Processor Idle Power State 1|C6 State 2|C3 State 3|C1E State 4|No C-states <= message and problems still persist. Some services fail to start randomly and complete system freezes in random time (aprox. 5 minutes from boot). Similar problems with current testing kernel-2.6.32: Kernel panic - not syncing: An NMI occurred, please see the Integrated Management Log for details. (just there is nothing in IML). Fedora 11 kernel boots fine, just my server has very high load some times and my virtual machines freezes for 1-60 minutes. Any other solutions?
Kernel 2.6.34-0.4.rc0.git2.fc14.x86_64: - no uhhuh message - Kernel panic - not syncing: An NMI occurred, please see the Integrated Management Log for details. Still doesn't work. Looks like everything >=2.6.31 fails. :-(
As a workaround, intel_iommu=off works for me. May be there are some disadvantages, but at least I can run my server.
With iommu=pt server works well, but after reboot fails in BIOS with message: NMI - Undetermined Source (this message is displayed before "Press F9 to setup ...") ILO IML log says: Severity Class Description Critical Host Bus Unknown Event (Class 6, Code 4) Another workaround is to disable VT-d in BIOS. Then there is no need to add any *iommu* kernel parameter and server works well also after reboot. I don't need to assign PCI devices to guests, I use only virtio drivers in guests, so I don't need VT-d, but it will be good to fix this in future. More info about VT-d: http://www.linux-kvm.org/page/How_to_assign_devices_with_VT-d_in_KVM
(In reply to comment #6) > Kernel 2.6.34-0.4.rc0.git2.fc14.x86_64: > - no uhhuh message > - Kernel panic - not syncing: An NMI occurred, please see the Integrated > Management Log for details. > > Still doesn't work. Looks like everything >=2.6.31 fails. :-( The messages went away because you are using the hpwdt kernel module which intercepts the unknown nmis and uses the BIOS to determine the source and save them in the ILO. Upon reboot you see what the ILO figured out (which was nothing interesting). I have a kernel module that can walk the pci tree at the time of the NMI but the problem needs to be reproducable after booting to load the module (unless you want a kernel patch to apply and build). Of course you would need to disable the hpwdt module to use this one instead. Let me know if anyone is interested. Cheers, Don
I don't think, that hpwdt is responsible for this problem. It only reports an bug, but does not make it. I have similar problem with DELL PowerEdge R710 server. I can't find VT-d switch in BIOS, so I can't disable VT-d. Only workaround is iommu=pt. Any chance to fix this in kernel to ignore or fix VT-d support?
Please try updating the firmware on the smart array controller to version 3.00 or newer. This should fix the NMI received during boot. The NMI on reboot seems to be a BIOS issue, unless there's anything more we can do to make sure the VT-d hardware is shutdown and will not produce faults during the BIOS reboot.
(In reply to comment #10) > I don't think, that hpwdt is responsible for this problem. It only reports an > bug, but does not make it. Sorry for the confusion. I was referring to the change of behaviour of the messages. I understand the hpwdt is not causing the problem. That module just reports the info differently than the normal unknown nmi path. Hopefully a firmware update can fix this problem :-) Cheers, Don
I can't experiment on this server now. It's a production server. But I have similar problems on DELL server with all firmware updates installed, same problems with F12 and F13. DELL does not display message after reboot, but also unable to boot without iommu=pt parameter. I can't find VT-d parameter in bios, so I can't disable it. Can I test something on this server? I don't use it in production yet.
For Dell BIOS brokenness; let's ask Dell... Jan, can you confirm that you're seeing the _same_ problem on the Dell box? Sorry to ask, but some people have a very strange idea of what 'similar bug' means :)
Offline we have established that the Dell issue is different -- it manifests itself as a lockup soon after enabling VT-d. Probably due to a missing RMRR for the iDRAC which is being used?
*** Bug 593003 has been marked as a duplicate of this bug. ***
I can confirm, that after these updates this problem disappeared: - bios update on HP to latest version - compaq smart array update to latest version - fedora kernel update and also on Dell, all bioses and kernel has been updated and now works well. Also after VT-d has been reenabled in BIOS, problem is gone. Thank you for help, closing bug.