Bug 618612
Created attachment 434666 [details]
Fedora 12 dmesg output for comparison
Hi, I managed to create a bootable USB key with a debug kernel. In the hope that the ACPI error is somehow related to the NMI I used the following command line Kernel command line: initrd=/debug/debug_initrd.img log_buf_len=8M text selinux=0 nomodeset BOOT_IMAGE=/debug/vmlinuz-2.6.33.3-85.fc13.i686.debug rescue acpi.debug_layer=0x20 acpi.debug_level=0xffffffff This produces a large dmesg log(hence the need for log_buf_len=8M). I should also say that I have tried each of the following kernel parameters(individually) and the NMI and disk hangs still occur:- acpi=off nolapic noapic nolapic_timer nohz=off Are there any other things worth trying? Created attachment 436780 [details]
F13 dmesg log with acpi.debug_layer=0x20 acpi.debug_level=0xffffffff
The latest release kernel is 2.6.33.6-147.2.4 . Always try the latest release before reporting kernel bugs. You could also try 2.6.34.2-33 from koji to see if this is fixed in 2.6.34 . Created attachment 437089 [details]
dmesg log from 2.6.34.2-34
Hi, thanks for the response, I had forgotten konji existed.
It was a bit difficult to try updated kernels as I had not been able to install due to the disk locks ups. That is why I made the bootable USB key. I did manage to get a limited F13 installed via the USB key(I guess timings were a bit different than from a DVD, possibly faster).
Both kernel-2.6.33.6-147.2.4.fc13.i686 and kernel-2.6.34.2-34.fc13.i686.rpm have the same problem NMI+disk hangs.
I have attached the dmesg log from 2.6.34.2-34.
I noticed a message about trying pci=nocrs but I got the NMI with that setting as well.
I'll see if I can get the 2.6.35-2.fc14 kernel from konji to install
thanks again
Created attachment 437093 [details]
dmesg from 2.6.35-2.fc14
2.6.35-2.fc14 has the same NMI problem I will try 2.6.36 next
Created attachment 437142 [details]
dmesg log from 2.6.36-0.0.rc0.git1.fc15 has NMI problem
Created attachment 437143 [details]
dmesg from 2.6.31.12-174.2.22(latest 2.6.31 on konji) no NMI+no disk hang
Created attachment 437148 [details]
dmesg from 2.6.32-1(earliest proper 2.6.32 on konji) with NMI
So the NMI+disk hang occur in 2.6.32-1.fc13
but there is no NMI and no disk hang in 2.6.31.12-174.2.22.fc12, I ran this for a couple of hours no problems.
I had thought that the ACPI Error and the NMI went together but 2.6.32-1 has the NMI but no ACPI Error. So they may not be related.
I can try some earlier 2.6.32 kernels to try to track down when the NMI error started happening if that would help.
Once again thanks for the pointer to konji, it is a great resource.
Created attachment 437202 [details]
dmesg from 2.6.32-0.14.rc0.git18.fc13 with NMI
The 2.6.32-0.14.rc0.git18.fc13 kernel also has the NMI problem this seems to be the earliest 2.6.32 kernel available from konji.
If there is anything else I can do to help narrow down the problem let me know.
Thanks
My experience with the Proliant series is usually a buggy iLo firmware. If you can try and update that firmware it might help. I think you can get it from the HP website. Otherwise attach the output of 'lspci -vvv' and 'lspci -t' and I'll try to figure out which device is causing the NMI error. Cheers, Don Also according to https://bugzilla.redhat.com/show_bug.cgi?id=548198 updating the firmware of the SmartArray seemed to have fixed the NMI problem there too. Cheers, Don Created attachment 440926 [details]
lspci -vvv output from 2.6.33 kernel
Thanks for the response.
I had already updated firmware to the latest versions from HP's website prior to posting my initial report. I have checked again today and there do not seem to be any updates since then.
In particular:-
BIOS A22 2/9/2010 i.e. 2010.02.09
iLo 2 v2.00 Jun 21 2010
Smart Array P410 3.30
I did update NIC firmware today but it made no difference.
I have attached the 2.6.33 lspci -vvv output. lspci -t and lspci -vvv from a working 2.6.31 kernel to follow(lspci -t is identical on 2.6.33 and 2.6.31).
By the way I had the machine running without any problems for 2 weeks with the 2.6.31 kernel. I only rebooted to get the lspci output requested.
Created attachment 440927 [details]
lspci -t from 2.6.33 kernel
Created attachment 440928 [details]
lspci -vvv from working 2.6.31 kernel for comparison
What happens when you blacklist the hpwdt and hpilo modules for your 2.6.31 kernel. Those modules don't seem to be running on your 2.6.33 kernel. The hpwdt module in particular takes all the NMIs and logs them and as a result you will not see the 'unknown NMI' message you see with 2.6.33. On the other hand your machine should have either panic'd or print another warning in /var/log/messages (or dmesg). So you can either do this: echo "blacklist hpwdt" >> /etc/modprobe.d/blacklist.conf echo "blacklist hpilo" >> /etc/modprobe.d/blacklist.conf and boot into the 2.6.31 kernel or install those two modules for you 2.6.33 kernel and see if the behaviour changes. Cheers, Don Created attachment 440979 [details]
lspci -vvv output from 2.6.31 kernel with hpwdt+hpilo modules blacklisted
Sorry, I should have mentioned that the 2.6.33 output from lspci -vvv supplied previously was obtained by booting into rescue mode from a USB stick. I tried to get the output 3 times by booting normally but the disk kept locking up before I got a chance to run lspci. I think that is the reason the hpilo+hpwdt modules were not in use by the 2.6.33 kernel.
After adding hpilo+hpwdt to the blacklist and booting 2.6.31 I still do NOT get the NMI message and disk does not seem to hang up(only up 15 mins but that is a lot longer than it normally takes to hang).
I have discovered the cause of the NMI problem. Initially I was compiling upstream kernels and they all worked. So I switched to using rpmbuild for kernel-2.6.34.6-47.fc13 with patches commented out of the spec file. In the end I isolated the problem to the following one line patch linux-2.6-defaults-aspm.patch With this patch commented out of the spec file the kernel works fine i.e. no NMI. The patch sets aspm_policy=POLICY_POWERSAVE where previously it was unset(probably 0 which is POLICY_DEFAULT this gets the setting from the BIOS). Looking through kernel docs there is a command line parameter pcie_aspm=off which disables power saving. When normal Fedora kernel is booted with this parameter the NMI error does NOT occur. To me it seems a bit odd to set POWERSAVE by default I would have thought the BIOS setting would be the correct default. Also with the patch there is no way to tell the kernel to use the BIOS setting the only values that can be specified to pcie_aspm are 'off' and 'force'. The other way to control the aspm mode is to echo values to /sys/module/pcie_aspm/parameters/policy. The options are "default", "performance" and "powersave". This has all the required options but I'm doubtful there is anyway to be sure that this is set before problems occur(e.g. an NMI). When I boot with pcie_aspm=off and cat /sys/modules/pcie_aspm/parameters/policy it shows:- default performance [powersave] Which indicates that powersave is on! I don't understand how this has happened. But the machine seems to be fine i.e. no NMI, no disk lock ups. ISTR there was a reason for that patch. It didn't work out to well in RHEL-6 either. A lot of strange NMIs were the result of that change. mjg can go into more detail but I am not surprised that patch is the culprit. Nice work! Cheers, Don Hi, Just wondering what if anything is happening to the linux-2.6-defaults-aspm.patch It still seems to be being applied in the latest f14 kernel builds. If it has been decided to retain the patch perhaps a mention of pcie_aspm=off could be added to release notes. Also a mention should be added to https://fedoraproject.org/wiki/Common_kernel_problems#Crashes.2FHangs I see it does get a mention in the Can't find installation CD/DVD or hard drives section of the page but that does not apply in this case. Regards, Jeremy Using a HP DL380 G4 with a SmartArray 6i using current RHEL-6 beta2 Kernel 2.6.32-44.2.el6.x86_64 booted with pcie_aspm=off NMI and Disk hang occurs. I tried to boot into 2.6.31.12-174.2.22.fc12 but unfortunately the encrypted System couldn't be opened - guess the standard encryption parameters are not supported by that kernel or so. rgds Christoph ok that's appearently because aes-xts-plain64 was introduced in the meantime. No Idea how to safely convert that back to aes-xts-plain (which would be safe since that volume is a lot less big than 2TB) Maybe the disk lock problem was supposed to be solved by these changes? http://www.kernel.org/pub/linux/kernel/v2.6/testing/ChangeLog-2.6.36-rc4 Dan Carpenter (8): cciss: handle allocation failure Stephen M. Cameron (3): cciss: disable doorbell reset on reset_devices cciss: fix reporting of max queue depth since init [SCSI] hpsa: disable doorbell reset on reset_devices seems this has also been discussed here https://partner-bugzilla.redhat.com/show_bug.cgi?id=612486 I installed the latest 2.6.36 rc7 from koji and it felt like the system survived a little bit longer. Only a little bit. Oops occured within the cciss module for it might have tried to free nonallocated memory. Lateron oopses within the filesystem module and finally.. there is the disk lockup problem again. The HP DL380 G4 loads the cciss module (not hpsa). I tried downgrading to Kernel 2.6.32-37.el6.x86_64 since I read somewhere that falling back might help but no luck. Yeah :) I think I actually wasted ppls time reading this. While playing around with the bios settings I thought.. yeah hit the memory check button and DIMM 2 and 3 were faulty. All of a sudden :/ So Excuse me - the error went aray when I replaced the to RAM elements rgds Christoph We have a rack of HP DL160 G6 machines that were all suffering similar problems: flashing red status lights and the following syslog errors kernel: [ 2976.038899] Uhhuh. NMI received for unknown reason 31 on CPU 0. kernel: [ 2976.038902] Do you have a strange power saving mode enabled? kernel: [ 2976.038904] Dazed and confused, but trying to continue kernel: [ 3437.022203] Uhhuh. NMI received for unknown reason 11 on CPU 0. kernel: [ 3437.022207] Do you have a strange power saving mode enabled? kernel: [ 3437.022209] Dazed and confused, but trying to continue Setting pcie_aspm=off appears to have solved the problem. Good catch Jeremy! Created attachment 481733 [details]
Disable ASPM if the BIOS doesn't support _OSC
AMD-based G6 servers (eg: DL385 G6) did not get the BIOS change that sets the global ASPM disable bit in the FACP/FADT.
In the meantime, please review/test the attached patch. It hasn't been submitted upstream yet. It basically tries to bring back Matthew Garrett's upstream patch (commit: 852972acff8f10f3a15679be2059bb94916cba5d) that was removed via commit: 28eb5f274a305bf3a13b2c80c4804d4515d05c64.
This shouldn't be necessary 28eb5f274a305bf3a13b2c80c4804d4515d05c64. Are you seeing ASPM enabled anyway? Created attachment 481743 [details]
v2: Disable ASPM if the BIOS doesn't support _OSC
Am unable to locate a DL385 G6 right away. Cc'ing Tony to see if there is one at Red Hat.
Does F13/F14 diverge from upstream in setting up ASPM. From code perusal it appears that if you boot 2.6.38 upstream with policy set to "powersave" you might run into this issue.
Attached (v2: Disable ASPM if the BIOS doesn't support _OSC) is a simpler patch.
Restting needinfo until the question in comment 27 is answered. In case you missed it(comment 18) the patch that seems to be responsible for this bug is:linux-2.6-defaults-aspm.patch The patch sets aspm_policy=POLICY_POWERSAVE where previously it was unset(probably 0 which is POLICY_DEFAULT this gets the setting from the BIOS). Upstream kernels did NOT have the problem. The patch is still being applied in the latest Fedora kernels(I just checked kernel-2.6.38-0.rc6.git6.1.fc15.src.rpm). Resetting needinfo until the question in comment 27 is answered. Naga, I have access to a dl385g7, but not a g6. Resetting needinfo until the question in comment 27 is answered. Results from HP: Am unable to capture “lspci –vvvxxx” output when the failure occurs on F13/F14. However, it seems related to ASPM. When I boot with “pcie_aspm=off” the problems go away. Fedora13 x86-64: • Unable to detect the connected hard drive if OS is booted from OS media. Also, getting NMI error • Able to install and boot the OS with “pcie_aspm=off boot parameter (Hard drive is also detected with this boot parameter) • “PCIe ASPM is disabled” message is displayed in “dmesg” output. • PCIe_ASPM policy is set to “powersave” • ASPM is disabled for all the PCI devices – probably because I used “pcie_aspm=off” Fedora14 x86-64: • Unable to detect the connected hard drive if OS is booted from OS media. Also, getting NMI error. • Able to install and boot the OS with “pcie_aspm=off” boot parameter • “PCIe ASPM is disabled” message is displayed in “dmesg” output. • PCIe_ASPM policy is set to “powersave” • ASPM is disabled for all the PCI devices – probably because I used “pcie_aspm=off”. Created attachment 483015 [details] Fedora 14: Disable ASPM when BIOS doesn't support _OSC Our QA team has reproduced this problem and as seen in comment #34 they are unable to capture the relevant information requested in comment #27. Fedora 14 expects ASPM to get disabled because of the code snippet below: In ./drivers/acpi/pci_root.c: acpi_pci_root_add() … if (status == AE_NOT_EXIST) { printk(KERN_INFO "Unable to assume PCIe control: Disabling ASPM\n"); pcie_no_aspm(); } … That code doesn’t pick up the failing case where the BIOS doesn’t have the special FADT bit set. A patch (please see attached “f14-aspm-disabled.patch”) would catch the failing case described by the original reporter of this BZ. Created attachment 483535 [details]
Fedora 14 x86_64 dmesg after aspmpatch.txt
Created attachment 483537 [details] Fedora 14 x86_64 policy after aspmpatch.txt QA reported that the patch in comment #35 fixed the issue. FYI. Please review the attached log files. Created attachment 483538 [details] Fedora 14 x86_64 lspcixxxvv after aspmpatch.txt QA reported that the patch in comment #35 fixed the issue. FYI. Please review the attached log files. Matt, Will this patch get rolled into Fedora? What can we do to help make this happen? This message is a reminder that Fedora 13 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 13. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '13'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 13's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 13 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping I just booted Fedora 15 in rescue mode on the DL385 G6 and it seems to be fine i.e. no NMI, no disk hang. I also confirmed that Fedora 13 without pcie_aspm=off still has the problem. So I think this can be marked as fixed in Fedora 15. |
Created attachment 434665 [details] Fedora 13 dmesg output with hang at end Description of problem: When booting Fedora 13 on an HP Proliant DL385 G6 with a P410 SmartArray the system reports an NMI error and when the disk is accessed it hangs unpredictably but generally after a few minutes at most. Version-Release number of selected component (if applicable): kernel is 2.6.33.3-85.fc13.i686(the x86_64 version does the same thing) How reproducible: Boot into rescue mode(normal install mode also has the NMI) from Fedora 13 DVD or CD. Steps to Reproduce: 1.Boot into rescue mode from Fedora 13 DVD or CD. 2.dmesg output includes following message(sometimes a1 rather than b1):- Uhhuh. NMI received for unknown reason b1 on CPU 0. You have some hardware problem, likely on the PCI bus. Dazed and confused, but trying to continue 3.to hang the disk do something like:- dd if=/dev/zero of=/dev/cciss/c0d0 count=999999 it is generally necessary to repeat the dd a few times to produce the hang Actual results: dmesg output has an NMI error disk hangs Expected results: no NMI error disk should not hang Additional info: Fedora 12 does NOT have the NMI error and does not seem to hang. The dd command dd if=/dev/zero of=/dev/cciss/c0d0 count=999999 takes between 16.5s and 17.1s on Fedora 12 but on Fedora 13 when the disk does not hang it takes about 30s Centos 5.5 also does NOT have the problem and the dd command takes between 12.6s and 13.1s The dmesg output of Feodra 13 also contains the following messages which seems strange to me(and Fedora 12 does not have):- ACPI Error: Field [CDW3] at 96 exceeds Buffer [NULL] size 64 (bits) (20091214/dsopcode-596) ACPI Error (psparse-0537): Method parse/execution failed [\_SB_._OSC] (Node f6c112b8), AE_AML_BUFFER_LIMIT