Bug 618612

Summary: HP DL385 G6 with P410 smart array gives NMI error on boot and disk hangs/locks up
Product: [Fedora] Fedora Reporter: Jeremy Faith <j.faithw>
Component: kernelAssignee: John Feeney <jfeeney>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 13CC: anton, christoph.sievers, dougsland, dzickus, elliott.forney, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, mishu, nagananda.chumbalkar, tcamuso, trinh.dao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-06-15 21:45:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Fedora 13 dmesg output with hang at end
none
Fedora 12 dmesg output for comparison
none
F13 dmesg log with acpi.debug_layer=0x20 acpi.debug_level=0xffffffff
none
dmesg log from 2.6.34.2-34
none
dmesg from 2.6.35-2.fc14
none
dmesg log from 2.6.36-0.0.rc0.git1.fc15 has NMI problem
none
dmesg from 2.6.31.12-174.2.22(latest 2.6.31 on konji) no NMI+no disk hang
none
dmesg from 2.6.32-1(earliest proper 2.6.32 on konji) with NMI
none
dmesg from 2.6.32-0.14.rc0.git18.fc13 with NMI
none
lspci -vvv output from 2.6.33 kernel
none
lspci -t from 2.6.33 kernel
none
lspci -vvv from working 2.6.31 kernel for comparison
none
lspci -vvv output from 2.6.31 kernel with hpwdt+hpilo modules blacklisted
none
Disable ASPM if the BIOS doesn't support _OSC
none
v2: Disable ASPM if the BIOS doesn't support _OSC
none
Fedora 14: Disable ASPM when BIOS doesn't support _OSC
none
Fedora 14 x86_64 dmesg after aspmpatch.txt
none
Fedora 14 x86_64 policy after aspmpatch.txt
none
Fedora 14 x86_64 lspcixxxvv after aspmpatch.txt none

Description Jeremy Faith 2010-07-27 11:26:11 UTC
Created attachment 434665 [details]
Fedora 13 dmesg output with hang at end

Description of problem:
When booting Fedora 13 on an HP Proliant DL385 G6 with a P410 SmartArray the system reports an NMI error and when the disk is accessed it hangs unpredictably but generally after a few minutes at most.

Version-Release number of selected component (if applicable):
kernel is 2.6.33.3-85.fc13.i686(the x86_64 version does the same thing)

How reproducible:
Boot into rescue mode(normal install mode also has the NMI) from Fedora 13 DVD or CD.

Steps to Reproduce:
1.Boot into rescue mode from Fedora 13 DVD or CD.
2.dmesg
output includes following message(sometimes a1 rather than b1):-
  Uhhuh. NMI received for unknown reason b1 on CPU 0.
  You have some hardware problem, likely on the PCI bus.
  Dazed and confused, but trying to continue
3.to hang the disk do something like:-
  dd if=/dev/zero of=/dev/cciss/c0d0 count=999999
it is generally necessary to repeat the dd a few times to produce the hang 
  
Actual results:
dmesg output has an NMI error
disk hangs

Expected results:
no NMI error
disk should not hang

Additional info:
Fedora 12 does NOT have the NMI error and does not seem to hang.
The dd command
  dd if=/dev/zero of=/dev/cciss/c0d0 count=999999 
takes between 16.5s and 17.1s on Fedora 12 but on Fedora 13 when the disk does not hang it takes about 30s
Centos 5.5 also does NOT have the problem and the dd command takes between 12.6s and 13.1s

The dmesg output of Feodra 13 also contains the following messages which seems strange to me(and Fedora 12 does not have):-
ACPI Error: Field [CDW3] at 96 exceeds Buffer [NULL] size 64 (bits) (20091214/dsopcode-596)
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_._OSC] (Node f6c112b8), AE_AML_BUFFER_LIMIT

Comment 1 Jeremy Faith 2010-07-27 11:27:22 UTC
Created attachment 434666 [details]
Fedora 12 dmesg output for comparison

Comment 2 Jeremy Faith 2010-08-05 08:41:40 UTC
Hi,

I managed to create a bootable USB key with a debug kernel. In the hope that the ACPI error is somehow related to the NMI I used the following command line

Kernel command line: initrd=/debug/debug_initrd.img log_buf_len=8M text selinux=0 nomodeset BOOT_IMAGE=/debug/vmlinuz-2.6.33.3-85.fc13.i686.debug rescue acpi.debug_layer=0x20 acpi.debug_level=0xffffffff

This produces a large dmesg log(hence the need for log_buf_len=8M).

I should also say that I have tried each of the following kernel parameters(individually) and the NMI and disk hangs still occur:-
  acpi=off
  nolapic
  noapic
  nolapic_timer
  nohz=off

Are there any other things worth trying?

Comment 3 Jeremy Faith 2010-08-05 08:44:27 UTC
Created attachment 436780 [details]
F13 dmesg log with acpi.debug_layer=0x20 acpi.debug_level=0xffffffff

Comment 4 Chuck Ebbert 2010-08-05 18:32:05 UTC
The latest release kernel is 2.6.33.6-147.2.4 . Always try the latest release before reporting kernel bugs.

You could also try 2.6.34.2-33 from koji to see if this is fixed in 2.6.34 .

Comment 5 Jeremy Faith 2010-08-06 09:53:03 UTC
Created attachment 437089 [details]
dmesg log from 2.6.34.2-34

Hi, thanks for the response, I had forgotten konji existed.

It was a bit difficult to try updated kernels as I had not been able to install due to the disk locks ups. That is why I made the bootable USB key. I did manage to get a limited F13 installed via the USB key(I guess timings were a bit different than from a DVD, possibly faster).

Both kernel-2.6.33.6-147.2.4.fc13.i686 and kernel-2.6.34.2-34.fc13.i686.rpm have the same problem NMI+disk hangs.

I have attached the dmesg log from 2.6.34.2-34.
I noticed a message about trying pci=nocrs but I got the NMI with that setting as well.

I'll see if I can get the 2.6.35-2.fc14 kernel from konji to install

thanks again

Comment 6 Jeremy Faith 2010-08-06 10:08:49 UTC
Created attachment 437093 [details]
dmesg from 2.6.35-2.fc14

2.6.35-2.fc14 has the same NMI problem I will try 2.6.36 next

Comment 7 Jeremy Faith 2010-08-06 13:28:25 UTC
Created attachment 437142 [details]
dmesg log from 2.6.36-0.0.rc0.git1.fc15 has NMI problem

Comment 8 Jeremy Faith 2010-08-06 13:30:38 UTC
Created attachment 437143 [details]
dmesg from 2.6.31.12-174.2.22(latest 2.6.31 on konji) no NMI+no disk hang

Comment 9 Jeremy Faith 2010-08-06 13:41:09 UTC
Created attachment 437148 [details]
dmesg from 2.6.32-1(earliest proper 2.6.32 on konji) with NMI

So the NMI+disk hang occur in 2.6.32-1.fc13 
but there is no NMI and no disk hang in 2.6.31.12-174.2.22.fc12, I ran this for a couple of hours no problems.
I had thought that the ACPI Error and the NMI went together but 2.6.32-1 has the NMI but no ACPI Error. So they may not be related.

I can try some earlier 2.6.32 kernels to try to track down when the NMI error started happening if that would help.

Once again thanks for the pointer to konji, it is a great resource.

Comment 10 Jeremy Faith 2010-08-06 16:38:58 UTC
Created attachment 437202 [details]
dmesg from 2.6.32-0.14.rc0.git18.fc13 with NMI

The 2.6.32-0.14.rc0.git18.fc13 kernel also has the NMI problem this seems to be the earliest 2.6.32 kernel available from konji. 

If there is anything else I can do to help narrow down the problem let me know.

Thanks

Comment 11 Don Zickus 2010-08-23 13:38:48 UTC
My experience with the Proliant series is usually a buggy iLo firmware.  If you can try and update that firmware it might help.  I think you can get it from the HP website.

Otherwise attach the output of 'lspci -vvv' and 'lspci -t' and I'll try to figure out which device is causing the NMI error.

Cheers,
Don

Comment 12 Don Zickus 2010-08-23 13:44:40 UTC
Also according to 
https://bugzilla.redhat.com/show_bug.cgi?id=548198

updating the firmware of the SmartArray seemed to have fixed the NMI problem there too.

Cheers,
Don

Comment 13 Jeremy Faith 2010-08-25 13:21:25 UTC
Created attachment 440926 [details]
lspci -vvv output from 2.6.33 kernel

Thanks for the response.

I had already updated firmware to the latest versions from HP's website prior to posting my initial report. I have checked again today and there do not seem to be any updates since then.
In particular:-
  BIOS             A22 2/9/2010 i.e. 2010.02.09
  iLo              2 v2.00 Jun 21 2010
  Smart Array P410 3.30

I did update NIC firmware today but it made no difference.

I have attached the 2.6.33 lspci -vvv output. lspci -t and lspci -vvv from a working 2.6.31 kernel to follow(lspci -t is identical on 2.6.33 and 2.6.31).

By the way I had the machine running without any problems for 2 weeks with the 2.6.31 kernel. I only rebooted to get the lspci output requested.

Comment 14 Jeremy Faith 2010-08-25 13:22:07 UTC
Created attachment 440927 [details]
lspci -t from 2.6.33 kernel

Comment 15 Jeremy Faith 2010-08-25 13:23:11 UTC
Created attachment 440928 [details]
lspci -vvv from working 2.6.31 kernel for comparison

Comment 16 Don Zickus 2010-08-25 15:01:39 UTC
What happens when you blacklist the hpwdt and hpilo modules for your 2.6.31 kernel.  Those modules don't seem to be running on your 2.6.33 kernel.

The hpwdt module in particular takes all the NMIs and logs them and as a result you will not see the 'unknown NMI' message you see with 2.6.33.  On the other hand your machine should have either panic'd or print another warning in /var/log/messages (or dmesg).

So you can either do this:

echo "blacklist hpwdt" >> /etc/modprobe.d/blacklist.conf
echo "blacklist hpilo" >> /etc/modprobe.d/blacklist.conf

and boot into the 2.6.31 kernel

or install those two modules for you 2.6.33 kernel and see if the behaviour changes.

Cheers,
Don

Comment 17 Jeremy Faith 2010-08-25 15:51:26 UTC
Created attachment 440979 [details]
lspci -vvv output from 2.6.31 kernel with hpwdt+hpilo modules blacklisted

Sorry, I should have mentioned that the 2.6.33 output from lspci -vvv supplied previously was obtained by booting into rescue mode from a USB stick. I tried to get the output 3 times by booting normally but the disk kept locking up before I got a chance to run lspci. I think that is the reason the hpilo+hpwdt modules were not in use by the 2.6.33 kernel.

After adding hpilo+hpwdt to the blacklist and booting 2.6.31 I still do NOT get the NMI message and disk does not seem to hang up(only up 15 mins but that is a lot longer than it normally takes to hang).

Comment 18 Jeremy Faith 2010-09-01 14:23:01 UTC
I have discovered the cause of the NMI problem.
Initially I was compiling upstream kernels and they all worked.
So I switched to using rpmbuild for kernel-2.6.34.6-47.fc13 with patches commented out of the spec file.
In the end I isolated the problem to the following one line patch
  linux-2.6-defaults-aspm.patch
With this patch commented out of the spec file the kernel works fine i.e. no NMI.

The patch sets aspm_policy=POLICY_POWERSAVE where previously it was unset(probably 0 which is POLICY_DEFAULT this gets the setting from the BIOS).

Looking through kernel docs there is a command line parameter
  pcie_aspm=off
which disables power saving. When normal Fedora kernel is booted with this parameter the NMI error does NOT occur.

To me it seems a bit odd to set POWERSAVE by default I would have thought the BIOS setting would be the correct default. Also with the patch there is no way to tell the kernel to use the BIOS setting the only values that can be specified to pcie_aspm are 'off' and 'force'.

The other way to control the aspm mode is to echo values to 
  /sys/module/pcie_aspm/parameters/policy.
The options are "default", "performance" and "powersave". This has all the required options but I'm doubtful there is anyway to be sure that this is set before problems occur(e.g. an NMI).

When I boot with pcie_aspm=off and cat /sys/modules/pcie_aspm/parameters/policy it shows:-
default performance [powersave]

Which indicates that powersave is on! I don't understand how this has happened. But the machine seems to be fine i.e. no NMI, no disk lock ups.

Comment 19 Don Zickus 2010-09-02 15:12:17 UTC
ISTR there was a reason for that patch.  It didn't work out to well in RHEL-6 either.  A lot of strange NMIs were the result of that change.  mjg can go into more detail but I am not surprised that patch is the culprit.

Nice work!

Cheers,
Don

Comment 20 Jeremy Faith 2010-09-30 15:26:56 UTC
Hi,

Just wondering what if anything is happening to the linux-2.6-defaults-aspm.patch
It still seems to be being applied in the latest f14 kernel builds.
If it has been decided to retain the patch perhaps a mention of pcie_aspm=off could be added to release notes. 

Also a mention should be added to 
  https://fedoraproject.org/wiki/Common_kernel_problems#Crashes.2FHangs
I see it does get a mention in the
  Can't find installation CD/DVD or hard drives 
section of the page but that does not apply in this case.

Regards,
Jeremy

Comment 21 Christoph Sievers 2010-10-09 21:25:25 UTC
Using a HP DL380 G4 with a SmartArray 6i
using current RHEL-6 beta2 Kernel 2.6.32-44.2.el6.x86_64
booted with pcie_aspm=off

NMI and Disk hang occurs.

I tried to boot into 2.6.31.12-174.2.22.fc12 but unfortunately the encrypted System couldn't be opened - guess the standard encryption parameters are not supported by that kernel or so.


rgds
Christoph

Comment 22 Christoph Sievers 2010-10-10 08:50:22 UTC
ok that's appearently because aes-xts-plain64 was introduced in the meantime. No Idea how to safely convert that back to aes-xts-plain (which would be safe since that volume is a lot less big than 2TB)

Comment 23 Christoph Sievers 2010-10-10 14:19:00 UTC
Maybe the disk lock problem was supposed to be solved by these changes?

http://www.kernel.org/pub/linux/kernel/v2.6/testing/ChangeLog-2.6.36-rc4
Dan Carpenter (8):
      cciss: handle allocation failure
Stephen M. Cameron (3):
      cciss: disable doorbell reset on reset_devices
      cciss: fix reporting of max queue depth since init
      [SCSI] hpsa: disable doorbell reset on reset_devices

seems this has also been discussed here

https://partner-bugzilla.redhat.com/show_bug.cgi?id=612486

I installed the latest 2.6.36 rc7 from koji and it felt like the system survived a little bit longer.

Only a little bit. Oops occured within the cciss module for it might have tried to free nonallocated memory. Lateron oopses within the filesystem module and finally.. there is the disk lockup problem again.

The HP DL380 G4 loads the cciss module (not hpsa).

I tried downgrading to Kernel 2.6.32-37.el6.x86_64 since I read somewhere that falling back might help but no luck.

Comment 24 Christoph Sievers 2010-10-10 20:11:26 UTC
Yeah :)

I think I actually wasted ppls time reading this. While playing around with the bios settings I thought.. yeah hit the memory check button and

DIMM 2 and 3 were faulty. All of a sudden :/

So

Excuse me - the error went aray when I replaced the to RAM elements

rgds
Christoph

Comment 25 Elliott Forney 2011-02-02 17:42:04 UTC
We have a rack of HP DL160 G6 machines that were all suffering similar problems:  flashing red status lights and the following syslog errors

kernel: [ 2976.038899] Uhhuh. NMI received for unknown reason 31 on CPU 0.
kernel: [ 2976.038902] Do you have a strange power saving mode enabled?
kernel: [ 2976.038904] Dazed and confused, but trying to continue
kernel: [ 3437.022203] Uhhuh. NMI received for unknown reason 11 on CPU 0.
kernel: [ 3437.022207] Do you have a strange power saving mode enabled?
kernel: [ 3437.022209] Dazed and confused, but trying to continue

Setting pcie_aspm=off appears to have solved the problem.  Good catch Jeremy!

Comment 26 Naga Chumbalkar 2011-03-01 22:50:36 UTC
Created attachment 481733 [details]
Disable ASPM if the BIOS doesn't support _OSC

AMD-based G6 servers (eg: DL385 G6) did not get the BIOS change that sets the global ASPM disable bit in the FACP/FADT. 

In the meantime, please review/test the attached patch. It hasn't been submitted upstream yet. It basically tries to bring back Matthew Garrett's upstream patch (commit: 852972acff8f10f3a15679be2059bb94916cba5d) that was removed via commit: 28eb5f274a305bf3a13b2c80c4804d4515d05c64.

Comment 27 Matthew Garrett 2011-03-01 23:10:21 UTC
This shouldn't be necessary 28eb5f274a305bf3a13b2c80c4804d4515d05c64. Are you seeing ASPM enabled anyway?

Comment 28 Naga Chumbalkar 2011-03-01 23:46:05 UTC
Created attachment 481743 [details]
v2: Disable ASPM if the BIOS doesn't support _OSC

Am unable to locate a DL385 G6 right away. Cc'ing Tony to see if there is one at Red Hat.

Does F13/F14 diverge from upstream in setting up ASPM. From code perusal it appears that if you boot 2.6.38 upstream with policy set to "powersave" you might run into this issue. 

Attached (v2: Disable ASPM if the BIOS doesn't support _OSC) is a simpler patch.

Comment 29 Matthew Garrett 2011-03-01 23:47:28 UTC
Restting needinfo until the question in comment 27 is answered.

Comment 30 Jeremy Faith 2011-03-02 10:18:12 UTC
In case you missed it(comment 18) the patch that seems to be responsible for this bug is:linux-2.6-defaults-aspm.patch

The patch sets aspm_policy=POLICY_POWERSAVE where previously it was
unset(probably 0 which is POLICY_DEFAULT this gets the setting from the BIOS).

Upstream kernels did NOT have the problem.

The patch is still being applied in the latest Fedora kernels(I just checked kernel-2.6.38-0.rc6.git6.1.fc15.src.rpm).

Comment 31 Matthew Garrett 2011-03-02 12:42:44 UTC
Resetting needinfo until the question in comment 27 is answered.

Comment 32 Tony Camuso 2011-03-02 13:13:20 UTC
Naga, 

I have access to a dl385g7, but not a g6.

Comment 33 Matthew Garrett 2011-03-02 13:36:25 UTC
Resetting needinfo until the question in comment 27 is answered.

Comment 34 Sandy Garza 2011-03-03 22:30:02 UTC
Results from HP:

Am unable to capture “lspci –vvvxxx” output when the failure occurs on F13/F14. However, it seems related to ASPM. When I boot with “pcie_aspm=off” the problems go away.

Fedora13  x86-64:
•	Unable to detect the connected hard drive if OS is booted from OS media. Also, getting NMI error
•	Able to install and boot the OS with “pcie_aspm=off boot parameter (Hard drive is also detected with this boot parameter)
•	“PCIe ASPM is disabled” message is displayed in “dmesg” output.
•	PCIe_ASPM policy is set to “powersave”
•	ASPM is disabled for all the PCI devices – probably because I used “pcie_aspm=off”


Fedora14  x86-64:
•	Unable to detect the connected hard drive if OS is booted from OS media. Also, getting NMI error.
•	Able to install and boot the OS with “pcie_aspm=off”  boot parameter
•	“PCIe ASPM is disabled” message is displayed in “dmesg” output.
•	PCIe_ASPM policy is set to “powersave”
•	ASPM is disabled for all the PCI devices – probably because I used “pcie_aspm=off”.

Comment 35 Naga Chumbalkar 2011-03-08 21:03:52 UTC
Created attachment 483015 [details]
Fedora 14: Disable ASPM when BIOS doesn't support _OSC

Our QA team has reproduced this problem and as seen in comment #34 they are unable to capture the relevant information requested in comment #27.

Fedora 14 expects ASPM to get disabled because of the code snippet below:
In ./drivers/acpi/pci_root.c: acpi_pci_root_add()
…
      if (status == AE_NOT_EXIST) {
            printk(KERN_INFO "Unable to assume PCIe control: Disabling ASPM\n");
            pcie_no_aspm();
      }
…

That code doesn’t pick up the failing case where the BIOS doesn’t have the special FADT bit set. A patch (please see attached “f14-aspm-disabled.patch”) would catch the failing case described by the original reporter of this BZ.

Comment 36 Trinh 2011-03-10 18:37:17 UTC
Created attachment 483535 [details]
Fedora 14 x86_64 dmesg after aspmpatch.txt

Comment 37 Trinh 2011-03-10 18:39:07 UTC
Created attachment 483537 [details]
Fedora 14 x86_64 policy after aspmpatch.txt

QA reported that the patch in comment #35 fixed the issue. FYI. Please review the attached log files.

Comment 38 Trinh 2011-03-10 18:40:11 UTC
Created attachment 483538 [details]
Fedora 14 x86_64 lspcixxxvv after aspmpatch.txt

QA reported that the patch in comment #35 fixed the issue. FYI. Please review the attached log files.

Comment 39 Tony Camuso 2011-03-11 15:26:48 UTC
Matt,

Will this patch get rolled into Fedora?

What can we do to help make this happen?

Comment 40 Bug Zapper 2011-06-01 12:44:30 UTC
This message is a reminder that Fedora 13 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 13.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '13'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 13's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 13 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 41 Jeremy Faith 2011-06-01 14:14:20 UTC
I just booted Fedora 15 in rescue mode on the DL385 G6 and it seems to be fine i.e. no NMI, no disk hang. I also confirmed that Fedora 13 without pcie_aspm=off still has the problem. So I think this can be marked as fixed in Fedora 15.