Bug 681017

Summary: 82576 stuck after PCI AER error
Product: Red Hat Enterprise Linux 6 Reporter: Alex Williamson <alex.williamson>
Component: kernelAssignee: Alex Williamson <alex.williamson>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.2CC: agospoda, ddugger, ddutile, dhoward, dts, jburke, jfeeney, jwest, kzhang, mishu, prarit, sassmann, syeghiay
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.32-128.el6 Doc Type: Bug Fix
Doc Text:
Under some circumstances, faulty logic in the system BIOS could report that ASPM (Active State Power Management) was not supported on the system, but leave ASPM enabled on a device. This could lead to AER (Advanced Error Reporting) errors that the kernel was unable to handle. With this update, the kernel proactively disables ASPM on devices when the BIOS reports that ASPM is not supported, safely eliminating the aforementioned issues.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-19 12:44:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 689015, 694073    
Attachments:
Description Flags
console log
none
lspci -vvv
none
lspci -vvv
none
DSDT none

Description Alex Williamson 2011-02-28 19:50:09 UTC
Description of problem:
82576 seems to be able to generate errors that AER can't deal with and causes the system to hang.  On my system I see:

Red Hat Enterprise Linux Server release 6.0 Beta (Santiago)
Kernel 2.6.32-114.0.1.el6.x86_64 on an x86_64

s20 login: pcieport 0000:00:07.0: AER: Corrected error received: id=0038
pcieport 0000:00:07.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer,
 id=0038(Transmitter ID)
pcieport 0000:00:07.0:   device [8086:340e] error status/mask=00001000/00002000
pcieport 0000:00:07.0:    [12] Replay Timer Timeout  
Ebtables v2.0 registered
ip6_tables: (C) 2000-2006 Netfilter Core Team
lo: Disabled Privacy Extensions
Intel(R) Gigabit Ethernet Network Driver - version 2.1.0-k2
Copyright (c) 2007-2009 Intel Corporation.
igb 0000:03:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
pcieport 0000:00:07.0: AER: Corrected error received: id=0038
pcieport 0000:00:07.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer,
 id=0038(Transmitter ID)
pcieport 0000:00:07.0:   device [8086:340e] error status/mask=00001000/00002000
pcieport 0000:00:07.0:    [12] Replay Timer Timeout  
pcieport 0000:00:07.0: AER: Corrected error received: id=0038

These errors seem to follow a dual-port 82576.  The only way I can use the card it to boot with pci=noaer, then the device seems to work correctly.  

Version-Release number of selected component (if applicable):
2.6.32-114.0.1.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. ??
2.
3.
  
Actual results:
82576 no longer boots w/o pci=noaer

Expected results:
a) the card shouldn't have died, and if it did, it still works surprisingly well with pci=noaer
b) aer errors shouldn't hang the systems

Additional info:
will upload logs

Comment 1 Alex Williamson 2011-02-28 19:53:05 UTC
Created attachment 481431 [details]
console log

Console log.  First boot is filled with AER errors, but the device still worked.  Next boots system would hang when I loaded the igb driver.  Subsequent boots show attempting to move the card to different slots, the errors follow the card.  Eventually got the card booted with pci=noaer command line option.

Comment 2 Alex Williamson 2011-02-28 19:55:07 UTC
Created attachment 481432 [details]
lspci -vvv

This is with the dual-port 82576 back in the original slot

Comment 3 Alex Williamson 2011-02-28 19:56:43 UTC
Created attachment 481433 [details]
lspci -vvv

card back in original slot

Comment 5 Stefan Assmann 2011-03-04 10:10:31 UTC
Alex,

thanks for reporting. Is this a lab machine I can access?

Comment 6 Don Dutile (Red Hat) 2011-03-04 14:30:58 UTC
(In reply to comment #5)
> Alex,
> 
> thanks for reporting. Is this a lab machine I can access?

No, it's Alex's at-home/remote system.

Comment 7 Alex Williamson 2011-03-04 15:24:03 UTC
(In reply to comment #5)
> thanks for reporting. Is this a lab machine I can access?

Yeah, as Don said it's my home devel/test box.  I can setup remote access to it, but it's not quite a convenient as a beaker/lab system.  Let me know if there are more logs to collect or if I can setup remote access for you.

Comment 8 Stefan Assmann 2011-03-07 10:04:35 UTC
Ok, not sure which one to blame AER or igb. I don't think it's necessary to setup remote access yet.
- Does the problem persist with a recent upstream kernel (with AER enabled)? If yes we might ask upstream to investigate.
- Try pci=nomsi
- modprobe igb with the network cable unplugged
- if possible try to swap the card for another igb card for testing

By examining the console log I suspect a locking issue with PCI/AER causing your system to hang.
INFO: task events/0:35 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
events/0      D 0000000000000000     0    35      2 0x00000000
 ffff8803718a3c10 0000000000000046 0000000000000000 0000000000000060
 ffff8803718a3b80 ffffffff8126aef6 ffff8803718a3bb0 00000000fffd3c2b
 ffff8803718a1af8 ffff8803718a3fd8 000000000000f558 ffff8803718a1af8
Call Trace:
 [<ffffffff8126aef6>] ? __const_udelay+0x46/0x50
 [<ffffffff814d7e95>] schedule_timeout+0x215/0x2e0
 [<ffffffff814d7166>] ? thread_return+0x4e/0x778
 [<ffffffff814d8db2>] __down+0x72/0xb0
 [<ffffffff810935c1>] down+0x41/0x50
 [<ffffffff81286af0>] ? find_device_iter+0x0/0x170
 [<ffffffff81277d36>] pci_walk_bus+0x66/0xd0
 [<ffffffff81286cb4>] find_source_device+0x54/0xa0
 [<ffffffff812874c6>] aer_isr+0x66/0x470
 [<ffffffff81287460>] ? aer_isr+0x0/0x470

Prarit, thoughts?

Comment 9 Prarit Bhargava 2011-03-07 13:16:08 UTC
I'll add this to my list of things to look at -- does the problem persist if you disable AER?

P.

Comment 10 Don Dutile (Red Hat) 2011-03-07 14:56:56 UTC
(In reply to comment #9)
> I'll add this to my list of things to look at -- does the problem persist if
> you disable AER?
> 
> P.

In description (and conversing w/aw about his problem earlier),
it goes away if pci=noaer on kernel cmdline.

Comment 11 Alex Williamson 2011-03-07 17:33:42 UTC
(In reply to comment #8)
> Ok, not sure which one to blame AER or igb. I don't think it's necessary to
> setup remote access yet.
> - Does the problem persist with a recent upstream kernel (with AER enabled)? If
> yes we might ask upstream to investigate.

No, seems I can boot upstream with AER enabled and I get no complaints when loading the igb module.

> - Try pci=nomsi
> - modprobe igb with the network cable unplugged

Yet to try these.

> - if possible try to swap the card for another igb card for testing

No such luck, this is the only card I have.

Comment 12 Alex Williamson 2011-03-07 18:14:32 UTC
(In reply to comment #8)
> - Try pci=nomsi

Works, no errors.

> - modprobe igb with the network cable unplugged

No difference

Comment 13 Alex Williamson 2011-03-07 18:45:49 UTC
I get the same errors on both rhel6 and stock 2.6.32, appears to be working on latest linux-2.6.git.  I'll try to bisect this down unless you have some more targeted things to try.

Comment 14 Stefan Assmann 2011-03-07 19:02:40 UTC
Thanks Alex, bisecting will really help tracking it down!

Comment 15 Alex Williamson 2011-03-09 15:22:19 UTC
Ok, git bisect gets us to this commit fixing it:

commit 2f671e2dbff6eb5ef4e2600adbec550c13b8fe72
Author: Matthew Garrett <mjg>
Date:   Mon Dec 6 14:00:56 2010 -0500

    PCI: Disable ASPM if BIOS asks us to
    
    We currently refuse to touch the ASPM registers if the BIOS tells us that
    ASPM isn't supported. This can cause problems if the BIOS has (for any
    reason) enabled ASPM on some devices anyway. Change the code such that we
    explicitly clear ASPM if the FADT indicates that ASPM isn't supported,
    and make sure we tidy up appropriately on device removal in order to deal
    with the hotplug case. If ASPM is disabled because the BIOS doesn't hand
    over control then we won't touch the registers.
    
    Signed-off-by: Matthew Garrett <mjg>
    Signed-off-by: Jesse Barnes <jbarnes>

However, when I backport it to RHEL6, it doesn't work.  The problem is the default ASPM policy in RHEL is powersave, and indeed if I boot upstream with pcie_aspm.policy=powersave, the fix above stops working.  I've added Matthew, but we should probably start a discussion on upstream lists.  I think there are at least two simple solutions.  We could a) set aspm_policy = POLICY_DEFAULT when we set aspm_clear_state, or b) we could enable the necessary code path directly:

--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -607,7 +607,7 @@ void pcie_aspm_init_link_state(struct pci_dev *pdev)
         * the BIOS's expectation, we'll do so once pci_enable_device() is
         * called.
         */
-       if (aspm_policy != POLICY_POWERSAVE) {
+       if (aspm_policy != POLICY_POWERSAVE || aspm_clear_state) {
                pcie_config_aspm_path(link);
                pcie_set_clkpm(link, policy_to_clkpm_state(link));
        }

Opinions?  BTW, I think there is a correlation of this problem flaring up in proximity to upgrading the BIOS on this system (Lenovo ThinkStation S20).  Stefan, I guess this might as well be assigned to me.

Comment 16 Matthew Garrett 2011-03-09 15:30:07 UTC
I think your second approach is correct here.

Comment 17 Stefan Assmann 2011-03-09 15:30:48 UTC
ok, reassigning to you Alex.

Comment 18 Alex Williamson 2011-03-09 17:11:28 UTC
Hmm, latest upstream works with or without policy=powersave.  Bisecting again to figure out if something already fixed this that we can backport.

Comment 19 Alex Williamson 2011-03-10 16:50:56 UTC
Well this is confusing.  Here's the upstream commit that makes it work with pcie_aspm.policy=powersave:

commit 415e12b2379239973feab91850b0dce985c6058a
Author: Rafael J. Wysocki <rjw>
Date:   Fri Jan 7 00:55:09 2011 +0100

    PCI/ACPI: Request _OSC control once for each root bridge (v3)
    
    Move the evaluation of acpi_pci_osc_control_set() (to request control of
    PCI Express native features) into acpi_pci_root_add() to avoid calling
    it many times for the same root complex with the same arguments.
    Additionally, check if all of the requisite _OSC support bits are set
    before calling acpi_pci_osc_control_set() for a given root complex.
    
    References: https://bugzilla.kernel.org/show_bug.cgi?id=20232
    Reported-by: Ozan Caglayan <ozan.tr>
    Tested-by: Ozan Caglayan <ozan.tr>
    Signed-off-by: Rafael J. Wysocki <rjw>
    Signed-off-by: Jesse Barnes <jbarnes>

The behavior is changed on this box because we fail this test:

(flags & ACPI_PCIE_REQ_SUPPORT) == ACPI_PCIE_REQ_SUPPORT)

and don't call acpi_pci_osc_control_set().  If I force us into that 'if' block in acpi_pci_root_add(), as the code before the above commit did, the AER errors return.  So it seems that this system's ACPI implementation is doing something bad when we call _OSC.  I'll include the DSDT, hopefully Matthew can shed more light on what it's doing than I can.  It appears to only do anything for OSC_PCI_EXPRESS_NATIVE_HP_CONTROL and OSC_PCI_EXPRESS_PME_CONTROL, but even if I clear those from the flags passed, I get AER errors.  Passing 0x0 for flags works.

Comment 20 Alex Williamson 2011-03-10 16:52:11 UTC
Created attachment 483508 [details]
DSDT

Comment 21 Alex Williamson 2011-03-10 16:53:54 UTC
(In reply to comment #19)
> It appears to only do anything for
> OSC_PCI_EXPRESS_NATIVE_HP_CONTROL and OSC_PCI_EXPRESS_PME_CONTROL

s/OSC_PCI_EXPRESS_NATIVE_HP_CONTROL/OSC_SHPC_NATIVE_HP_CONTROL/

Comment 22 Alex Williamson 2011-03-10 18:13:54 UTC
Ok, the effective difference the patch in comment 19 made was that we no longer request AER control via the _OSC.  That means the OSPM never gets notified about AER errors, they're probably still occurring.  The patch in comment 15 is still effective at eliminating them when we do request AER control.  So, I'll proceed with that patch, and I think there's a new bug upstream that we skip requesting _OSC control because the platform isn't setting OSC_ACTIVE_STATE_PWR_SUPPORT or OSC_CLOCK_PWR_CAPABILITY_SUPPORT support flags.

Comment 23 Matthew Garrett 2011-03-10 18:38:19 UTC
Those are OS capability flags, not platform capability flags (the _OSC method makes approximately no rational sense at all). We should be setting those all the time, although there's currently a bug upstream where we don't if ASPM is disabled. On the other hand, if the platform doesn't give us full PCIe control (which is defined as all of native hotplug, PME and AER) then we won't enable any PCIe features. This matches the behaviour of Windows and the expectations of some hardware vendors.

Comment 24 Alex Williamson 2011-03-10 21:34:44 UTC
(In reply to comment #23)
> Those are OS capability flags, not platform capability flags (the _OSC method
> makes approximately no rational sense at all). We should be setting those all
> the time, although there's currently a bug upstream where we don't if ASPM is
> disabled.

Right, I found the proposed patch for that here: https://patchwork.kernel.org/patch/612171/  With that, we'll return to making the _OSC call, which will enable AER errors and upstream will fail with policy=powersave just like rhel does.

> On the other hand, if the platform doesn't give us full PCIe control
> (which is defined as all of native hotplug, PME and AER) then we won't enable
> any PCIe features. This matches the behaviour of Windows and the expectations
> of some hardware vendors.

Ugh.

Comment 26 RHEL Program Management 2011-03-18 15:49:57 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 30 Aristeu Rozanski 2011-03-30 14:32:51 UTC
Patch(es) available on kernel-2.6.32-128.el6

Comment 34 Andy Gospodarek 2011-04-05 13:33:29 UTC
*** Bug 619806 has been marked as a duplicate of this bug. ***

Comment 35 Jeremy West 2011-04-05 17:02:42 UTC
*** Bug 689015 has been marked as a duplicate of this bug. ***

Comment 38 Jeremy West 2011-04-05 17:21:46 UTC
*** Bug 647077 has been marked as a duplicate of this bug. ***

Comment 42 Martin Prpič 2011-05-05 08:52:02 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Under some circumstances, faulty logic in the system BIOS could report that ASPM (Active State Power Management) was not supported on the system, but leave ASPM enabled on a device. This could lead to AER (Advanced Error Reporting) errors that the kernel was unable to handle. With this update, the kernel proactively disables ASPM on devices when the BIOS reports that ASPM is not supported, safely eliminating the aforementioned issues.

Comment 43 errata-xmlrpc 2011-05-19 12:44:19 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html