Bug 770308 - Dell Studio 1536 and possibly many others requires pci=nocrs
Summary: Dell Studio 1536 and possibly many others requires pci=nocrs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 16
Hardware: i386
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-12-25 22:51 UTC by John Gotts
Modified: 2012-01-28 03:33 UTC (History)
6 users (show)

Fixed In Version: kernel-3.2.2-1.fc16
Clone Of:
Environment:
Last Closed: 2012-01-28 03:33:51 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Output of dmidecode from a Dell Studio 1536 (12.49 KB, text/plain)
2011-12-28 20:10 UTC, John Gotts
no flags Details
dmesg output with Dave Jones' patched 3.1.6-2.fc16 kernel (70.50 KB, application/octet-stream)
2012-01-03 18:24 UTC, John Gotts
no flags Details
PNP quirk for unreported MMCONFIG space (6.70 KB, patch)
2012-01-04 23:23 UTC, Bjorn Helgaas
no flags Details | Diff

Description John Gotts 2011-12-25 22:51:34 UTC
Description of problem:
I was unable to boot the Fedora 15 and Fedora 16 installation media. The system would crash following a kernel panic. I was able to distro-sync up to 16 and continue to use the 14 kernel, so I concluded that this is an upstream problem with the kernel.

I was able to get my system working in a very crippled manner using pci=noacpi. The keyboard did not work correctly. Key press and release events would be missed. Keystrokes would appear very slowly or not at all and every key was at risk of becoming stuck (including the arrow keys, Page Up, Page Down, Control, Shift, and all other keys). Suspend stopped working. There were problems with IRQs being missing (irq n: nobody cared in the logs followed by call traces). The system seemed to run more slowly (but I didn't perform any benchmarks to prove this).

Then I happened upon this document:

http://fedoraproject.org/wiki/Common_kernel_problems

Under "Can't find installation CD/DVD or hard drives" there is this line:

"Try the boot option pci=nocrs on 2.6.34 and later kernels."

I'd almost recommend turning this on by default, as it seems to cause no harm whatsoever and without it my hardware becomes unbootable. What I would definitely recommend is suggesting this option if there is some level of success with pci=noacpi in the "Crashes/Hangs" section of that document.

I've done extensive searches over the past week and I've seen many instances of people having problems with Dell keyboard hardware. I feel like correct initialization of recent Dell keyboard hardware is a prerequisite of it working correctly and that noacpi causes that not to happen.

This lockup issue also seems to be common with Ubuntu. I hope that Ubuntu users with this issue will eventually see my report.

Version-Release number of selected component (if applicable):
All versions shipped with Fedora 16. [At least] current versions shipped with Fedora 15.

How reproducible:
Always.

Steps to Reproduce:
1. Attempt to boot the kernel with installation media.
2.
3.
  
Actual results:
Failure to boot.

Expected results:
Successful booting and system operation.

Additional info:

Comment 1 Dave Jones 2011-12-28 14:56:15 UTC
please attach output of dmidecode from systems that need this option.

Comment 2 John Gotts 2011-12-28 20:10:16 UTC
Created attachment 549856 [details]
Output of dmidecode from a Dell Studio 1536

Here is the output of dmidecode from the Dell Studio 1536, which requires pci=nocrs to boot 3.1.x.

Comment 3 Dave Jones 2011-12-29 18:12:21 UTC
Thanks. I've just added a patch that should make this work without the need for to pass the pci=nocrs.
When the build at http://koji.fedoraproject.org/koji/taskinfo?taskID=3609574 finishes, give it a try.

Comment 4 John Gotts 2011-12-30 00:18:22 UTC
The patch is good. It would fix the Fedora 15 and 16 installation media for this hardware but more importantly it will fix Fedora 17.

It's no fun to have to boot the installation media with pci=nocrs or any kernel flag.

The machine seems to run better with the patch than with pci=nocrs, but I haven't taken any scientific measurements.

Good work!

Comment 5 Dave Jones 2011-12-30 19:31:35 UTC
thanks for testing, I'll get this sent upstream.

Comment 6 Bjorn Helgaas 2012-01-03 17:50:34 UTC
John, would you mind attaching the complete dmesg log?  I think the log from a boot with the patch or with "pci=nocrs" is sufficient.  This will identify the exact problem and help create a fix that's more generic than the blacklist.

Turning off _CRS is definitely a workaround for this machine, but it does mean that we can't handle multiple PCI host bridges correctly, so we can't turn off _CRS across the board.

This is likely the same problem as https://bugzilla.kernel.org/show_bug.cgi?id=31602, which is on a Dell 1546.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/647043 is another report (with many Ubuntu duplicates).

Comment 7 John Gotts 2012-01-03 18:21:48 UTC
The Dell Studio 1557 will also need a separate blacklist entry (or the more general fix you would like to create):

https://bugzilla.redhat.com/show_bug.cgi?id=769657

kernel.org is not responding right now but beware that the 1546 is a different line of hardware, an Inspiron not a Studio. The Studio line is/was Dell's entry-level hardware. I believe the Inspiron is slightly better than entry level, but the hardware may indeed be similar enough...

As far as the Ubuntu bug report goes, the underlying problem might be the same, but the result is that the Fedora 15 and 16 installation media and all shipped Fedora 15 and 16 kernels crash on my hardware during boot (without pci=nocrs). There is no opportunity to test USB or any kernel functionality for that matter. One data point worth mentioning is that I have 4 GB of system memory and I'm using the PAE kernel. I read something in the bug report about a mapping issue of some sort.

In any case, I am attaching the output of dmesg with Dave Jones' patched kernel. If you need me to boot with pci=nocrs and attach that dmesg let me know. (The machine is unstable in that mode.)

Note that Fedora 14 ran fine on my system. This bug came about one or two kernel revisions before Linus jumped to version 3.x.

Comment 8 John Gotts 2012-01-03 18:24:21 UTC
Created attachment 550500 [details]
dmesg output with Dave Jones' patched 3.1.6-2.fc16 kernel

Comment 9 Bjorn Helgaas 2012-01-03 19:41:59 UTC
There are two problems on the 1536, both of which look like BIOS bugs:
  1) the BIOS left USB devices outside all the PCI host bridge apertures
  2) the apertures include an unreported MMCONFIG area
When Linux tries to fix problem 1, it moves the USB devices into the unreported MMCONFIG area, where they don't work.  Windows will also fix problem 1 by moving the USB devices, but it chooses a different area that doesn't fall into the MMCONFIG area.

The chipset is programmed to claim [mem 0xd0000000-0xdfffffff] as MMCONFIG space, but the ACPI MCFG description only tells the OS about [mem 0xd0000000-0xd3ffffff].  In fact, the rest of that MMCONFIG region is reported as the PCI host bridge aperture [mem 0xd4000000-0xfebfffff].

Here are the relevant parts of your dmesg log:

    BIOS-e820: 00000000c7e60800 - 00000000d4000000 (reserved)
    Fam 10h mmconf [d0000000, dfffffff]
    PCI: MMCONFIG for domain 0000 [bus 00-3f] at [mem 0xd0000000-0xd3ffffff] (base 0xd0000000)
    pci_root PNP0A03:00: host bridge window [mem 0x000a0000-0x000bffff] (ignored)
    pci_root PNP0A03:00: host bridge window [mem 0x000d0000-0x000dffff] (ignored)
    pci_root PNP0A03:00: host bridge window [mem 0xd4000000-0xfebfffff] (ignored)
    pci_root PNP0A03:00: host bridge window [mem 0xfec10000-0xfecfffff] (ignored)
    pci_root PNP0A03:00: host bridge window [mem 0xfed00400-0xfedfffff] (ignored)
    pci_root PNP0A03:00: host bridge window [mem 0xfee10000-0xff9fffff] (ignored)
    pci_root PNP0A03:00: host bridge window [mem 0xffc00000-0xffdffbff] (ignored)
    pci_root PNP0A03:00: host bridge window [mem 0x130000000-0x327ffffff] (ignored)
    pci 0000:00:12.0: reg 10: [mem 0xffb00000-0xffb00fff]
    pci 0000:00:12.1: reg 10: [mem 0xffb01000-0xffb01fff]
    pci 0000:00:12.2: reg 10: [mem 0xffa80000-0xffa800ff]
    pci 0000:00:13.0: reg 10: [mem 0xffb02000-0xffb02fff]
    pci 0000:00:13.1: reg 10: [mem 0xffb03000-0xffb03fff]
    pci 0000:00:13.2: reg 10: [mem 0xffa80400-0xffa804ff]

Since this kernel ignores _CRS, we don't move the USB devices.  The unpatched F16 & F17 kernels will move them to the area at 0xd4000000, since that appears unused.

Comment 10 Bjorn Helgaas 2012-01-03 19:54:50 UTC
> In any case, I am attaching the output of dmesg with Dave Jones' patched
> kernel. If you need me to boot with pci=nocrs and attach that dmesg let me
> know. (The machine is unstable in that mode.)

I don't know what patch is in Dave's kernel, but the patch he posted upstream (https://lkml.org/lkml/2011/12/30/59) should be exactly equivalent to booting with "pci=nocrs".  If they behave differently, I'm missing something or there's another issue.

Comment 11 John Gotts 2012-01-03 19:58:45 UTC
(In reply to comment #10)
> > In any case, I am attaching the output of dmesg with Dave Jones' patched
> > kernel. If you need me to boot with pci=nocrs and attach that dmesg let me
> > know. (The machine is unstable in that mode.)
> 
> I don't know what patch is in Dave's kernel, but the patch he posted upstream
> (https://lkml.org/lkml/2011/12/30/59) should be exactly equivalent to booting
> with "pci=nocrs".  If they behave differently, I'm missing something or there's
> another issue.

After a few days of testing, they do appear to be equivalent.

Comment 12 Dave Jones 2012-01-03 20:54:29 UTC
Yes, same patch that I posted upstream.

Comment 13 Bjorn Helgaas 2012-01-04 23:23:26 UTC
Created attachment 550776 [details]
PNP quirk for unreported MMCONFIG space

Dave, John, I'd appreciate it if you could test this patch in a kernel without the Dell blacklist.  If it works as intended, it should fix the issue on all AMD systems, not just ones in the blacklist.

Comment 14 John Gotts 2012-01-04 23:32:18 UTC
Dave, can you easily generate a Fedora 16 kernel package to test this change?

Thanks!

Comment 15 Dave Jones 2012-01-04 23:47:43 UTC
sure, give me a few hours.

Comment 16 Dave Jones 2012-01-04 23:58:22 UTC
build started.
the rpm's will show up at the bottom of this page once they're built.

http://koji.fedoraproject.org/koji/taskinfo?taskID=3620090

Comment 17 John Gotts 2012-01-05 04:13:03 UTC
3.1.7-2 looks good to me. It seems to correctly work around the BIOS bugs.

Comment 18 John Gotts 2012-01-25 19:13:03 UTC
The official 3.1.7-1 and 3.1.9 were good but 3.2.1-3 now crashes in the same way that all previous did and booting with pci=nocrs is required again.

Did Bjorn's change get reverted?

Comment 19 Josh Boyer 2012-01-25 19:36:50 UTC
(In reply to comment #18)
> The official 3.1.7-1 and 3.1.9 were good but 3.2.1-3 now crashes in the same
> way that all previous did and booting with pci=nocrs is required again.
> 
> Did Bjorn's change get reverted?

Sigh.  Sort of.  It was dropped as a patch to the Fedora kernel when we rebased to 3.1.10, because the upstream 3.1.10 release contained that patch already.  Then we rebased to 3.2.1, which apparently doesn't contain said patch.

It looks like 3.2.2 will include it and that should be released later today, so the next Fedora kernel should have this fixed again.  Sorry for the inconvenience.

Comment 20 Fedora Update System 2012-01-27 00:34:07 UTC
kernel-3.2.2-1.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/FEDORA-2012-0949/kernel-3.2.2-1.fc16

Comment 21 Fedora Update System 2012-01-28 03:33:51 UTC
kernel-3.2.2-1.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.