Bug 620313 - 2.6.34 kernel needs "pci=nocrs" option to boot on Dell Precision T3500
Summary: 2.6.34 kernel needs "pci=nocrs" option to boot on Dell Precision T3500
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 13
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-08-02 06:20 UTC by Stefan Becker
Modified: 2010-09-10 17:05 UTC (History)
12 users (show)

Fixed In Version: kernel-2.6.34.6-54.fc13
Clone Of:
Environment:
Last Closed: 2010-09-09 01:17:38 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
allocate from top of window (608 bytes, patch)
2010-09-04 17:28 UTC, Bjorn Helgaas
no flags Details | Diff
allocate from top of window (v2) (660 bytes, patch)
2010-09-05 04:19 UTC, Bjorn Helgaas
no flags Details | Diff
LSHW output for 2.6.33 on M501R (22.29 KB, text/plain)
2010-09-06 00:35 UTC, Pramod Dematagoda
no flags Details
LSHW output for 2.6.34 on M501R with pci=nocrs (21.99 KB, text/plain)
2010-09-06 00:36 UTC, Pramod Dematagoda
no flags Details
deubg patch (3.01 KB, patch)
2010-09-07 16:27 UTC, Bjorn Helgaas
no flags Details | Diff
screenshot showing erroneous allocation (1.34 MB, image/jpeg)
2010-09-07 19:44 UTC, Bjorn Helgaas
no flags Details
screenshot from my T3500 with debug patch applied (888.28 KB, image/jpeg)
2010-09-08 12:59 UTC, Stefan Becker
no flags Details
alloc top-down (v3) (8.00 KB, patch)
2010-09-10 04:27 UTC, Bjorn Helgaas
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Linux Kernel 16228 0 None None None Never

Description Stefan Becker 2010-08-02 06:20:06 UTC
Description of problem:

I installed kernel-2.6.34.1-29.fc13.x86_64 from Koji to verify if bug #609764 has been corrected. Boot fails with the message "Boot has failed, sleeping forever.". I didn't update any other package before the test.

Version-Release number of selected component (if applicable):

kernel-2.6.34.1-29.fc13.x86_64
dracut-005-3.fc13.noarch

How reproducible: Always

Additional info:

I removed "rhgb quiet" from the kernel command line and see:

   - KMS gets initialized, console changed
   - USB devices are initialized
   - USB card reader devices are initialized

The message is generated by dracuts "init" script, so initrd is reached.

Comment 1 Chuck Ebbert 2010-08-02 18:49:09 UTC
Can you try some of the dracut debug options on the command line? Look at the dracut manpage under KERNEL COMMAND LINE.

Comment 2 Stefan Becker 2010-08-03 07:27:08 UTC
From the debugging output I can see that AHCI fails to initialize the controller. Therefore udev can't find the SATA disks, find the 2 physical partitions, setup LVM and finally /dev/mapper/... for the root device. So the rdinit script bails out after the last iteration of the "initqueue" phase.

AHCI messages in dmesg:

ahci: 0000:00:1f.2: Version 3.0
ahci: 0000:00:1f.2: PCI INT C -> GSI 20 (level, low) -> IRQ 20
ahci: 0000:00:1f.2: irq 53 for MSI/MSI-X
ahci: SSS flag set, parallel bus scan disabled
ahci: 0000:00:1f.2: controller reset failed (0xffffffff)
ahci: 0000:00:1f.2: PCI INT C disabled
ahci: probe of 0000:00:1f.2 failed with error -5

Based on the error message Google found these:

   <http://kerneltrap.org/mailarchive/linux-kernel/2010/4/6/4555916>
   <https://bugzilla.kernel.org/show_bug.cgi?id=15744>

It talks about a Dell Precision T3400. I get the same "no compatible bridge window for []" error message in dmesg. With the suggested "pci=nocrs" the machine boots up fine.

As they talk about a possible BIOS bug: my T3500 has the latest BIOS version A07 (15-Apr-2010) installed.

The kernel bug report indicates that this should be fixed for 2.6.34. Maybe you need to check if all of the mentioned patches actually made it to .34?

Comment 3 Stefan Becker 2010-08-03 15:34:43 UTC
The kernel developer pointed out that the bug report I found is not the corrected one, but a related one. I've added the BKO URL.

Comment 4 Charles Butterfield 2010-09-02 05:42:53 UTC
"Me Too"! -- 2.6.34.6-47.fc13.x86_64 just became available via the normal non-testing) channels and I seem to have the same symptoms as you, namely A brief flash that looks like AHCI, then repeating identical warning about etc/modprobe.conf, then the message about "boot failed, sleeping forever".

I'm crossing my fingers that when I reboot (after submitting this), that the pci=nocrs option cures things.

Given that this is no longer in "testing" I think the serverity and priority should be bumped way up.

I too have a Dell PW T4500, with the latest BIOS.

Comment 5 Charles Butterfield 2010-09-02 05:45:49 UTC
Whoops, that's a Dell T3500, getting way past my bedtime.

Comment 6 Charles Butterfield 2010-09-02 05:55:22 UTC
Well pci=nocrs option worked for me.  Whew :-)

Does anybody know if there are any worrisome consequences of this option? Performance, behavior quirks, whatever.  I am using fakeraid with the ICH10R chipset + md + LVM.

I guess I'm really asking about how much risk does pci=nocrs introduce vs the alternative of reverting to the previous kernel.  Does it just disable some optimizations that are buggy, or is there an uglier story to using it.

Regards
-- Charlie

Comment 7 Chuck Ebbert 2010-09-02 10:41:33 UTC
pci=nocrs just reverts the default method for finding PCI resources back to what was used in 2.6.33

Comment 8 Bjorn Helgaas 2010-09-04 17:28:44 UTC
Created attachment 443074 [details]
allocate from top of window

I'm pretty sure this is the same as https://bugzilla.kernel.org/show_bug.cgi?id=16228.  This patch is a partial fix for that, but hasn't been tested yet.

Comment 9 Bjorn Helgaas 2010-09-05 04:19:03 UTC
Created attachment 443122 [details]
allocate from top of window (v2)

Previous patch is broken, please try this instead.

Comment 10 Pramod Dematagoda 2010-09-06 00:34:02 UTC
I tried this patch with a vanilla 2.6.34 kernel, but no go.

The error message I'm getting is(without the patch):-

pci_root PNP0A03:00: address space collision: host bridge window [mem 0xbfffffff - 0xdfffffff] conflicts with PCI bus 0000:00 [mem 0xc0000000 - 0xffffffff]

This error occurs before KMS is initialised, so I'm guessing before the initrd is loaded.

I am getting this issue with a Dell Inspiron M501R laptop and need to use the pci=nocrs option to boot as well. This issue is also present in the 2.6.35 and 2.6.36 vanilla kernels that I tried, it started happening in the 2.6.34-rc1 kernel.

My lshw output from both 2.6.33 and 2.6.34 are attached, if necessary.

Comment 11 Pramod Dematagoda 2010-09-06 00:35:08 UTC
Created attachment 443202 [details]
LSHW output for 2.6.33 on M501R

Comment 12 Pramod Dematagoda 2010-09-06 00:36:07 UTC
Created attachment 443203 [details]
LSHW output for 2.6.34 on M501R with pci=nocrs

Comment 13 Bjorn Helgaas 2010-09-06 03:37:00 UTC
Pramod, I think you are actually seeing https://bugzilla.kernel.org/show_bug.cgi?id=17011, which is actually a different issue.  I'd
like to keep them separate because the fixes will be quite different.

--------------------------------
In this bug (bug 620313 and https://bugzilla.kernel.org/show_bug.cgi?id=16228),
the first symptom is this message:

  pci 0000:00:1f.2: no compatible bridge window for [mem 0xff970000-0xff9707ff]

This means BIOS left the 1f.2 device (AHCI) somewhere that's outside all
the host bridge windows.  Linux moves it into a window, but the new location
doesn't work.  The result?  AHCI is broken, and all other devices work.

---------------------------------
In https://bugzilla.kernel.org/show_bug.cgi?id=17011, the first symptom
is a message like this:

    pci_root PNP0A03:00: address space collision: host bridge window [mem 0xafffffff-0xdfffffff] conflicts with PCI Bus 0000:00 [mem 0xb0000000-0xffffffff]

This means BIOS reported two host bridge windows that overlap.  In 17011,
and I suspect in your case Pramod, the real problem is that there is a
third window that's completely contained in one of the first ones.  That
causes trouble like this later when we claim the PCI device resources:

    pci 0000:00:01.0: address space collision: [mem 0xff300000-0xff4fffff] conflicts with PCI Bus 0000:00 [mem 0xf0000000-0xffffffff]
    pci 0000:00:06.0: address space collision: [mem 0xff600000-0xff6fffff] conflicts with PCI Bus 0000:00 [mem 0xf0000000-0xffffffff]
    ...

This may result in many broken devices.

----------------------------------

If my assessment doesn't seem right, Pramod, is there any chance you could
collect a console log with "ignore_loglevel" via serial console or netconsole
or video?  I think that would have enough information to tell for sure.

Comment 14 Stefan Becker 2010-09-06 07:24:00 UTC
Applied patch V2 to kernel-2.6.34.6-54.fc13 and now my T3500 boots up fine without pci=nocrs.

Comment 15 Pramod Dematagoda 2010-09-06 07:57:47 UTC
I think you're right Bjorn, the linux kernel bug report looks a lot like my one. I'll turn over to that.

Thanks.

Comment 16 Chuck Ebbert 2010-09-06 14:58:38 UTC
We're going to disable _CRS by default in the next 2.6.34 update for F13, but leave it on in F14 (2.6.35) for now.

Comment 17 Fedora Update System 2010-09-06 20:54:17 UTC
kernel-2.6.34.6-54.fc13 has been submitted as an update for Fedora 13.
https://admin.fedoraproject.org/updates/kernel-2.6.34.6-54.fc13

Comment 18 Stefan Becker 2010-09-07 09:09:38 UTC
(In reply to comment #14)
> Applied patch V2 to kernel-2.6.34.6-54.fc13 and now my T3500 boots up fine
> without pci=nocrs.

Fetched the wrong kernel SRPM from koji. That one had "pci=nocrs" as default and therefore worked fine.

Retested patch V2 with earlier kernel and this kernel (with pci=use_crs). In both cases the kernel *does not* boot. That means the patch doesn't help. BKO seems to have technical problems right now, so I can't update the kernel bug report yet.

Comment 19 Bjorn Helgaas 2010-09-07 16:27:54 UTC
Created attachment 443797 [details]
deubg patch

Stefan and Charles, I'm very sorry for the inconvenience this problem is
causing you, and I really appreciate your testing efforts.  The ideal
thing would be if somebody had a serial console or netconsole setup and
could collect the kernel output with this patch and "pci=use_crs
ignore_loglevel".  I know that's a hassle to set up; the next best thing
would be a digital photo of the console with "pci=use_crs ignore_loglevel
vga=0xf07".

Comment 20 Bjorn Helgaas 2010-09-07 19:44:47 UTC
Created attachment 445764 [details]
screenshot showing erroneous allocation

This screenshot contains the information I requested in the previous
comment, specifically:

    find_resource: try [mem 0xbff00000-0xbfffffff] before [mem 0xc0000000-0xdfffffff] PCI Bus 0000:02
    pci 0000:00:1f.2: BAR 5: assigned [mem 0xbffff800-0xbfffffff]

This shows that we allocated from the top of the region, but since
we're still looking at available regions bottom-up, we found a region
that doesn't work (I suspect the region we allocated is really RAM).

Comment 21 Fedora Update System 2010-09-08 02:21:42 UTC
kernel-2.6.34.6-54.fc13 has been pushed to the Fedora 13 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update kernel'.  You can provide feedback for this update here: https://admin.fedoraproject.org/updates/kernel-2.6.34.6-54.fc13

Comment 22 Charles Butterfield 2010-09-08 03:01:20 UTC
Doesn't seem to be available yet:

$ uname -a
Linux hpc16.home 2.6.34.6-47.fc13.x86_64 #1 SMP Fri Aug 27 08:56:01 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux

$ yum --enablerepo=updates-testing update kernel
Loaded plugins: presto, priorities, refresh-packagekit
Setting up Update Process
No Packages marked for Update

Comment 23 Stefan Becker 2010-09-08 12:59:21 UTC
Created attachment 445977 [details]
screenshot from my T3500 with debug patch applied

(In reply to comment #19)

I applied the debug patch and took the attached screenshot

Comment 24 Fedora Update System 2010-09-09 01:16:46 UTC
kernel-2.6.34.6-54.fc13 has been pushed to the Fedora 13 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 25 Bjorn Helgaas 2010-09-10 04:27:36 UTC
Created attachment 446424 [details]
alloc top-down (v3)

Stefan and Charles, here's a more thorough patch that I think should
fix the problem.  I know kernel-2.6.34.6-54.fc13 has papered over the
problem by turning off "pci=use_crs", but that's only a short-term
workaround.

So if anybody has a chance to test this patch (don't forget to use
"pci=use_crs" to make sure we're exercising this path), I'd really
appreciate it.  If it works, please attach the dmesg log so we can
verify that it's doing the right thing.

Comment 26 Kyle McMartin 2010-09-10 14:44:03 UTC
Hi,

I've done a scratch build here: http://koji.fedoraproject.org/koji/taskinfo?taskID=2459617 (well, it should complete soon at least.) with your patches added, and the nocrs-by-default patch reverted for the original reporters to try.

Hopefully it will help them out a bit.

regards, Kyle

Comment 27 Stefan Becker 2010-09-10 17:05:56 UTC
Sorry, with this bug closed I reported my successful test of patch V3 directly in the upstream bugzilla entry.


Note You need to log in before you can comment on or make changes to this bug.