Red Hat Bugzilla – Bug 620313
2.6.34 kernel needs "pci=nocrs" option to boot on Dell Precision T3500
Last modified: 2010-09-10 13:05:56 EDT
Description of problem:
I installed kernel-184.108.40.206-29.fc13.x86_64 from Koji to verify if bug #609764 has been corrected. Boot fails with the message "Boot has failed, sleeping forever.". I didn't update any other package before the test.
Version-Release number of selected component (if applicable):
How reproducible: Always
I removed "rhgb quiet" from the kernel command line and see:
- KMS gets initialized, console changed
- USB devices are initialized
- USB card reader devices are initialized
The message is generated by dracuts "init" script, so initrd is reached.
Can you try some of the dracut debug options on the command line? Look at the dracut manpage under KERNEL COMMAND LINE.
From the debugging output I can see that AHCI fails to initialize the controller. Therefore udev can't find the SATA disks, find the 2 physical partitions, setup LVM and finally /dev/mapper/... for the root device. So the rdinit script bails out after the last iteration of the "initqueue" phase.
AHCI messages in dmesg:
ahci: 0000:00:1f.2: Version 3.0
ahci: 0000:00:1f.2: PCI INT C -> GSI 20 (level, low) -> IRQ 20
ahci: 0000:00:1f.2: irq 53 for MSI/MSI-X
ahci: SSS flag set, parallel bus scan disabled
ahci: 0000:00:1f.2: controller reset failed (0xffffffff)
ahci: 0000:00:1f.2: PCI INT C disabled
ahci: probe of 0000:00:1f.2 failed with error -5
Based on the error message Google found these:
It talks about a Dell Precision T3400. I get the same "no compatible bridge window for " error message in dmesg. With the suggested "pci=nocrs" the machine boots up fine.
As they talk about a possible BIOS bug: my T3500 has the latest BIOS version A07 (15-Apr-2010) installed.
The kernel bug report indicates that this should be fixed for 2.6.34. Maybe you need to check if all of the mentioned patches actually made it to .34?
The kernel developer pointed out that the bug report I found is not the corrected one, but a related one. I've added the BKO URL.
"Me Too"! -- 220.127.116.11-47.fc13.x86_64 just became available via the normal non-testing) channels and I seem to have the same symptoms as you, namely A brief flash that looks like AHCI, then repeating identical warning about etc/modprobe.conf, then the message about "boot failed, sleeping forever".
I'm crossing my fingers that when I reboot (after submitting this), that the pci=nocrs option cures things.
Given that this is no longer in "testing" I think the serverity and priority should be bumped way up.
I too have a Dell PW T4500, with the latest BIOS.
Whoops, that's a Dell T3500, getting way past my bedtime.
Well pci=nocrs option worked for me. Whew :-)
Does anybody know if there are any worrisome consequences of this option? Performance, behavior quirks, whatever. I am using fakeraid with the ICH10R chipset + md + LVM.
I guess I'm really asking about how much risk does pci=nocrs introduce vs the alternative of reverting to the previous kernel. Does it just disable some optimizations that are buggy, or is there an uglier story to using it.
pci=nocrs just reverts the default method for finding PCI resources back to what was used in 2.6.33
Created attachment 443074 [details]
allocate from top of window
I'm pretty sure this is the same as https://bugzilla.kernel.org/show_bug.cgi?id=16228. This patch is a partial fix for that, but hasn't been tested yet.
Created attachment 443122 [details]
allocate from top of window (v2)
Previous patch is broken, please try this instead.
I tried this patch with a vanilla 2.6.34 kernel, but no go.
The error message I'm getting is(without the patch):-
pci_root PNP0A03:00: address space collision: host bridge window [mem 0xbfffffff - 0xdfffffff] conflicts with PCI bus 0000:00 [mem 0xc0000000 - 0xffffffff]
This error occurs before KMS is initialised, so I'm guessing before the initrd is loaded.
I am getting this issue with a Dell Inspiron M501R laptop and need to use the pci=nocrs option to boot as well. This issue is also present in the 2.6.35 and 2.6.36 vanilla kernels that I tried, it started happening in the 2.6.34-rc1 kernel.
My lshw output from both 2.6.33 and 2.6.34 are attached, if necessary.
Created attachment 443202 [details]
LSHW output for 2.6.33 on M501R
Created attachment 443203 [details]
LSHW output for 2.6.34 on M501R with pci=nocrs
Pramod, I think you are actually seeing https://bugzilla.kernel.org/show_bug.cgi?id=17011, which is actually a different issue. I'd
like to keep them separate because the fixes will be quite different.
In this bug (bug 620313 and https://bugzilla.kernel.org/show_bug.cgi?id=16228),
the first symptom is this message:
pci 0000:00:1f.2: no compatible bridge window for [mem 0xff970000-0xff9707ff]
This means BIOS left the 1f.2 device (AHCI) somewhere that's outside all
the host bridge windows. Linux moves it into a window, but the new location
doesn't work. The result? AHCI is broken, and all other devices work.
In https://bugzilla.kernel.org/show_bug.cgi?id=17011, the first symptom
is a message like this:
pci_root PNP0A03:00: address space collision: host bridge window [mem 0xafffffff-0xdfffffff] conflicts with PCI Bus 0000:00 [mem 0xb0000000-0xffffffff]
This means BIOS reported two host bridge windows that overlap. In 17011,
and I suspect in your case Pramod, the real problem is that there is a
third window that's completely contained in one of the first ones. That
causes trouble like this later when we claim the PCI device resources:
pci 0000:00:01.0: address space collision: [mem 0xff300000-0xff4fffff] conflicts with PCI Bus 0000:00 [mem 0xf0000000-0xffffffff]
pci 0000:00:06.0: address space collision: [mem 0xff600000-0xff6fffff] conflicts with PCI Bus 0000:00 [mem 0xf0000000-0xffffffff]
This may result in many broken devices.
If my assessment doesn't seem right, Pramod, is there any chance you could
collect a console log with "ignore_loglevel" via serial console or netconsole
or video? I think that would have enough information to tell for sure.
Applied patch V2 to kernel-18.104.22.168-54.fc13 and now my T3500 boots up fine without pci=nocrs.
I think you're right Bjorn, the linux kernel bug report looks a lot like my one. I'll turn over to that.
We're going to disable _CRS by default in the next 2.6.34 update for F13, but leave it on in F14 (2.6.35) for now.
kernel-22.214.171.124-54.fc13 has been submitted as an update for Fedora 13.
(In reply to comment #14)
> Applied patch V2 to kernel-126.96.36.199-54.fc13 and now my T3500 boots up fine
> without pci=nocrs.
Fetched the wrong kernel SRPM from koji. That one had "pci=nocrs" as default and therefore worked fine.
Retested patch V2 with earlier kernel and this kernel (with pci=use_crs). In both cases the kernel *does not* boot. That means the patch doesn't help. BKO seems to have technical problems right now, so I can't update the kernel bug report yet.
Created attachment 443797 [details]
Stefan and Charles, I'm very sorry for the inconvenience this problem is
causing you, and I really appreciate your testing efforts. The ideal
thing would be if somebody had a serial console or netconsole setup and
could collect the kernel output with this patch and "pci=use_crs
ignore_loglevel". I know that's a hassle to set up; the next best thing
would be a digital photo of the console with "pci=use_crs ignore_loglevel
Created attachment 445764 [details]
screenshot showing erroneous allocation
This screenshot contains the information I requested in the previous
find_resource: try [mem 0xbff00000-0xbfffffff] before [mem 0xc0000000-0xdfffffff] PCI Bus 0000:02
pci 0000:00:1f.2: BAR 5: assigned [mem 0xbffff800-0xbfffffff]
This shows that we allocated from the top of the region, but since
we're still looking at available regions bottom-up, we found a region
that doesn't work (I suspect the region we allocated is really RAM).
kernel-188.8.131.52-54.fc13 has been pushed to the Fedora 13 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
su -c 'yum --enablerepo=updates-testing update kernel'. You can provide feedback for this update here: https://admin.fedoraproject.org/updates/kernel-184.108.40.206-54.fc13
Doesn't seem to be available yet:
$ uname -a
Linux hpc16.home 220.127.116.11-47.fc13.x86_64 #1 SMP Fri Aug 27 08:56:01 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
$ yum --enablerepo=updates-testing update kernel
Loaded plugins: presto, priorities, refresh-packagekit
Setting up Update Process
No Packages marked for Update
Created attachment 445977 [details]
screenshot from my T3500 with debug patch applied
(In reply to comment #19)
I applied the debug patch and took the attached screenshot
kernel-18.104.22.168-54.fc13 has been pushed to the Fedora 13 stable repository. If problems still persist, please make note of it in this bug report.
Created attachment 446424 [details]
alloc top-down (v3)
Stefan and Charles, here's a more thorough patch that I think should
fix the problem. I know kernel-22.214.171.124-54.fc13 has papered over the
problem by turning off "pci=use_crs", but that's only a short-term
So if anybody has a chance to test this patch (don't forget to use
"pci=use_crs" to make sure we're exercising this path), I'd really
appreciate it. If it works, please attach the dmesg log so we can
verify that it's doing the right thing.
I've done a scratch build here: http://koji.fedoraproject.org/koji/taskinfo?taskID=2459617 (well, it should complete soon at least.) with your patches added, and the nocrs-by-default patch reverted for the original reporters to try.
Hopefully it will help them out a bit.
Sorry, with this bug closed I reported my successful test of patch V3 directly in the upstream bugzilla entry.