Bug 637647 - pci: BAR can't assign mem
Summary: pci: BAR can't assign mem
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: x86_64
OS: Linux
low
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-09-27 01:47 UTC by Horst H. von Brand
Modified: 2011-02-07 17:59 UTC (History)
11 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2011-02-07 17:59:47 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Boot log for kernel-2.6.36-0.24.rc5.git0.fc15.x86_64 (works) (77.12 KB, text/plain)
2010-10-01 02:25 UTC, Horst H. von Brand
no flags Details
Boot log for kernel-2.6.36-0.27.rc5.git6.fc15.x86_64 (broken) (167.96 KB, text/plain)
2010-10-01 02:26 UTC, Horst H. von Brand
no flags Details
Boot log for kernel-2.6.36-0.28.rc6.git0.fc15.x86_64 (broken) (159.81 KB, text/plain)
2010-10-01 02:27 UTC, Horst H. von Brand
no flags Details
Boot log for kernel-2.6.36-0.30.rc6.git0.fc15.x86_64 (broken) (86.63 KB, application/octet-stream)
2010-10-01 22:23 UTC, Horst H. von Brand
no flags Details
update iomem_resource end (1.04 KB, patch)
2010-10-10 13:02 UTC, Bjorn Helgaas
no flags Details | Diff
fix resource 64-bit wrap (1.27 KB, patch)
2010-10-10 13:58 UTC, Bjorn Helgaas
no flags Details | Diff

Description Horst H. von Brand 2010-09-27 01:47:27 UTC
Description of problem:
07:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8040T PCI-E Fast Ethernet Controller (rev 12)

kernel-2.6.36-0.27.rc5.git6.fc15.x86_64 says:

Sep 26 21:11:08 laptop14 kernel: sky2: driver version 1.28
Sep 26 21:11:08 laptop14 kernel: sky2 0000:07:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Sep 26 21:11:08 laptop14 kernel: sky2 0000:07:00.0: unsupported chip type 0xff
Sep 26 21:11:08 laptop14 kernel: sky2 0000:07:00.0: PCI INT A disabled
Sep 26 21:11:08 laptop14 kernel: sky2: probe of 0000:07:00.0 failed with error -95

kernel-2.6.36-0.24.rc5.git0.fc15.x86_64 says:

Sep 26 21:28:28 laptop14 kernel: sky2: driver version 1.28
Sep 26 21:28:28 laptop14 kernel: sky2 0000:07:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Sep 26 21:28:28 laptop14 kernel: sky2 0000:07:00.0: Yukon-2 FE+ chip revision 0
Sep 26 21:28:28 laptop14 kernel: sky2 0000:07:00.0: eth0: addr 00:1e:68:63:4d:74
Sep 26 21:29:05 laptop14 NetworkManager[1211]: <info> (eth0): new Ethernet device (driver: 'sky2' ifindex: 2)
Sep 26 21:29:05 laptop14 kernel: sky2 0000:07:00.0: eth0: enabling interface
Sep 26 21:29:06 laptop14 kernel: sky2 0000:07:00.0: eth0: Link is up at 100 Mbps, full duplex, flow control both

Version-Release number of selected component (if applicable):
kernel-2.6.36-0.27.rc5.git6.fc15.x86_64

How reproducible:
Tried twice

Steps to Reproduce:
1. Boot...
2.
3.
  
Actual results:
Ethernet not working

Expected results:


Additional info:

Comment 1 Stanislaw Gruszka 2010-09-29 19:13:32 UTC
We do not have any sky2 changes between 2.6.35 and 2.6.36-rc6, so this problem is caused by some other subsystem change - probably PCI.

It's hard to say what is broken. Unfortunately some work from you will be needed to fix that problem. You have to compile the kernel, and if still not work perform bisection to find commit that broke driver.

Firstly please clone current linus git tree

> git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git

Then you have to build it. To install all needed tools do (as root)

> yum-builddep kernel

You can use fedora kernel config to complile kernel. But better is to customise config to remove unneeded options/drivers to speed up kernel compilation, but remember that kernel still need to boot and run on your machine. If don't want to customise config, use fedora config like in example below:

> $ cp /boot/config-2.6.36-0.27.rc5.git6.fc15.x86_64 linux-2.6/
> $ cd linux-2.6/
> $ make oldconfig

Then compile, to speed up use -j Number_of_processors you have i.e

> $ make -j 3

Then install (as root):

> $ make modules_install
> $ make install

Then boot the compiled kernel. If problem is fixed, that will mean some fedora patches broke driver or problem was upstream and is now fixed. If problem still occurs, perform bisection by "git bisect" between last known working commit (i.e. 2.6.36-rc4) and HEAD. Bisection is described here:
http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html
You need to compile and boot kernel at every step, something about 14 steps will be needed.

Comment 2 Horst H. von Brand 2010-09-30 13:42:14 UTC
Sorry, didn't see the above response until today. But with yesterday's kernel (kernel-2.6.36-0.28.rc6.git0.fc15.x86_64) both sky2 and iwlagn are broken the same way.

Sep 30 08:59:34 laptop14 kernel: sky2: driver version 1.28
Sep 30 08:59:34 laptop14 kernel: sky2 0000:07:00.0: PCI INT A -> GSI 16 (level, 
low) -> IRQ 16
Sep 30 08:59:34 laptop14 kernel: sky2 0000:07:00.0: unsupported chip type 0xff
Sep 30 08:59:34 laptop14 kernel: sky2 0000:07:00.0: PCI INT A disabled
Sep 30 08:59:34 laptop14 kernel: sky2: probe of 0000:07:00.0 failed with error -
95

Sep 30 08:59:34 laptop14 kernel: iwlagn: Intel(R) Wireless WiFi Link AGN driver 
for Linux, in-tree:d
Sep 30 08:59:34 laptop14 kernel: iwlagn: Copyright(c) 2003-2010 Intel Corporatio
n
Sep 30 08:59:34 laptop14 kernel: iwlagn 0000:08:00.0: PCI INT A -> GSI 17 (level
, low) -> IRQ 17
Sep 30 08:59:34 laptop14 kernel: iwlagn 0000:08:00.0: Detected Intel(R) Wireless
 WiFi Link 4965AGN, REV=0xFFFFFFFF
Sep 30 08:59:34 laptop14 kernel: iwlagn 0000:08:00.0: Unknown hardware type
Sep 30 08:59:34 laptop14 kernel: iwlagn 0000:08:00.0: Unable to init EEPROM
Sep 30 08:59:34 laptop14 kernel: iwlagn 0000:08:00.0: PCI INT A disabled
Sep 30 08:59:34 laptop14 kernel: iwlagn: probe of 0000:08:00.0 failed with error
 -2

Comment 3 Horst H. von Brand 2010-09-30 13:54:35 UTC
BTW, I need to copy:

$ cp /boot/config-2.6.36-0.27.rc5.git6.fc15.x86_64 linux-2.6/.config

for the above recipe to work (yes, noticed too late ;-)

Comment 4 Horst H. von Brand 2010-09-30 17:29:05 UTC
Got lucky... v2.6.36-rc6-6-g4193d91 works (both sky2 and iwlagn).

Thanks!

Comment 5 Horst H. von Brand 2010-09-30 17:45:13 UTC
Annoyingly, with this kernel now the speaker beeps (for example when a bash completion isn't unique). I suppose _that_ was broken before ;-)

Comment 6 Chuck Ebbert 2010-09-30 20:55:27 UTC
This is probably caused by the PCI patches I added that are queued for 2.6.37. Does booting with "pci=nocrs" fix the problem? Can you post boot logs from older working kernels and the new failing one?

Comment 7 Horst H. von Brand 2010-10-01 02:25:30 UTC
Created attachment 450929 [details]
Boot log for kernel-2.6.36-0.24.rc5.git0.fc15.x86_64 (works)

Comment 8 Horst H. von Brand 2010-10-01 02:26:30 UTC
Created attachment 450930 [details]
Boot log for kernel-2.6.36-0.27.rc5.git6.fc15.x86_64 (broken)

Comment 9 Horst H. von Brand 2010-10-01 02:27:21 UTC
Created attachment 450931 [details]
Boot log for kernel-2.6.36-0.28.rc6.git0.fc15.x86_64 (broken)

Comment 10 Horst H. von Brand 2010-10-01 02:28:28 UTC
Tried booting with pci=nocrs, same result.

Comment 11 Stanislaw Gruszka 2010-10-01 12:57:29 UTC
Below is link to rawhide kernel build with removed patches:
> pci-v2-1-4-resources-ensure-alignment-callback-doesn-t-allocate-below-available-start.patch
> pci-v2-2-4-x86-PCI-allocate-space-from-the-end-of-a-region-not-the-beginning.patch
> pci-v2-3-4-resources-allocate-space-within-a-region-from-the-top-down.patch
> pci-v2-4-4-PCI-allocate-bus-resources-from-the-top-down.patch

http://koji.fedoraproject.org/koji/taskinfo?taskID=2506125

Does it also work?

Comment 12 Horst H. von Brand 2010-10-01 17:32:31 UTC
That one does work. Just booted it, have eth0 and wlan0.

Comment 13 Horst H. von Brand 2010-10-01 22:23:30 UTC
Created attachment 451138 [details]
Boot log for kernel-2.6.36-0.30.rc6.git0.fc15.x86_64 (broken)

kernel-2.6.36-0.30.rc6.git0.fc15.x86_64 is again broken. Just checked 2.6.36-0.30.rc6.git0.bz637647.fc15.x86_64 again, that one _does_ work (running it right now, in fact).

Comment 14 Stanislaw Gruszka 2010-10-02 17:29:44 UTC
We do not fix the bug yet. 2.6.36-0.30.rc6.git0.bz637647 was just test kernel to prove where the problem is. I'm not sure if we will remove these four broken pci-v2-* patches or will try to fix them (for sure we need to report problem to patches author).

Comment 15 Horst H. von Brand 2010-10-03 01:56:51 UTC
OK.

Do the patches make sense each one separately? Were do they come from?

Comment 16 Horst H. von Brand 2010-10-08 13:39:38 UTC
Still the same with kernel-2.6.36-0.35.rc7.git0.fc15.x86_64. I guess it will be vanilla kernels for me from here on...

Am I the *only* one to see this?

This is a Toshiba Satellite Pro U400 notebook. It seems my Samsung N210 netbook is not affected.

Comment 17 Chuck Ebbert 2010-10-08 21:17:20 UTC
The patches that caused the problem are from here:
 https://bugzilla.kernel.org/show_bug.cgi?id=16228#c49

Comment 18 Bjorn Helgaas 2010-10-08 23:22:27 UTC
Thanks for pointing me at this bugzilla, Chuck.

Horst, could you please try a boot with the "pci=use_crs" options and attach
the dmesg log and the contents of /proc/iomem?  (The other logs look like they
came from somewhere else; they're missing the KERN_DEBUG output.)

Apparently the BIOS did configure the sky2 and iwlagn devices because the
broken kernel log shows this:

  pci 0000:07:00.0: BAR 0: trying firmware assignment [mem 0xf0200000-0xf0203fff 64bit]
  pci 0000:08:00.0: BAR 0: trying firmware assignment [mem 0xf0300000-0xf0301fff 64bit]

but left the bridge windows leading to them disabled.

The working 2.6.36-0.24 kernel assigned space for the windows from the
available area at [mem 0xc0000000-0xdfffffff] and then moved the sky2 and
iwlagn devices into the windows:

  pci 0000:00:1c.4: BAR 14: assigned [mem 0xc1000000-0xc11fffff] (a mem window)
  pci 0000:07:00.0: BAR 0: assigned [mem 0xc1000000-0xc1003fff 64bit]
  pci 0000:00:1c.5: BAR 14: assigned [mem 0xc1400000-0xc15fffff] (a mem window)
  pci 0000:08:00.0: BAR 0: assigned [mem 0xc1400000-0xc1401fff 64bit]

The broken 2.6.36-0.35 kernel failed to assign space for the bridge windows
so it left them disabled.  Disabling the windows means we can't allocate space
for the devices behind the bridge either, so we fell back to the original
BIOS assignments, which still don't work because the bridge window is still
disabled.

The question is why we couldn't allocate window space.  There should be
plenty of space available.  Maybe the /proc/iomem will have a clue.

Comment 19 Bjorn Helgaas 2010-10-10 13:02:54 UTC
Created attachment 452589 [details]
update iomem_resource end

One thing that's wrong is that on x86, we statically initialize iomem_resource
to [mem 0x00000000-0xffffffffffffffff] (the entire 64-bit physical address
space) and never update it based on the CPU capabilities.  My patches make us
allocate from the top-down, but of course no current x86 CPU supports a full
64-bit physical address space, so the end of that range, which we assigned
to a 1c.0 bridge window, is useless:

  pci 0000:00:1c.0: BAR 15: assigned [mem 0xffffffffffe00000-0xffffffffffffffff 64bit pref]

I don't think this patch will fix the sky2 and iwlagn problems, but at least we
shouldn't assign this useless window.

Comment 20 Bjorn Helgaas 2010-10-10 13:58:49 UTC
Created attachment 452593 [details]
fix resource 64-bit wrap

I think I see the problem.  The resource allocator doesn't handle the case
where a child ends exactly ~0, because it looks for space after the child
and computes ~0 + 1, which equals 0.  This makes it mistakenly hand out
space that may already be in use.

So the previous patch probably *will* fix sky2 and iwlagn, because it
prevents the case where a resource ends at ~0 by restricting iomem_resource
to end earlier.  But we should also do something like this patch to fix
the allocator in general.

My allocator changes (the ones referenced in comment 17) haven't been merged
upstream yet, so I'll probably incorporate these two patches into the series
and repost it so that upstream never sees this problem.

It would be very helpful if we could test these two fixes on this machine
first to make sure they actually fix the problem.

Comment 21 Horst H. von Brand 2010-10-12 16:39:01 UTC
Exactly which patches should I apply? To the vanilla kernel or the Fedora patched one?

BTW, this can't be a "all 64 bit problem", I've got an assortment of 64 bit machines and only one of them shows this problem. Sure, they have other eth/WiFi controllers, but the above discussion sounds like "(almost) all PCI is broken".

[I don't want to waste my/your time here, I'm quite comfortable building my own   kernels and fooling around with git]

Comment 22 Bjorn Helgaas 2010-10-12 19:32:27 UTC
If you could apply the patches from comment 19 and comment 20 to the
Fedora kernel, I think that would be what we want.  I'm also going to
send you the complete updated series against upstream via email.  If
it's convenient for you to test that, that would be even better.

It's not really that all PCI is broken.  To hit this, you need these:
  - Machine old enough that we don't turn on "pci=use_crs" automatically
  - A device behind a bridge, where the BIOS left the bridge disabled

Most machines won't have the second situation, so they won't see the
problem.

Comment 23 Horst H. von Brand 2010-10-16 01:54:55 UTC
OK, applied in turn: 189182  189232  189242  189252 as discussed in comment 17, and then the patches in comments 19 and 20 to vanilla 2.6.36-rc8. Compiled clean (a bunch of warnings, unrelated AFAICS), result crashes on boot (somewhere in the read(2) system call, didn't get the whole backtrace on screen in any case).

Currently compiling plain 2.6.38-rc8. My earlier vanilla kernel was v2.6.36-rc7-199-gae42d8d, works fine.

Comment 24 Horst H. von Brand 2010-10-16 13:28:47 UTC
Sorry, Linus sneaked in a commit. The vanilla kernel I'm running now (and which I patched for comment 23) is v2.6.36-rc8-1-g8fd01d6.

Comment 25 Bjorn Helgaas 2010-10-18 20:02:21 UTC
Comment on attachment 452593 [details]
fix resource 64-bit wrap

This fix doesn't work.  Current version of this series starts here: http://marc.info/?l=linux-pci&m=128709830705469&w=2

Comment 26 Chuck Ebbert 2010-10-20 04:25:52 UTC
(In reply to comment #25)
> This fix doesn't work.  Current version of this series starts here:
> http://marc.info/?l=linux-pci&m=128709830705469&w=2

Those patches are now in kernel-2.6.36-0.41.rc8.git5.fc15

Comment 27 Horst H. von Brand 2010-10-20 13:50:52 UTC
kernel-2.6.36-0.40.rc8.git0.fc15.x86_64 does work fine here...

Comment 28 Chuck Ebbert 2010-10-22 02:07:39 UTC
(In reply to comment #27)
> kernel-2.6.36-0.40.rc8.git0.fc15.x86_64 does work fine here...

That version should have the same bug as the previous one. The only change that went in there was to use _CRS by default.

Comment 29 Bjorn Helgaas 2010-10-22 16:20:39 UTC
OK, now I'm confused :-)  From comment 26, I thought
kernel-2.6.36-0.41.rc8.git5.fc15 included the v4 patches from the series at
http://marc.info/?l=linux-pci&m=128709830705469&w=2 .

If that's true, I expect that kernel to work, because it has all the known
issues fixed.

Comment 30 Fabrice Bellet 2010-10-23 21:00:37 UTC
Something changed between 0.40 and 0.41 that broke pcmcia on my old thinkpad 770Z, and adding "resource_alloc_from_bottom" when booting 0.41 makes it work again. I put a partial dmesg diff in bz #646027

Comment 31 Horst H. von Brand 2010-11-08 11:09:39 UTC
It hasn't happened again since comment #27.


Note You need to log in before you can comment on or make changes to this bug.