Bug 665109

Summary: e100 problems on old Compaq Proliant DL320
Product: [Fedora] Fedora Reporter: joshua
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 14CC: bjorn.helgaas, dwmw2, gansalmon, itamar, jonathan, kernel-maint, kmcmartin, madhu.chinakonda
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.35.14-96.fc14 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-09-06 23:58:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
picture of the boottime e100_probe EIP
none
sosreport from the machine
none
dmesg output from the EIP'ing F14 kernel
none
dmesg output from the rawhide kernel
none
ignore broadcom_bus.c if machine supports ACPI none

Description joshua 2010-12-22 18:49:30 UTC
Created attachment 470284 [details]
picture of the boottime e100_probe EIP

Description of problem:

F13 i386 works perfectly fine, but F14 i386 can't initialize the e100 cards on my old Compaq Proliant DL320s.  Screenshot attached.

Version-Release number of selected component (if applicable):

Any F14 i386 kernel

How reproducible:

Try the installer for F14, or install F13, and install any F14 kernel

Comment 1 joshua 2010-12-22 18:51:59 UTC
Created attachment 470285 [details]
sosreport from the machine

Comment 2 joshua 2010-12-22 18:52:52 UTC
We would love to use F14 and future Fedora releases on these machines... please fix!

Comment 3 Kyle McMartin 2010-12-22 18:59:54 UTC
Try upgrading to the rawhide 2.6.37-rc6 kernel and let us know if it's resolved there?

Comment 4 Kyle McMartin 2010-12-22 19:06:43 UTC
Is there any way you can capture the full oops? The photo attached only shows the tail end of it.

Comment 6 joshua 2010-12-22 22:31:56 UTC
Ok... that didn't work, but didn't cause an EIP.

I have dmesg output from both the F14 and rawhide kernels.
dmesg.f14 contains the full EIP info that you were looking for... not sure what dmesg.rawhide contains, but it does have several lines about how it is unhappy with the PCI device base address for the e100 NICs.

Comment 7 joshua 2010-12-22 22:33:09 UTC
Created attachment 470321 [details]
dmesg output from the EIP'ing F14 kernel

Comment 8 joshua 2010-12-22 22:33:49 UTC
Created attachment 470322 [details]
dmesg output from the rawhide kernel

Comment 9 Kyle McMartin 2010-12-22 22:39:12 UTC
Thanks, if you boot with "e100.use_io=1" on the kernel cmdline does that help?

I'll pull the debuginfo and try to figure out where it's dying.

Comment 10 joshua 2010-12-23 00:24:15 UTC
e100.use_io=1 doesn't change anything for the rawhide or the F14 kernels

Comment 11 Kyle McMartin 2010-12-23 00:37:09 UTC
Well, that's odd.

Can you attach:

"sudo lspci -vvnn -s 0000:01:03.0"

(which should be the ethernet card.)

It looks like the BARs are not being set up correctly (resulting in a null ptr deref when it tries to use them in 2.6.35, but is being correctly handled somehow in 2.6.37...)

Comment 12 joshua 2010-12-23 00:51:27 UTC
# sudo lspci -vvnn -s 0000:01:03.0
01:03.0 Ethernet controller [0200]: Intel Corporation 82557/8/9/0/1 Ethernet Pro 100 [8086:1229] (rev 08)
	Subsystem: Compaq Computer Corporation NC3163 Fast Ethernet NIC (embedded, WOL) [0e11:b134]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 66 (2000ns min, 14000ns max), Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 21
	Region 0: Memory at d0200000 (32-bit, non-prefetchable) [size=4K]
	Region 1: I/O ports at b000 [size=64]
	Region 2: Memory at d0000000 (32-bit, non-prefetchable) [size=1M]
	[virtual] Expansion ROM at 40100000 [disabled] [size=1M]
	Capabilities: [dc] Power Management version 2
		Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=2 PME-
	Kernel driver in use: e100
	Kernel modules: e100

Comment 13 Kyle McMartin 2010-12-23 00:51:57 UTC
Ah, nevermind, I can see that in the dmesg from 2.6.37... Looks like it can't claim the resources and then decides to fail. I'll try to fix 2.6.35 to not misbehave and to at least handle it the way .37 does...

If you boot with pci=use_crs does your e100 work? What about pnpacpi=off?

Comment 14 joshua 2010-12-23 01:49:02 UTC
pnpacpi=off doesn't change anything for either kernel (rawhide or F14), but pci=use_crs does in fact do the trick for both!

Is this a "bug", or a "feature"?  It would be nice to have things just work in F14 without having to specify this... like in F13

Comment 15 Kyle McMartin 2010-12-23 02:40:49 UTC
Yeah, we don't enable CRS on BIOS' older than 2008 because they were buggy (I assume.)

I'll point Bjorn at this, hopefully he can shed some light on it.

Comment 16 Bjorn Helgaas 2010-12-23 18:58:23 UTC
I'm on vacation until Jan 4, but will look in more detail then.

Ugh, what a mess.  I think arch/x86/pci/broadcom_bus.c is screwing
things up here.  IIRC, that was added for machines where we don't
have ACPI, so it was the best we could figure out.  But in this
case, we *do* have ACPI, and it's telling us more reliable stuff
than broadcom_bus.c is.

We don't enable pci=use_crs automatically on machines before 2008
just out of paranoia about the reliability of ACPI _CRS.  But it's
only fear, not any real data, behind that date.

We could easily add a quirk (see arch/x86/pci/acpi.c) to turn it
on for this specific machine.  Or, if we felt daring, we could adjust
or remove that 2008 date for enabling pci=use_crs automatically.  It
would be nice to have some data showing that Windows relies on it
on boxes this old (e.g., an Everest report or something).

Comment 17 joshua 2010-12-24 14:52:08 UTC
Interesting.  I can say that F13 which apparently doesn't have this phobia about old BIOSes works perfectly, right out of the box, though I can't provide over-arching statistics about all/most/other older machines.

Comment 18 Kyle McMartin 2010-12-24 14:58:47 UTC
The broadcom_bus.c file didn't get introduced until after 2.6.34 was released. :/

Comment 19 Bjorn Helgaas 2011-01-04 22:45:28 UTC
Created attachment 471767 [details]
ignore broadcom_bus.c if machine supports ACPI

The problem is that broadcom_bus.c discovers bogus windows on this machine (it doesn't know how to discover io windows, and it looks like it should ignore the upper 32 bits of some of the mem windows).  This machine is older than 2008, so we don't use ACPI _CRS information, and bus_numa.c uses the faulty information from broadcom_bus.c.

We're in the gray area of pre-2008 machines with ACPI.  I think the possibilities are:

  1) Fix broadcom_bus.c.  We don't have documentation to do a complete job of this, and there's no reason to think the result would be better than using the generic ACPI driver.

  2) Ignore _CRS and make broadcom_bus.c do nothing.  This gets us back to the working situation of F13.  Since we don't have any host bridge information, things like PCI hotplug and option ROM mapping may not work, but that's the way it's always been on these boxes.

  3) Turn on _CRS, either just for this machine with a DMI quirk or for a whole class of machines.  This should make hotplug work, but changing lots of machines is risky.

I think (2) is the safest, and it will likely fix other CNB20LE-based systems as well.  That's what this patch should do.

Comment 20 Josh Boyer 2011-08-31 17:44:50 UTC
The patch in comment #19 went upstream in 2.6.38 with commit 30e664afb5cb597dd6f7651e6d116e10b9741084

Joshua, are you still using F14 or have you moved on to f15/f16 at this point?

Comment 21 joshua 2011-08-31 17:58:45 UTC
I'm using Fedora 15 now, which doesn't need the pci=use_crs work around.

That said, shouldn't F14 pick up this change from F15, since we know F15 works on these older systems, just like F13 did?

Comment 22 Josh Boyer 2011-08-31 18:09:14 UTC
(In reply to comment #21)
> I'm using Fedora 15 now, which doesn't need the pci=use_crs work around.

Excellent.

> That said, shouldn't F14 pick up this change from F15, since we know F15 works
> on these older systems, just like F13 did?

Yes.  I have it prepped to go into the F14 kernel.  I asked to make sure the upstream commit worked for you.  Thanks for letting us know!

Comment 23 Fedora Update System 2011-09-01 15:24:49 UTC
kernel-2.6.35.14-96.fc14 has been submitted as an update for Fedora 14.
https://admin.fedoraproject.org/updates/kernel-2.6.35.14-96.fc14

Comment 24 Fedora Update System 2011-09-02 05:29:43 UTC
Package kernel-2.6.35.14-96.fc14:
* should fix your issue,
* was pushed to the Fedora 14 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-2.6.35.14-96.fc14'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/kernel-2.6.35.14-96.fc14
then log in and leave karma (feedback).

Comment 25 joshua 2011-09-06 18:05:35 UTC
Yes, this works on my old servers without the need for pci=use_crs.

Thank you!

Comment 26 Fedora Update System 2011-09-06 23:57:45 UTC
kernel-2.6.35.14-96.fc14 has been pushed to the Fedora 14 stable repository.  If problems still persist, please make note of it in this bug report.