Bug 214050

Summary: /proc/bus/pci/devices missing entry (was Xorg PCI scan misses video card)
Product: [Fedora] Fedora Reporter: Charles Butterfield <cb20777>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: medium    
Version: 6CC: ajax, jim.cornette
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: 2.6.20-1.2944.fc6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-03-28 18:22:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Zip file of key text files: (xorg.conf, Xorg.0.log, scanpci, Xorg-scanpci, strace, etc)
none
viewing the devices file in /proc/bus/pci
none
dmesg output after reboot with failing Xorg server none

Description Charles Butterfield 2006-11-05 05:19:06 UTC
Description of problem: After upgrading from FC5 to FC6, my integrated ATI Rage
XL is no longer detected, with Xorg failing with "no device found".  Some clues:

- worked in FC5, not in FC6
- scanpci shows my video device
- Xorg -scanpci does NOT show my video device
- video is the last one enumerated by scanpci.  Perhaps a new off-by-one error?
Or perhaps previous device terminated the scan?
- strace indicated a brute force scan of all possible pci devices which stops
abruptly at /proc/bus/pci/03/03.0, while my device is the next(and last) at
/proc/bus/pci/03/0e.0
- See attached zip file containg useful output (xorg.conf, Xorg.0.log, scanpci,
Xorg-scanpci, strace, etc)
- My hardware: Dell PowerEdge 700, with integrated ATI Rage XL

Version-Release number of selected component (if applicable):
xorg-x11-server-Xorg-1.1.1-47.fc
xorg-x11-server-utils-7.1-4.fc6
xorg-x11-drv-ati-6.6.2-4.fc


How reproducible: Totally (on my machine)


Steps to Reproduce: Visit me :-)
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Charles Butterfield 2006-11-05 05:19:06 UTC
Created attachment 140379 [details]
Zip file of key text files: (xorg.conf, Xorg.0.log, scanpci, Xorg-scanpci, strace, etc)

Comment 2 Charles Butterfield 2006-11-05 21:16:35 UTC
Also submitted to freedesktop.org as bug number 8894, with more recent clues. 
See: https://bugs.freedesktop.org/show_bug.cgi?id=8894

I could NOT seem to add that reference to this ticket via the "External Bugzilla
References" sub-form herein, so I'm just doing it as a manual comment.

Comment 3 Charles Butterfield 2006-11-07 04:10:01 UTC
Okay, based on a debugging trail (logged in the xfreedesktop bugzilla #8894), it
has become apparent that this is really not an Xorg bug, but a bug in the
/proc/bus/pci logic (kernel?).  So I'm going to try to reassign this. Here is
the AHA entry from my xfreedesktop bug report.

------------------------------------------------------------------

Hmmm.  Looks like an OS issue.  After stepping through the Xorg scan with gdb I
noticed that we are stopping after scanning 14 PCI devices, although my video
card is the 15th (and last).

It turns out that there is a mismatch between the contents of
/proc/bus/pci/devices (14 devices) and the nodes in /proc/bus/pci/xx/* (which
number 15).

The device missing from /proc/bus/pci/devices is /proc/bus/pci/00/06.0.

Xorg is getting a count of PCI devices by counting the lines in
/proc/bus/pci/devices (function xf86OSLinuxGetPciDevs in lnx_pci.c).  Since this
is missing one PCI device, the subsequent scan stops prematurely which is only a
problem if you video device is the last one.  Mine is.

So there is clearly an OS problem, which I will try to figure out how to submit.
 Can anybody suggest where?  For that matter, is anybody reading this stuff? 
Maybe a triage team?  Some feedback would make me feel less lonely :-)



Comment 4 Charles Butterfield 2006-11-07 04:16:15 UTC
I'm guessing at the assignment, changing the category did NOT change the
assignment from original auto-generated value of X/OpenGL maintenance which
seems to be quite wrong given the change in component.

Comment 5 Charles Butterfield 2006-11-07 04:19:30 UTC
And I'm trying for a better Summary.  Sorry for not doing this all in one change.

Comment 6 Jim Cornette 2006-11-15 03:35:33 UTC
I don't know if you have mc installed but you can do the lspci |grep VGA to get
the video card pc id. Afterwards you can look into the devices file with f4 to
view the file details.

I have two video cards
 lspci |grep VGA
01:00.0 VGA compatible controller: nVidia Corporation NV5M64 [RIVA TNT2 Model
64/Model 64 Pro] (rev 15)
02:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA G200 (rev 01)

See attachment for screenshot. I take it 0100 is the NV and 0200 is the Matrox
card information. 

sorry for the interruption. Hopefully someone who knows what the heck they are
doing will reply to this bug. Mention on the list should have added visibility.

Comment 7 Jim Cornette 2006-11-15 03:39:19 UTC
Created attachment 141216 [details]
viewing the devices file in /proc/bus/pci

You might find the F4 edit feature for mc useful to read content. Maybe!

Your bug seems to be far out of my knowledge base,

Comment 8 Dave Jones 2006-11-15 04:09:37 UTC
please attach output of dmesg


Comment 9 Charles Butterfield 2006-11-15 05:22:03 UTC
Created attachment 141225 [details]
dmesg output after reboot with failing Xorg server

Here is the dmesg output.

Sorry for the delay.  A few days ago I built a modified Xorg that just bumped
the device count by one as a totally crude workaround for the pci scan problem.
 Thought it prudent to roll back to the nominal Xorg prior to generating your
dmesg listing (just it case it affected anything you were looking for).

It's way past bedtime, so goodnight :-)

Comment 10 Charles Butterfield 2006-11-17 05:14:38 UTC
I think the previously attached files clearly indicate a bug in the pci related
processing for the procfs filesystem (as indicated by the fact that
/proc/bus/pci/devices contains a different number of PCI devices than the
/proc/bus/pci/xx/* device entries).

1) Is there any reason not to just report this upstream?  That is,is there any
reason to suspect this is some bug added by Fedora customization of the
associated code?

2) Am I correct in assuming that the distro maintainers are the proper
gatekeepers for submitting kernel bugs?  If not, should I just go ahead and
submit this stuff myself?

Comment 11 Charles Butterfield 2006-11-26 05:56:32 UTC
Significatn Update - I finally figured out how download a vanilla 2.6.18.1
kernel from www.kernel.org and build it.  The big surprise was that in this
vanilla kernel, there is NO mismatch between /proc/bus/pci/devices and
/proc/bus/pci/xx/*.

Wow!  So it seems like the Fedora kernel mods may well be the culprit.  However
at this point I'm totally unsure of how to proceed.  There are are a tremendous
number of differences between a vanilla 2.6.18.1 kernel and the Fedora
2.6.18-1.2849.fc6 kernel.

Any suggestions on next steps?

Comment 12 Charles Butterfield 2006-12-06 00:42:34 UTC
New conclusion:  It appears the bug is associated with the CONFIG_EXPERIMENTAL
flag in the stock 2.6.18 kernel.

Details: I rebuilt the vanilla 2.6.18 kernel two ways:
1) With FC6 .config file (which sets EXPERIMENTAL=y) - this manifests the bug
2) With FC6 .config file, (BUT setting EXPERIMENTAL undefined) - no bug!

So, its an upstream bug.  Could somebody please suggest what the correct next
step is?

Comment 13 Adam Jackson 2006-12-18 17:36:27 UTC
Try building with EXPERIMENTAL set but without the 82875 EDAC driver (should be
CONFIG_EDAC_82875P).  I suspect it's doing something unpleasant that ends up
hiding it from /proc/bus/pci/devices.

Comment 14 Adam Jackson 2007-01-05 21:26:52 UTC
Or, build the kernel with this patch:

http://people.freedesktop.org/~ajax/i82875p-edac-fix.patch

Comment 15 Adam Jackson 2007-01-25 02:18:02 UTC
This patch _is_ correct, but it doesn't appear to be in either the FC6 or
rawhide kernels yet.

Comment 16 Adam Jackson 2007-03-28 18:22:40 UTC
It's in rawhide now, I'll poke Chuck to get it into FC6 updates too.

Comment 17 Chuck Ebbert 2007-03-28 18:56:35 UTC
Is somebody going to submit this patch upstream?

Comment 18 Charles Butterfield 2007-04-15 23:39:40 UTC
My problem is resolved by FC6 kernel 2.6.20-1.2944.fc6.  In this release the
contents of /proc/bus/pci/devices and the nodes in /proc/bus/pci/xx/* agree in
the number of devices (both indicate 15).

The previous release (2.6.20-1.2933.fc6), did NOT resolve the problem, so
thank-you to whoever fixed the problem between 2933 and 2944.

I have no idea if ALL of the issues discussed on this list have been fixed.  I
suspect not, since there seem to be several different chunks of code that need
to arrive at the same conclusion about what PCI devices exist, which is a recipe
for future problems.  At present, on my particular hardware configuration, the
various code paths seem to be in agreement.

Thanks again to all concerned!