Bug 219288

Summary: RHEL5-B2: /proc/bus/pci/devices speaks LIES
Product: Red Hat Enterprise Linux 5 Reporter: Raghavendra Biligiri <raghavendra_biligiri>
Component: kernelAssignee: John Feeney <jfeeney>
Status: CLOSED CURRENTRELEASE QA Contact: Brian Brock <bbrock>
Severity: urgent Docs Contact:
Priority: high    
Version: 5.0CC: ajax, jfeeney, mmatsuya, wwlinuxengineering
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: 5.0.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-01-15 14:21:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 200812    
Attachments:
Description Flags
Attached the xorg.conf,Xorg.log and lspci output. none

Description Raghavendra Biligiri 2006-12-12 14:19:09 UTC
Description of problem:
On RHEL5-B2(kernel-2.6.18-1.2767.el5),X fails to come up with the ati driver.
X fails to come up even during the installation and after the installation.

After installation when we try to start X we get the following error message :

(EE) No devices detected.

Fatal server error:
no screens found
XIO:  fatal IO error 104 (Connection reset by peer) on X server ":0.0"
      after 0 requests (0 known processed) with 0 events remaining.


Version-Release number of selected component (if applicable):
RHEL5-B2(kernel-2.6.18-1.2767.el5)
xorg-x11-drv-ati-6.6.3-2.el5

How reproducible:


Steps to Reproduce:
1.Install RHEL5-B2(kernel-2.6.18-1.2767.el5) on PE700 or PE2650.
2.Try to start X
3.
  
Actual results:
X fails to come up.

Expected results:
X should start without any errors.

Additional info:
Attached the Xorg.log,xorg.conf and output of lspci

Comment 1 Raghavendra Biligiri 2006-12-12 14:22:32 UTC
Created attachment 143391 [details]
Attached the xorg.conf,Xorg.log and lspci output.

Comment 2 Adam Jackson 2006-12-12 16:14:44 UTC
*** Bug 219286 has been marked as a duplicate of this bug. ***

Comment 3 John Feeney 2006-12-14 19:05:54 UTC
Per Release Criteria (section 1 Desktop point ac Xorg x11) Dell wants this to be
a blocker of 5.0.0. 

Does it need to be an exception?

Comment 5 Adam Jackson 2006-12-14 20:52:10 UTC
Note that the X PCI scan and the lspci scan give different results...

Comment 7 Adam Jackson 2006-12-15 16:25:57 UTC
Analysis from #212030, which appears to be an identical issue:

This appears to be the kernel's fault.  X is counting the number of PCI devices
by inspecting /proc/bus/pci/devices, same as lspci.  But it then scans down the
device trees in /proc/bus/pci/*/, and finds 16 devices!  Since the mach64
happens to be at the end of the list, we stop at 15, and miss the mach64.

Note the following entry in X's PCI scan:

(II) PCI: 00:06:0: chip 8086,257e card 0000,0000 rev 02 class 08,80,00 hdr 00

Which isn't visible in lspci or /proc/bus/pci/devices.  Seems like an odd one to
leave out...

Reassigning to kernel.

Comment 8 Adam Jackson 2006-12-15 21:15:10 UTC
Related Fedora bug with some hints: bug #214050

Comment 9 Linda Wang 2006-12-18 22:07:20 UTC
2746 has the "sort PCI device list breadth-first" patch.  So, this issue
is seen after that patch is applied.  

According to bug 212030, that issue is seen on 2714.  So, the pci ordering
patch didn't create the regression.  Can someone narrow down the 
issue with pci scan problem? 

Comment 10 Adam Jackson 2006-12-19 00:10:34 UTC
As mentioned in bug #214050, this seems to only happen with CONFIG_EXPERIMENTAL.
 The device missing from lspci is the EDAC driver for that chipset, which we
only build the driver for when EXPERIMENTAL=y.  Sounds like a good place to
start looking.

Comment 11 Amit Bhutani 2006-12-20 10:14:56 UTC
Based on analysis from comment #7, it appears that this could break (read: No X)
*any* system where the the video device shows up deep enough (>15) in the
/proc/bus/pci/*/ tree.

Bumping the severity of the issue based on that analysis. RH- Please mark as
Blocker for RHEL5.0 if not already marked that way.

Comment 12 Adam Jackson 2006-12-20 17:37:39 UTC
It's not an issue of depth, it's an issue of miscounting.  The list of devices
visible in /proc/bus/pci/devices is not the same set as those visible through
/sys/bus/pci/devices.  One device in particular is consistently missing from
/proc/bus/pci/devices, and it's _not_ the VGA device.

It may be possible to work around this in X, but the correct fix is for the
kernel's filesystems to present a consistent view of the world.

Comment 13 Larry Troan 2006-12-20 21:33:10 UTC
Per John Feeney, question whether this is a DUP of bug 212030? It may also be
related to Fedora bug 214050.

Comment 14 Adam Jackson 2006-12-20 22:33:47 UTC
Probably to both.  All three bugs show exactly the same fault: 8086:257e at PCI
slot 0:6.0 missing from lspci but visible in /sys/bus/pci/devices, and X failing
to start because the device count is wrong.

Comment 15 Ken Reilly 2006-12-22 15:37:02 UTC
After talking with John Feeney, he'll continue trying to isolate the problem(s)
are refine the scope/impact. If there isn't more information available on/before
January 5, 2007 we'll assess the impact of defering this bug to a later release. 

Comment 16 Stuart Hayes 2007-01-02 21:16:35 UTC
It looks like i82875p_setup_overfl_dev() (in drivers/edac/i82875p_edac.c) is 
exposing a PCI device (part of the north bridge) that was hidden by the BIOS 
(at Intel's recommendataion), and calling pci_proc_attach_device(), which 
creates the /proc file specific to this device, but never calling 
pci_bus_add_device(), which adds this device to the global list pci_devices, 
which /proc/bus/pci/devices exposes.

I don't yet have a system to check this out on, though... this is all just 
based on my looking at the code--I could be missing something.


Comment 18 Larry Troan 2007-01-03 15:07:38 UTC
BLOCKER: We at least need a workaround that will permit RHEL5 certification.



Comment 19 Stuart Hayes 2007-01-03 20:20:20 UTC
What I said in comment #16 appears to be correct.  This patch fixed the 
problem--/proc/bus/pci/devices now shows device 8086:257e, and X windows 
started up.

--- linux-2.6.18.i386/drivers/edac/i82875p_edac.c	2006-09-19 
22:42:06.000000000 -0500
+++ linux-2.6.18.i386_dec18_2006/drivers/edac/i82875p_edac.c	2007-01-03 
06:37:22.000000000 -0600
@@ -297,6 +297,8 @@ static int i82875p_setup_overfl_dev(stru
 			       "device\n", __func__);
 		return 1;
 	}
+	pci_bus_add_device(dev);
+
 #endif  /* CONFIG_PROC_FS */
 	if (pci_enable_device(dev)) {
 		i82875p_printk(KERN_ERR, "%s(): Failed to enable overflow "


Comment 20 Jay Turner 2007-01-03 20:36:53 UTC
QE ack for RHEL5.

Comment 23 John Feeney 2007-01-05 21:29:17 UTC
An update: The patch provided by Dell works but the final solution as to how to
implement the change is being worked on given that this patch needs to be sent
upstream for approval. A discussion has been initiated with internal personnel
to  provide the best answer for RHEL-5 and upstream. It is not anticipated at
this time that this discussion should prevent this patch from being submitted to
rhkernel on time for RHEL-5. I have provided Stuart Hayes with details of the
discussion and asked for his input. Again, my thanks to Stuart for finding the
solution.


Comment 24 John Feeney 2007-01-08 20:54:13 UTC
The patch was posted on rhkernel list for review and acceptance.

Comment 25 Jay Turner 2007-01-10 15:29:36 UTC
Built into 2.6.18-1.3002.el5.

Comment 26 Amit Bhutani 2007-01-10 18:33:18 UTC
rsync from kernel build page of Don Z has not picked up the 3002 build yet. Dell
will report results once that build has been made available to Dell.

Comment 28 Raghavendra Biligiri 2007-01-12 10:19:35 UTC
This issue is not reproducible on the test kernel(kernel-2.6.18-1.3002.el5).
X comes up fine with the test kernel(kernel-2.6.18-1.3002.el5) on PE700.



Comment 29 Amit Bhutani 2007-01-12 14:00:08 UTC
Moving to VERIFIED based on previous comment.

Comment 30 Jay Turner 2007-01-15 14:21:49 UTC
kernel-2.6.18-1.3002.el5 included in 20070111.1 and 20070112.3.