Bug 219288
Summary: | RHEL5-B2: /proc/bus/pci/devices speaks LIES | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Raghavendra Biligiri <raghavendra_biligiri> | ||||
Component: | kernel | Assignee: | John Feeney <jfeeney> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 5.0 | CC: | ajax, jfeeney, mmatsuya, wwlinuxengineering | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i386 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | 5.0.0 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2007-01-15 14:21:49 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 200812 | ||||||
Attachments: |
|
Description
Raghavendra Biligiri
2006-12-12 14:19:09 UTC
Created attachment 143391 [details]
Attached the xorg.conf,Xorg.log and lspci output.
*** Bug 219286 has been marked as a duplicate of this bug. *** Per Release Criteria (section 1 Desktop point ac Xorg x11) Dell wants this to be a blocker of 5.0.0. Does it need to be an exception? Note that the X PCI scan and the lspci scan give different results... Analysis from #212030, which appears to be an identical issue: This appears to be the kernel's fault. X is counting the number of PCI devices by inspecting /proc/bus/pci/devices, same as lspci. But it then scans down the device trees in /proc/bus/pci/*/, and finds 16 devices! Since the mach64 happens to be at the end of the list, we stop at 15, and miss the mach64. Note the following entry in X's PCI scan: (II) PCI: 00:06:0: chip 8086,257e card 0000,0000 rev 02 class 08,80,00 hdr 00 Which isn't visible in lspci or /proc/bus/pci/devices. Seems like an odd one to leave out... Reassigning to kernel. Related Fedora bug with some hints: bug #214050 2746 has the "sort PCI device list breadth-first" patch. So, this issue is seen after that patch is applied. According to bug 212030, that issue is seen on 2714. So, the pci ordering patch didn't create the regression. Can someone narrow down the issue with pci scan problem? As mentioned in bug #214050, this seems to only happen with CONFIG_EXPERIMENTAL. The device missing from lspci is the EDAC driver for that chipset, which we only build the driver for when EXPERIMENTAL=y. Sounds like a good place to start looking. Based on analysis from comment #7, it appears that this could break (read: No X) *any* system where the the video device shows up deep enough (>15) in the /proc/bus/pci/*/ tree. Bumping the severity of the issue based on that analysis. RH- Please mark as Blocker for RHEL5.0 if not already marked that way. It's not an issue of depth, it's an issue of miscounting. The list of devices visible in /proc/bus/pci/devices is not the same set as those visible through /sys/bus/pci/devices. One device in particular is consistently missing from /proc/bus/pci/devices, and it's _not_ the VGA device. It may be possible to work around this in X, but the correct fix is for the kernel's filesystems to present a consistent view of the world. Per John Feeney, question whether this is a DUP of bug 212030? It may also be related to Fedora bug 214050. Probably to both. All three bugs show exactly the same fault: 8086:257e at PCI slot 0:6.0 missing from lspci but visible in /sys/bus/pci/devices, and X failing to start because the device count is wrong. After talking with John Feeney, he'll continue trying to isolate the problem(s) are refine the scope/impact. If there isn't more information available on/before January 5, 2007 we'll assess the impact of defering this bug to a later release. It looks like i82875p_setup_overfl_dev() (in drivers/edac/i82875p_edac.c) is exposing a PCI device (part of the north bridge) that was hidden by the BIOS (at Intel's recommendataion), and calling pci_proc_attach_device(), which creates the /proc file specific to this device, but never calling pci_bus_add_device(), which adds this device to the global list pci_devices, which /proc/bus/pci/devices exposes. I don't yet have a system to check this out on, though... this is all just based on my looking at the code--I could be missing something. BLOCKER: We at least need a workaround that will permit RHEL5 certification. What I said in comment #16 appears to be correct. This patch fixed the problem--/proc/bus/pci/devices now shows device 8086:257e, and X windows started up. --- linux-2.6.18.i386/drivers/edac/i82875p_edac.c 2006-09-19 22:42:06.000000000 -0500 +++ linux-2.6.18.i386_dec18_2006/drivers/edac/i82875p_edac.c 2007-01-03 06:37:22.000000000 -0600 @@ -297,6 +297,8 @@ static int i82875p_setup_overfl_dev(stru "device\n", __func__); return 1; } + pci_bus_add_device(dev); + #endif /* CONFIG_PROC_FS */ if (pci_enable_device(dev)) { i82875p_printk(KERN_ERR, "%s(): Failed to enable overflow " QE ack for RHEL5. An update: The patch provided by Dell works but the final solution as to how to implement the change is being worked on given that this patch needs to be sent upstream for approval. A discussion has been initiated with internal personnel to provide the best answer for RHEL-5 and upstream. It is not anticipated at this time that this discussion should prevent this patch from being submitted to rhkernel on time for RHEL-5. I have provided Stuart Hayes with details of the discussion and asked for his input. Again, my thanks to Stuart for finding the solution. The patch was posted on rhkernel list for review and acceptance. Built into 2.6.18-1.3002.el5. rsync from kernel build page of Don Z has not picked up the 3002 build yet. Dell will report results once that build has been made available to Dell. This issue is not reproducible on the test kernel(kernel-2.6.18-1.3002.el5). X comes up fine with the test kernel(kernel-2.6.18-1.3002.el5) on PE700. Moving to VERIFIED based on previous comment. kernel-2.6.18-1.3002.el5 included in 20070111.1 and 20070112.3. |