Red Hat Bugzilla – Bug 728627
[RHEL6.2] Kernel fails to boot. 2.6.32-179.el6 or higher
Last modified: 2012-02-28 23:29:24 EST
Description of problem:
While testing the RHEL6.2 kernels we ran across an instance where a system no longer boots.
Version-Release number of selected component (if applicable):
This is the last thing you see on the console.
ksign: Installing public key data
- Added public key FD6F1B13E15862F9
- User ID: Red Hat, Inc. (Kernel Module GPG key)
- Added public key D4A26C9CCD09BEDA
- User ID: Red Hat Enterprise Linux Driver Update Program <firstname.lastname@example.org>
Block layer SCSI generic (bsg) driver version 0.4 loaded (major 252)
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered (default)
System should boot
After spending a couple of days on this, I am going to close it out as broken hardware. I have debugged the hangs to one of the various pci_config_read or pci_config_write.
According to our pci/e guys, the system should never hang on a read, possibly a write but unlikely on the particular write I noticed. Granted the system always seems to hang on the particular pcie port all the time, just adding in a couple of printks allows the system to boot fine. This leads me to believe it is a system timing issue.
This machine is an Intel whitebox that is no long supported.
Yeah, it is a regression from 6.1, but to properly enable APEI support, the PCIe intialization had to be re-done to accomodate the support. This change pokes registers a little differently.
I hit hangs on a pci_config_write in pcibios_set_master
I hit hangs on a pci_config_read in PCI_COMMAND
I hit hangs on a couple of pci_config_reads while configuring interrupts that I didn't feel like tracking down.
This is just wasting my time on a broken machine that is no longer supported. Closing it out before I waste more time better spent elsewhere.
Do we know what is the last kernel that did work on the system?
I believe it is -178.el6 and that the APEI changes to the PCIe root bridge probably caused the problem.
I just uncheck'd the private field of comment 3. Not sure why I did that. There is a workaround that I forgot to mention 'pci=noaer' and 'pcie_ports=compat' successfully worked around the problem.
Thanks Don. Makes sense to me.
>This machine is an Intel whitebox that is no long supported. (Comment #3)
On ProLiant DL165 G7, this problem happens too. Though the workarond
('pci=noaer' and 'pcie_ports=compat' ) works well, our customer expect
that this issue be fixed in the future release.
(In reply to comment #7)
> >This machine is an Intel whitebox that is no long supported. (Comment #3)
> On ProLiant DL165 G7, this problem happens too. Though the workarond
> ('pci=noaer' and 'pcie_ports=compat' ) works well, our customer expect
> that this issue be fixed in the future release.
You will have to open a new bugzilla for that issue. The issue here seemed to be broken hardware, which was caused by software enabling features in the hardware. There isn't much we can do in that case except for those workarounds.
The ProLiant should be a working box though, so I wouldn't be surprised if that issue is something different.