Bug 728627 - [RHEL6.2] Kernel fails to boot. 2.6.32-179.el6 or higher
[RHEL6.2] Kernel fails to boot. 2.6.32-179.el6 or higher
Status: CLOSED CANTFIX
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.2
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Don Zickus
Red Hat Kernel QE team
:
Depends On:
Blocks: 6.2KnownIssues
  Show dependency treegraph
 
Reported: 2011-08-05 16:54 EDT by Jeff Burke
Modified: 2012-02-28 23:29 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-08-10 15:02:12 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Jeff Burke 2011-08-05 16:54:27 EDT
Description of problem:
 While testing the RHEL6.2 kernels we ran across an instance where a system no longer boots.

Version-Release number of selected component (if applicable):
2.6.32-179.el6

How reproducible:
Always
  
Actual results:
This is the last thing you see on the console.

============<snip>============
ksign: Installing public key data 
Loading keyring 
- Added public key FD6F1B13E15862F9 
- User ID: Red Hat, Inc. (Kernel Module GPG key) 
- Added public key D4A26C9CCD09BEDA 
- User ID: Red Hat Enterprise Linux Driver Update Program <secalert@redhat.com> 
Block layer SCSI generic (bsg) driver version 0.4 loaded (major 252) 
io scheduler noop registered 
io scheduler anticipatory registered 
io scheduler deadline registered 
io scheduler cfq registered (default) 

============<\snip>============

Expected results:
System should boot

Additional info:
Comment 3 Don Zickus 2011-08-10 15:02:12 EDT
After spending a couple of days on this, I am going to close it out as broken hardware.  I have debugged the hangs to one of the various pci_config_read or pci_config_write. 

According to our pci/e guys, the system should never hang on a read, possibly a write but unlikely on the particular write I noticed.  Granted the system always seems to hang on the particular pcie port all the time, just adding in a couple of printks allows the system to boot fine.  This leads me to believe it is a system timing issue.

This machine is an Intel whitebox that is no long supported.

Yeah, it is a regression from 6.1, but to properly enable APEI support, the PCIe intialization had to be re-done to accomodate the support.  This change pokes registers a little differently.

I hit hangs on a pci_config_write in pcibios_set_master
I hit hangs on a pci_config_read in PCI_COMMAND
I hit hangs on a couple of pci_config_reads while configuring interrupts that I didn't feel like tracking down.

This is just wasting my time on a broken machine that is no longer supported.  Closing it out before I waste more time better spent elsewhere.

Cheers,
Don
Comment 4 John Villalovos 2011-08-10 16:49:10 EDT
Do we know what is the last kernel that did work on the system?
Comment 5 Don Zickus 2011-08-10 17:04:32 EDT
I believe it is -178.el6 and that the APEI changes to the PCIe root bridge probably caused the problem.

I just uncheck'd the private field of comment 3.  Not sure why I did that.  There is a workaround that I forgot to mention 'pci=noaer' and 'pcie_ports=compat' successfully worked around the problem.

Cheers,
Don
Comment 6 John Villalovos 2011-08-10 17:10:27 EDT
Thanks Don.  Makes sense to me.
Comment 7 masaya.hasegawa.hp 2012-01-25 01:21:09 EST
>This machine is an Intel whitebox that is no long supported. (Comment #3)

On ProLiant DL165 G7, this problem happens too. Though the workarond
('pci=noaer' and 'pcie_ports=compat' ) works well, our customer expect
that this issue be fixed in the future release.
Comment 8 Don Zickus 2012-01-25 09:47:59 EST
(In reply to comment #7)
> >This machine is an Intel whitebox that is no long supported. (Comment #3)
> 
> On ProLiant DL165 G7, this problem happens too. Though the workarond
> ('pci=noaer' and 'pcie_ports=compat' ) works well, our customer expect
> that this issue be fixed in the future release.

Hi Masaya,

You will have to open a new bugzilla for that issue.  The issue here seemed to be broken hardware, which was caused by software enabling features in the hardware.  There isn't much we can do in that case except for those workarounds.

The ProLiant should be a working box though, so I wouldn't be surprised if that issue is something different.

Cheers,
Don

Note You need to log in before you can comment on or make changes to this bug.