Bug 728627

Summary: [RHEL6.2] Kernel fails to boot. 2.6.32-179.el6 or higher
Product: Red Hat Enterprise Linux 6 Reporter: Jeff Burke <jburke>
Component: kernelAssignee: Don Zickus <dzickus>
Status: CLOSED CANTFIX QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.2CC: arozansk, jstancek, jvillalo, kmcmartin, masaya.hasegawa, pbunyan
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-08-10 19:02:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 728633    

Description Jeff Burke 2011-08-05 20:54:27 UTC
Description of problem:
 While testing the RHEL6.2 kernels we ran across an instance where a system no longer boots.

Version-Release number of selected component (if applicable):
2.6.32-179.el6

How reproducible:
Always
  
Actual results:
This is the last thing you see on the console.

============<snip>============
ksign: Installing public key data 
Loading keyring 
- Added public key FD6F1B13E15862F9 
- User ID: Red Hat, Inc. (Kernel Module GPG key) 
- Added public key D4A26C9CCD09BEDA 
- User ID: Red Hat Enterprise Linux Driver Update Program <secalert> 
Block layer SCSI generic (bsg) driver version 0.4 loaded (major 252) 
io scheduler noop registered 
io scheduler anticipatory registered 
io scheduler deadline registered 
io scheduler cfq registered (default) 

============<\snip>============

Expected results:
System should boot

Additional info:

Comment 3 Don Zickus 2011-08-10 19:02:12 UTC
After spending a couple of days on this, I am going to close it out as broken hardware.  I have debugged the hangs to one of the various pci_config_read or pci_config_write. 

According to our pci/e guys, the system should never hang on a read, possibly a write but unlikely on the particular write I noticed.  Granted the system always seems to hang on the particular pcie port all the time, just adding in a couple of printks allows the system to boot fine.  This leads me to believe it is a system timing issue.

This machine is an Intel whitebox that is no long supported.

Yeah, it is a regression from 6.1, but to properly enable APEI support, the PCIe intialization had to be re-done to accomodate the support.  This change pokes registers a little differently.

I hit hangs on a pci_config_write in pcibios_set_master
I hit hangs on a pci_config_read in PCI_COMMAND
I hit hangs on a couple of pci_config_reads while configuring interrupts that I didn't feel like tracking down.

This is just wasting my time on a broken machine that is no longer supported.  Closing it out before I waste more time better spent elsewhere.

Cheers,
Don

Comment 4 John Villalovos 2011-08-10 20:49:10 UTC
Do we know what is the last kernel that did work on the system?

Comment 5 Don Zickus 2011-08-10 21:04:32 UTC
I believe it is -178.el6 and that the APEI changes to the PCIe root bridge probably caused the problem.

I just uncheck'd the private field of comment 3.  Not sure why I did that.  There is a workaround that I forgot to mention 'pci=noaer' and 'pcie_ports=compat' successfully worked around the problem.

Cheers,
Don

Comment 6 John Villalovos 2011-08-10 21:10:27 UTC
Thanks Don.  Makes sense to me.

Comment 7 masaya.hasegawa.hp 2012-01-25 06:21:09 UTC
>This machine is an Intel whitebox that is no long supported. (Comment #3)

On ProLiant DL165 G7, this problem happens too. Though the workarond
('pci=noaer' and 'pcie_ports=compat' ) works well, our customer expect
that this issue be fixed in the future release.

Comment 8 Don Zickus 2012-01-25 14:47:59 UTC
(In reply to comment #7)
> >This machine is an Intel whitebox that is no long supported. (Comment #3)
> 
> On ProLiant DL165 G7, this problem happens too. Though the workarond
> ('pci=noaer' and 'pcie_ports=compat' ) works well, our customer expect
> that this issue be fixed in the future release.

Hi Masaya,

You will have to open a new bugzilla for that issue.  The issue here seemed to be broken hardware, which was caused by software enabling features in the hardware.  There isn't much we can do in that case except for those workarounds.

The ProLiant should be a working box though, so I wouldn't be surprised if that issue is something different.

Cheers,
Don