Bug 192760

Summary: RHAS4 U3 x86_64 largesmp kernels don't support >8 cores on AMD64
Product: Red Hat Enterprise Linux 4 Reporter: Nakul Saraiya <nakul>
Component: kernelAssignee: Brian Maly <bmaly>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: bnagendr, cseshadri, jbaron, konradr
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0304 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-05-08 01:34:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Boot hang #1
none
Boot hang #2
none
Working patch for RHAS4 U1 16-core AMD64
none
PROPOSED patch for RHAS4 U3 for 16-core AMD64
none
Patch against RHEL4U3 tree to put AMD64 systems into physflat mode none

Description Nakul Saraiya 2006-05-22 19:33:17 UTC
Description of problem:

The RHAS4U3 'largesmp' kernel does not support >8 cores on AMD64 systems. 
8-socket dual-core systems require 'physical flat' APIC mode, as per the
kernel.org kernels.

Version-Release number of selected component (if applicable):

RHEL4 U3 x86_64 'largesmp'

How reproducible:

Boot an 8-socket dual-core (total 16 cores) AMD64 system with the largesmp kernel.

Steps to Reproduce:
1.  See above
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Jim Paradis 2006-05-22 22:52:34 UTC
RHEL4 already supports >8 CPUs in Update 3 using clustered APIC mode.  This has
been verified in-house with a 8-way dual-core system.


Comment 2 Nakul Saraiya 2006-05-24 02:39:44 UTC
AMD64 does *not* support clustered APIC mode for >8 cores; Intel EM64T does. 
The boot hangs on our AMD Opteron systems, probably because of lost interrupts.

See the kernel.org source under arch/x86_64/kernel/genapic.c and
.../genapic_flat.c and .../mpparse.c.  

I have a patch against both RHEL4 U1 (in use) and RHEL4 U3 (in test) that make
this work.  


Comment 3 Nakul Saraiya 2006-05-24 03:35:53 UTC
Created attachment 129895 [details]
Boot hang #1

Comment 4 Nakul Saraiya 2006-05-24 03:36:44 UTC
Created attachment 129896 [details]
Boot hang #2

Comment 5 Nakul Saraiya 2006-05-24 03:46:12 UTC
Created attachment 129897 [details]
Working patch for RHAS4 U1 16-core AMD64

This is slightly non-optimal (should use DM_FIXED delivery mode), and fixes
some other small annoyances on AMD64.  It has been tested and is in use.  It is
against RHEL4 U1 and was submitted to your Eng team through our Business
Development contacts.  It also required that the config be changed to support
16 CPUs (not in patch.)

Comment 6 Nakul Saraiya 2006-05-24 03:48:58 UTC
Created attachment 129898 [details]
PROPOSED patch for RHAS4 U3 for 16-core AMD64

This has been compiled but not yet tested.  I will update the bug report
tomorrow with test results (boot log.)	I also changed the largesmp config to
use 16 CPUs, since that is all that AMD64 supports today, unlike Intel EM64T.

Comment 7 Nakul Saraiya 2006-05-24 03:59:26 UTC
Comment on attachment 129898 [details]
PROPOSED patch for RHAS4 U3 for 16-core AMD64

Oops, I changed the 'flat' APIC mode to fixed, rather than 'physical flat'. 
Will submit an updated patch tomorrow.

Comment 8 Konrad Rzeszutek 2006-05-24 18:33:36 UTC
Nakul, the bug was changed to "CLOSED WONTFIX" - is that the correct state?

Comment 9 Nakul Saraiya 2006-05-24 18:37:52 UTC
bugzilla didn't allow me to reopen the bug, so I just flagged it as best I
could.  I'm currently testing a patch and should have results for you later today.

Comment 10 Konrad Rzeszutek 2006-05-24 19:03:28 UTC
Nakul,
Re-openning the bug

Comment 11 Nakul Saraiya 2006-05-25 01:23:52 UTC
Quick update - I have managed to get our system working but it seems to go into
clustered mode (both RHAS4U1 with patch and RHAS4U3 with patch.)  While the
right answer is to use physical-flat mode, I believe that I may have been wrong
about AMD64 supporting clustered mode.  I will check with AMD as to the impact
of this.

In the meanwhile, I'll see how to coax the system into physflat mode.

Comment 12 Nakul Saraiya 2006-05-26 03:31:27 UTC
An update while I get to the root cause.

1. For some reason, boot_cpu_data.cpu_vendor is not being set by the time that
clustered_apic_check() is called - my stock Tyan AMD64 system thinks it is an
Intel EM64T system with your kernel.  So do my systems.
2. The reason for the original hang on our systems was the missing call to
clustered_apic_check() in mpparse.c.  We don't (currently) supply ACPI info, but
rely on mptables.  So this local change fixes the hang, but still puts us into
clustered APIC mode due to #1.



Comment 13 Nakul Saraiya 2006-05-26 20:43:23 UTC
Created attachment 130052 [details]
Patch against RHEL4U3 tree to put AMD64 systems into physflat mode

This successfully boots on a Tyan 2-socket and our 16-socket and puts the
16-socket into physical flat mode.  SQA will test this more thoroughly over the
next week or two.

Comment 15 Jason Baron 2006-10-12 16:15:41 UTC
committed in stream U5 build 42.18. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/


Comment 16 RHEL Program Management 2006-10-13 00:03:31 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 17 Jay Turner 2006-10-17 15:34:15 UTC
QE ack for 4.5.

Comment 22 Red Hat Bugzilla 2007-03-18 22:39:43 UTC
User jparadis's account has been closed

Comment 23 Chitrank Seshadri 2007-03-27 21:33:48 UTC
Is there a patch available for this bug for RHAS4U4?

Comment 25 Mike Gahagan 2007-04-03 15:33:09 UTC
Patch is in the -52 kernel, already working for two customers.


Comment 27 Red Hat Bugzilla 2007-05-08 01:34:19 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0304.html