Red Hat Bugzilla – Bug 192760
RHAS4 U3 x86_64 largesmp kernels don't support >8 cores on AMD64
Last modified: 2009-07-20 05:58:16 EDT
Description of problem: The RHAS4U3 'largesmp' kernel does not support >8 cores on AMD64 systems. 8-socket dual-core systems require 'physical flat' APIC mode, as per the kernel.org kernels. Version-Release number of selected component (if applicable): RHEL4 U3 x86_64 'largesmp' How reproducible: Boot an 8-socket dual-core (total 16 cores) AMD64 system with the largesmp kernel. Steps to Reproduce: 1. See above 2. 3. Actual results: Expected results: Additional info:
RHEL4 already supports >8 CPUs in Update 3 using clustered APIC mode. This has been verified in-house with a 8-way dual-core system.
AMD64 does *not* support clustered APIC mode for >8 cores; Intel EM64T does. The boot hangs on our AMD Opteron systems, probably because of lost interrupts. See the kernel.org source under arch/x86_64/kernel/genapic.c and .../genapic_flat.c and .../mpparse.c. I have a patch against both RHEL4 U1 (in use) and RHEL4 U3 (in test) that make this work.
Created attachment 129895 [details] Boot hang #1
Created attachment 129896 [details] Boot hang #2
Created attachment 129897 [details] Working patch for RHAS4 U1 16-core AMD64 This is slightly non-optimal (should use DM_FIXED delivery mode), and fixes some other small annoyances on AMD64. It has been tested and is in use. It is against RHEL4 U1 and was submitted to your Eng team through our Business Development contacts. It also required that the config be changed to support 16 CPUs (not in patch.)
Created attachment 129898 [details] PROPOSED patch for RHAS4 U3 for 16-core AMD64 This has been compiled but not yet tested. I will update the bug report tomorrow with test results (boot log.) I also changed the largesmp config to use 16 CPUs, since that is all that AMD64 supports today, unlike Intel EM64T.
Comment on attachment 129898 [details] PROPOSED patch for RHAS4 U3 for 16-core AMD64 Oops, I changed the 'flat' APIC mode to fixed, rather than 'physical flat'. Will submit an updated patch tomorrow.
Nakul, the bug was changed to "CLOSED WONTFIX" - is that the correct state?
bugzilla didn't allow me to reopen the bug, so I just flagged it as best I could. I'm currently testing a patch and should have results for you later today.
Nakul, Re-openning the bug
Quick update - I have managed to get our system working but it seems to go into clustered mode (both RHAS4U1 with patch and RHAS4U3 with patch.) While the right answer is to use physical-flat mode, I believe that I may have been wrong about AMD64 supporting clustered mode. I will check with AMD as to the impact of this. In the meanwhile, I'll see how to coax the system into physflat mode.
An update while I get to the root cause. 1. For some reason, boot_cpu_data.cpu_vendor is not being set by the time that clustered_apic_check() is called - my stock Tyan AMD64 system thinks it is an Intel EM64T system with your kernel. So do my systems. 2. The reason for the original hang on our systems was the missing call to clustered_apic_check() in mpparse.c. We don't (currently) supply ACPI info, but rely on mptables. So this local change fixes the hang, but still puts us into clustered APIC mode due to #1.
Created attachment 130052 [details] Patch against RHEL4U3 tree to put AMD64 systems into physflat mode This successfully boots on a Tyan 2-socket and our 16-socket and puts the 16-socket into physical flat mode. SQA will test this more thoroughly over the next week or two.
committed in stream U5 build 42.18. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
QE ack for 4.5.
User jparadis@redhat.com's account has been closed
Is there a patch available for this bug for RHAS4U4?
Patch is in the -52 kernel, already working for two customers.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0304.html