Bug 717153 - get_nprocs (getsysstats.c) returns wrong number of CPUs in a very large system
Summary: get_nprocs (getsysstats.c) returns wrong number of CPUs in a very large system
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: glibc
Version: 5.6
Hardware: x86_64
OS: Linux
medium
urgent
Target Milestone: rc
: ---
Assignee: Jeff Law
QA Contact: qe-baseos-tools-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-06-28 07:43 UTC by Tal Nevo
Modified: 2016-11-24 15:37 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-05-09 14:35:42 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Tal Nevo 2011-06-28 07:43:31 UTC
Description of problem:

Running OpenMP applications compiled with the Intel compiler, I noticed that sporadically the OpenMP library would detect the wrong number of CPUs. When talking to Intel about it, their support indicated that their library uses the following call to determine the number of CPUs: "sysconf(_SC_NPROCESSORS_ONLN)".
When investigating this problem I found that the above function calls a function named 'get_nprocs()' to retrieve the number of CPUs. A small test program that calls 'get_nprocs()' directly was showing the same problem seen with the Intel OpenMP libraries.

When searching the net for this problem I discovered that the problem has been reported on March 2010 and fixed on April 2010.

Here is a link detailing the problem, its cause and the solution for it:

http://sourceware.org/bugzilla/show_bug.cgi?id=11432

here is a link to the solution patch itself:

http://sources.redhat.com/git/gitweb.cgi?p=glibc.git;a=commitdiff;h=3ed8e241229e370cca96650ed727f09838c51d67

And here is a link indicating when the patch was applied to glibc-2.11-311:

http://cygwin.com/ml/glibc-cvs/2010-q2/msg00003.html


HOWEVER, RHEL 5 uses glibc-2.5. In fact, the problem is only seen in later update releases of RHEL 5: It does not exist in RHEL 5.3, but does show up in RHEL 5.5, 5.6 and 5.7 beta.

It appears that a patch to getsysstat.c has been backported to glibc-2.5. The patch file in the source RPM is named: 'glibc-expmalloc4.patch'. Unfortunately, this new version of getsysstat.c exhibits the problem described above. The patch that solved the problem was never backported as well - leaving the version of 'get_nprocs()' used by RHEL 5.5 and beyond with this bug (including the current 5.7 beta).


TO SUMMARIZE: This is a known bug that was fixed more than a year ago, however the fix was never ported to the glibc-2.5 used in the latest RHEL 5 versions.



Version-Release number of selected component (if applicable):
glibc-2.5-49 , glibc-2.5-58 , glibc-2.5-65

How reproducible:
On a system with 160 CPUs or more this could happen. I used a system with 768 CPUs and ran into the problem at least 30% of the time with any OpenMP application that was compiled with the Intel compiler on the system

Steps to Reproduce:
1. Use a system with a large number of CPUs (160 may do, but use at least 192 to be certain).
1. Compile an OpenMP application (e.g. STREAM) with Intel compiler 11.
2. set KMP_AFFINITY=compact,verbose
3. run repeatedly. Check the number of CPUs detected as reported by Intel's OpenMP. When it varies from the currect value, it is due to this bug in 'get_nprocs()'.
  
Actual results:
This is how Intel's OpenMP may report the number of CPUs:
OMP: Info #156: KMP_AFFINITY: 157 available OS procs
and sometimes it was:
OMP: Info #156: KMP_AFFINITY: 487 available OS procs

Expected results:
This is how Intel's OpenMP should report the number of CPUs detected:
OMP: Info #156: KMP_AFFINITY: 768 available OS procs

Additional info:

Comment 1 RHEL Program Management 2012-04-02 13:09:55 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 4 RHEL Program Management 2012-05-09 14:35:42 UTC
Quality Engineering Management has reviewed and declined this request.
You may appeal this decision by reopening this request.


Note You need to log in before you can comment on or make changes to this bug.