Bug 193803

Summary: numactl --show reports wrong CPU binding
Product: Red Hat Enterprise Linux 4 Reporter: Mike Stroyan <mike.stroyan>
Component: numactlAssignee: Neil Horman <nhorman>
Status: CLOSED NOTABUG QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: k.georgiou, tao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-06-14 11:16:14 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mike Stroyan 2006-06-01 18:25:32 UTC
Description of problem:  numactl --show reports wrong CPU binding


Version-Release number of selected component (if applicable): 0.6.4-1.25


How reproducible:

  Problem is completely reproduceable on affected hardware.

Steps to Reproduce:
1. Run 'numactl --show" on a system with more CPUs than numa nodes.
   This example is actual vs expected output for a 16 CPU rx8620
   with 16 CPUs in 4 cells.
  
Actual results:

nodebind: 0 1 2 3


Expected results:

nodebind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Additional info:
The "numactl.c:show()" function is getting a list of CPUs from a call to
numa_sched_getaffinity().  It then passes that to util.c:printmask().
But printmask() will only print up to numa_max_node() entries.
On a system with more CPUs than nodes it won't show the binding to any
of the CPUs above the maximum node number.

This problem and several other problems are fixed in the more current
versions of numactl available from ftp://ftp.suse.com/pub/people/ak/numa/ .
The most complete fix would be to incorporate a newer version of numactl.

Comment 1 Neil Horman 2006-06-14 11:16:14 UTC
This is working as designed.  The nodemask output is meant to show which numa
nodes a process is limited to allocating memory from.  Unless you have modified
numactl to only bind to certain nodes, running numactl --show on a system with 4
numa nodes should show a nodebind output of:
nodebind: 0 1 2 3
not 0 through 16 as you indicate.

There is a cpubind output available in the latest version of numactl.  That will
be available in FC6 & RHEL5

Comment 2 Mike Stroyan 2006-06-14 16:01:03 UTC
I chose my example poorly.  The current "numactl --show" output for
"nodebind:" is not showing what nodes a process is bound to.  It is
showing a truncated list of the CPUs that a process is bound to.
Here are three more interesting examples using a rx8620 with 16 CPUs
in 4 cells.  It has nodes 0 to 3 with 4 cpus and cell local memory,
plus node 4 with no CPUs and the system's interleaved memory.

Running
 numactl --cpubind 0 numactl --show
binds to CPUs 0,1,2,3 in node 0 and produces
 nodebind: 0 1 2 3
reporting cpus instead of nodes.

Running
 numactl --cpubind 1 numactl --show
binds to CPUs 4,5,6,7 in node 1 and produces
 nodebind: 4
reporting cpu 4 instead of node 1.  Node 4 actually has no CPU.

Running
 numactl --cpubind 2,3 numactl --show
binds to CPUs 8,9,10,11,12,13,14,15 in nodes 2 and 3 and produces
 nodebind:
reporting the process is bound to no nodes.

The numactl.c:show() function is calling numa_sched_getaffinity() and
passing the CPU mask from that to util.cprintmask().  But printmask()
will only print up to numa_max_node() entries.  The 0.9.8 version of
numactl has changed the show function to call
libnuma.c:numa_get_run_node_mask() to really get a mask of nodes to use
for the node mask passed into printmask().

The 0.6.4-1.25 version of numa_get_run_node_mask has its own problems.
numa_get_run_node_mask is looping through comparing
NUMA_NUM_NODES/BITS_PER_LONG array elements.  But only
CPU_WORDS(ncpus) elements of the arrays have real data.  Beyond
that can run past the end of the nodecpus and cpus arrays.
The 0.9.8 version of numa_get_run_node_mask loops over
just the number of CPUs.
for (k = 0; k < CPU_LONGS(ncpus); k++) {

There is also a problem in the 0.6.4-1.25 version of number_of_cpus().
If it can open /proc/cpuinfo then number_of_cpus returns the highest
processor number read from that file.  In that case it should return
one higher than the highest processor number because the cpus are
numbered from 0 to N-1.
The 0.9.8 version uses
        return maxcpus + 1;
The code for the case when /proc/cpuinfo is unreadable is still
badly broken in version 0.9.8.  It is changed to use
maxcpus = i*sizeof(long)+k
which is better, but it still loops over
for (k = 0; k< 8; k++)
bits when it should be looping over
for (k = 0; k< sizeof(long); k++)
I don't know if that code is ever used.