1238614 – numad restricts memory zones causing numademo to throw errors

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1238614 - numad restricts memory zones causing numademo to throw errors

Summary: numad restricts memory zones causing numademo to throw errors

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	numad
Sub Component:
Version:	7.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	7.3
Assignee:	Jan Synacek
QA Contact:	Petr Sklenar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-07-02 09:41 UTC by Milos Vyletel
Modified:	2019-12-16 04:48 UTC (History)
CC List:	3 users (show)
Fixed In Version:	numad-0.5-17.20150602git.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-11-04 06:09:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:2414	0	normal	SHIPPED_LIVE	numad bug fix update	2016-11-03 13:57:50 UTC

Description Milos Vyletel 2015-07-02 09:41:26 UTC

Description of problem:

Customer reported that numad behaves differently on rhel6 and rhel7. This was
observed as errors thrown by running numademo while numad was running. At first
I thought that this was not a problem because cpuset is created for numademo
and besides cpus also mems is set to numa node that was chosen.

crash> ps | grep numademo
> 38803  13541  25  ffff8800360ba220  RU   0.7  273784 262944  numademo
crash> set 38803
    PID: 38803
COMMAND: "numademo"
   TASK: ffff8800360ba220  [THREAD_INFO: ffff880273f88000]
    CPU: 25
  STATE: TASK_RUNNING (ACTIVE)

at this point is in default cgroup

crash> task -R mems_allowed
PID: 38803  TASK: ffff8800360ba220  CPU: 25  COMMAND: "numademo"
  mems_allowed = {
    bits = {15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
  }, 

crash> eval -b 15
hexadecimal: f  
    decimal: 15  
      octal: 17
     binary: 0000000000000000000000000000000000000000000000000000000000001111
   bits set: 3 2 1 0 

so nodes 0-3 allowed

echo $! > /sys/fs/cgroup/cpuset/test/tasks ( to simulate what numad does)

crash> task -R mems_allowed
PID: 38803  TASK: ffff8800360ba220  CPU: 20  COMMAND: "numademo"
  mems_allowed = {
    bits = {8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
  }, 

crash> eval -b 8
hexadecimal: 8  
    decimal: 8  
      octal: 10
     binary: 0000000000000000000000000000000000000000000000000000000000001000
   bits set: 3 

only node 3 allowed

[root@ibm-x3850m3-1 ~]# cat /sys/fs/cgroup/cpuset/test/cpuset.mems
3

this is why call to sys_setmempolicy or sys_mbind fails with -EINVAL if
any other node than 3 is set.

Again this is how it was designed and it should work on rhel7. However this
did not happen on rhel6 even though the kernel side of the code is the same.
Later I found out the numad in rhel6.7 does not work same way and does not
create subset at all

[root@ibm-x3850x5-1 ~]# rpm -q numad
numad-0.5-12.20150602git.el6.x86_64

Wed Jul  1 07:15:47 2015: Nodes: 2
Min CPUs free: 619, Max CPUs: 719, Avg CPUs: 669, StdDev: 50
Min MBs free: 10147, Max MBs: 10726, Avg MBs: 10436, StdDev: 289.5
Node 0: MBs_total 16330, MBs_free  10147, CPUs_total 720, CPUs_free  719,  Distance: 10 11  CPUs: 0-5,12-17
Node 1: MBs_total 16384, MBs_free  10726, CPUs_total 720, CPUs_free  619,  Distance: 11 10  CPUs: 6-11,18-23
Wed Jul  1 07:15:47 2015: Processes: 491
Wed Jul  1 07:15:47 2015: Candidates: 1
761038: PID 25069: (numademo), Threads  1, MBs_size  10248, MBs_used  10240, CPUs_used   99, Magnitude 1013760, Nodes: 1
Wed Jul  1 07:15:47 2015: PICK NODES FOR:  PID: 25069,  CPUs 100,  MBs 12047
Wed Jul  1 07:15:47 2015: PROCESS_MBs[0]: 5119
Wed Jul  1 07:15:47 2015: PROCESS_MBs[1]: 5121
Wed Jul  1 07:15:47 2015: Interleaved MBs: 2
Wed Jul  1 07:15:47 2015:     Node[0]: mem: 137602  cpu: 1219
Wed Jul  1 07:15:47 2015:     Node[1]: mem: 137818  cpu: 1218
Wed Jul  1 07:15:47 2015: Totmag[0]: 1677368
Wed Jul  1 07:15:47 2015: Totmag[1]: 1678623
Wed Jul  1 07:15:47 2015: best_node_ix: 1
Wed Jul  1 07:15:47 2015: Node: 1  Dist: 10  Magnitude: 167862324
Wed Jul  1 07:15:47 2015: Node: 0  Dist: 11  Magnitude: 167736838
Wed Jul  1 07:15:47 2015: MBs: 12047,  CPUs: 100
Wed Jul  1 07:15:47 2015: Assigning resources from node 1
Wed Jul  1 07:15:47 2015:     Node[0]: mem: 17348  cpu: 618
Wed Jul  1 07:15:47 2015: Advising pid 25069 (numademo) move from nodes (1) to nodes (1)
Wed Jul  1 07:15:47 2015: Moving memory from node: 0 to node 1
Wed Jul  1 07:16:00 2015: PID 25069 moved to node(s) 1 in 12.93 seconds

while running numad version from rhel7.1 rebuilt from 

numad-0.5-14.20140620git.el7.src.rpm

does the same thing as it does on rhel7
Wed Jul  1 07:45:50 2015: Candidates: 1
941329: PID 25069: (numademo), Threads  1, MBs_size  10248, MBs_used  10240, CPUs_used  100, Magnitude 1024000, Nodes: 0-1
Wed Jul  1 07:45:50 2015: PICK NODES FOR:  PID: 25069,  CPUs 100,  MBs 12056
Wed Jul  1 07:45:50 2015: PROCESS_MBs[0]: 5119
Wed Jul  1 07:45:50 2015: PROCESS_MBs[1]: 5121
Wed Jul  1 07:45:50 2015: Interleaved MBs: 2
Wed Jul  1 07:45:50 2015: PROCESS_CPUs[0]: 36
Wed Jul  1 07:45:50 2015:     Node[0]: mem: 130166  cpu: 1420
Wed Jul  1 07:45:50 2015: PROCESS_CPUs[1]: 36
Wed Jul  1 07:45:50 2015:     Node[1]: mem: 131939  cpu: 1359
Wed Jul  1 07:45:50 2015: MBs: 12056,  CPUs: 100
Wed Jul  1 07:45:50 2015: Sorted magnitude[0]: 196387952
Wed Jul  1 07:45:50 2015: Sorted magnitude[1]: 190511669
Wed Jul  1 07:45:50 2015:     Node[0]: mem: 9606  cpu: 620
Wed Jul  1 07:45:50 2015: Advising pid 25069 (numademo) move from nodes (0-1) to nodes (0)
Wed Jul  1 07:45:50 2015: Making new cpuset: /cgroup/cpuset/numad.25069
Wed Jul  1 07:45:50 2015: Writing 0-23 to: /cgroup/cpuset/numad.25069/cpuset.cpus
Wed Jul  1 07:45:50 2015: Writing 1 to: /cgroup/cpuset/numad.25069/cpuset.memory_migrate
Wed Jul  1 07:45:50 2015: Writing 0 to: /cgroup/cpuset/numad.25069/cpuset.mems
Wed Jul  1 07:45:50 2015: Including PID: 25069 in cpuset: /cgroup/cpuset/numad.25069
Wed Jul  1 07:45:56 2015: Writing 0-5,12-17 to: /cgroup/cpuset/numad.25069/cpuset.cpus
Wed Jul  1 07:45:56 2015: PID 25069 moved to node(s) 0 in 6.71 seconds

note this is still on same rhel6.7 server just with rhel7 numad

The change in behaviour in rhel6.7 was done by 

https://bugzilla.redhat.com/show_bug.cgi?id=1150585

which get rid of cpuset and used sched_affinity to set cpumask. This call
however does not affect memory nodes so sys_setmempolicy/mbind will not fail
and we can change memory node while keeping cpumask pointing to another numa
node.

crash> set hex
output radix: 16 (hex)
crash> task -R mems_allowed,cpus_allowed
PID: 25556  TASK: ffff880875930ab0  CPU: 7   COMMAND: "numademo"
  cpus_allowed = {
    bits = {0xfc0fc0, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff}
  }, 
  mems_allowed = {
    bits = {0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
  }, 

crash> eval -b 0xfc0fc0
hexadecimal: fc0fc0  
    decimal: 16519104  
      octal: 77007700
     binary: 0000000000000000000000000000000000000000111111000000111111000000
   bits set: 23 22 21 20 19 18 11 10 9 8 7 6 
crash> eval -b 0x3
hexadecimal: 3  
    decimal: 3  
      octal: 3
     binary: 0000000000000000000000000000000000000000000000000000000000000011
   bits set: 1 0 

which means that even though we're bound to CPUs in zone 1 we can still use both
memory zones. 

To be honest I'm not really sure which approach is better as I see benefits in
both. However I do believe we should keep the behaviour the same between the
releases.

Version-Release number of selected component (if applicable):
numad-0.5-14.20140620git.el7

How reproducible:
always

Steps to Reproduce:
1. service numad start
2. numademo -t 256m ptrchase

Actual results:
output of numademo contains lots of 
      set_mempolicy: Invalid argument
errors

Expected results:
no -EINVAL errors printed

Additional info:

Note I just realized I did not try rhel6.6 version but I suppose it's the same
as rhel7.1. I assume that it behaves the same way but customer reports it does
not. I think this is because I had to increase the memory usage way higher on
rhel6 for numad to notice numademo process and take action.

System was completely idle minimal install with just numad and numademo running
besides the original daemons

Comment 1 Milos Vyletel 2015-07-02 09:52:45 UTC

Note on rhel6

Thu Jul  2 05:50:11 2015: Advising pid 7221 (numademo) move from nodes (0-1) to nodes (0)
Thu Jul  2 05:50:11 2015: Making new cpuset: /cgroup/cpuset/numad.7221
Thu Jul  2 05:50:12 2015: PID 7221 moved to node(s) 0 in 0.89 seconds
^C
[root@ibm-x3850x5-1 ~]# ls /cgroup/cpuset/numad.7221
cgroup.event_control  cpuset.mem_hardwall        cpuset.mems
cgroup.procs          cpuset.memory_migrate      cpuset.sched_load_balance
cpuset.cpu_exclusive  cpuset.memory_pressure     cpuset.sched_relax_domain_level
cpuset.cpus           cpuset.memory_spread_page  notify_on_release
cpuset.mem_exclusive  cpuset.memory_spread_slab  tasks
[root@ibm-x3850x5-1 ~]# cat /cgroup/cpuset/numad.7221/cpuset.mem
cpuset.mem_exclusive       cpuset.memory_pressure     cpuset.mems
cpuset.mem_hardwall        cpuset.memory_spread_page  
cpuset.memory_migrate      cpuset.memory_spread_slab  
[root@ibm-x3850x5-1 ~]# cat /cgroup/cpuset/numad.7221/cpuset.mems 
0-1
[root@ibm-x3850x5-1 ~]# cat /cgroup/cpuset/numad.7221/cpuset.cpus 
0-5,12-17

in this case we do not seem to be limiting memory zones only cpus

Comment 9 errata-xmlrpc 2016-11-04 06:09:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2414.html

Note You need to log in before you can comment on or make changes to this bug.