Bug 870326

Summary: migrate_pages() reports success, but pages are not moved to desired node
Product: Red Hat Enterprise Linux 6 Reporter: Jan Stancek <jstancek>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED NOTABUG QA Contact: Kernel General QE <kernel-general-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.4CC: aarcange, aquini, atomlin, jburke, lwoodman, nobody+295318, pbunyan, riel, wgomerin
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-10-14 18:30:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1270638    
Attachments:
Description Flags
reproducer v1 none

Description Jan Stancek 2012-10-26 07:46:49 UTC
Description of problem:
If target node is low on memory, migrate_pages() can report success,
but pages are not moved to desired node.

For example, on host with 4 nodes (0-3):

# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 2 4 6 8 10
node 0 size: 4086 MB
node 0 free: 58 MB
node 1 cpus: 12 14 16 18 20 22
node 1 size: 4096 MB
node 1 free: 2229 MB
node 2 cpus: 13 15 17 19 21 23
node 2 size: 4096 MB
node 2 free: 3974 MB
node 3 cpus: 1 3 5 7 9 11
node 3 size: 4096 MB
node 3 free: 41 MB

mpages.c allocates shared anon page, and then calls migrate_pages() to move current process to each node. After each move it checks if page is on desired node.

# gcc mpages.c -lnuma
# ./a.out 
1. shared mem is on node: 1
2. shared mem is on node: 1
3. shared mem is on node: 2
4. shared mem is on node: 1

Version-Release number of selected component (if applicable):
kernel-2.6.32-336.el6.x86_64

How reproducible:
90%

Steps to Reproduce:
1. exhaust memory, for example by filling cache (numbers needs to be adjusted depending on your node sizes):
dd if=/dev/zero of=/root/GB1 count=1024000 bs=1024
cat /root/GB1 > /dev/null
2. run reproducer (on host with 4 nodes: 0-3)
-or-
2. allocate shared anon page (mmap)
3. migrate current process to each node and check that allocated page has been migrated to desired node
4. optionally check also /proc/pid/numa_maps

Actual results:
migrate_pages reports success (return value is 0), but some pages are not moved to desired node.

Expected results:
If some pages can not be moved to desired node, migrate_pages should return number of pages that could not be moved.

Additional info:
Also reproducible with kernel-3.6.0-0.28.el7.x86_64

Comment 1 Jan Stancek 2012-10-26 07:51:18 UTC
Created attachment 633710 [details]
reproducer v1

On host with 4 nodes (0-3):

# gcc mpages.c -lnuma
# cat bigfile > /dev/null
# ./a.out 
1. shared mem is on node: 1
2. shared mem is on node: 1
3. shared mem is on node: 2
4. shared mem is on node: 1

Comment 3 Larry Woodman 2012-12-04 20:08:48 UTC
The problem is alloc_pages_exact_node() is called by new_node_page() in mm/mempolicy.c with just GFP_HIGHUSER_MOVABLE rather than GFP_HIGHUSER_MOVABLE|GFP_THISNODE like new_page_node() does in mm/migrate.c.

The problem is when I make this change:

-----------------------------------------------------------------------
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 060437d..93cab05 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -920,7 +920,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
 
 static struct page *new_node_page(struct page *page, unsigned long node, int **x)
 {
-       return alloc_pages_exact_node(node, GFP_HIGHUSER_MOVABLE, 0);
+       return alloc_pages_exact_node(node, GFP_HIGHUSER_MOVABLE|GFP_THISNODE, 0);
 }
 
 /*
-------------------------------------------------------------------------

the whole migrate_pages() system call fails with -ENOMEM which isnt what the man pages says.

[root@hp-dl580g7-01 lwoodman]# ./a.out 
migrate_pages failed: -1
Cannot allocate memory


Actually the man pages says it might move the pages to another node...

----------------------------------------------------------------------------
MIGRATE_PAGES(2)                              Linux Programmer's Manual                              MIGRATE_PAGES(2)

NAME
       migrate_pages - move all pages in a process to another set of nodes

SYNOPSIS
       #include <numaif.h>

       long migrate_pages(int pid, unsigned long maxnode,
                          const unsigned long *old_nodes,
                          const unsigned long *new_nodes);

       Link with -lnuma.

DESCRIPTION
       migrate_pages()  moves  all pages of the process pid that are in memory nodes old_nodes to the memory nodes in
       new_nodes.  Pages not located in any node in old_nodes will not be migrated.  As far as possible,  the  kernel
       maintains the relative topology relationship inside old_nodes during the migration to new_nodes.

       The  old_nodes  and  new_nodes arguments are pointers to bit masks of node numbers, with up to maxnode bits in
       each mask.  These masks are maintained as arrays of unsigned long integers (in the last long integer, the bits
       beyond  those  specified  by maxnode are ignored).  The maxnode argument is the maximum node number in the bit
       mask plus one (this is the same as in mbind(2), but different from select(2)).

       The pid argument is the ID of the process whose pages are to be moved.  To move pages in another process,  the
       caller  must  be  privileged (CAP_SYS_NICE) or the real or effective user ID of the calling process must match
       the real or saved-set user ID of the target process.  If pid is 0, then migrate_pages()  moves  pages  of  the
       calling process.

       Pages shared with another process will only be moved if the initiating process has the CAP_SYS_NICE privilege.

RETURN VALUE
       On success migrate_pages() returns zero.  On error, it returns -1, and sets errno to indicate the error.

ERRORS
ERRORS
       EPERM  Insufficient  privilege  (CAP_SYS_NICE)  to move pages of the process specified by pid, or insufficient
              privilege (CAP_SYS_NICE) to access the specified target nodes.

       ESRCH  No process matching pid could be found.

VERSIONS
       The migrate_pages() system call first appeared on Linux in version 2.6.16.

CONFORMING TO
       This system call is Linux-specific.

NOTES
       For information on library support, see numa(7).

       Use get_mempolicy(2) with the MPOL_F_MEMS_ALLOWED flag to obtain the set of nodes  that  are  allowed  by  the
       calling  process's cpuset.  Note that this information is subject to change at any time by manual or automatic
       reconfiguration of the cpuset.

       Use of migrate_pages() may result in pages whose location (node) violates the memory  policy  established  for
       the  specified  addresses (see mbind(2)) and/or the specified process (see set_mempolicy(2)).  That is, memory
       policy does not constrain the destination nodes used by migrate_pages().

Comment 4 Larry Woodman 2012-12-04 21:24:14 UTC
After further investigation the move_pages() syscall already calls alloc_pages_exact_node() with GFP_THISNODE so it will return -ENOMEM if pages can not be allocated on the desired node.  At this point I'd say migrate_pages() is supposed to silently fail when it cant get memory on the target node and move pages is supposed actually fail when it cant get memory on the target node.

I'll check upstream with Christoph Lameter, the author of both system calls.


Larry

Comment 5 RHEL Program Management 2012-12-14 08:20:10 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 6 Larry Woodman 2013-03-11 15:35:17 UTC
I checked with Christoph Lameter and he did say migrate_pages() is supposed to silently fail when it cant get memory on the target node and move_pages() is supposed actually fail when it cant get memory on the target node.

Are you OK with this or should we try to convince him otherwise???


Larry Woodman

Comment 7 Jan Stancek 2013-03-11 15:55:13 UTC
(In reply to comment #6)
> I checked with Christoph Lameter and he did say migrate_pages() is supposed
> to silently fail when it cant get memory on the target node and move_pages()
> is supposed actually fail when it cant get memory on the target node.
> 
> Are you OK with this or should we try to convince him otherwise???

I can accept, that 'silent fail' is the way it's supposed to work, but it would be nice to mention that also in documentation [1].

migrate_pages(2) currently says:
  RETURN VALUE
    ... a return of zero means that all pages were successfully moved

[1] http://git.kernel.org/pub/scm/docs/man-pages/man-pages