Bug 1076990 (qemu-complex-mem)

Summary: Enable complex memory requirements for virtual machines
Product: Red Hat Enterprise Linux 7 Reporter: Stephen Gordon <sgordon>
Component: qemu-kvm-rhevAssignee: Eduardo Habkost <ehabkost>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: high    
Version: 7.0CC: ehabkost, hhuang, juzhang, knoel, michen, mrezanin, rbalakri, virt-maint, xfu
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qemu-kvm-rhev-2.1.2-2.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: libvirt-complex-guest-mem Environment:
Last Closed: 2015-03-05 09:44:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 996259    
Bug Blocks: 1076989, 1110708    

Description Stephen Gordon 2014-03-17 00:28:20 UTC
+++ This bug was initially created as a clone of Bug #1076989 +++

Description of problem:

Enable the specification of complex memory requirements for virtual machines such as:

* A virtual machine using 2 NUMA nodes, with different huge pages number for each NUMA node
* A virtual machine with a specific number of huge pages and an additional amount of memory not backed by huge pages (this latter might be oversubscribed), guaranteeing that all memory comes from the same NUMA node

--- Additional comment from Stephen Gordon on 2014-03-16 20:27:09 EDT ---

In addition (perhaps more generically stated):

It is not possible to assure through libvirt a deterministic memory allocation of huge pages in NUMA nodes.

Comment 2 Eduardo Habkost 2014-07-10 20:08:25 UTC
Moving to qemu-kvm-rhev. Patches were included upstream and will be in QEMU 2.1.0.

Comment 5 FuXiangChun 2014-08-26 06:06:35 UTC
Hi Eduardo,
kvm QE want to know how to test two checkpoint as below from qemu-kvm-rhev. would you please provide qemu-kvm-rhev command line for QE? 


1. * A virtual machine using 2 NUMA nodes, with different huge pages number for each NUMA node
2. * A virtual machine with a specific number of huge pages and an additional amount of memory not backed by huge pages (this latter might be oversubscribed), guaranteeing that all memory comes from the same NUMA node

BTW, QE didn't find patch file related in qemu-2.1.
#rpm -qpi qemu-kvm-rhev-2.1.0-2.el7.src.rpm --changelog |grep 1076990
nothing
#rpm -qpi qemu-kvm-2.1.0-1.el7.src.rpm --changelog |grep 1076990
nothing

Comment 6 Eduardo Habkost 2014-08-27 17:42:06 UTC
(In reply to FuXiangChun from comment #5)
> 1. * A virtual machine using 2 NUMA nodes, with different huge pages number
> for each NUMA node

Use a different memdev for each numa node, each with a different hugetlbfs mountpoints and different "host-nodes" options. e.g.:

  -object memory-backend-file,host-nodes=0,id=mem-0,mem-path=/tmp/hugetlbfs1 \
  -numa node,id=0,memdev=mem-0 \
  -object memory-backend-file,host-nodes=1,id=mem-1,mem-path=/tmp/hugetlbfs2 \
  -numa node,id=1,memdev=mem-1 \
 

> 2. * A virtual machine with a specific number of huge pages and an
> additional amount of memory not backed by huge pages (this latter might be
> oversubscribed), guaranteeing that all memory comes from the same NUMA node

Just use a different memdev for each numa node. Some nodes may point to memory-backend-file objects, other nodes may point to memory-backend-ram objects.

> 
> BTW, QE didn't find patch file related in qemu-2.1.
> #rpm -qpi qemu-kvm-rhev-2.1.0-2.el7.src.rpm --changelog |grep 1076990
> nothing
> #rpm -qpi qemu-kvm-2.1.0-1.el7.src.rpm --changelog |grep 1076990
> nothing

The package was rebased and the patches are already in upstream QEMU version 2.1.0.

Comment 7 Eduardo Habkost 2014-08-27 17:42:53 UTC
(In reply to Eduardo Habkost from comment #6)
> > 2. * A virtual machine with a specific number of huge pages and an
> > additional amount of memory not backed by huge pages (this latter might be
> > oversubscribed), guaranteeing that all memory comes from the same NUMA node
> 
> Just use a different memdev for each numa node. Some nodes may point to
> memory-backend-file objects, other nodes may point to memory-backend-ram
> objects.

Example command-line options:

  -object memory-backend-ram,host-nodes=0,id=mem-0 \
  -numa node,id=0,memdev=mem-0 \
  -object memory-backend-file,host-nodes=1,id=mem-1,mem-path=/tmp/hugetlbfs2 \
  -numa node,id=1,memdev=mem-1 \

Comment 8 FuXiangChun 2014-09-01 15:46:45 UTC
Tested qemu-kvm-rhev-2.1.0-3.el7.x86_64 & 3.10.0-148.el7.x86_64
/usr/libexec/qemu-kvm 
-object memory-backend-file,host-nodes=0,id=mem-0,mem-path=/mnt/kvm_hugepage/ -numa node,id=0,memdev=mem-0 
-object memory-backend-file,host-nodes=1,id=mem-1,mem-path=/mnt/kvm_hugepage -numa node,id=1,memdev=mem-1

result:
qemu-kvm: -numa node,id=0,memdev=mem-0: Parameter 'id' expects an identifier

/usr/libexec/qemu-kvm 
-object memory-backend-file,host-nodes=0,id=mem-0,mem-path=/mnt/kvm_hugepage/ -numa node,memdev=mem-0 
-object memory-backend-file,host-nodes=1,id=mem-1,mem-path=/mnt/kvm_hugepage -numa node,memdev=mem-1

result:
qemu-kvm: -object memory-backend-file,host-nodes=0,id=mem-0,mem-path=/mnt/kvm_hugepage/: NUMA node binding are not supported by this QEMU
qemu-kvm: -object memory-backend-file,host-nodes=1,id=mem-1,mem-path=/mnt/kvm_hugepage: NUMA node binding are not supported by this QEMU

Eduardo,
Base on this result. Do I need to re-assign this bug?

Comment 9 Eduardo Habkost 2014-09-01 15:52:10 UTC
(In reply to FuXiangChun from comment #8)
> Tested qemu-kvm-rhev-2.1.0-3.el7.x86_64 & 3.10.0-148.el7.x86_64
> /usr/libexec/qemu-kvm 
> -object
> memory-backend-file,host-nodes=0,id=mem-0,mem-path=/mnt/kvm_hugepage/ -numa
> node,id=0,memdev=mem-0 
> -object memory-backend-file,host-nodes=1,id=mem-1,mem-path=/mnt/kvm_hugepage
> -numa node,id=1,memdev=mem-1
> 
> result:
> qemu-kvm: -numa node,id=0,memdev=mem-0: Parameter 'id' expects an identifier

This was my mistake. Proper format is: -node nodeid=X,memdev=Y.

> 
> /usr/libexec/qemu-kvm 
> -object
> memory-backend-file,host-nodes=0,id=mem-0,mem-path=/mnt/kvm_hugepage/ -numa
> node,memdev=mem-0 
> -object memory-backend-file,host-nodes=1,id=mem-1,mem-path=/mnt/kvm_hugepage
> -numa node,memdev=mem-1
> 
> result:
> qemu-kvm: -object
> memory-backend-file,host-nodes=0,id=mem-0,mem-path=/mnt/kvm_hugepage/: NUMA
> node binding are not supported by this QEMU
> qemu-kvm: -object
> memory-backend-file,host-nodes=1,id=mem-1,mem-path=/mnt/kvm_hugepage: NUMA
> node binding are not supported by this QEMU

This is a bug. QEMU should be compiled with --enable-numa. Reopening.

Comment 11 FuXiangChun 2014-09-02 07:00:33 UTC
Re-tested this issue with private build qemu-kvm-rhev-2.1.0-3.el7.numa.buildrequires.v1.x86_64. QE tested 2 scenarios.

S1.
/usr/libexec/qemu-kvm -M pc-i440fx-rhel7.0.0 -name RHEL-Server-7.0-64 -m 27G -smp 4,maxcpus=160 \

-object memory-backend-file,host-nodes=0,id=mem-0,policy=bind,prealloc=yes,mem-path=/mnt/kvm_hugepage/,size=1024M 
-numa node,nodeid=0,memdev=mem-0 
-object memory-backend-file,policy=bind,host-nodes=1,id=mem-1,mem-path=/mnt/kvm_hugepage2,size=1024M 
-numa node,nodeid=1,memdev=mem-1

result:
# grep -2 1048576 smaps
2aaaaac00000-2aaaeac00000 rw-p 00000000 00:26 667856                     /mnt/kvm_hugepage/qemu_back_mem._objects_mem-0.srrTfd (deleted)
Size:            1048576 kB
Rss:                   0 kB
Pss:                   0 kB
--
VmFlags: rd wr mr mw me dc de ht 
2aaaeac00000-2aab2ac00000 rw-p 00000000 00:27 666757                     /mnt/kvm_hugepage2/qemu_back_mem._objects_mem-1.XElhru (deleted)
Size:            1048576 kB
Rss:                   0 kB
Pss:                   0 kB

# cat numa_maps |grep 2aaaaac00000
2aaaaac00000 bind:0 file=/mnt/kvm_hugepage/qemu_back_mem._objects_mem-0.srrTfd\040(deleted) huge anon=512 dirty=512 N0=512

# cat numa_maps |grep 2aaaeac00000
2aaaeac00000 bind:1 file=/mnt/kvm_hugepage2/qemu_back_mem._objects_mem-1.XElhru\040(deleted) huge anon=502 dirty=502 N1=502

S2.
/usr/libexec/qemu-kvm -M pc-i440fx-rhel7.0.0 -name RHEL-Server-7.0-64 -m 2G -smp 4,maxcpus=160 -object memory-backend-file,host-nodes=0,id=mem-0,policy=bind,prealloc=yes,mem-path=/mnt/kvm_hugepage/,size=1536M -numa node,nodeid=0,memdev=mem-0 -object memory-backend-file,policy=bind,host-nodes=1,id=mem-1,mem-path=/mnt/kvm_hugepage2,size=512M -numa node,nodeid=1,memdev=mem-1

result:
# grep -2 1572864 smaps 
2aaaaac00000-2aab0ac00000 rw-p 00000000 00:26 669656                     /mnt/kvm_hugepage/qemu_back_mem._objects_mem-0.DPAbez (deleted)
Size:            1572864 kB
Rss:                   0 kB
Pss:                   0 kB
# grep -2 524288 smaps
VmFlags: rd wr mr mw me dc de ht 
2aab0ac00000-2aab2ac00000 rw-p 00000000 00:27 669657                     /mnt/kvm_hugepage2/qemu_back_mem._objects_mem-1.DrCFi1 (deleted)
Size:             524288 kB
Rss:                   0 kB
Pss:                   0 kB

# cat numa_maps |grep 2aaaaac00000
2aaaaac00000 bind:0 file=/mnt/kvm_hugepage/qemu_back_mem._objects_mem-0.DPAbez\040(deleted) huge anon=768 dirty=768 N0=768

# cat numa_maps |grep 2aab0ac00000
2aab0ac00000 bind:1 file=/mnt/kvm_hugepage2/qemu_back_mem._objects_mem-1.DrCFi1\040(deleted) huge anon=135 dirty=135 N1=135

Eduardo,
can this result verify this bug? 
Another, about "an additional amount of memory not backed by huge pages", can you provide a qemu-kvm cli example to QE?  QE don't know how to trigger it. 

It is from 
2. * A virtual machine with a specific number of huge pages and an additional amount of memory not backed by huge pages

Comment 12 Eduardo Habkost 2014-09-03 17:54:27 UTC
(In reply to FuXiangChun from comment #11)
> Re-tested this issue with private build
> qemu-kvm-rhev-2.1.0-3.el7.numa.buildrequires.v1.x86_64. QE tested 2
> scenarios.
> 
> S1.
> /usr/libexec/qemu-kvm -M pc-i440fx-rhel7.0.0 -name RHEL-Server-7.0-64 -m 27G
> -smp 4,maxcpus=160 \
> 
> -object
> memory-backend-file,host-nodes=0,id=mem-0,policy=bind,prealloc=yes,mem-path=/
> mnt/kvm_hugepage/,size=1024M 
> -numa node,nodeid=0,memdev=mem-0 
> -object
> memory-backend-file,policy=bind,host-nodes=1,id=mem-1,mem-path=/mnt/
> kvm_hugepage2,size=1024M 
> -numa node,nodeid=1,memdev=mem-1

I see you didn't use prealloc on mem-1.


> # cat numa_maps |grep 2aaaaac00000
> 2aaaaac00000 bind:0
> file=/mnt/kvm_hugepage/qemu_back_mem._objects_mem-0.srrTfd\040(deleted) huge
> anon=512 dirty=512 N0=512

That means 512 2MB pages, or 1024MB. Looks OK.

> 
> # cat numa_maps |grep 2aaaeac00000
> 2aaaeac00000 bind:1
> file=/mnt/kvm_hugepage2/qemu_back_mem._objects_mem-1.XElhru\040(deleted)
> huge anon=502 dirty=502 N1=502

This one doesn't have all pages allocated because it doesn't have prealloc=yes. But they are all on node 1. Looks OK.

Note that for the above test case, you will need the hugepages to be preallocated on the right nodes. You can do that very early on boot by using:

# echo 512 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
# echo 512 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages

Please also check the contents of those files to ensure enough hugepages were allocated.

> 
> S2.
> /usr/libexec/qemu-kvm -M pc-i440fx-rhel7.0.0 -name RHEL-Server-7.0-64 -m 2G
> -smp 4,maxcpus=160 -object
> memory-backend-file,host-nodes=0,id=mem-0,policy=bind,prealloc=yes,mem-path=/
> mnt/kvm_hugepage/,size=1536M -numa node,nodeid=0,memdev=mem-0 -object
> memory-backend-file,policy=bind,host-nodes=1,id=mem-1,mem-path=/mnt/
> kvm_hugepage2,size=512M -numa node,nodeid=1,memdev=mem-1
> 
> result:
> # grep -2 1572864 smaps 
> 2aaaaac00000-2aab0ac00000 rw-p 00000000 00:26 669656                    
> /mnt/kvm_hugepage/qemu_back_mem._objects_mem-0.DPAbez (deleted)
> Size:            1572864 kB
> Rss:                   0 kB
> Pss:                   0 kB
> # grep -2 524288 smaps
> VmFlags: rd wr mr mw me dc de ht 
> 2aab0ac00000-2aab2ac00000 rw-p 00000000 00:27 669657                    
> /mnt/kvm_hugepage2/qemu_back_mem._objects_mem-1.DrCFi1 (deleted)
> Size:             524288 kB
> Rss:                   0 kB
> Pss:                   0 kB
> 
> # cat numa_maps |grep 2aaaaac00000
> 2aaaaac00000 bind:0
> file=/mnt/kvm_hugepage/qemu_back_mem._objects_mem-0.DPAbez\040(deleted) huge
> anon=768 dirty=768 N0=768

768*2MB = 1536MB. All on N0. Looks OK.
> 
> # cat numa_maps |grep 2aab0ac00000
> 2aab0ac00000 bind:1
> file=/mnt/kvm_hugepage2/qemu_back_mem._objects_mem-1.DrCFi1\040(deleted)
> huge anon=135 dirty=135 N1=135

Again, you didn't use prealloc for node 1, so not all pages were allocated. But the ones that were allocated are all on node 1. Looks good.

> 
> Eduardo,
> can this result verify this bug? 
> Another, about "an additional amount of memory not backed by huge pages",
> can you provide a qemu-kvm cli example to QE?  QE don't know how to trigger
> it. 

See comment #7:

  -object memory-backend-ram,host-nodes=0,id=mem-0 \
  -numa node,id=0,memdev=mem-0 \
  -object memory-backend-file,host-nodes=1,id=mem-1,mem-path=/tmp/hugetlbfs2 \
  -numa node,id=1,memdev=mem-1 \

That will use hugepages for guest node 1, and normal pages for guest node 0. In the example above the pages can come from any host node, but you can use policy=bind,host-nodes=X to make sure they come from a specific host node (you can do that for normal pages and for hugepages). I forgot to add the "size" parameters to the memory-backend-* objects, but you can choose reasonable sizes for each one.

Comment 14 Miroslav Rezanina 2014-10-10 07:33:52 UTC
Fix included in qemu-kvm-rhev-2.1.2-2.el7

Comment 16 FuXiangChun 2014-10-30 07:46:29 UTC
Re-verified bug with 3.10.0-195.el7.x86_64 and qemu-kvm-rhev-2.1.2-5.el7.x86_64.

host info:

# cat /proc/buddyinfo 
Node 0, zone      DMA      0      1      1      1      1      1      1      0      1      1      3 
Node 0, zone    DMA32    175     96    139     68     11      5     17     15      7      4      1 
Node 0, zone   Normal    122     87    168     92     36      8      8     20      4      1      0 
Node 1, zone   Normal    890    552    301    136     69     14      6      7      9      5      3 
Node 2, zone   Normal    309    225    195    138    102     54     19     15      9      2      1 
Node 3, zone   Normal    373    303    200    148     54     32     13     10     11      3


S1. with the same huge pages number for each NUMA node

1. echo 2048 >/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages

2. echo 2048 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

3. qemu-kvm cli
/usr/libexec/qemu-kvm  -m 8G \

-object memory-backend-file,host-nodes=0,id=mem-0,policy=bind,prealloc=yes,mem-path=/mnt/kvm_hugepage1,size=4096M -numa node,nodeid=0,memdev=mem-0 \

-object memory-backend-file,policy=bind,host-nodes=1,id=mem-1,prealloc=yes,mem-path=/mnt/kvm_hugepage2,size=4096M -numa node,nodeid=1,memdev=mem-1 \

result:
# grep -2 4194304 smaps 
2aaaaac00000-2aabaac00000 rw-p 00000000 00:26 43472                      /mnt/kvm_hugepage1/qemu_back_mem._objects_mem-0.BCQSYP (deleted)
Size:            4194304 kB
Rss:                   0 kB
Pss:                   0 kB
--
VmFlags: rd wr mr mw me dc de ht 
2aabaac00000-2aacaac00000 rw-p 00000000 00:27 43473                      /mnt/kvm_hugepage2/qemu_back_mem._objects_mem-1.pZ3s4r (deleted)
Size:            4194304 kB
Rss:                   0 kB
Pss:                   0 kB

# grep 2aaaaac00000 numa_maps 
2aaaaac00000 bind:0 file=/mnt/kvm_hugepage1/qemu_back_mem._objects_mem-0.BCQSYP\040(deleted) huge anon=2048 dirty=2048 N0=2048

# grep 2aabaac00000 numa_maps 
2aabaac00000 bind:1 file=/mnt/kvm_hugepage2/qemu_back_mem._objects_mem-1.pZ3s4r\040(deleted) huge anon=2048 dirty=2048 N1=2048


S2. with different huge pages number for each NUMA node

1./usr/libexec/qemu-kvm  -m 4.5G \
-object memory-backend-file,host-nodes=0,id=mem-0,policy=bind,prealloc=yes,mem-path=/mnt/kvm_hugepage1,size=4096M -numa node,nodeid=0,memdev=mem-0 \

-object memory-backend-file,policy=bind,host-nodes=1,id=mem-1,prealloc=yes,mem-path=/mnt/kvm_hugepage2,size=512M -numa node,nodeid=1,memdev=mem-1 \

2.# grep -2 524288 smaps
VmFlags: rd wr mr mw me dc de ht 
2aabaac00000-2aabcac00000 rw-p 00000000 00:27 21992                      /mnt/kvm_hugepage2/qemu_back_mem._objects_mem-1.A1Oe9q (deleted)
Size:             524288 kB
Rss:                   0 kB
Pss:                   0 kB

3.# grep 2aabaac00000 numa_maps 
2aabaac00000 bind:1 file=/mnt/kvm_hugepage2/qemu_back_mem._objects_mem-1.A1Oe9q\040(deleted) huge anon=256 dirty=256 N1=256

S3. with normal memory and hugepage for each NUMA node

/usr/libexec/qemu-kvm  -m 4.5G \
-object memory-backend-file,host-nodes=0,id=mem-0,policy=bind,prealloc=yes,mem-path=/mnt/kvm_hugepage1,size=4096M -numa node,nodeid=0,memdev=mem-0 \

-object memory-backend-ram,policy=bind,host-nodes=1,id=mem-1,prealloc=yes,size=512M -numa node,nodeid=1,memdev=mem-1 \

result:

# grep -2 4194304 smaps
2aaaaac00000-2aabaac00000 rw-p 00000000 00:26 61977                      /mnt/kvm_hugepage1/qemu_back_mem._objects_mem-0.HcgDMS (deleted)
Size:            4194304 kB
Rss:                   0 kB
Pss:                   0 kB

# grep 7f29ec200000 numa_maps 
7f29ec200000 bind:1 anon=131072 dirty=131072 active=76800 N1=131072

Additional.
1. RHEL7.1 guest show numa info is good via numactl -H
2. windows guest works well, but not found any way to check numa information inside guest.

According to 3 scenarios's test result. this bug is fixed.

Comment 19 errata-xmlrpc 2015-03-05 09:44:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0624.html