Bug 1076989 (libvirt-complex-guest-mem)
Summary: | Enable complex memory requirements for virtual machines | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Stephen Gordon <sgordon> | |
Component: | libvirt | Assignee: | Michal Privoznik <mprivozn> | |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | |
Severity: | medium | Docs Contact: | ||
Priority: | high | |||
Version: | 7.0 | CC: | dyuan, gsun, honzhang, jdenemar, jmiao, jsuchane, knoel, mprivozn, mzhan, rbalakri, sgordon | |
Target Milestone: | rc | Keywords: | TestOnly | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | libvirt-1.2.8-5.el7 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | qemu-complex-mem 1136151 (view as bug list) | Environment: | ||
Last Closed: | 2015-03-05 07:32:39 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1076990 | |||
Bug Blocks: | 1078542, 1113520, 1136151 |
Description
Stephen Gordon
2014-03-17 00:25:29 UTC
In addition (perhaps more generically stated): It is not possible to assure through libvirt a deterministic memory allocation of huge pages in NUMA nodes. (In reply to Stephen Gordon from comment #2) > In addition (perhaps more generically stated): > > It is not possible to assure through libvirt a deterministic memory > allocation of huge pages in NUMA nodes. As a result a virtual machine defined with a strict resource assignment (“strict” NUMA policy) might become running with a different resource assignment. Ad-hoc solution An ad-hoc method for strict resource assignment has been provided by Red Hat. This method should be initially part of libvirt and later on part of whatever virtual machine management framework. The method is the following: * Start the VM in a paused state. * Check hugepages backing of guest RAM. An entry like this should appear: $ sudo cat /proc/$pid-of-qemu/maps | grep huge 7f0dc0000000-7f11c0000000 rw-p 00000000 00:11 36596 /mnt/huge/libvirt/qemu/kvm.mNHb6G (deleted) * Check NUMA placement of guest RAM, verifying that all pages are on the desired node. An entry like this should appear: $ sudo cat /proc/$pid-of-qemu/numa_maps |grep huge 7f0dc0000000 bind:0 file=/mnt/huge/libvirt/qemu/kvm.mNHb6G\040(deleted) huge anon=16 dirty=16 N0=16 Where N0 is the number of 1GiB huge pages allocated in NUMA node 0. * If any page of the guest RAM is allocated on a different node, error out (presence of Ny on that line indicates a page on a different node other than Nx) * Check strict assignment of vCPUs to CPUs: * List all running domains * Query vcpupin of all domains * Error out if there is one physical CPU pinned to two different vCPUs * Override the VM state from “paused” to “running” to let it run. Can you be more specific please? What does qemu API look like? What is the usual use case? I've tried to dig out the qemu patches, but got lost in the primeval forest of qemu sources. From my notes, the example is that the user specifies NUMA 0, but NUMA 0 does not have enough free space so instead starts using space from NUMA 1. Currently start we start the virtual machine, ensure it is using memory from the NUMA that it was assigned strictly to or if not move/kill it (very much a reactive approach/hack). The problem is that currently specifying strict doesn’t include the NUMA assignment. The suggestion is that qemu-kvm be modified such that strict not only enforces the use of huge pages but the NUMA node assignment both for cores and where the huge pages are coming from. I believe Karen suggested/discussed the above with us and the customer and might have some more background. Strict should also enforce use of huge pages and NUMA node placement. Here is the BZ for this: Bug 996750 - strict NUMA policy on hugetlbfs backed guests https://bugzilla.redhat.com/show_bug.cgi?id=996750 Everything I see about "strict" is related to memory. And the strict memory placement is enforced by QEMU. I'm not sure about cores. The pinning is done by libvirt, I believe. So, the question is what happens if the vcpu pinning to a cpuset fails? Does the domain fail to start? If so, is it already considered "strict"? Or, is there is different meaning of "strict" for vcpus? (In reply to Karen Noel from comment #7) > Strict should also enforce use of huge pages and NUMA node placement. Here > is the BZ for this: > > Bug 996750 - strict NUMA policy on hugetlbfs backed guests > https://bugzilla.redhat.com/show_bug.cgi?id=996750 > > Everything I see about "strict" is related to memory. And the strict memory > placement is enforced by QEMU. I'm not sure about cores. The pinning is done > by libvirt, I believe. So, the question is what happens if the vcpu pinning > to a cpuset fails? Does the domain fail to start? If so, is it already > considered "strict"? > > Or, is there is different meaning of "strict" for vcpus? Right, if vcpu pinning fails, the domain is killed (we can do pinning only after qemu is started and has spawned vcpu threads). What currently happens in the circumstance where I request pinning to a NUMA node (0), set strict, and not enough memory (huge pages if requested) is available on that NUMA node (0). * Does the request fail? * Do I instead get memory from another NUMA node (1) while my cores remain pinned to the NUMA node I specified (0)? * Do I instead get moved to another NUMA node (1)? Assume here that when I say "pinned to a NUMA node" I explicitly specified pinning to the cores within the node. From my understanding of customer demands their expectation is the first scenario, that the request fails. Karen, does that reflect your recollection of our discussions? (In reply to Stephen Gordon from comment #9) > What currently happens in the circumstance where I request pinning to a NUMA > node (0), set strict, and not enough memory (huge pages if requested) is > available on that NUMA node (0). > > * Does the request fail? > * Do I instead get memory from another NUMA node (1) while my cores remain > pinned to the NUMA node I specified (0)? > * Do I instead get moved to another NUMA node (1)? > > Assume here that when I say "pinned to a NUMA node" I explicitly specified > pinning to the cores within the node. > > From my understanding of customer demands their expectation is the first > scenario, that the request fails. Yes, that it my understanding too. Let's make sure we test all these scenarios and demonstrate that when strict is specified the VM fails to start if the configured memory conditions are not met. So after my patches, it's still unclear to me what's required here. I mean, is this a test only bug, or is it a duplicate of another one (e.g. bug 1076725)? Other option is that I say this bug is fixed by the patchset and follow the usual workflow. The question from my side is still how does the implementation work now, does it match one of the scenarios outlined in comment # 9 (ideally the first one). (In reply to Stephen Gordon from comment #9) > What currently happens in the circumstance where I request pinning to a NUMA > node (0), set strict, and not enough memory (huge pages if requested) is > available on that NUMA node (0). > > * Does the request fail? Yes. It's the memory allocation that will actually throw an error. Even though, the allocation is done in qemu once spawned by libvirt. > * Do I instead get memory from another NUMA node (1) while my cores remain > pinned to the NUMA node I specified (0)? No, as long as you set 'strict' mode. If you set 'preferred' then qemu may find another suitable NUMA node if the preferred one doesn't have enough memory. > * Do I instead get moved to another NUMA node (1)? Again. In strict mode everything either works as configured (vCPUs / memory is pinned) or domain fails to start. To get moved you'll need to relax the mode to preferred. > > Assume here that when I say "pinned to a NUMA node" I explicitly specified > pinning to the cores within the node. > > From my understanding of customer demands their expectation is the first > scenario, that the request fails. Yep. That's how it works. The guest complex memory requirements could be configured in libvirt, and there is an error reporting when no enough hugepages for guest. # rpm -q libvirt qemu-kvm-rhev libvirt-1.2.8-7.el7.x86_64 qemu-kvm-rhev-2.1.2-9.el7.x86_64 # uname -r 3.10.0-205.el7.x86_64 Cross memory pinning for guest NUMA nodes. The test scenario is: Host node #0 only has 512 2M-hugepages Host node #1 only has 2 1G-hugepages gNode #0: 1G with 2M-hugepages, pinned to Host node #1. gNode #1: 2G with 1G-hugepages, pinned to Host node #0. # cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 0 # cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages 2 # cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages 512 # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages 0 1. configure guest NUMA with two nodes: # virsh edit r71 ... <memory unit='KiB'>3145728</memory> <currentMemory unit='KiB'>3145728</currentMemory> <memoryBacking> <hugepages> <page size='2048' unit='KiB' nodeset='0'/> <page size='1048576' unit='KiB' nodeset='1'/> </hugepages> </memoryBacking> <vcpu placement='auto'>4</vcpu> <numatune> <memory mode='strict' nodeset='0-1'/> <memnode cellid='0' mode='strict' nodeset='1'/> <memnode cellid='1' mode='strict' nodeset='0'/> </numatune> <cpu mode='host-model'> <model fallback='allow'/> <numa> <cell id='0' cpus='0-1' memory='1048576'/> <cell id='1' cpus='2-3' memory='2097152'/> </numa> </cpu> ... 2. start guest # virsh start r71 Domain r71 started in guest: <guest># numactl --hard available: 2 nodes (0-1) node 0 cpus: 0 1 node 0 size: 1023 MB node 0 free: 716 MB node 1 cpus: 2 3 node 1 size: 2047 MB node 1 free: 899 MB node distances: node 0 1 0: 10 20 1: 20 10 3. check hugepage alignment between Host and Guest # grep ram-node0 /proc/`pidof qemu-kvm`/smaps 2aaaaac00000-2aaaeac00000 rw-p 00000000 00:24 19215 /dev/hugepages2M/libvirt/qemu/qemu_back_mem._objects_ram-node0.9xHi88 (deleted) # grep 2aaaaac00000 /proc/`pidof qemu-kvm`/numa_maps 2aaaaac00000 bind:1 file=/dev/hugepages2M/libvirt/qemu/qemu_back_mem._objects_ram-node0.9xHi88\040(deleted) huge anon=512 dirty=512 N1=512 So memory-object for guest node 0 is 2M-hugepages, and is aligned to Host Node 1. # grep ram-node1 /proc/`pidof qemu-kvm`/smaps 2aab00000000-2aab80000000 rw-p 00000000 00:25 19216 /dev/hugepages1G/libvirt/qemu/qemu_back_mem._objects_ram-node1.rsof2i (deleted) # grep 2aab00000000 /proc/`pidof qemu-kvm`/numa_maps 2aab00000000 bind:0 file=/dev/hugepages1G/libvirt/qemu/qemu_back_mem._objects_ram-node1.rsof2i\040(deleted) huge anon=2 dirty=2 N0=2 So memory-object for guest node 1 is 1G-hugepages, and is aligned to Host Node 0. Negative test for insufficent hugepages # echo 511 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages # virsh start r71 error: Failed to start domain r71 error: internal error: early end of file from monitor: possible problem: 2014-11-24T09:36:07.912519Z qemu-kvm: -object memory-backend-file,prealloc=yes,mem-path=/dev/hugepages2M/libvirt/qemu,size=1024M,id=ram-node0,host-nodes=1,policy=bind: unable to map backing store for hugepages: Cannot allocate memory # echo 512 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages # echo 1 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages # virsh start r71 error: Failed to start domain r71 error: internal error: early end of file from monitor: possible problem: 2014-11-24T09:38:09.426075Z qemu-kvm: -object memory-backend-file,prealloc=yes,mem-path=/dev/hugepages1G/libvirt/qemu,size=2048M,id=ram-node1,host-nodes=0,policy=bind: unable to map backing store for hugepages: Cannot allocate memory Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0323.html |