Bug 1010885
Summary: | kvm_init_vcpu failed: Cannot allocate memory in NUMA | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Jincheng Miao <jmiao> | |
Component: | libvirt | Assignee: | Martin Kletzander <mkletzan> | |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | |
Severity: | medium | Docs Contact: | ||
Priority: | high | |||
Version: | 7.0 | CC: | abisogia, atheurer, berrange, dvrao.584, dyuan, gsun, honzhang, jmiao, jtomko, juzhang, mkalinin, mkletzan, mtosatti, mzhan, peljasz, pkrempa, rbalakri, reyum3, smooney, tdosek, tvvcox, xuzhang, yohmura, ypu | |
Target Milestone: | rc | Keywords: | Upstream, ZStream | |
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | libvirt-1.2.7-1.el7 | Doc Type: | Bug Fix | |
Doc Text: |
Cause: For domains with <numatune><memory mode='strict' nodeset='X'/> libvirt strictly prohibited qemu/kvm from allocating memory from NUMA nodes outside the nodeset X.
Consequence: When qemu initializes, kvm needs to allocate some memory from the DMA32 zone. If that zone was present only on a NUMA node outside that X, the allocation failed and the domain couldn't start.
Fix: libvirt only uses numa_membind() calls for restricting qemu user process at start and might restrict it using cgroups only after the memory was allocated (it has started).
Result: libvirt can start domains with memory strictly pinned to any available nodeset.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1135871 1206424 (view as bug list) | Environment: | ||
Last Closed: | 2015-03-05 07:24:51 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1135871, 1171792, 1206424 |
Description
Jincheng Miao
2013-09-23 09:06:40 UTC
I am seeing a similar problem, when using: virsh numatune test2 --mode strict --nodeset 1 --config and then starting the VM. I was able to make this work if I made sure libvirtd did not use cpuset cgroup. Could you try your test with cpuset cgroup not used for libvirtd? To disable the use of cpuset for libvirtd: (1) edit /etc/libvirtd/qemu.conf and change the line: "#cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuset", "cpuacct" ]" to: "cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuacct" ]" (2) restart libvirtd: systemctl restart libvirtd (In reply to Andrew Theurer from comment #2) > I am seeing a similar problem, when using: > > virsh numatune test2 --mode strict --nodeset 1 --config Yes, I also get guest failing to start using this config: <vcpu placement='auto'>4</vcpu> <numatune> <memory mode='strict' nodeset='1'/> </numatune> I used the NUMA machine with 2 nodes # numactl --show policy: default preferred node: current physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 cpubind: 0 1 nodebind: 0 1 membind: 0 1 When I strict memory to node 0, the guest can be started. > > and then starting the VM. I was able to make this work if I made sure > libvirtd did not use cpuset cgroup. Could you try your test with cpuset > cgroup not used for libvirtd? To disable the use of cpuset for libvirtd: > (1) edit /etc/libvirtd/qemu.conf and change the line: > > "#cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuset", > "cpuacct" ]" > > to: > "cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuacct" ]" > > (2) restart libvirtd: systemctl restart libvirtd So after abandon cpuset usage, strict memory to node 1 (the failed one), the guest can be started. # virsh start r7 error: Failed to start domain r7 error: internal error: process exited while connecting to monitor: kvm_init_vcpu failed: Cannot allocate memory # vim /etc/libvirt/qemu.conf "cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuacct" ]" # systemctl restart libvirtd # virsh start r7 Domain r7 started # virsh destroy r7 Domain r7 destroyed # while sleep 1; do virsh start r7; sleep 0.5; virsh destroy r7; done Domain r7 started Domain r7 destroyed Domain r7 started Domain r7 destroyed Domain r7 started Domain r7 destroyed Is that meaning that cpuset blocks kvm_init_vcpu to allocate the memory? Thank you very much for testing this. There appears to be a conflict between cpuset that libvirt uses and the numactl calls that qemu uses. These two methods, IMO, are redundant. The root pagetable is allocated with __GFP_DMA32, which is 0-4GB physical address range. Reassigning to KVM. (In reply to Andrew Theurer from comment #4) > Thank you very much for testing this. There appears to be a conflict > between cpuset that libvirt uses and the numactl calls that qemu uses. > These two methods, IMO, are redundant. That doesn't make a whole lot of sense - QEMU isn't using numactl - libvirt is fully responsible for setting the numa policies of QEMU before it is execed. It is necessary for KVM to allocate pages in the 0-4GB physical address range, as noted by mmu.c: /* * When emulating 32-bit mode, cr3 is only 32 bits even on x86_64. * Therefore we need to allocate shadow page tables in the first * 4GB of memory, which happens to fit the DMA32 zone. */ page = alloc_page(GFP_KERNEL | __GFP_DMA32); if (!page) return -ENOMEM; So libvirt should add nodes which contain such range to cpuset.mems_allowed list as follows: 1) Find nodes which contain DMA/DMA32 zones: grep "zone DMA" /proc/zoneinfo 2) Add such nodes to cpuset.mem_allowed list. About kernel patch submitted on comment 8: kernel behaviour regarding GFP_DMA-type allocations and cpusets memory enforcenment is not well defined. (In reply to Marcelo Tosatti from comment #9) Two questions: 1) Would it be enough to allow the emulator thread to allocate from such nodes? If not, then how are we supposed to restrict qemu to some memory nodes? 2) Is there any other way to get the information from kernel? I don't think we want to be grepping /proc/zoneinfo in libvirt. (In reply to Martin Kletzander from comment #11) > (In reply to Marcelo Tosatti from comment #9) > Two questions: > > 1) Would it be enough to allow the emulator thread to allocate from such > nodes? If not, then how are we supposed to restrict qemu to some memory > nodes? Even adding just the emulator thread to that node(s) is really undesirable, since it allows any allocation in that thread to come from a node that the user / mgmt app has requested that we do not use. If it is really just that one kernel allocation that needs to be from a specific node, it seems better to just special case that in the KVM kernel module. > 2) Is there any other way to get the information from kernel? I don't think > we want to be grepping /proc/zoneinfo in libvirt. (In reply to Martin Kletzander from comment #11) > (In reply to Marcelo Tosatti from comment #9) > Two questions: > > 1) Would it be enough to allow the emulator thread to allocate from such > nodes? If not, then how are we supposed to restrict qemu to some memory > nodes? Just remove the given nodes from mems_allowed. Restrict qemu to some memory nodes by using mbind(MBIND): The MPOL_BIND mode specifies a strict policy that restricts memory allocation to the nodes specified in node‐mask. If nodemask specifies more than one node, page allocations will come from the node with the lowest numeric node ID first, until that node contains no free memory. Allocations will then come from the node with the next highest node ID specified in nodemask and so forth, until none of the specified nodes contain free memory. ****Pages will not be allocated from any node not specified in the nodemask****. > 2) Is there any other way to get the information from kernel? I don't think > we want to be grepping /proc/zoneinfo in libvirt. I'll have to look up. What is the preferred interface rather than reading /proc/zoneinfo ? Libvirt does not parse any information in /proc/ ? (In reply to Daniel Berrange from comment #12) > (In reply to Martin Kletzander from comment #11) > > (In reply to Marcelo Tosatti from comment #9) > > Two questions: > > > > 1) Would it be enough to allow the emulator thread to allocate from such > > nodes? If not, then how are we supposed to restrict qemu to some memory > > nodes? > > Even adding just the emulator thread to that node(s) is really undesirable, > since it allows any allocation in that thread to come from a node that the > user / mgmt app has requested that we do not use. If it is really just that > one kernel allocation that needs to be from a specific node, it seems better > to just special case that in the KVM kernel module. Allocation from nodes which contain GFP_DMA zones is required only during initialization of the guest. So you can drop the vcpuset mems_allowed "special case" (GFP_DMA zones) after the guest has initialized. (In reply to Marcelo Tosatti from comment #13) > (In reply to Martin Kletzander from comment #11) > > (In reply to Marcelo Tosatti from comment #9) > > Two questions: > > > > 1) Would it be enough to allow the emulator thread to allocate from such > > nodes? If not, then how are we supposed to restrict qemu to some memory > > nodes? > > Just remove the given nodes from mems_allowed. Restrict qemu to some > memory nodes by using mbind(MBIND): > > The MPOL_BIND mode specifies a strict policy that restricts memory > allocation to the nodes specified in node‐mask. > If nodemask specifies more than one node, page allocations > will come from the node with the lowest numeric node ID first, > until that node contains no free memory. > Allocations will then come from the node with the next > highest node ID specified in nodemask and so forth, until none of the > specified nodes contain free memory. > ****Pages will not be allocated from any node not specified in the > nodemask****. And return to normal cpuset configuration once guest has initialized. (In reply to Marcelo Tosatti from comment #14) > (In reply to Daniel Berrange from comment #12) > > (In reply to Martin Kletzander from comment #11) > > > (In reply to Marcelo Tosatti from comment #9) > > > Two questions: > > > > > > 1) Would it be enough to allow the emulator thread to allocate from such > > > nodes? If not, then how are we supposed to restrict qemu to some memory > > > nodes? > > > > Even adding just the emulator thread to that node(s) is really undesirable, > > since it allows any allocation in that thread to come from a node that the > > user / mgmt app has requested that we do not use. If it is really just that > > one kernel allocation that needs to be from a specific node, it seems better > > to just special case that in the KVM kernel module. > > Allocation from nodes which contain GFP_DMA zones is required only during > initialization of the guest. So you can drop the vcpuset mems_allowed > "special case" (GFP_DMA zones) after the guest has initialized. There are a tonne of allocations done during initialization, so that is still going to allow potential for many allocations to be done from the undesired NUMA nodes. So I don't think just leaving it relaxed during init is reasonable, just to get 1 single alloc from the DMA zone to work. (In reply to Daniel Berrange from comment #16) > (In reply to Marcelo Tosatti from comment #14) > > (In reply to Daniel Berrange from comment #12) > > > (In reply to Martin Kletzander from comment #11) > > > > (In reply to Marcelo Tosatti from comment #9) > > > > Two questions: > > > > > > > > 1) Would it be enough to allow the emulator thread to allocate from such > > > > nodes? If not, then how are we supposed to restrict qemu to some memory > > > > nodes? > > > > > > Even adding just the emulator thread to that node(s) is really undesirable, > > > since it allows any allocation in that thread to come from a node that the > > > user / mgmt app has requested that we do not use. If it is really just that > > > one kernel allocation that needs to be from a specific node, it seems better > > > to just special case that in the KVM kernel module. > > > > Allocation from nodes which contain GFP_DMA zones is required only during > > initialization of the guest. So you can drop the vcpuset mems_allowed > > "special case" (GFP_DMA zones) after the guest has initialized. > > There are a tonne of allocations done during initialization, so that is > still going to allow potential for many allocations to be done from the > undesired NUMA nodes. So I don't think just leaving it relaxed during init > is reasonable, just to get 1 single alloc from the DMA zone to work. Is there any point in using cpusets but not using mbind(MBIND)? Would the setting from numa_set_membind() live through creating qemu's threads? I believe it should, but I'm rather asking. We are already using that for some settings and if that suffice then it sould be possible, I guess. (In reply to Martin Kletzander from comment #18) > Would the setting from numa_set_membind() live through creating qemu's > threads? I believe it should, but I'm rather asking. We are already using > that for some settings and if that suffice then it sould be possible, I > guess. It uses set_mempolicy, and from the man page: set_mempolicy - set default NUMA memory policy for a process and its children The process memory policy is preserved across an execve(2), and is inherited by child processes created using fork(2) or clone(2). If I may bother you with one more question... I'm also wondering whether we should switch from cgroups to numa_set_membind() only until the needed structures are allocated and then enforce the nodeset in cpuset.mems (possibly with cpuset.memory_migrate set to 1) or if we should skip working with cgroups at all and just use membind without any additional cgroup tuning. (In reply to Martin Kletzander from comment #20) > If I may bother you with one more question... > > I'm also wondering whether we should switch from cgroups to > numa_set_membind() only until the needed structures are allocated and then > enforce the nodeset in cpuset.mems (possibly with cpuset.memory_migrate set > to 1) or if we should skip working with cgroups at all and just use membind > without any additional cgroup tuning. Only until the qemu-kvm process is initialized (which is where the needed structures are allocated). Fixed upstream with v1.2.6-176-g7e72ac7: commit 7e72ac787848b7434c9359a57c1e2789d92350f8 Author: Martin Kletzander <mkletzan> Date: Tue Jul 8 09:59:49 2014 +0200 qemu: leave restricting cpuset.mems after initialization This problem is fix in latest libvirt-1.2.8-7.el7.x86_64. There are the verification steps: # rpm -q libvirt libvirt-1.2.8-7.el7.x86_64 1. host NUMA with 2 nodes, and DMA zone resides in node 1. # numactl --hard available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 node 0 size: 65514 MB node 0 free: 59459 MB node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 node 1 size: 65536 MB node 1 free: 60689 MB node distances: node 0 1 0: 10 11 1: 11 10 # grep "DMA" /proc/zoneinfo Node 0, zone DMA Node 0, zone DMA32 2. configure guest memory as 'auto' placement: # virsh edit r7 ... <vcpu placement='auto'>4</vcpu> <numatune> <memory mode='strict' placement='auto'/> </numatune> .. # virsh start r7 Domain r7 started 3. configure guest memory is binding to node 1. # virsh edit r7 ... <vcpu placement='auto'>4</vcpu> <numatune> <memory mode='strict' nodeset='1'/> </numatune> .. # virsh start r7 Domain r7 started So change the status to VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0323.html *** Bug 1320830 has been marked as a duplicate of this bug. *** hi, I see something which looks exactly as the problem described here but on Centos 7.7 with libvirt-4.5.0-23.el7_7.1.x86_64 and qemu-kvm-ev-2.12.0-33.1.el7.x86_64 with 3.10.0-957.21.3.el7.x86_64 Would someone have some suggestions? Hello, I have the same problem as mentioned, as mentioned above. When i updated libvirt to libvirt-4.5.0-23.el7_7.1.x86_64 from libvirt-4.5.0-10.el7_7.1.x86_64 i started getting errors with 'qemu-kvm: kvm_init_vcpu failed: Cannot allocate memory' OS: CentOS 7.7 Kernel: 4.4.175-1.el7.elrepo.x86_64 Same with : 4.4.202-1.el7.elrepo.x86_64 but my qemu-kvm-ev version is: qemu-kvm-ev-2.10.0-21.el7_5.7.1.x86_64 tried updating it to newest to did not help. Hello again, After updating the packages to the newest libvirt-4.5.0-23.el7_7.3.x86_64 I just wanted to add that i tried the fix mentioned before: Where i add to /etc/libvirt/qemu.conf This line: cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuacct" ] Then restart libvirtd process. If i don't add that line the same error shows up: qemu-kvm: kvm_init_vcpu failed: Cannot allocate memory After that everything is working fine and as intended live migration works/start/stop of the virtual machine |