Bug 1010885 - kvm_init_vcpu failed: Cannot allocate memory in NUMA
kvm_init_vcpu failed: Cannot allocate memory in NUMA
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libvirt (Show other bugs)
7.0
x86_64 Linux
high Severity medium
: rc
: ---
Assigned To: Martin Kletzander
Virtualization Bugs
: Upstream, ZStream
: 1320830 (view as bug list)
Depends On:
Blocks: 1135871 1171792 1206424
  Show dependency treegraph
 
Reported: 2013-09-23 05:06 EDT by Jincheng Miao
Modified: 2016-04-26 12:29 EDT (History)
21 users (show)

See Also:
Fixed In Version: libvirt-1.2.7-1.el7
Doc Type: Bug Fix
Doc Text:
Cause: For domains with <numatune><memory mode='strict' nodeset='X'/> libvirt strictly prohibited qemu/kvm from allocating memory from NUMA nodes outside the nodeset X. Consequence: When qemu initializes, kvm needs to allocate some memory from the DMA32 zone. If that zone was present only on a NUMA node outside that X, the allocation failed and the domain couldn't start. Fix: libvirt only uses numa_membind() calls for restricting qemu user process at start and might restrict it using cgroups only after the memory was allocated (it has started). Result: libvirt can start domains with memory strictly pinned to any available nodeset.
Story Points: ---
Clone Of:
: 1135871 1206424 (view as bug list)
Environment:
Last Closed: 2015-03-05 02:24:51 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:0323 normal SHIPPED_LIVE Low: libvirt security, bug fix, and enhancement update 2015-03-05 07:10:54 EST

  None (edit)
Description Jincheng Miao 2013-09-23 05:06:40 EDT
Description of problem:
A race condition in libvirt:
In NUMA machine, set numa memory mode is strict. After libvirtd determined the node range, other process may take some memory, so that the domain may fail to gain the memory, and fail to start.

"
Peter Krempa 2013-09-13 04:44:50 EDT:
this is a problem in the approach libvirt is using to determine the node range. The problem for now is that there is no way to do it without the race condition as other processes may take the memory that was available at the time we determined the node range before the starting domain is able to allocate it.
"

Version-Release number of selected component (if applicable):
libvirt-1.1.1-6.el7.x86_64
qemu-kvm-1.5.3-2.el7.x86_64
kernel-3.10.0-9.el7.x86_64
numad-0.5-10.20121130git.el7.x86_64


How reproducible:
not always, depend on machine

Steps to Reproduce:
1.
# virsh dumpxml r7q
...
  <vcpu placement='auto'>4</vcpu>
  <numatune>
    <memory mode='strict' placement='auto'/>
  </numatune>
...

2.
# virsh start r7q
error: Failed to start domain r7q
error: internal error: process exited while connecting to monitor: char device redirected to /dev/pts/3 (label charserial0)
kvm_init_vcpu failed: Cannot allocate memory


Actual results:
start failed

Expected results:
start success
Comment 2 Andrew Theurer 2014-03-27 12:41:09 EDT
I am seeing a similar problem, when using:

virsh numatune test2 --mode strict --nodeset 1 --config

and then starting the VM.  I was able to make this work if I made sure libvirtd did not use cpuset cgroup.  Could you try your test with cpuset cgroup not used for libvirtd?  To disable the use of cpuset for libvirtd: (1) edit /etc/libvirtd/qemu.conf and change the line: 

"#cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuset", "cpuacct" ]" 

to:
"cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuacct" ]"

(2) restart libvirtd: systemctl restart libvirtd
Comment 3 Jincheng Miao 2014-03-28 01:29:42 EDT
(In reply to Andrew Theurer from comment #2)
> I am seeing a similar problem, when using:
> 
> virsh numatune test2 --mode strict --nodeset 1 --config

Yes, I also get guest failing to start using this config:
  <vcpu placement='auto'>4</vcpu>
  <numatune>
    <memory mode='strict' nodeset='1'/>
  </numatune>

I used the NUMA machine with 2 nodes
# numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 
cpubind: 0 1 
nodebind: 0 1 
membind: 0 1 

When I strict memory to node 0, the guest can be started.

> 
> and then starting the VM.  I was able to make this work if I made sure
> libvirtd did not use cpuset cgroup.  Could you try your test with cpuset
> cgroup not used for libvirtd?  To disable the use of cpuset for libvirtd:
> (1) edit /etc/libvirtd/qemu.conf and change the line: 
> 
> "#cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuset",
> "cpuacct" ]" 
> 
> to:
> "cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuacct" ]"
> 
> (2) restart libvirtd: systemctl restart libvirtd

So after abandon cpuset usage, strict memory to node 1 (the failed one), the  guest can be started.
# virsh start r7
error: Failed to start domain r7
error: internal error: process exited while connecting to monitor: kvm_init_vcpu failed: Cannot allocate memory


# vim /etc/libvirt/qemu.conf 
"cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuacct" ]"

# systemctl restart libvirtd

# virsh start r7
Domain r7 started

# virsh destroy r7
Domain r7 destroyed

# while sleep 1; do virsh start r7; sleep 0.5; virsh destroy r7; done
Domain r7 started

Domain r7 destroyed

Domain r7 started

Domain r7 destroyed

Domain r7 started

Domain r7 destroyed


Is that meaning that cpuset blocks kvm_init_vcpu to allocate the memory?
Comment 4 Andrew Theurer 2014-03-31 10:31:48 EDT
Thank you very much for testing this.  There appears to be a conflict between cpuset that libvirt uses and the numactl calls that qemu uses.  These two methods, IMO, are redundant.
Comment 6 Marcelo Tosatti 2014-05-21 19:19:41 EDT
The root pagetable is allocated with __GFP_DMA32, which is 0-4GB physical address range.

Reassigning to KVM.
Comment 7 Daniel Berrange 2014-05-22 05:47:25 EDT
(In reply to Andrew Theurer from comment #4)
> Thank you very much for testing this.  There appears to be a conflict
> between cpuset that libvirt uses and the numactl calls that qemu uses. 
> These two methods, IMO, are redundant.

That doesn't make a whole lot of sense - QEMU isn't using numactl - libvirt is fully responsible for setting the numa policies of QEMU before it is execed.
Comment 8 Marcelo Tosatti 2014-05-23 15:40:02 EDT
Patch posted:
http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg649629.html
Comment 9 Marcelo Tosatti 2014-05-30 17:41:10 EDT
It is necessary for KVM to allocate pages in the 0-4GB physical address range, 
as noted by mmu.c:

        /*
         * When emulating 32-bit mode, cr3 is only 32 bits even on x86_64.
         * Therefore we need to allocate shadow page tables in the first
         * 4GB of memory, which happens to fit the DMA32 zone.
         */
        page = alloc_page(GFP_KERNEL | __GFP_DMA32);
        if (!page)
                return -ENOMEM;

So libvirt should add nodes which contain such range to cpuset.mems_allowed
list as follows:

1) Find nodes which contain DMA/DMA32 zones:

grep "zone    DMA" /proc/zoneinfo

2) Add such nodes to cpuset.mem_allowed list.
Comment 10 Marcelo Tosatti 2014-05-30 17:43:00 EDT
About kernel patch submitted on comment 8: kernel behaviour regarding GFP_DMA-type allocations and cpusets memory enforcenment is not well defined.
Comment 11 Martin Kletzander 2014-06-02 04:16:52 EDT
(In reply to Marcelo Tosatti from comment #9)
Two questions:

1) Would it be enough to allow the emulator thread to allocate from such nodes?  If not, then how are we supposed to restrict qemu to some memory nodes?

2) Is there any other way to get the information from kernel?  I don't think we want to be grepping /proc/zoneinfo in libvirt.
Comment 12 Daniel Berrange 2014-06-02 04:53:39 EDT
(In reply to Martin Kletzander from comment #11)
> (In reply to Marcelo Tosatti from comment #9)
> Two questions:
> 
> 1) Would it be enough to allow the emulator thread to allocate from such
> nodes?  If not, then how are we supposed to restrict qemu to some memory
> nodes?

Even adding just the emulator thread to that node(s) is really undesirable, since it allows any allocation in that thread to come from a node that the user / mgmt app has requested that we do not use. If it is really just that one kernel allocation that needs to be from a specific node, it seems better to just special case that in the KVM kernel module.

> 2) Is there any other way to get the information from kernel?  I don't think
> we want to be grepping /proc/zoneinfo in libvirt.
Comment 13 Marcelo Tosatti 2014-06-02 14:08:03 EDT
(In reply to Martin Kletzander from comment #11)
> (In reply to Marcelo Tosatti from comment #9)
> Two questions:
> 
> 1) Would it be enough to allow the emulator thread to allocate from such
> nodes?  If not, then how are we supposed to restrict qemu to some memory
> nodes?

Just remove the given nodes from mems_allowed. Restrict qemu to some 
memory nodes by using mbind(MBIND):

The MPOL_BIND mode specifies a strict policy that restricts memory
allocation to the nodes specified in  node‐mask.   
If  nodemask  specifies  more  than one node, page allocations
will come from the node with the lowest numeric node ID first, 
until that node contains no free memory.  
Allocations will then come from the node with the  next  
highest  node ID specified in nodemask and so forth, until none of the specified nodes contain free memory.  
****Pages will not be allocated from any node not specified in the nodemask****.

> 2) Is there any other way to get the information from kernel?  I don't think
> we want to be grepping /proc/zoneinfo in libvirt.

I'll have to look up. What is the preferred interface rather than reading 
/proc/zoneinfo ?
Libvirt does not parse any information in /proc/ ?
Comment 14 Marcelo Tosatti 2014-06-02 14:11:27 EDT
(In reply to Daniel Berrange from comment #12)
> (In reply to Martin Kletzander from comment #11)
> > (In reply to Marcelo Tosatti from comment #9)
> > Two questions:
> > 
> > 1) Would it be enough to allow the emulator thread to allocate from such
> > nodes?  If not, then how are we supposed to restrict qemu to some memory
> > nodes?
> 
> Even adding just the emulator thread to that node(s) is really undesirable,
> since it allows any allocation in that thread to come from a node that the
> user / mgmt app has requested that we do not use. If it is really just that
> one kernel allocation that needs to be from a specific node, it seems better
> to just special case that in the KVM kernel module.

Allocation from nodes which contain GFP_DMA zones is required only during initialization of the guest. So you can drop the vcpuset mems_allowed "special case" (GFP_DMA zones) after the guest has initialized.
Comment 15 Marcelo Tosatti 2014-06-02 14:12:28 EDT
(In reply to Marcelo Tosatti from comment #13)
> (In reply to Martin Kletzander from comment #11)
> > (In reply to Marcelo Tosatti from comment #9)
> > Two questions:
> > 
> > 1) Would it be enough to allow the emulator thread to allocate from such
> > nodes?  If not, then how are we supposed to restrict qemu to some memory
> > nodes?
> 
> Just remove the given nodes from mems_allowed. Restrict qemu to some 
> memory nodes by using mbind(MBIND):
> 
> The MPOL_BIND mode specifies a strict policy that restricts memory
> allocation to the nodes specified in  node‐mask.   
> If  nodemask  specifies  more  than one node, page allocations
> will come from the node with the lowest numeric node ID first, 
> until that node contains no free memory.  
> Allocations will then come from the node with the  next  
> highest  node ID specified in nodemask and so forth, until none of the
> specified nodes contain free memory.  
> ****Pages will not be allocated from any node not specified in the
> nodemask****.

And return to normal cpuset configuration once guest has initialized.
Comment 16 Daniel Berrange 2014-06-03 04:56:06 EDT
(In reply to Marcelo Tosatti from comment #14)
> (In reply to Daniel Berrange from comment #12)
> > (In reply to Martin Kletzander from comment #11)
> > > (In reply to Marcelo Tosatti from comment #9)
> > > Two questions:
> > > 
> > > 1) Would it be enough to allow the emulator thread to allocate from such
> > > nodes?  If not, then how are we supposed to restrict qemu to some memory
> > > nodes?
> > 
> > Even adding just the emulator thread to that node(s) is really undesirable,
> > since it allows any allocation in that thread to come from a node that the
> > user / mgmt app has requested that we do not use. If it is really just that
> > one kernel allocation that needs to be from a specific node, it seems better
> > to just special case that in the KVM kernel module.
> 
> Allocation from nodes which contain GFP_DMA zones is required only during
> initialization of the guest. So you can drop the vcpuset mems_allowed
> "special case" (GFP_DMA zones) after the guest has initialized.

There are a tonne of allocations done during initialization, so that is still going to allow potential for many allocations to be done from the undesired NUMA nodes. So I don't think just leaving it relaxed during init is reasonable, just to get 1 single alloc from the DMA zone to work.
Comment 17 Marcelo Tosatti 2014-06-03 12:01:47 EDT
(In reply to Daniel Berrange from comment #16)
> (In reply to Marcelo Tosatti from comment #14)
> > (In reply to Daniel Berrange from comment #12)
> > > (In reply to Martin Kletzander from comment #11)
> > > > (In reply to Marcelo Tosatti from comment #9)
> > > > Two questions:
> > > > 
> > > > 1) Would it be enough to allow the emulator thread to allocate from such
> > > > nodes?  If not, then how are we supposed to restrict qemu to some memory
> > > > nodes?
> > > 
> > > Even adding just the emulator thread to that node(s) is really undesirable,
> > > since it allows any allocation in that thread to come from a node that the
> > > user / mgmt app has requested that we do not use. If it is really just that
> > > one kernel allocation that needs to be from a specific node, it seems better
> > > to just special case that in the KVM kernel module.
> > 
> > Allocation from nodes which contain GFP_DMA zones is required only during
> > initialization of the guest. So you can drop the vcpuset mems_allowed
> > "special case" (GFP_DMA zones) after the guest has initialized.
> 
> There are a tonne of allocations done during initialization, so that is
> still going to allow potential for many allocations to be done from the
> undesired NUMA nodes. So I don't think just leaving it relaxed during init
> is reasonable, just to get 1 single alloc from the DMA zone to work.

Is there any point in using cpusets but not using mbind(MBIND)?
Comment 18 Martin Kletzander 2014-06-03 12:36:31 EDT
Would the setting from numa_set_membind() live through creating qemu's threads?  I believe it should, but I'm rather asking.  We are already using that for some settings and if that suffice then it sould be possible, I guess.
Comment 19 Marcelo Tosatti 2014-06-03 14:03:56 EDT
(In reply to Martin Kletzander from comment #18)
> Would the setting from numa_set_membind() live through creating qemu's
> threads?  I believe it should, but I'm rather asking.  We are already using
> that for some settings and if that suffice then it sould be possible, I
> guess.

It uses set_mempolicy, and from the man page:

       set_mempolicy - set default NUMA memory policy for a process and its children

       The process memory policy is preserved across an execve(2), and is inherited by  child  processes  created
       using fork(2) or clone(2).
Comment 20 Martin Kletzander 2014-06-17 10:47:05 EDT
If I may bother you with one more question...

I'm also wondering whether we should switch from cgroups to numa_set_membind() only until the needed structures are allocated and then enforce the nodeset in cpuset.mems (possibly with cpuset.memory_migrate set to 1) or if we should skip working with cgroups at all and just use membind without any additional cgroup tuning.
Comment 21 Marcelo Tosatti 2014-06-30 14:18:24 EDT
(In reply to Martin Kletzander from comment #20)
> If I may bother you with one more question...
> 
> I'm also wondering whether we should switch from cgroups to
> numa_set_membind() only until the needed structures are allocated and then
> enforce the nodeset in cpuset.mems (possibly with cpuset.memory_migrate set
> to 1) or if we should skip working with cgroups at all and just use membind
> without any additional cgroup tuning.

Only until the qemu-kvm process is initialized (which is where the needed structures are allocated).
Comment 22 Martin Kletzander 2014-07-16 14:23:28 EDT
Fixed upstream with v1.2.6-176-g7e72ac7:

commit 7e72ac787848b7434c9359a57c1e2789d92350f8
Author: Martin Kletzander <mkletzan@redhat.com>
Date:   Tue Jul 8 09:59:49 2014 +0200

    qemu: leave restricting cpuset.mems after initialization
Comment 29 Jincheng Miao 2014-11-20 05:23:57 EST
This problem is fix in latest libvirt-1.2.8-7.el7.x86_64.

There are the verification steps:
# rpm -q libvirt
libvirt-1.2.8-7.el7.x86_64

1. host NUMA with 2 nodes, and DMA zone resides in node 1.
# numactl --hard
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 65514 MB
node 0 free: 59459 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 65536 MB
node 1 free: 60689 MB
node distances:
node   0   1 
  0:  10  11 
  1:  11  10 

# grep "DMA" /proc/zoneinfo 
Node 0, zone      DMA
Node 0, zone    DMA32


2. configure guest memory as 'auto' placement:
# virsh edit r7
...
  <vcpu placement='auto'>4</vcpu>
  <numatune>
    <memory mode='strict' placement='auto'/>
  </numatune>
..

# virsh start r7
Domain r7 started

3. configure guest memory is binding to node 1.
# virsh edit r7
...
  <vcpu placement='auto'>4</vcpu>
  <numatune>
    <memory mode='strict' nodeset='1'/>
  </numatune>
..

# virsh start r7
Domain r7 started

So change the status to VERIFIED.
Comment 31 errata-xmlrpc 2015-03-05 02:24:51 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0323.html
Comment 32 Ján Tomko 2016-03-24 06:08:47 EDT
*** Bug 1320830 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.