1010885 – kvm_init_vcpu failed: Cannot allocate memory in NUMA

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1010885 - kvm_init_vcpu failed: Cannot allocate memory in NUMA

Summary: kvm_init_vcpu failed: Cannot allocate memory in NUMA

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	7.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Martin Kletzander
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1320830 (view as bug list)
Depends On:
Blocks:	1135871 1171792 1206424
TreeView+	depends on / blocked

Reported:	2013-09-23 09:06 UTC by Jincheng Miao
Modified:	2020-06-18 14:10 UTC (History)
CC List:	24 users (show)
Fixed In Version:	libvirt-1.2.7-1.el7
Doc Type:	Bug Fix
Doc Text:	Cause: For domains with <numatune><memory mode='strict' nodeset='X'/> libvirt strictly prohibited qemu/kvm from allocating memory from NUMA nodes outside the nodeset X. Consequence: When qemu initializes, kvm needs to allocate some memory from the DMA32 zone. If that zone was present only on a NUMA node outside that X, the allocation failed and the domain couldn't start. Fix: libvirt only uses numa_membind() calls for restricting qemu user process at start and might restrict it using cgroups only after the memory was allocated (it has started). Result: libvirt can start domains with memory strictly pinned to any available nodeset.
Clone Of:
Clones:	1135871 1206424 (view as bug list)
Environment:
Last Closed:	2015-03-05 07:24:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:0323	0	normal	SHIPPED_LIVE	Low: libvirt security, bug fix, and enhancement update	2015-03-05 12:10:54 UTC

Description Jincheng Miao 2013-09-23 09:06:40 UTC

Description of problem:
A race condition in libvirt:
In NUMA machine, set numa memory mode is strict. After libvirtd determined the node range, other process may take some memory, so that the domain may fail to gain the memory, and fail to start.

"
Peter Krempa 2013-09-13 04:44:50 EDT:
this is a problem in the approach libvirt is using to determine the node range. The problem for now is that there is no way to do it without the race condition as other processes may take the memory that was available at the time we determined the node range before the starting domain is able to allocate it.
"

Version-Release number of selected component (if applicable):
libvirt-1.1.1-6.el7.x86_64
qemu-kvm-1.5.3-2.el7.x86_64
kernel-3.10.0-9.el7.x86_64
numad-0.5-10.20121130git.el7.x86_64


How reproducible:
not always, depend on machine

Steps to Reproduce:
1.
# virsh dumpxml r7q
...
  <vcpu placement='auto'>4</vcpu>
  <numatune>
    <memory mode='strict' placement='auto'/>
  </numatune>
...

2.
# virsh start r7q
error: Failed to start domain r7q
error: internal error: process exited while connecting to monitor: char device redirected to /dev/pts/3 (label charserial0)
kvm_init_vcpu failed: Cannot allocate memory


Actual results:
start failed

Expected results:
start success

Comment 2 Andrew Theurer 2014-03-27 16:41:09 UTC

I am seeing a similar problem, when using:

virsh numatune test2 --mode strict --nodeset 1 --config

and then starting the VM.  I was able to make this work if I made sure libvirtd did not use cpuset cgroup.  Could you try your test with cpuset cgroup not used for libvirtd?  To disable the use of cpuset for libvirtd: (1) edit /etc/libvirtd/qemu.conf and change the line: 

"#cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuset", "cpuacct" ]" 

to:
"cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuacct" ]"

(2) restart libvirtd: systemctl restart libvirtd

Comment 3 Jincheng Miao 2014-03-28 05:29:42 UTC

(In reply to Andrew Theurer from comment #2)
> I am seeing a similar problem, when using:
> 
> virsh numatune test2 --mode strict --nodeset 1 --config

Yes, I also get guest failing to start using this config:
  <vcpu placement='auto'>4</vcpu>
  <numatune>
    <memory mode='strict' nodeset='1'/>
  </numatune>

I used the NUMA machine with 2 nodes
# numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 
cpubind: 0 1 
nodebind: 0 1 
membind: 0 1 

When I strict memory to node 0, the guest can be started.

> 
> and then starting the VM.  I was able to make this work if I made sure
> libvirtd did not use cpuset cgroup.  Could you try your test with cpuset
> cgroup not used for libvirtd?  To disable the use of cpuset for libvirtd:
> (1) edit /etc/libvirtd/qemu.conf and change the line: 
> 
> "#cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuset",
> "cpuacct" ]" 
> 
> to:
> "cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuacct" ]"
> 
> (2) restart libvirtd: systemctl restart libvirtd

So after abandon cpuset usage, strict memory to node 1 (the failed one), the  guest can be started.
# virsh start r7
error: Failed to start domain r7
error: internal error: process exited while connecting to monitor: kvm_init_vcpu failed: Cannot allocate memory


# vim /etc/libvirt/qemu.conf 
"cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuacct" ]"

# systemctl restart libvirtd

# virsh start r7
Domain r7 started

# virsh destroy r7
Domain r7 destroyed

# while sleep 1; do virsh start r7; sleep 0.5; virsh destroy r7; done
Domain r7 started

Domain r7 destroyed

Domain r7 started

Domain r7 destroyed

Domain r7 started

Domain r7 destroyed


Is that meaning that cpuset blocks kvm_init_vcpu to allocate the memory?

Comment 4 Andrew Theurer 2014-03-31 14:31:48 UTC

Thank you very much for testing this.  There appears to be a conflict between cpuset that libvirt uses and the numactl calls that qemu uses.  These two methods, IMO, are redundant.

Comment 6 Marcelo Tosatti 2014-05-21 23:19:41 UTC

The root pagetable is allocated with __GFP_DMA32, which is 0-4GB physical address range.

Reassigning to KVM.

Comment 7 Daniel Berrangé 2014-05-22 09:47:25 UTC

(In reply to Andrew Theurer from comment #4)
> Thank you very much for testing this.  There appears to be a conflict
> between cpuset that libvirt uses and the numactl calls that qemu uses. 
> These two methods, IMO, are redundant.

That doesn't make a whole lot of sense - QEMU isn't using numactl - libvirt is fully responsible for setting the numa policies of QEMU before it is execed.

Comment 8 Marcelo Tosatti 2014-05-23 19:40:02 UTC

Patch posted:
http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg649629.html

Comment 9 Marcelo Tosatti 2014-05-30 21:41:10 UTC

It is necessary for KVM to allocate pages in the 0-4GB physical address range, 
as noted by mmu.c:

        /*
         * When emulating 32-bit mode, cr3 is only 32 bits even on x86_64.
         * Therefore we need to allocate shadow page tables in the first
         * 4GB of memory, which happens to fit the DMA32 zone.
         */
        page = alloc_page(GFP_KERNEL | __GFP_DMA32);
        if (!page)
                return -ENOMEM;

So libvirt should add nodes which contain such range to cpuset.mems_allowed
list as follows:

1) Find nodes which contain DMA/DMA32 zones:

grep "zone    DMA" /proc/zoneinfo

2) Add such nodes to cpuset.mem_allowed list.

Comment 10 Marcelo Tosatti 2014-05-30 21:43:00 UTC

About kernel patch submitted on comment 8: kernel behaviour regarding GFP_DMA-type allocations and cpusets memory enforcenment is not well defined.

Comment 11 Martin Kletzander 2014-06-02 08:16:52 UTC

(In reply to Marcelo Tosatti from comment #9)
Two questions:

1) Would it be enough to allow the emulator thread to allocate from such nodes?  If not, then how are we supposed to restrict qemu to some memory nodes?

2) Is there any other way to get the information from kernel?  I don't think we want to be grepping /proc/zoneinfo in libvirt.

Comment 12 Daniel Berrangé 2014-06-02 08:53:39 UTC

(In reply to Martin Kletzander from comment #11)
> (In reply to Marcelo Tosatti from comment #9)
> Two questions:
> 
> 1) Would it be enough to allow the emulator thread to allocate from such
> nodes?  If not, then how are we supposed to restrict qemu to some memory
> nodes?

Even adding just the emulator thread to that node(s) is really undesirable, since it allows any allocation in that thread to come from a node that the user / mgmt app has requested that we do not use. If it is really just that one kernel allocation that needs to be from a specific node, it seems better to just special case that in the KVM kernel module.

> 2) Is there any other way to get the information from kernel?  I don't think
> we want to be grepping /proc/zoneinfo in libvirt.

Comment 13 Marcelo Tosatti 2014-06-02 18:08:03 UTC

(In reply to Martin Kletzander from comment #11)
> (In reply to Marcelo Tosatti from comment #9)
> Two questions:
> 
> 1) Would it be enough to allow the emulator thread to allocate from such
> nodes?  If not, then how are we supposed to restrict qemu to some memory
> nodes?

Just remove the given nodes from mems_allowed. Restrict qemu to some 
memory nodes by using mbind(MBIND):

The MPOL_BIND mode specifies a strict policy that restricts memory
allocation to the nodes specified in  node‐mask.   
If  nodemask  specifies  more  than one node, page allocations
will come from the node with the lowest numeric node ID first, 
until that node contains no free memory.  
Allocations will then come from the node with the  next  
highest  node ID specified in nodemask and so forth, until none of the specified nodes contain free memory.  
****Pages will not be allocated from any node not specified in the nodemask****.

> 2) Is there any other way to get the information from kernel?  I don't think
> we want to be grepping /proc/zoneinfo in libvirt.

I'll have to look up. What is the preferred interface rather than reading 
/proc/zoneinfo ?
Libvirt does not parse any information in /proc/ ?

Comment 14 Marcelo Tosatti 2014-06-02 18:11:27 UTC

(In reply to Daniel Berrange from comment #12)
> (In reply to Martin Kletzander from comment #11)
> > (In reply to Marcelo Tosatti from comment #9)
> > Two questions:
> > 
> > 1) Would it be enough to allow the emulator thread to allocate from such
> > nodes?  If not, then how are we supposed to restrict qemu to some memory
> > nodes?
> 
> Even adding just the emulator thread to that node(s) is really undesirable,
> since it allows any allocation in that thread to come from a node that the
> user / mgmt app has requested that we do not use. If it is really just that
> one kernel allocation that needs to be from a specific node, it seems better
> to just special case that in the KVM kernel module.

Allocation from nodes which contain GFP_DMA zones is required only during initialization of the guest. So you can drop the vcpuset mems_allowed "special case" (GFP_DMA zones) after the guest has initialized.

Comment 15 Marcelo Tosatti 2014-06-02 18:12:28 UTC

(In reply to Marcelo Tosatti from comment #13)
> (In reply to Martin Kletzander from comment #11)
> > (In reply to Marcelo Tosatti from comment #9)
> > Two questions:
> > 
> > 1) Would it be enough to allow the emulator thread to allocate from such
> > nodes?  If not, then how are we supposed to restrict qemu to some memory
> > nodes?
> 
> Just remove the given nodes from mems_allowed. Restrict qemu to some 
> memory nodes by using mbind(MBIND):
> 
> The MPOL_BIND mode specifies a strict policy that restricts memory
> allocation to the nodes specified in  node‐mask.   
> If  nodemask  specifies  more  than one node, page allocations
> will come from the node with the lowest numeric node ID first, 
> until that node contains no free memory.  
> Allocations will then come from the node with the  next  
> highest  node ID specified in nodemask and so forth, until none of the
> specified nodes contain free memory.  
> ****Pages will not be allocated from any node not specified in the
> nodemask****.

And return to normal cpuset configuration once guest has initialized.

Comment 16 Daniel Berrangé 2014-06-03 08:56:06 UTC

(In reply to Marcelo Tosatti from comment #14)
> (In reply to Daniel Berrange from comment #12)
> > (In reply to Martin Kletzander from comment #11)
> > > (In reply to Marcelo Tosatti from comment #9)
> > > Two questions:
> > > 
> > > 1) Would it be enough to allow the emulator thread to allocate from such
> > > nodes?  If not, then how are we supposed to restrict qemu to some memory
> > > nodes?
> > 
> > Even adding just the emulator thread to that node(s) is really undesirable,
> > since it allows any allocation in that thread to come from a node that the
> > user / mgmt app has requested that we do not use. If it is really just that
> > one kernel allocation that needs to be from a specific node, it seems better
> > to just special case that in the KVM kernel module.
> 
> Allocation from nodes which contain GFP_DMA zones is required only during
> initialization of the guest. So you can drop the vcpuset mems_allowed
> "special case" (GFP_DMA zones) after the guest has initialized.

There are a tonne of allocations done during initialization, so that is still going to allow potential for many allocations to be done from the undesired NUMA nodes. So I don't think just leaving it relaxed during init is reasonable, just to get 1 single alloc from the DMA zone to work.

Comment 17 Marcelo Tosatti 2014-06-03 16:01:47 UTC

(In reply to Daniel Berrange from comment #16)
> (In reply to Marcelo Tosatti from comment #14)
> > (In reply to Daniel Berrange from comment #12)
> > > (In reply to Martin Kletzander from comment #11)
> > > > (In reply to Marcelo Tosatti from comment #9)
> > > > Two questions:
> > > > 
> > > > 1) Would it be enough to allow the emulator thread to allocate from such
> > > > nodes?  If not, then how are we supposed to restrict qemu to some memory
> > > > nodes?
> > > 
> > > Even adding just the emulator thread to that node(s) is really undesirable,
> > > since it allows any allocation in that thread to come from a node that the
> > > user / mgmt app has requested that we do not use. If it is really just that
> > > one kernel allocation that needs to be from a specific node, it seems better
> > > to just special case that in the KVM kernel module.
> > 
> > Allocation from nodes which contain GFP_DMA zones is required only during
> > initialization of the guest. So you can drop the vcpuset mems_allowed
> > "special case" (GFP_DMA zones) after the guest has initialized.
> 
> There are a tonne of allocations done during initialization, so that is
> still going to allow potential for many allocations to be done from the
> undesired NUMA nodes. So I don't think just leaving it relaxed during init
> is reasonable, just to get 1 single alloc from the DMA zone to work.

Is there any point in using cpusets but not using mbind(MBIND)?

Comment 18 Martin Kletzander 2014-06-03 16:36:31 UTC

Would the setting from numa_set_membind() live through creating qemu's threads?  I believe it should, but I'm rather asking.  We are already using that for some settings and if that suffice then it sould be possible, I guess.

Comment 19 Marcelo Tosatti 2014-06-03 18:03:56 UTC

(In reply to Martin Kletzander from comment #18)
> Would the setting from numa_set_membind() live through creating qemu's
> threads?  I believe it should, but I'm rather asking.  We are already using
> that for some settings and if that suffice then it sould be possible, I
> guess.

It uses set_mempolicy, and from the man page:

       set_mempolicy - set default NUMA memory policy for a process and its children

       The process memory policy is preserved across an execve(2), and is inherited by  child  processes  created
       using fork(2) or clone(2).

Comment 20 Martin Kletzander 2014-06-17 14:47:05 UTC

If I may bother you with one more question...

I'm also wondering whether we should switch from cgroups to numa_set_membind() only until the needed structures are allocated and then enforce the nodeset in cpuset.mems (possibly with cpuset.memory_migrate set to 1) or if we should skip working with cgroups at all and just use membind without any additional cgroup tuning.

Comment 21 Marcelo Tosatti 2014-06-30 18:18:24 UTC

(In reply to Martin Kletzander from comment #20)
> If I may bother you with one more question...
> 
> I'm also wondering whether we should switch from cgroups to
> numa_set_membind() only until the needed structures are allocated and then
> enforce the nodeset in cpuset.mems (possibly with cpuset.memory_migrate set
> to 1) or if we should skip working with cgroups at all and just use membind
> without any additional cgroup tuning.

Only until the qemu-kvm process is initialized (which is where the needed structures are allocated).

Comment 22 Martin Kletzander 2014-07-16 18:23:28 UTC

Fixed upstream with v1.2.6-176-g7e72ac7:

commit 7e72ac787848b7434c9359a57c1e2789d92350f8
Author: Martin Kletzander <mkletzan>
Date:   Tue Jul 8 09:59:49 2014 +0200

    qemu: leave restricting cpuset.mems after initialization

Comment 29 Jincheng Miao 2014-11-20 10:23:57 UTC

This problem is fix in latest libvirt-1.2.8-7.el7.x86_64.

There are the verification steps:
# rpm -q libvirt
libvirt-1.2.8-7.el7.x86_64

1. host NUMA with 2 nodes, and DMA zone resides in node 1.
# numactl --hard
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 65514 MB
node 0 free: 59459 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 65536 MB
node 1 free: 60689 MB
node distances:
node   0   1 
  0:  10  11 
  1:  11  10 

# grep "DMA" /proc/zoneinfo 
Node 0, zone      DMA
Node 0, zone    DMA32


2. configure guest memory as 'auto' placement:
# virsh edit r7
...
  <vcpu placement='auto'>4</vcpu>
  <numatune>
    <memory mode='strict' placement='auto'/>
  </numatune>
..

# virsh start r7
Domain r7 started

3. configure guest memory is binding to node 1.
# virsh edit r7
...
  <vcpu placement='auto'>4</vcpu>
  <numatune>
    <memory mode='strict' nodeset='1'/>
  </numatune>
..

# virsh start r7
Domain r7 started

So change the status to VERIFIED.

Comment 31 errata-xmlrpc 2015-03-05 07:24:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0323.html

Comment 32 Ján Tomko 2016-03-24 10:08:47 UTC

*** Bug 1320830 has been marked as a duplicate of this bug. ***

Comment 33 lejeczek 2019-09-28 10:19:19 UTC

hi,
I see something which looks exactly as the problem described here but on Centos 7.7 with libvirt-4.5.0-23.el7_7.1.x86_64 and qemu-kvm-ev-2.12.0-33.1.el7.x86_64 with 3.10.0-957.21.3.el7.x86_64

Would someone have some suggestions?

Comment 34 Arreth 2019-11-20 15:37:29 UTC

Hello,

I have the same problem as mentioned, as mentioned above.

When i updated libvirt to libvirt-4.5.0-23.el7_7.1.x86_64 from libvirt-4.5.0-10.el7_7.1.x86_64 i started getting errors with 'qemu-kvm: kvm_init_vcpu failed: Cannot allocate memory'
OS: CentOS 7.7
Kernel: 4.4.175-1.el7.elrepo.x86_64
Same with : 4.4.202-1.el7.elrepo.x86_64
but my qemu-kvm-ev version is: qemu-kvm-ev-2.10.0-21.el7_5.7.1.x86_64 tried updating it to newest to did not help.

Comment 35 Arreth 2020-01-07 12:34:19 UTC

Hello again,

After updating the packages to the newest libvirt-4.5.0-23.el7_7.3.x86_64

I just wanted to add that i tried the fix mentioned before:

Where i add to /etc/libvirt/qemu.conf
This line:
cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuacct" ]

Then restart libvirtd process.

If i don't add that line the same error shows up:
qemu-kvm: kvm_init_vcpu failed: Cannot allocate memory

After that everything is working fine and as intended live migration works/start/stop of the virtual machine

Note You need to log in before you can comment on or make changes to this bug.

abisogia
atheurer
berrange
dvrao.584
dyuan
gsun
honzhang
jmiao
jtomko
juzhang
mkalinin
mkletzan
mtosatti
mzhan
peljasz
pkrempa
rbalakri
reyum3
smooney
tdosek
tvvcox
xuzhang
yohmura
ypu