Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 949408

Summary: domain fail to start with vcpu placement as auto
Product: Red Hat Enterprise Linux 7 Reporter: Wayne Sun <gsun>
Component: libvirtAssignee: Peter Krempa <pkrempa>
Status: CLOSED CURRENTRELEASE QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.0CC: acathrow, cwei, dallan, dyuan, honzhang, jmiao, mzhan, pkrempa
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: libvirt-1.1.1-1.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-13 11:26:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
domain qemu log
none
libvirtd log
none
cpuinfo
none
sysfs dump info
none
libvirtd log with Nodeset
none
cpuset cgroup files of domain
none
libvirtd log
none
domain qemu log
none
numa cpuset log none

Description Wayne Sun 2013-04-08 06:12:51 UTC
Created attachment 732553 [details]
domain qemu log

Description of problem:
domain fail to start with vcpu placement as auto


Version-Release number of selected component (if applicable):
libvirt-1.0.3-1.el7.x86_64
qemu-kvm-1.4.0-1.el7.x86_64
kernel-3.7.0-0.32.el7.x86_64
numad-0.5-8.20121130git.el7.x86_64

How reproducible:
always

Steps to Reproduce:
1. set vcpu placement as auto
# virsh dumpxml aa
...
<currentMemory unit='KiB'>1048576</currentMemory>
<vcpu placement='auto'>4</vcpu>
<numatune>
<memory mode='strict' placement='auto'/>
</numatune>
...

2. start domain
# virsh start aa
error: Failed to start domain aa
error: internal error process exited while connecting to monitor: 2013-04-08 05:28:33.826+0000: 5494: debug : virFileClose:72 : Closed fd 25
2013-04-08 05:28:33.826+0000: 5494: debug : virFileClose:72 : Closed fd 31
2013-04-08 05:28:33.828+0000: 5494: debug : virFileClose:72 : Closed fd 3
2013-04-08 05:28:33.828+0000: 5495: debug : virExec:602 : Run hook 0x7fae6521fef0 0x7fae6a0f53c0
2013-04-08 05:28:33.828+0000: 5495: debug : qemuProcessHook:2728 : Obtaining domain lock
2013-04-08 05:28:33.828+0000: 5495: debug : virSecuritySELinuxSetSecuritySocketLabel:1963 : Setting VM aa socket context system_u:system_r:svirt_t:s0:c160,c656
2013-04-08 05:28:33.829+0000: 5495: debug : virDomainLockProcessStart:170 : plugin=0x7fae5c005600 dom=0x7fae5c29edb0 paused=1 fd=0x7fae6a0f4f4c
2013-04-08 05:28:33.829+0000: 5495: debug : virDomainLockManagerNew:128 : plugin=0x7fae5c005600 dom=0x7fae5c29edb0 withResources=1
2013-04-08 05:28:33.829+0000: 5495: debug : virLockManagerPluginGetDriver:297 : plugin=0x7fae5c005600
2013-04-08 05:28:33.829+0000: 5

3.
Check domain log:
# vim /var/log/libvirt/qemu/aa.log
...
2013-04-08 05:53:32.972+0000: 7318: debug : virCommandHandshakeChild:377 : Handshake with parent is done
char device redirected to /dev/pts/2 (label charserial0)
kvm_init_vcpu failed: Cannot allocate memory
2013-04-08 05:53:33.175+0000: shutting down


domain fail with 'kvm_init_vcpu failed: Cannot allocate memory'


Actual results:
domain fail to start

Expected results:
domain start succeed

Additional info:
check in log:
2013-04-08 05:28:33.765+0000: 26491: debug : qemuProcessStart:3728 : Nodeset returned from numad: 1

so it's not a problem with numad

Comment 2 Wayne Sun 2013-04-08 06:16:20 UTC
Created attachment 732554 [details]
libvirtd log

Comment 3 Osier Yang 2013-04-17 14:48:46 UTC
> 
> Additional info:
> check in log:
> 2013-04-08 05:28:33.765+0000: 26491: debug : qemuProcessStart:3728 : Nodeset
> returned from numad: 1
> 
> so it's not a problem with numad

Can you provide the CPU toplogy?

Comment 4 Wayne Sun 2013-04-18 03:23:24 UTC
Created attachment 737121 [details]
cpuinfo

# virsh nodeinfo
CPU model:           x86_64
CPU(s):              32
CPU frequency:       1064 MHz
CPU socket(s):       1
Core(s) per socket:  8
Thread(s) per core:  2
NUMA cell(s):        2
Memory size:         131875768 KiB

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 65514 MB
node 0 free: 62316 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 65536 MB
node 1 free: 62744 MB
node distances:
node   0   1 
  0:  10  11 
  1:  11  10

Comment 5 Wayne Sun 2013-04-18 03:31:36 UTC
Created attachment 737124 [details]
sysfs dump info

# ll /sys/devices/system/

Comment 6 Osier Yang 2013-04-22 04:31:26 UTC
Wayne, I need more info on this, one is the debug log, except the log containing "Nodeset", logs about setting "cpuset" cgroup are also needed. And can you make a tarball of the domain's cpuset cgroup files? 

It looks like similar problem with this:

https://www.redhat.com/archives/libvirt-users/2013-January/msg00085.html

Comment 7 Wayne Sun 2013-04-22 05:57:35 UTC
Created attachment 738435 [details]
libvirtd log with Nodeset

after updated kernel to latest:
3.9.0-0.rc7.52.el7.x86_64

domain can start successfully.

Also tried on another machine with
3.9.0-0.rc6.51.el7.x86_64

it also works, so it migth be kernel's cgroup problem.

Anyway, the fail full log before update kernel is provided.

Comment 8 Wayne Sun 2013-04-22 07:02:13 UTC
Created attachment 738450 [details]
cpuset cgroup files of domain

(In reply to comment #7)
> Created attachment 738435 [details]
> libvirtd log with Nodeset
> 
> after updated kernel to latest:
> 3.9.0-0.rc7.52.el7.x86_64
> 
> domain can start successfully.
> 
> Also tried on another machine with
> 3.9.0-0.rc6.51.el7.x86_64
> 
> it also works, so it migth be kernel's cgroup problem.
> 
> Anyway, the fail full log before update kernel is provided.

after repeat run several times it occured again, so the problem still exist.
The cgroup file is removed after failed to start domain. So, I will attach the cpuset cgroup files without placement as auto set.

Comment 9 Osier Yang 2013-04-22 13:20:18 UTC
Okay, it's the same problem with https://www.redhat.com/archives/libvirt-users/2013-January/msg00085.html indeed, the cpuset.cpus is set with 0-31 (all cpus), however, cpuset.mems is 1 (only node 1).

Comment 10 Wayne Sun 2013-04-23 11:27:53 UTC
numatune with mode as interleave will not take effect

# virsh dumpxml rhel6_local|grep interleave -2
  <vcpu placement='static'>2</vcpu>
  <numatune>
    <memory mode='interleave' nodeset='1-2'/>
  </numatune>
  <os>

start the domain and check process:
# cat /proc/3713/status |grep Mems_allowed_list
Mems_allowed_list:	0-3

# virsh numatune rhel6_local
numa_mode      : interleave
numa_nodeset   : 0-3

Check in log:

2013-04-23 11:19:39.156+0000: 3154: debug : virCgroupSetValueStr:331 : Set value '/sys/fs/cgroup/cpuset/libvirt/qemu/rhel6_local/emulator/cpuset.mems' to '0-3'

Check with nmstat tool shows the memory usage is not even distributed on node 1-2. Is this the same problem here?

Comment 11 Osier Yang 2013-04-24 06:06:14 UTC
> Check with nmstat tool shows the memory usage is not even distributed on
> node 1-2. Is this the same problem here?

Yes. similar.

Comment 12 Osier Yang 2013-05-13 09:28:26 UTC
patches posted to upstream. https://www.redhat.com/archives/libvir-list/2013-May/msg00637.html.

Comment 13 Wayne Sun 2013-06-07 07:38:28 UTC
Created attachment 758015 [details]
libvirtd log

reproduce with:
libvirt-1.0.6-1.el7.x86_64
qemu-kvm-1.5.0-2.el7.x86_64
kernel-3.9.0-0.55.el7.x86_64

latest libvirtd log attached

Comment 14 Wayne Sun 2013-06-07 07:39:55 UTC
Created attachment 758020 [details]
domain qemu log

domain qemu log also updated

Comment 15 Peter Krempa 2013-07-18 10:51:20 UTC
Updated version posted for review:

http://www.redhat.com/archives/libvir-list/2013-July/msg01159.html

Comment 16 Peter Krempa 2013-07-18 13:11:47 UTC
Fixed upstream:

commit a39f69d2bb5494d661be917956baa437d01a4d13
Author: Osier Yang <jyang>
Date:   Fri May 24 17:08:28 2013 +0800

    qemu: Set cpuset.cpus for domain process
    
    When either "cpuset" of <vcpu> is specified, or the "placement" of
    <vcpu> is "auto", only setting the cpuset.mems might cause the guest
    starting to fail. E.g. ("placement" of both <vcpu> and <numatune> is
    "auto"):
    
    1) Related XMLs
      <vcpu placement='auto'>4</vcpu>
      <numatune>
        <memory mode='strict' placement='auto'/>
      </numatune>
    
    2) Host NUMA topology
      % numactl --hardware
      available: 8 nodes (0-7)
      node 0 cpus: 0 4 8 12 16 20 24 28
      node 0 size: 16374 MB
      node 0 free: 11899 MB
      node 1 cpus: 32 36 40 44 48 52 56 60
      node 1 size: 16384 MB
      node 1 free: 15318 MB
      node 2 cpus: 2 6 10 14 18 22 26 30
      node 2 size: 16384 MB
      node 2 free: 15766 MB
      node 3 cpus: 34 38 42 46 50 54 58 62
      node 3 size: 16384 MB
      node 3 free: 15347 MB
      node 4 cpus: 3 7 11 15 19 23 27 31
      node 4 size: 16384 MB
      node 4 free: 15041 MB
      node 5 cpus: 35 39 43 47 51 55 59 63
      node 5 size: 16384 MB
      node 5 free: 15202 MB
      node 6 cpus: 1 5 9 13 17 21 25 29
      node 6 size: 16384 MB
      node 6 free: 15197 MB
      node 7 cpus: 33 37 41 45 49 53 57 61
      node 7 size: 16368 MB
      node 7 free: 15669 MB
    
    4) cpuset.cpus will be set as: (from debug log)
    
    2013-05-09 16:50:17.296+0000: 417: debug : virCgroupSetValueStr:331 :
    Set value '/sys/fs/cgroup/cpuset/libvirt/qemu/toy/cpuset.cpus'
    to '0-63'
    
    5) The advisory nodeset got from querying numad (from debug log)
    
    2013-05-09 16:50:17.295+0000: 417: debug : qemuProcessStart:3614 :
    Nodeset returned from numad: 1
    
    6) cpuset.mems will be set as: (from debug log)
    
    2013-05-09 16:50:17.296+0000: 417: debug : virCgroupSetValueStr:331 :
    Set value '/sys/fs/cgroup/cpuset/libvirt/qemu/toy/cpuset.mems'
    to '0-7'
    
    I.E, the domain process's memory is restricted on the first NUMA node,
    however, it can use all of the CPUs, which will likely cause the domain
    process to fail to start because of the kernel fails to allocate
    memory with the the memory policy as "strict".
    
    % tail -n 20 /var/log/libvirt/qemu/toy.log
    ...
    2013-05-09 05:53:32.972+0000: 7318: debug : virCommandHandshakeChild:377 :
    Handshake with parent is done
    char device redirected to /dev/pts/2 (label charserial0)
    kvm_init_vcpu failed: Cannot allocate memory
    ...
    
    Signed-off-by: Peter Krempa <pkrempa>

commit b8b38321e724b5b1b7858c415566ab5e6e96ec8c
Author: Peter Krempa <pkrempa>
Date:   Thu Jul 18 11:21:48 2013 +0200

    caps: Add helpers to convert NUMA nodes to corresponding CPUs
    
    These helpers use the remembered host capabilities to retrieve the cpu
    map rather than query the host again. The intended usage for this
    helpers is to fix automatic NUMA placement with strict memory alloc. The
    code doing the prepare needs to pin the emulator process only to cpus
    belonging to a subset of NUMA nodes of the host.

v1.1.0-254-ga39f69d

Comment 17 Jincheng Miao 2013-08-01 09:43:34 UTC
Created attachment 781506 [details]
numa cpuset log

Comment 18 Jincheng Miao 2013-08-01 09:45:05 UTC
hi Peter ,

I also met the problem with latest libvirt:

# rpm -q libvirt qemu-kvm kernel numad
libvirt-1.1.1-1.el7.x86_64
qemu-kvm-1.5.2-1.el7.x86_64
kernel-3.10.0-3.el7.x86_64
numad-0.5-10.20121130git.el7.x86_64

# virsh dumpxml r7q
...
  <vcpu placement='auto'>4</vcpu>
  <numatune>
    <memory mode='strict' placement='auto'/>
  </numatune>
...

# virsh start r7q
error: Failed to start domain r7q
error: internal error: process exited while connecting to monitor: char device redirected to /dev/pts/3 (label charserial0)
kvm_init_vcpu failed: Cannot allocate memory

the CPU toplogy is the same as comment 4:
# virsh nodeinfo
CPU model:           x86_64
CPU(s):              32
CPU frequency:       1064 MHz
CPU socket(s):       1
Core(s) per socket:  8
Thread(s) per core:  2
NUMA cell(s):        2
Memory size:         131752920 KiB

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 65514 MB
node 0 free: 61865 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 65536 MB
node 1 free: 63435 MB
node distances:
node   0   1 
  0:  10  11 
  1:  11  10

And in libvirtd.log(the attachment I uploaded):
I can see cpuset.cpus = 0-31, and cpuset.mems = 0-1. According to comment 9 from Osier, this should be right.

Is that meaning libvirt is ok, the error is in qemu-kvm?

Comment 19 Peter Krempa 2013-09-02 14:00:05 UTC
Well libvirt takes the information returned from "numad" and uses it to create the topology of the guest. The data returned from numad depend on multiple factors and it's not guaranteed that the guest will start successfully even after querying numad.

According to the log, NUMA nodes 0-1 should be used. This corresponds to all cpus (0-31) in the host. To verify that the fix is okay, you have to re-run the guest (with less memory maybe) so that "numad" will provide a different node range. Then you need to verify that the CPU range provided to the guest corresponds to the NUMA node range. When they do match the fix is okay, but it's still not guaranteed that qemu will successfully be able to allocate it's memory.

Comment 21 Jincheng Miao 2013-09-06 06:29:03 UTC
According to libvirtd.log, the cpu range matches NUMA node range. 

But no guaranteed success of qemu starting makes me confused.

Should this bug move to qemu-kvm component ?

Comment 22 Peter Krempa 2013-09-13 08:44:50 UTC
(In reply to Jincheng Miao from comment #21)
> According to libvirtd.log, the cpu range matches NUMA node range. 

That is corresponding to the original problem described by this bug.

> 
> But no guaranteed success of qemu starting makes me confused.

The problem is that invoking numad to find out what nodes contain enough memory to accomodate a guest doesn't guarantee that the memory will be available at the time the guest will be allocating it. This creates a race condition that may sometimes result into the guest failing to start when there's "just enough" free memory.

> 
> Should this bug move to qemu-kvm component ?

No this is a problem in the approach libvirt is using to determine the node range. The problem for now is that there is no way to do it without the race condition as other processes may take the memory that was available at the time we determined the node range before the starting domain is able to allocate it.

It may be worth opening a separate bug to track that issue as this bug is regarding the invalid CPU range that was generated from the node list which was fixed by the patches mentioned above.

Comment 24 Jincheng Miao 2013-09-23 09:12:05 UTC
According Peter reply comment 22, this bug is related to the invalid CPU range, and it is fix in this patch, so I choose to change the status to VERIFIED.

For failing to start domain, there is a race condition for allocating memory, so I opened a new bug ( https://bugzilla.redhat.com/show_bug.cgi?id=1010885 ) to track that issue.

Thanks for Peter's advice.

Comment 25 Ludek Smid 2014-06-13 11:26:20 UTC
This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.