Bug 1118517

Summary:	Can't simultaneously boot multiple machines using NUMA automatic placement (regression from R6.5)
Product:	Red Hat Enterprise Linux 7	Reporter:	Dax Kelson <dkelson>
Component:	libvirt	Assignee:	Martin Kletzander <mkletzan>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	7.0	CC:	dkelson, dyuan, honzhang, jmiao, mkletzan, mzhan, rbalakri
Target Milestone:	rc	Keywords:	Reopened, Upstream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	libvirt-1.2.7-1.el7	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-12-22 10:08:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dax Kelson 2014-07-10 23:01:28 UTC

Description of problem:
On RHEL7.0 when using "<vcpu placement='auto'>1</vcpu>", two virtual machines won't boot at the same time. The second machine gets this error:

kvm_init_vcpu failed: Cannot allocate memory

Once the first machine has FULLY booted for a minute or two, then the second machine can be booted.

The box has 288GB of RAM and only 2 virtual machines (each with 768MB of RAM).

Getting rid of the "placement='auto'" attribute allows the machines the boot at the same time.

RHEL6.5 has no error.

Here is the libvirtd debug output that shows the error:

2014-07-10 21:56:18.000+0000: starting up
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name r71s01 -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off -cpu SandyBridge,+erms,+smep,+fsgsbase,+pdpe1gb,+rdrand,+f16c,+osxsave,+dca,+pcid,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme,+kvm_pv_eoi -m 768 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 30bd7054-56b3-47d4-b0b3-49e35155a6bc -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/r71s01.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/rlabs/images/r71s01.img,if=none,id=drive-ide0-0-0,format=qcow2,cache=writeback -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -drive file=/rlabs/isos/R64/dvd.iso,if=none,id=drive-ide0-1-0,readonly=on,format=raw -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:71:00:01,bus=pci.0,addr=0x3 -device usb-tablet,id=input0 -vnc 0.0.0.0:2 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4
2014-07-10 21:56:18.003+0000: 32206: debug : virFileClose:90 : Closed fd 25
2014-07-10 21:56:18.003+0000: 32206: debug : virFileClose:90 : Closed fd 33
2014-07-10 21:56:18.003+0000: 32206: debug : virFileClose:90 : Closed fd 3
2014-07-10 21:56:18.004+0000: 32207: debug : virExec:633 : Run hook 0x7ff93a28bdf0 0x7ff942296fd0
2014-07-10 21:56:18.004+0000: 32207: debug : qemuProcessHook:2626 : Obtaining domain lock
2014-07-10 21:56:18.004+0000: 32207: debug : virSecuritySELinuxSetSecuritySocketLabel:2146 : Setting VM r71s01 socket context system_u:system_r:svirt_t:s0:c618,c630
2014-07-10 21:56:18.004+0000: 32207: debug : virDomainLockProcessStart:170 : plugin=0x7ff9341bf8f0 dom=0x7ff934268a70 paused=1 fd=0x7ff942296b54
2014-07-10 21:56:18.004+0000: 32207: debug : virDomainLockManagerNew:128 : plugin=0x7ff9341bf8f0 dom=0x7ff934268a70 withResources=1
2014-07-10 21:56:18.004+0000: 32207: debug : virLockManagerPluginGetDriver:289 : plugin=0x7ff9341bf8f0
2014-07-10 21:56:18.004+0000: 32207: debug : virLockManagerNew:313 : driver=0x7ff950fadba0 type=0 nparams=5 params=0x7ff942296a00 flags=0
2014-07-10 21:56:18.004+0000: 32207: debug : virLockManagerLogParams:107 :   key=uuid type=uuid value=30bd7054-56b3-47d4-b0b3-49e35155a6bc
2014-07-10 21:56:18.004+0000: 32207: debug : virLockManagerLogParams:100 :   key=name type=string value=r71s01
2014-07-10 21:56:18.004+0000: 32207: debug : virLockManagerLogParams:88 :   key=id type=uint value=45
2014-07-10 21:56:18.004+0000: 32207: debug : virLockManagerLogParams:88 :   key=pid type=uint value=32207
2014-07-10 21:56:18.004+0000: 32207: debug : virLockManagerLogParams:103 :   key=uri type=cstring value=qemu:///system
2014-07-10 21:56:18.004+0000: 32207: debug : virDomainLockManagerNew:140 : Adding leases
2014-07-10 21:56:18.004+0000: 32207: debug : virDomainLockManagerNew:145 : Adding disks
2014-07-10 21:56:18.004+0000: 32207: debug : virDomainLockManagerAddDisk:86 : Add disk /rlabs/images/r71s01.img
2014-07-10 21:56:18.004+0000: 32207: debug : virLockManagerAddResource:340 : lock=0x7ff9280047d0 type=0 name=/rlabs/images/r71s01.img nparams=0 params=(nil) flags=0
2014-07-10 21:56:18.004+0000: 32207: debug : virDomainLockManagerAddDisk:86 : Add disk /rlabs/isos/R64/dvd.iso
2014-07-10 21:56:18.004+0000: 32207: debug : virLockManagerAddResource:340 : lock=0x7ff9280047d0 type=0 name=/rlabs/isos/R64/dvd.iso nparams=0 params=(nil) flags=1
2014-07-10 21:56:18.004+0000: 32207: debug : virLockManagerAcquire:358 : lock=0x7ff9280047d0 state='(null)' flags=3 action=0 fd=0x7ff942296b54
2014-07-10 21:56:18.004+0000: 32207: debug : virLockManagerFree:395 : lock=0x7ff9280047d0
2014-07-10 21:56:18.004+0000: 32207: debug : virNumaSetupMemoryPolicy:103 : Set NUMA memory policy with advisory nodeset from numad
2014-07-10 21:56:18.004+0000: 32207: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7ff9341a99e0
2014-07-10 21:56:18.004+0000: 32207: debug : qemuProcessHook:2653 : Hook complete ret=0
2014-07-10 21:56:18.004+0000: 32207: debug : virExec:635 : Done hook 0
2014-07-10 21:56:18.004+0000: 32207: debug : virExec:642 : Setting child security label to system_u:system_r:svirt_t:s0:c618,c630
2014-07-10 21:56:18.005+0000: 32207: debug : virExec:672 : Setting child uid:gid to 107:107 with caps 0
2014-07-10 21:56:18.005+0000: 32207: debug : virCommandHandshakeChild:363 : Notifying parent for handshake start on 30
2014-07-10 21:56:18.005+0000: 32207: debug : virCommandHandshakeChild:371 : Waiting on parent for handshake complete on 31
2014-07-10 21:56:18.088+0000: 32207: debug : virFileClose:90 : Closed fd 30
2014-07-10 21:56:18.088+0000: 32207: debug : virFileClose:90 : Closed fd 31
2014-07-10 21:56:18.088+0000: 32207: debug : virCommandHandshakeChild:391 : Handshake with parent is done
kvm_init_vcpu failed: Cannot allocate memory
2014-07-10 21:56:18.291+0000: shutting down


Version-Release number of selected component (if applicable):
libvirt-1.1.1-29.el7.x86_64
kernel-3.10.0-123.4.2.el7.x86_64

How reproducible:
every time

Comment 2 Martin Kletzander 2014-09-16 08:15:15 UTC

This might be already fixed.  Could you try few things to confirm that?

1) Check libvirtd.log for the placements offered by numad and paste it here.

2) Paste output of 'grep DMA /proc/zoneinfo' here.

3) Edit qemu.conf and set cgroup_controllers to the default list (the commented one), but remove "cpuset" and try whether it fixes your issue.

Thank you,
Martin

Comment 3 Martin Kletzander 2014-09-17 15:53:18 UTC

I suspect this is caused by numad advising nodes without node 0 that probably has DMA and DMA_32 zones (that kvm needs).

Fixed upstream with v1.2.6-176-g7e72ac7:

commit 7e72ac787848b7434c9359a57c1e2789d92350f8
Author: Martin Kletzander <mkletzan>
Date:   Tue Jul 8 09:59:49 2014 +0200

    qemu: leave restricting cpuset.mems after initialization

Comment 4 Dax Kelson 2014-09-17 16:42:45 UTC

(In reply to Martin Kletzander from comment #2)
> This might be already fixed.  Could you try few things to confirm that?
> 
> 1) Check libvirtd.log for the placements offered by numad and paste it here.

2014-09-17 16:38:16.060+0000: 4685: debug : virCommandRunAsync:2282 : About to run /bin/numad -w 2:8192
2014-09-17 16:38:18.066+0000: 4685: debug : qemuProcessStart:3771 : Nodeset returned from numad: 1
2014-09-17 16:38:18.087+0000: 4685: debug : qemuProcessInitCpuAffinity:2021 : Set CPU affinity with advisory nodeset from numad
2014-09-17 16:38:18.444+0000: 4684: debug : virCommandRunAsync:2282 : About to run /bin/numad -w 2:768
2014-09-17 16:38:20.449+0000: 4684: debug : qemuProcessStart:3771 : Nodeset returned from numad: 0
2014-09-17 16:38:20.468+0000: 4684: debug : qemuProcessInitCpuAffinity:2021 : Set CPU affinity with advisory nodeset from numad


> 
> 2) Paste output of 'grep DMA /proc/zoneinfo' here.

Node 0, zone      DMA
Node 0, zone    DMA32
 
> 3) Edit qemu.conf and set cgroup_controllers to the default list (the
> commented one), but remove "cpuset" and try whether it fixes your issue.

I made the requested edit, but still broken.

for i in keystonedev.gurulabs.com storagedev.gurulabs.com unifi.gurulabs.com; do virsh start $i; doneerror: Failed to start domain keystonedev.gurulabs.com
error: internal error: process exited while connecting to monitor: kvm_init_vcpu failed: Cannot allocate memory


Domain storagedev.gurulabs.com started

error: Failed to start domain unifi.gurulabs.com
error: internal error: process exited while connecting to monitor: kvm_init_vcpu failed: Cannot allocate memory

Comment 5 Martin Kletzander 2014-09-18 07:13:41 UTC

I guess the thread that got returned nodeset 0 (4684) started the domain successfully and the other one (4684) failed to start the domain.  As I expected this should be already fixed.  I'm not sure why the workaround doesn't work, though.

Comment 7 Dax Kelson 2014-10-03 21:22:22 UTC

Using libvirt-1.1.1-29.el7_0.3.x86_64 it is still broken, but with a new error message.

Using <vcpu placement='auto'> I get the following trying to rapidly start a bunch of vms:

# for i in computedev1 computedev2 edxdev keystonedev oldspidey shaka storagedev tlv unifi
> do
> virsh start $i
> done
error: Failed to start domain computedev1
error: Unable to write to '/sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dcomputedev1.scope/cpuset.cpus': Device or resource busy

error: Failed to start domain computedev2
error: Unable to write to '/sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dcomputedev2.scope/cpuset.cpus': Device or resource busy

error: Failed to start domain edxdev
error: Unable to write to '/sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dedxdev.scope/cpuset.cpus': Device or resource busy

Domain keystonedev started

error: Failed to start domain oldspidey
error: Unable to write to '/sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2doldspidey.scope/cpuset.cpus': Device or resource busy

error: Failed to start domain shaka
error: Unable to write to '/sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dshaka.scope/cpuset.cpus': Device or resource busy

Domain storagedev started

error: Failed to start domain tlv
error: Unable to write to '/sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dtlv.scope/cpuset.cpus': Device or resource busy

error: Failed to start domain unifi
error: Unable to write to '/sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dunifi.scope/cpuset.cpus': Device or resource busy

Comment 8 Martin Kletzander 2014-10-06 05:17:50 UTC

Can you attach a debug log, please?

Comment 9 Jincheng Miao 2014-11-21 11:22:07 UTC

I can't reproduce it in libvirt-1.1.1-29.el7_0.3.x86_64,

in my environment, machine has two NUMA nodes:

# numactl --hard
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 65514 MB
node 0 free: 58918 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 65536 MB
node 1 free: 55141 MB
node distances:
node   0   1 
  0:  10  11 
  1:  11  10 

And DMA zone resides in Node 0
# grep "DMA" /proc/zoneinfo 
Node 0, zone      DMA
Node 0, zone    DMA32

So, both guest configured that mbinding in Node 1, as:
# virsh dumpxml a
...
  <vcpu placement='static'>4</vcpu>
  <numatune>
    <memory mode='strict' nodeset='1'/>
  </numatune>
...

Than start them:
# for i in {a..b}; do virsh start $i; done
Domain a started

Domain b started

# for i in {a..b}; do virsh destroy $i; done
Domain a destroyed

Domain b destroyed

# for i in {a..b}; do virsh start $i; done
Domain a started

Domain b started

# for i in {a..b}; do virsh destroy $i; done
Domain a destroyed

Domain b destroyed

There is no error occurs.

IMO, the error like "Device or resource busy" should be the problem with kernel
"
error: Unable to write to '/sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dunifi.scope/cpuset.cpus': Device or resource busy
"
My software stacks are
libvirt-1.1.1-29.el7_0.3.x86_64
qemu-kvm-1.5.3-60.el7.x86_64
kernel-3.10.0-123.17.1.el7.x86_64

Comment 10 Jincheng Miao 2014-12-09 03:53:32 UTC

I also used libvirt-1.1.1-29.el7_0.3.x86_64 and kernel-3.10.0-123.4.2.el7.x86_64,
but still can't reproduce it.

# rpm -q libvirt
libvirt-1.1.1-29.el7_0.3.x86_64

# uname -r
3.10.0-123.4.2.el7.x86_64

use following configuration:
  <vcpu placement='auto'>4</vcpu>
  <numatune>
    <memory mode='strict' nodeset='1'/>
  </numatune>


[root@localhost ~]# for i in {a..e}; do virsh start $i; done
Domain a started

Domain b started

Domain c started

Domain d started

Domain e started

[root@localhost ~]# for i in {a..e}; do virsh destroy $i; done
Domain a destroyed

Domain b destroyed

Domain c destroyed

Domain d destroyed

Domain e destroyed

[root@localhost ~]# for i in {a..e}; do virsh start $i; done
Domain a started

Domain b started

Domain c started

Domain d started

Domain e started

[root@localhost ~]# for i in {a..e}; do virsh destroy $i; done
Domain a destroyed

Domain b destroyed

Domain c destroyed

Domain d destroyed

Domain e destroyed

Comment 11 Jincheng Miao 2014-12-09 04:00:59 UTC

Hi Martin,

I think this bug is triggered by strict cpuset.mems in cgroup initialization.

For "Unable to write to cpuset.cpus: Device or resource busy", I couldn't reproduce it in my environment.

Is it a real problem?

or could I verify it as testing cpuset.mems for DMA Zone?

Comment 12 Martin Kletzander 2014-12-09 08:13:28 UTC

For the 'Device or resource busy" error, this looks like it has the same root cause as Bug 1168866, so this should "just work" with libvirt-1.2.8-10.el7.

Comment 13 Martin Kletzander 2014-12-09 08:14:55 UTC

Just to be sure that my previous comment is not a complete guess, could you attach an XML of a domain that cannot be started, please?  Thank you.

Comment 14 Jincheng Miao 2014-12-10 04:46:35 UTC

For problem "kvm_init_vcpu failed: Cannot allocate memory", I can reproduce it
in libvirt-1.1.1-29.el7_0.1.x86_64.

# rpm -q libvirt; uname -r
libvirt-1.1.1-29.el7_0.1.x86_64

use following configuration:
  <vcpu placement='auto'>4</vcpu>
  <numatune>
    <memory mode='strict' nodeset='1'/>
  </numatune>

# for i in {a..e}; do echo starting $i; virsh start $i; done
starting a
error: Failed to start domain a
error: internal error: process exited while connecting to monitor: kvm_init_vcpu failed: Cannot allocate memory


starting b
error: Failed to start domain b
error: internal error: process exited while connecting to monitor: kvm_init_vcpu failed: Cannot allocate memory


starting c
error: Failed to start domain c
error: internal error: process exited while connecting to monitor: kvm_init_vcpu failed: Cannot allocate memory


starting d
error: Failed to start domain d
error: internal error: process exited while connecting to monitor: kvm_init_vcpu failed: Cannot allocate memory


starting e
error: Failed to start domain e
error: internal error: process exited while connecting to monitor: kvm_init_vcpu failed: Cannot allocate memory

Comment 19 Jincheng Miao 2014-12-10 08:01:24 UTC

According to bug description, the problem of "Device or resource busy" is similar with bug 1168866, and I will mark this bug as duplicate to bug 1168866.

If you still have question about it, feel free to reopen it.

*** This bug has been marked as a duplicate of bug 1168866 ***

Comment 20 Martin Kletzander 2014-12-10 09:24:24 UTC

We are not sure whether this is duplicate or not.  Please read my previous messages saying that we cannot do anything unless we have more information from the reporter.  I'm moving this back to assigned until we have that info and until we are sure what fixes it.  If there is no new info and we're past some deadline, we'll have to close this bug with INSUFFICIENT DATA.

Comment 23 Martin Kletzander 2014-12-22 10:08:18 UTC

There is no response from the reporter for some time and we are not sure whether this is already fixed or not due to reproducibility issues and insufficient data.  I'm closing it as such, feel free to reopen this BZ if there is any new information.

Comment 24 Red Hat Bugzilla 2023-09-14 02:11:21 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days