Bug 1150505

Summary: Domain is out of control from libvirt when running some concurrent define/undefine/start/destroy jobs rapidly
Product: Red Hat Enterprise Linux 7 Reporter: Hu Jianwei <jiahu>
Component: libvirtAssignee: Martin Kletzander <mkletzan>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.1CC: berrange, dyuan, honzhang, jiahu, jmiao, lmiksik, mkletzan, mzhan, rbalakri, vivianzhang
Target Milestone: rcKeywords: Upstream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: libvirt-1.2.8-11.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-03-05 07:46:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Error log for scratch build
none
log for libvirtd on 1.2.8-7 build none

Description Hu Jianwei 2014-10-08 11:37:59 UTC
Description
Domain is out of control from libvirt when running some concurrent define/undefine/start/destroy jobs rapidly

Version:
libvirt-1.2.8-4.el7.x86_64
qemu-kvm-1.5.3-73.el7.x86_64
kernel-3.10.0-123.el7.x86_64
libcgroup-0.41-6.el7.x86_64
libcgroup-tools-0.41-6.el7.x86_64

How reproducible:
95%

Steps to Reproduce:
1. In the first terminal:
[root@ibm-x3850x5-06 ~]# while true; do virsh undefine test1;virsh define test1.xml; done

2. In the second terminal:
[root@ibm-x3850x5-06 libvirt-1.2.8]# while true; do virsh destroy test1;virsh start test1; done

3. After the rapid stress scripts:
[root@ibm-x3850x5-06 machine.slice]# ps aux | grep test1
qemu       748 46.4  0.8 1639612 282100 ?      Sl   16:25   0:31 /usr/libexec/qemu-kvm -name test1 -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off -m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 4309adb4-30f0-4f23-9109-a3a2c3877868 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/test1.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive file=/var/lib/libvirt/images/test.img,if=none,id=drive-ide0-0-0,format=raw,cache=none -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -netdev tap,fd=28,id=hostnet0,vhost=on,vhostfd=29 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:46:9d:f0,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev spicevmc,id=charc!
 hannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -spice port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on -vga qxl -global qxl-vga.ram_size=67108864 -global qxl-vga.vram_size=67108864 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on

[root@ibm-x3850x5-06 libvirt-1.2.8]# virsh start test1
error: Failed to start domain test1
error: error from service: CreateMachine: File exists

[root@ibm-x3850x5-06 machine.slice]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     test1                          shut off

[root@ibm-x3850x5-06 machine.slice]# pwd
/sys/fs/cgroup/systemd/machine.slice
[root@ibm-x3850x5-06 machine.slice]# ll
total 0
-rw-r--r--. 1 root root 0 Sep 29 21:06 cgroup.clone_children
--w--w--w-. 1 root root 0 Sep 29 21:06 cgroup.event_control
-rw-r--r--. 1 root root 0 Sep 29 21:06 cgroup.procs
drwxr-xr-x. 2 root root 0 Sep 30 15:35 machine-qemu\x2dtest1.scope
-rw-r--r--. 1 root root 0 Sep 29 21:06 notify_on_release
-rw-r--r--. 1 root root 0 Sep 29 21:06 tasks

Actual results:
As above shown steps, the domain's qemu process was left and detached from libvirt, libvirt can not start it anymore.

2014-09-30 07:22:41.715+0000: 2757: debug : virEventPollDispatchHandles:494 : i=0 w=1
2014-09-30 07:22:41.715+0000: 2761: error : virDBusCall:1429 : error from service: CreateMachine: File exists

Expected results:
libvirt should start the domain.

Additional info:

Comment 1 Daniel Berrangé 2014-10-08 11:43:02 UTC
Probably need this upstream commit

commit 4882618ed13b469d92fa8b2b4a158fdb17dbe9f1
Author: Guido Günther <agx>
Date:   Thu Sep 25 13:32:58 2014 +0200

    qemu: use systemd's TerminateMachine to kill all processes
    
    If we don't properly clean up all processes in the
    machine-<vmname>.scope systemd won't remove the cgroup and subsequent vm
    starts fail with
    
      'CreateMachine: File exists'
    
    Additional processes can e.g. be added via
    
      echo $PID > /sys/fs/cgroup/systemd/machine.slice/machine-${VMNAME}.scope/tasks
    
    but there are other cases like
    
      http://bugs.debian.org/761521
    
    Invoke TerminateMachine to be on the safe side since systemd tracks the
    cgroup anyway. This is a noop if all processes have terminated already.

Comment 5 Martin Kletzander 2014-10-15 07:25:44 UTC
Please provide debug logs from libvirt while reproducing the issue? Thank you.

Comment 6 Hu Jianwei 2014-10-15 08:03:56 UTC
Created attachment 947129 [details]
Error log for scratch build

Please check the error log for scratch build

Comment 7 Martin Kletzander 2014-11-04 09:58:41 UTC
Fixed upstream with v1.2.10-9-gb629c64:

commit b629c64e5e0a32ef439b8eeb3a697e2cd76f3248
Author:     Martin Kletzander <mkletzan>
AuthorDate: Thu Oct 30 14:38:35 2014 +0100

    qemu: avoid rare race when undefining domain

Comment 10 Hu Jianwei 2014-11-25 02:46:52 UTC
Still can reproduce it.

[root@ibm-x3850x5-06 ~]# rpm -q libvirt
libvirt-1.2.8-7.el7.x86_64

After do concurrent jobs rapidly.

[root@ibm-x3850x5-06 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     test                           shut off

[root@ibm-x3850x5-06 ~]# virsh start test
error: Failed to start domain test
error: error from service: CreateMachine: File exists

[root@ibm-x3850x5-06 ~]# ps aux | grep qemu-kvm
qemu       377  7.1  0.8 1661472 290980 ?      Sl   10:34   0:38 /usr/libexec/qemu-kvm -name test -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off -m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 2ce8d663-981e-416e-8760-a21216481992 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/test.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive file=/var/lib/libvirt/images/test.img,if=none,id=drive-ide0-0-0,format=raw,cache=none -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -netdev tap,fd=24,id=hostnet0,vhost=on,vhostfd=21 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:9d:96:2a,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -spice port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on
root       858  0.0  0.0 112644   972 pts/0    S+   10:43   0:00 grep --color=auto qemu-kvm

Comment 11 Hu Jianwei 2014-11-25 02:50:35 UTC
Created attachment 960995 [details]
log for libvirtd on 1.2.8-7 build

Comment 12 Martin Kletzander 2014-12-08 13:06:27 UTC
I need to investigate more if this is still not fixed.  Moving back to assigned.

Comment 14 vivian zhang 2014-12-23 08:08:21 UTC
I can produce this bug on build
libvirt-1.2.8-10.el7.x86_64

verify it on build
libvirt-1.2.8-11.el7.x86_64

verify steps:

1. prepare a guest xml in the host 
In the first terminal:
#while true; do virsh undefine vm1;virsh define vm1.xml; done

In the second terminal:
# while true;do virsh destroy vm1;virsh start vm1;done


2. execute the stress scripts test more than 2 hours, guest still works normally,
no qemu-kvm process exists always

 # virsh start vm1
Domain vm1 started

[root@intel-e31225-16-2 ~]# virsh list
 Id    Name                           State
----------------------------------------------------
 12824 vm1                            running


move to verified

Comment 16 errata-xmlrpc 2015-03-05 07:46:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0323.html