Description of problem: After creating a CPU profile to a VM an attempt to migrate it causes the VM to go down with migration source reporting the migration as successful and destination failing. Version-Release number of selected component (if applicable): vdsm-4.19.4-1.el7.centos.x86_64 qemu-kvm-ev-2.6.0-28.el7_3.3.1.x86_64 package libvirtd is not installed How reproducible: about 70% of the time, Steps to Reproduce: 1. install oVirt 4.1.0 with at least 2 hosts in the same DC 2. go to Data Centers -> select one -> QoS -> CPU 3. create a new QoS policy with a limit of 10% and name it "test" 4. go to Clusters -> select one -> CPU Profiles 5. create new profile "test" and use QoS policy created in 3 6. create VM in this data center and press Edit 7. click "Show Advanced Options" and go to "Resource Allocation" 8. set "test" as "CPU Profile" 9. start VM 10. after it starts - live migrate it Actual results: Migration fails. VM goes to "down" state and has to be started manually Expected results: Migration succeeds Additional info: This also happened to VMs where quota was added while they were running - after migration attempts they crash. Will attach related log snippets
Source VDSM log - all OK here 2017-02-24 16:33:54,341 INFO (jsonrpc/1) [vdsm.api] START migrate args=(<virt.vm.Vm object at 0x25bf090>, {u'incomingLimit': 2, u'tunneled': u'false', u'dstqemu': u'66.187.230.4', u'autoConverge': u'false', u'src': u'ovirt-srv03.ovirt.org', u'enableGuestEvents': False, u'dst': u'ovirt-srv02.ovirt.org:54321', u'vmId': u'3ef3ba2a-47ce-4579-912d-3891a2dc9ce4', u'abortOnError': u'true', u'outgoingLimit': 2, u'compressed': u'false', u'method': u'online', 'mode': 'remote'}) kwargs={} (api:37) 2017-02-24 16:33:55,606 INFO (migsrc/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Creation of destination VM took: 0 seconds (migration:431) 2017-02-24 16:33:55,606 INFO (migsrc/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') starting migration to qemu+tls://ovirt-srv02.ovirt.org/system with miguri tcp://66.187.230.4 (migration:453) 2017-02-24 16:34:01,997 INFO (libvirt/events) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') CPU stopped: onSuspend (vm:4877) 2017-02-24 16:34:02,977 INFO (migsrc/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') migration took 7 seconds to complete (migration:465) 2017-02-24 16:34:02,981 INFO (migsrc/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Changed state to Down: Migration succeeded (code=4) (vm:1199) 2017-02-24 16:34:02,982 INFO (migsrc/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Stopping connection (guestagent:430) 2017-02-24 16:34:03,242 INFO (jsonrpc/0) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Release VM resources (vm:4092) 2017-02-24 16:34:03,242 WARN (jsonrpc/0) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') trying to set state to Powering down when already Down (vm:351) 2017-02-24 16:34:03,242 INFO (jsonrpc/0) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Stopping connection (guestagent:430) 2017-02-24 16:34:03,243 INFO (jsonrpc/0) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') _destroyVmGraceful attempt #0 (vm:4128) 2017-02-24 16:34:03,244 INFO (jsonrpc/0) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') VM '3ef3ba2a-47ce-4579-912d-3891a2dc9ce4' already down and destroyed (vm:4146) 2017-02-24 16:34:03,248 INFO (jsonrpc/0) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Stopping connection (guestagent:430) Destination VDSM log - error parsing vcpuLimit? 2017-02-24 16:33:55,441 INFO (jsonrpc/3) [vdsm.api] START __init__ args=(<virt.vm.Vm object at 0x7fa52034fb50>, <clientIF.clientIF object at 0x2049590>, ... 2017-02-24 16:33:55,450 INFO (vm/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') VM wrapper has started (vm:1903) 2017-02-24 16:34:02,076 INFO (libvirt/events) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') CPU running: onResume (vm:4877) 2017-02-24 16:34:02,077 INFO (libvirt/events) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') CPU running: onResume (vm:4877) 2017-02-24 16:34:02,139 INFO (vm/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Starting connection (guestagent:245) 2017-02-24 16:34:02,162 INFO (vm/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') CPU running: domain initialization (vm:4877) 2017-02-24 16:34:02,169 ERROR (vm/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') The vm start process failed (vm:616) Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 571, in _startUnderlyingVm self._completeIncomingMigration() File "/usr/share/vdsm/virt/vm.py", line 3084, in _completeIncomingMigration self._domDependentInit() File "/usr/share/vdsm/virt/vm.py", line 1787, in _domDependentInit self._vmDependentInit() File "/usr/share/vdsm/virt/vm.py", line 1854, in _vmDependentInit self._dom_vcpu_setup() File "/usr/share/vdsm/virt/vm.py", line 1877, in _dom_vcpu_setup self._updateVcpuLimit() File "/usr/share/vdsm/virt/vm.py", line 2564, in _updateVcpuLimit self._vcpuLimit = vmxml.dom_text(vcpuLimit) AttributeError: 'module' object has no attribute 'dom_text' 2017-02-24 16:34:02,192 INFO (vm/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Changed state to Down: 'module' object has no attribute 'dom_text' (code=1) (vm:1199) 2017-02-24 16:34:02,192 INFO (vm/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Stopping connection (guestagent:430) 2017-02-24 16:34:02,299 INFO (jsonrpc/1) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Release VM resources (vm:4092) 2017-02-24 16:34:02,300 WARN (jsonrpc/1) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') trying to set state to Powering down when already Down (vm:351) 2017-02-24 16:34:02,300 INFO (jsonrpc/1) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Stopping connection (guestagent:430) 2017-02-24 16:34:02,300 INFO (jsonrpc/1) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') _destroyVmGraceful attempt #0 (vm:4128) 2017-02-24 16:34:02,831 INFO (jsonrpc/1) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Stopping connection (guestagent:430) qemu domain log on destination (qemu process receives SIGTERM) 2017-02-24 16:33:56.576+0000: starting up libvirt version: 2.0.0, package: 10.el7_3.4 (CentOS BuildSystem <http://bugs.centos.org>, 2017-01-17-23:37:48, c1bm.rdu2.centos.org), qemu version: 2.6.0 (qemu-kvm-ev-2.6.0-28.el7_3.3.1), hostname: ovirt-srv02.ovirt.org 2017-02-24T16:33:56.734612Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2017-02-24T16:33:56.735254Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config main_channel_link: add main channel client red_dispatcher_set_cursor_peer: inputs_connect: inputs channel client create red_dispatcher_loadvm_commands: 2017-02-24T16:34:02.302400Z qemu-kvm: terminating on signal 15 from pid 1564 2017-02-24 16:34:02.503+0000: shutting down messages on destination Feb 24 16:33:56 ovirt-srv02 kernel: device vnet10 entered promiscuous mode Feb 24 16:33:56 ovirt-srv02 kernel: ovirtmgmt: port 12(vnet10) entered forwarding state Feb 24 16:33:56 ovirt-srv02 kernel: ovirtmgmt: port 12(vnet10) entered forwarding state Feb 24 16:33:56 ovirt-srv02 systemd-machined: New machine qemu-23-test-41-migration. Feb 24 16:33:56 ovirt-srv02 systemd: Started Virtual Machine qemu-23-test-41-migration. Feb 24 16:33:56 ovirt-srv02 systemd: Starting Virtual Machine qemu-23-test-41-migration. Feb 24 16:33:56 ovirt-srv02 kvm: 11 guests now active Feb 24 16:33:59 ovirt-srv02 journal: vdsm vds.dispatcher ERROR SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 48866, 0, 0) at 0x7fa5206f1cf8>: unexpected eof Feb 24 16:34:01 ovirt-srv02 journal: vdsm vds.dispatcher ERROR SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 48868, 0, 0) at 0x7fa52074c878>: unexpected eof Feb 24 16:34:01 ovirt-srv02 journal: vdsm vds.dispatcher ERROR SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 48870, 0, 0) at 0x7fa52074c200>: unexpected eof Feb 24 16:34:02 ovirt-srv02 kernel: ovirtmgmt: port 12(vnet10) entered disabled state Feb 24 16:34:02 ovirt-srv02 kernel: device vnet10 left promiscuous mode Feb 24 16:34:02 ovirt-srv02 kernel: ovirtmgmt: port 12(vnet10) entered disabled state Feb 24 16:34:02 ovirt-srv02 kvm: 10 guests now active Feb 24 16:34:02 ovirt-srv02 systemd-machined: Machine qemu-23-test-41-migration terminated. Feb 24 16:34:02 ovirt-srv02 journal: vdsm root WARN File: /var/lib/libvirt/qemu/channels/3ef3ba2a-47ce-4579-912d-3891a2dc9ce4.com.redhat.rhevm.vdsm already removed Feb 24 16:34:02 ovirt-srv02 journal: vdsm root WARN File: /var/lib/libvirt/qemu/channels/3ef3ba2a-47ce-4579-912d-3891a2dc9ce4.org.qemu.guest_agent.0 already removed Feb 24 16:34:03 ovirt-srv02 ntpd[955]: Deleting interface #29 vnet10, fe80::fc1a:4aff:fe5b:7e0a#123, interface stats: received=0, sent=0, dropped=0, active_time=5 secs
does it work for non-qos VMs?
Yes, the same VM migrates without issues when the CPU profile is reset to default without a CPU quota.
adding Martin since it raises in QoS code
Software versions used, for reference: kernel-3.10.0-514.6.1.el7.x86_64 libvirt-daemon-2.0.0-10.el7_3.4.x86_64 libvirt-daemon-driver-qemu-2.0.0-10.el7_3.4.x86_64 libvirt-daemon-kvm-2.0.0-10.el7_3.4.x86_64 qemu-kvm-ev-2.6.0-28.el7_3.3.1.x86_64
Looking at the proper libvirtd log, the following public APIs are called (in given order): <migration finishes, vcpus are resumed> virDomainGetXMLDesc virDomainGetInfo virDomainGetSchedulerType virDomainGetSchedulerParameters virDomainSetSchedulerParameters params["cpu_shares"]=(ullong)1020 virDomainGetMetadata virDomainGetXMLDesc virDomainGetXMLDesc virDomainDestroyFlags The VM is destroyed deliberately by one of the clients of libvirtd. According to the logs it looks like it's the same client executed all the APIs above. The migration was initiated by a different client. (probably peer2peer). The time difference between the first virDomainGetXMLDesc and virDomainDestroyFlags is ~0.2s
ah, yes it's killed as a consequence of that exception during initialization (line 617 in vdsm/virt/vm.py)
the fix is trivial, we can fix that quickly. As a follow up we should see why was this not caught in testing since the offending code is there for 3 months!
*** Bug 1425371 has been marked as a duplicate of this bug. ***
No need for a doc_text
Verify with: Red Hat Virtualization Manager Version: 4.1.1.4-0.1.el7 Host: OS Version:RHEL - 7.3 - 7.el7 Kernel Version:3.10.0 - 550.el7.x86_64 KVM Version:2.6.0 - 28.el7_3.3.1 LIBVIRT Version:libvirt-2.0.0-10.el7_3.5 VDSM Version:vdsm-4.19.7-1.el7ev SPICE Version:0.12.4 - 20.el7_3 Steps: 1. install oVirt 4.1.0 with at least 2 hosts in the same DC 2. go to Data Centers -> select one -> QoS -> CPU 3. create a new QoS policy with a limit of 10% and name it "test" 4. go to Clusters -> select one -> CPU Profiles 5. create new profile "test" and use QoS policy created in 3 6. create VM in this data center and press Edit 7. click "Show Advanced Options" and go to "Resource Allocation" 8. set "test" as "CPU Profile" 9. start VM (with load and without load on CPU) 10. after it starts - live migrate it Actual results: Migration PASS