Bug 1426727 - VM dies during live migration with CPU quotas enabled
Summary: VM dies during live migration with CPU quotas enabled
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.19.4
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ovirt-4.1.1
: 4.19.7
Assignee: Francesco Romani
QA Contact: Israel Pinto
URL:
Whiteboard:
: 1425371 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-24 17:10 UTC by Evgheni Dereveanchin
Modified: 2017-04-21 09:35 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-04-21 09:35:18 UTC
oVirt Team: Virt
Embargoed:
rule-engine: ovirt-4.1+
rule-engine: blocker+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 73195 0 master MERGED vm: fix vcpu limit detection on create 2017-02-28 09:32:44 UTC
oVirt gerrit 73198 0 master MERGED vm: tests: add test for initial vcpu limit setup 2017-02-28 09:32:41 UTC
oVirt gerrit 73233 0 ovirt-4.1 MERGED vm: fix vcpu limit detection on create 2017-02-28 10:18:53 UTC
oVirt gerrit 73234 0 ovirt-4.1 MERGED vm: tests: add test for initial vcpu limit setup 2017-02-28 10:18:50 UTC

Description Evgheni Dereveanchin 2017-02-24 17:10:07 UTC
Description of problem:
After creating a CPU profile to a VM an attempt to migrate it causes the VM to go down with migration source reporting the migration as successful and destination failing.

Version-Release number of selected component (if applicable):
vdsm-4.19.4-1.el7.centos.x86_64
qemu-kvm-ev-2.6.0-28.el7_3.3.1.x86_64
package libvirtd is not installed

How reproducible:
about 70% of the time, 

Steps to Reproduce:
1. install oVirt 4.1.0 with at least 2 hosts in the same DC
2. go to Data Centers -> select one -> QoS -> CPU
3. create a new QoS policy with a limit of 10% and name it "test"
4. go to Clusters -> select one ->  CPU Profiles
5. create new profile "test" and use QoS policy created in 3
6. create VM in this data center  and press Edit
7. click "Show Advanced Options" and go to "Resource Allocation"
8. set "test" as "CPU Profile"
9. start VM
10. after it starts - live migrate it


Actual results:
Migration fails. VM goes to "down" state and has to be started manually

Expected results:
Migration succeeds

Additional info:
This also happened to VMs where quota was added while they were running - after migration attempts they crash. Will attach related log snippets

Comment 1 Evgheni Dereveanchin 2017-02-24 17:15:56 UTC
Source VDSM log - all OK here

2017-02-24 16:33:54,341 INFO  (jsonrpc/1) [vdsm.api] START migrate args=(<virt.vm.Vm object at 0x25bf090>, {u'incomingLimit': 2, u'tunneled': u'false', u'dstqemu': u'66.187.230.4', u'autoConverge': u'false', u'src': u'ovirt-srv03.ovirt.org', u'enableGuestEvents': False, u'dst': u'ovirt-srv02.ovirt.org:54321', u'vmId': u'3ef3ba2a-47ce-4579-912d-3891a2dc9ce4', u'abortOnError': u'true', u'outgoingLimit': 2, u'compressed': u'false', u'method': u'online', 'mode': 'remote'}) kwargs={} (api:37)
2017-02-24 16:33:55,606 INFO  (migsrc/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Creation of destination VM took: 0 seconds (migration:431)
2017-02-24 16:33:55,606 INFO  (migsrc/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') starting migration to qemu+tls://ovirt-srv02.ovirt.org/system with miguri tcp://66.187.230.4 (migration:453)
2017-02-24 16:34:01,997 INFO  (libvirt/events) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') CPU stopped: onSuspend (vm:4877)
2017-02-24 16:34:02,977 INFO  (migsrc/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') migration took 7 seconds to complete (migration:465)
2017-02-24 16:34:02,981 INFO  (migsrc/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Changed state to Down: Migration succeeded (code=4) (vm:1199)
2017-02-24 16:34:02,982 INFO  (migsrc/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Stopping connection (guestagent:430)
2017-02-24 16:34:03,242 INFO  (jsonrpc/0) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Release VM resources (vm:4092)
2017-02-24 16:34:03,242 WARN  (jsonrpc/0) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') trying to set state to Powering down when already Down (vm:351)
2017-02-24 16:34:03,242 INFO  (jsonrpc/0) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Stopping connection (guestagent:430)
2017-02-24 16:34:03,243 INFO  (jsonrpc/0) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') _destroyVmGraceful attempt #0 (vm:4128)
2017-02-24 16:34:03,244 INFO  (jsonrpc/0) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') VM '3ef3ba2a-47ce-4579-912d-3891a2dc9ce4' already down and destroyed (vm:4146)
2017-02-24 16:34:03,248 INFO  (jsonrpc/0) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Stopping connection (guestagent:430)



Destination VDSM log - error parsing vcpuLimit?

2017-02-24 16:33:55,441 INFO  (jsonrpc/3) [vdsm.api] START __init__ args=(<virt.vm.Vm object at 0x7fa52034fb50>, <clientIF.clientIF object at 0x2049590>, ...
2017-02-24 16:33:55,450 INFO  (vm/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') VM wrapper has started (vm:1903)
2017-02-24 16:34:02,076 INFO  (libvirt/events) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') CPU running: onResume (vm:4877)
2017-02-24 16:34:02,077 INFO  (libvirt/events) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') CPU running: onResume (vm:4877)
2017-02-24 16:34:02,139 INFO  (vm/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Starting connection (guestagent:245)
2017-02-24 16:34:02,162 INFO  (vm/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') CPU running: domain initialization (vm:4877)
2017-02-24 16:34:02,169 ERROR (vm/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') The vm start process failed (vm:616)
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 571, in _startUnderlyingVm
    self._completeIncomingMigration()
  File "/usr/share/vdsm/virt/vm.py", line 3084, in _completeIncomingMigration
    self._domDependentInit()
  File "/usr/share/vdsm/virt/vm.py", line 1787, in _domDependentInit
    self._vmDependentInit()
  File "/usr/share/vdsm/virt/vm.py", line 1854, in _vmDependentInit
    self._dom_vcpu_setup()
  File "/usr/share/vdsm/virt/vm.py", line 1877, in _dom_vcpu_setup
    self._updateVcpuLimit()
  File "/usr/share/vdsm/virt/vm.py", line 2564, in _updateVcpuLimit
    self._vcpuLimit = vmxml.dom_text(vcpuLimit)
AttributeError: 'module' object has no attribute 'dom_text'
2017-02-24 16:34:02,192 INFO  (vm/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Changed state to Down: 'module' object has no attribute 'dom_text' (code=1) (vm:1199)
2017-02-24 16:34:02,192 INFO  (vm/3ef3ba2a) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Stopping connection (guestagent:430)
2017-02-24 16:34:02,299 INFO  (jsonrpc/1) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Release VM resources (vm:4092)
2017-02-24 16:34:02,300 WARN  (jsonrpc/1) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') trying to set state to Powering down when already Down (vm:351)
2017-02-24 16:34:02,300 INFO  (jsonrpc/1) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Stopping connection (guestagent:430)
2017-02-24 16:34:02,300 INFO  (jsonrpc/1) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') _destroyVmGraceful attempt #0 (vm:4128)
2017-02-24 16:34:02,831 INFO  (jsonrpc/1) [virt.vm] (vmId='3ef3ba2a-47ce-4579-912d-3891a2dc9ce4') Stopping connection (guestagent:430)



qemu domain log on destination (qemu process receives SIGTERM)

2017-02-24 16:33:56.576+0000: starting up libvirt version: 2.0.0, package: 10.el7_3.4 (CentOS BuildSystem <http://bugs.centos.org>, 2017-01-17-23:37:48, c1bm.rdu2.centos.org), qemu version: 2.6.0 (qemu-kvm-ev-2.6.0-28.el7_3.3.1), hostname: ovirt-srv02.ovirt.org
2017-02-24T16:33:56.734612Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2017-02-24T16:33:56.735254Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config
main_channel_link: add main channel client
red_dispatcher_set_cursor_peer: 
inputs_connect: inputs channel client create
red_dispatcher_loadvm_commands: 
2017-02-24T16:34:02.302400Z qemu-kvm: terminating on signal 15 from pid 1564
2017-02-24 16:34:02.503+0000: shutting down



messages on destination

Feb 24 16:33:56 ovirt-srv02 kernel: device vnet10 entered promiscuous mode
Feb 24 16:33:56 ovirt-srv02 kernel: ovirtmgmt: port 12(vnet10) entered forwarding state
Feb 24 16:33:56 ovirt-srv02 kernel: ovirtmgmt: port 12(vnet10) entered forwarding state
Feb 24 16:33:56 ovirt-srv02 systemd-machined: New machine qemu-23-test-41-migration.
Feb 24 16:33:56 ovirt-srv02 systemd: Started Virtual Machine qemu-23-test-41-migration.
Feb 24 16:33:56 ovirt-srv02 systemd: Starting Virtual Machine qemu-23-test-41-migration.
Feb 24 16:33:56 ovirt-srv02 kvm: 11 guests now active
Feb 24 16:33:59 ovirt-srv02 journal: vdsm vds.dispatcher ERROR SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 48866, 0, 0) at 0x7fa5206f1cf8>: unexpected eof
Feb 24 16:34:01 ovirt-srv02 journal: vdsm vds.dispatcher ERROR SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 48868, 0, 0) at 0x7fa52074c878>: unexpected eof
Feb 24 16:34:01 ovirt-srv02 journal: vdsm vds.dispatcher ERROR SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 48870, 0, 0) at 0x7fa52074c200>: unexpected eof
Feb 24 16:34:02 ovirt-srv02 kernel: ovirtmgmt: port 12(vnet10) entered disabled state
Feb 24 16:34:02 ovirt-srv02 kernel: device vnet10 left promiscuous mode
Feb 24 16:34:02 ovirt-srv02 kernel: ovirtmgmt: port 12(vnet10) entered disabled state
Feb 24 16:34:02 ovirt-srv02 kvm: 10 guests now active
Feb 24 16:34:02 ovirt-srv02 systemd-machined: Machine qemu-23-test-41-migration terminated.
Feb 24 16:34:02 ovirt-srv02 journal: vdsm root WARN File: /var/lib/libvirt/qemu/channels/3ef3ba2a-47ce-4579-912d-3891a2dc9ce4.com.redhat.rhevm.vdsm already removed
Feb 24 16:34:02 ovirt-srv02 journal: vdsm root WARN File: /var/lib/libvirt/qemu/channels/3ef3ba2a-47ce-4579-912d-3891a2dc9ce4.org.qemu.guest_agent.0 already removed
Feb 24 16:34:03 ovirt-srv02 ntpd[955]: Deleting interface #29 vnet10, fe80::fc1a:4aff:fe5b:7e0a#123, interface stats: received=0, sent=0, dropped=0, active_time=5 secs

Comment 3 Michal Skrivanek 2017-02-25 06:18:52 UTC
does it work for non-qos VMs?

Comment 4 Evgheni Dereveanchin 2017-02-25 15:09:20 UTC
Yes, the same VM migrates without issues when the CPU profile is reset to default without a CPU quota.

Comment 5 Michal Skrivanek 2017-02-25 18:00:17 UTC
adding Martin since it raises in QoS code

Comment 8 Evgheni Dereveanchin 2017-02-27 15:26:47 UTC
Software versions used, for reference:

kernel-3.10.0-514.6.1.el7.x86_64
libvirt-daemon-2.0.0-10.el7_3.4.x86_64
libvirt-daemon-driver-qemu-2.0.0-10.el7_3.4.x86_64
libvirt-daemon-kvm-2.0.0-10.el7_3.4.x86_64
qemu-kvm-ev-2.6.0-28.el7_3.3.1.x86_64

Comment 10 Peter Krempa 2017-02-27 17:09:45 UTC
Looking at the proper libvirtd log, the following public APIs are called (in given order):

<migration finishes, vcpus are resumed>
virDomainGetXMLDesc 
virDomainGetInfo 
virDomainGetSchedulerType 
virDomainGetSchedulerParameters  
virDomainSetSchedulerParameters params["cpu_shares"]=(ullong)1020 
virDomainGetMetadata  
virDomainGetXMLDesc   
virDomainGetXMLDesc   
virDomainDestroyFlags 

The VM is destroyed deliberately by one of the clients of libvirtd. According to the logs it looks like it's the same client executed all the APIs above. The migration was initiated by a different client. (probably peer2peer).

The time difference between the first virDomainGetXMLDesc and virDomainDestroyFlags is ~0.2s

Comment 11 Michal Skrivanek 2017-02-28 08:15:03 UTC
ah, yes
it's killed as a consequence of that exception during initialization (line 617 in vdsm/virt/vm.py)

Comment 12 Michal Skrivanek 2017-02-28 08:26:38 UTC
the fix is trivial, we can fix that quickly. As a follow up we should see why was this not caught in testing since the offending code is there for 3 months!

Comment 13 Michal Skrivanek 2017-02-28 10:48:25 UTC
*** Bug 1425371 has been marked as a duplicate of this bug. ***

Comment 14 Francesco Romani 2017-03-01 08:59:16 UTC
No need for a doc_text

Comment 15 Israel Pinto 2017-03-14 13:55:53 UTC
Verify with:
Red Hat Virtualization Manager Version: 4.1.1.4-0.1.el7
Host:
OS Version:RHEL - 7.3 - 7.el7
Kernel Version:3.10.0 - 550.el7.x86_64
KVM Version:2.6.0 - 28.el7_3.3.1
LIBVIRT Version:libvirt-2.0.0-10.el7_3.5
VDSM Version:vdsm-4.19.7-1.el7ev
SPICE Version:0.12.4 - 20.el7_3

Steps: 
1. install oVirt 4.1.0 with at least 2 hosts in the same DC
2. go to Data Centers -> select one -> QoS -> CPU
3. create a new QoS policy with a limit of 10% and name it "test"
4. go to Clusters -> select one ->  CPU Profiles
5. create new profile "test" and use QoS policy created in 3
6. create VM in this data center  and press Edit
7. click "Show Advanced Options" and go to "Resource Allocation"
8. set "test" as "CPU Profile"
9. start VM (with load and without load on CPU)
10. after it starts - live migrate it


Actual results:
Migration PASS


Note You need to log in before you can comment on or make changes to this bug.