1425371 – Live migration fails

Bug 1425371 - Live migration fails

Summary: Live migration fails

Keywords:
Status:	CLOSED DUPLICATE of bug 1426727
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.19.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Dan Kenigsberg
QA Contact:	meital avital
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-02-21 09:47 UTC by spamcop
Modified:	2017-02-28 10:48 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-02-28 10:48:25 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
vdsm.log dest (1.76 MB, text/plain) 2017-02-22 08:42 UTC, spamcop	no flags	Details
vdsm.log source (1.76 MB, text/plain) 2017-02-22 08:43 UTC, spamcop	no flags	Details
engine.log (162.34 KB, text/plain) 2017-02-22 09:22 UTC, spamcop	no flags	Details
VM guest agent log (9.68 KB, text/plain) 2017-02-22 09:29 UTC, spamcop	no flags	Details
Migration 11:17 AM (46.97 KB, text/plain) 2017-02-22 10:31 UTC, spamcop	no flags	Details
Migration 11:17 AM (54.46 KB, text/plain) 2017-02-22 10:32 UTC, spamcop	no flags	Details
Migration 11:17 AM (59.95 KB, text/plain) 2017-02-22 10:32 UTC, spamcop	no flags	Details
journal dest (31.37 KB, text/x-vhdl) 2017-02-22 12:11 UTC, spamcop	no flags	Details
journal source (33.88 KB, text/x-vhdl) 2017-02-22 12:11 UTC, spamcop	no flags	Details
qemu.log (6.83 KB, text/plain) 2017-02-22 12:12 UTC, spamcop	no flags	Details
Show Obsolete (3) View All

Description spamcop 2017-02-21 09:47:30 UTC

Description of problem:
Live migration of VM with latest oVirt 4.1 release fails.



Version-Release number of selected component (if applicable):
OS Version: RHEL - 7 - 3.1611.el7.centos
OS Description: CentOS Linux 7 (Core)
Kernel Version: 3.10.0 - 514.6.1.el7.x86_64
KVM Version: 2.6.0 - 28.el7_3.3.1
LIBVIRT Version: libvirt-2.0.0-10.el7_3.4
VDSM Version: vdsm-4.19.4-1.el7.centos
SPICE Version: 0.12.4 - 19.el7
GlusterFS Version: glusterfs-3.8.8-1.el7
CEPH Version: librbd1-0.94.5-1.el7



How reproducible:
Update oVirt-Node to latest available oVirt 4.1 release and try to live migrate a VM. Migration ends up with:

Migration failed  (VM: <Hostname>, Source: <source-node>, Destination: <dest-node>).



Steps to Reproduce:
1.
2.
3.



Actual results:
/var/log/vdsm/vdsm.log (Dest-Node)

2017-02-20 13:52:41,095 ERROR (vm/3dc71915) [virt.vm] (vmId='3dc71915-b154-4731-b933-26dc63ebd6bc') The vm start process failed (vm:616)
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 571, in _startUnderlyingVm
    self._completeIncomingMigration()
  File "/usr/share/vdsm/virt/vm.py", line 3084, in _completeIncomingMigration
    self._domDependentInit()
  File "/usr/share/vdsm/virt/vm.py", line 1787, in _domDependentInit
    self._vmDependentInit()
  File "/usr/share/vdsm/virt/vm.py", line 1854, in _vmDependentInit
    self._dom_vcpu_setup()
  File "/usr/share/vdsm/virt/vm.py", line 1877, in _dom_vcpu_setup
    self._updateVcpuLimit()
  File "/usr/share/vdsm/virt/vm.py", line 2564, in _updateVcpuLimit
    self._vcpuLimit = vmxml.dom_text(vcpuLimit)
AttributeError: 'module' object has no attribute 'dom_text'
2017-02-20 13:52:41,867 WARN  (jsonrpc/4) [virt.vm] (vmId='3dc71915-b154-4731-b933-26dc63ebd6bc') trying to set state to Powering down when already Down (vm:351)
2017-02-20 13:52:42,241 WARN  (jsonrpc/4) [root] File: /var/lib/libvirt/qemu/channels/3dc71915-b154-4731-b933-26dc63ebd6bc.com.redhat.rhevm.vdsm already removed (utils:120)
2017-02-20 13:52:42,241 WARN  (jsonrpc/4) [root] File: /var/lib/libvirt/qemu/channels/3dc71915-b154-4731-b933-26dc63ebd6bc.org.qemu.guest_agent.0 already removed (utils:120)
2017-02-20 13:52:42,243 WARN  (libvirt/events) [virt.vm] (vmId='3dc71915-b154-4731-b933-26dc63ebd6bc') trying to set state to Down when already Down (vm:351)



Expected results:
Live migration of VM on oVirt-Node with latest stable release should be working



Additional info:

/var/log/libvirt/qemu/<vm>.log (Source-Node)

2017-02-21 09:32:14.104+0000: initiating migration
2017-02-21 09:32:22.998+0000: shutting down
2017-02-21T09:32:23.144289Z qemu-kvm: terminating on signal 15 from pid 16029

/var/log/libvirt/qemu/<vm>.log (Dest-Node)

2017-02-21 09:32:13.901+0000: starting up libvirt version: 2.0.0, package: 10.el7_3.4 (CentOS BuildSystem <http://bugs.centos.org>, 2017-01-17-23:37:48, c1bm.rdu2.centos.org), qemu version: 2.6.0 (qemu-kvm-ev-2.6.0-28.el7_3.3.1), hostname: .com
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=spice /usr/libexec/qemu-kvm -name guest=.com,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-18-.com/master-key.aes -machine pc-i440fx-rhel7.2.0,accel=kvm,usb=off -cpu Nehalem -m size=2097152k,slots=16,maxmem=8388608k -realtime mlock=off -smp 2,maxcpus=16,sockets=16,cores=1,threads=1 -numa node,nodeid=0,cpus=0-1,mem=2048 -uuid 3dc71915-b154-4731-b933-26dc63ebd6bc -smbios 'type=1,manufacturer=oVirt,product=oVirt Node,version=7-3.1611.el7.centos,serial=36343035-3333-5A43-4A30-303730314259,uuid=3dc71915-b154-4731-b933-26dc63ebd6bc' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-18-.com/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2017-02-21T10:32:13,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x4 -drive if=none,id=drive-ide0-1-0,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/2ea1031e-ebc3-4be9-b97f-bc8436745d45/4d256da4-f5f2-4f77-ab18-8d5264033c23/images/fe9c5408-1edc-4415-8396-f2436cac5f3c/8fe48b5c-96d1-46bf-bb1d-7dfd3f0a3007,format=raw,if=none,id=drive-virtio-disk0,serial=fe9c5408-1edc-4415-8396-f2436cac5f3c,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=30,id=hostnet0,vhost=on,vhostfd=31 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:16:01:56,bus=pci.0,addr=0x3,bootindex=2 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/3dc71915-b154-4731-b933-26dc63ebd6bc.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/3dc71915-b154-4731-b933-26dc63ebd6bc.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice tls-port=5900,addr=172.xx.xx.xx,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=default,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=8388608,vram64_size_mb=0,vgamem_mb=16,bus=pci.0,addr=0x2 -incoming defer -msg timestamp=on
2017-02-21T09:32:14.005364Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2017-02-21T09:32:14.005474Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config
red_dispatcher_loadvm_commands:
2017-02-21T09:32:32.814759Z qemu-kvm: terminating on signal 15 from pid 2039
2017-02-21 09:32:33.015+0000: shutting down

Comment 1 Yaniv Kaul 2017-02-21 10:55:59 UTC

Can you please attach complete VDSM logs of source and destination?

Comment 2 Michal Skrivanek 2017-02-22 06:20:31 UTC

You seem to shut down the guest in the middle of migration. Few warnings are expected in that case

Comment 3 spamcop 2017-02-22 08:42:52 UTC

Created attachment 1256377 [details]
vdsm.log dest

Comment 4 spamcop 2017-02-22 08:43:23 UTC

Created attachment 1256379 [details]
vdsm.log source

Comment 5 Michal Skrivanek 2017-02-22 08:46:54 UTC

(In reply to Michal Skrivanek from comment #2)
> You seem to shut down the guest in the middle of migration. Few warnings are
> expected in that case

Any logs from guest then? It looks like a normal case of shutting down a guest (by you, or by the guest OS itself)

Comment 6 spamcop 2017-02-22 09:22:09 UTC

Created attachment 1256387 [details]
engine.log

oVirt engine log is also attached.

Search for:
- ovirt-node-source.tld
- ovirt-node-dest.tld
- ovirt-engine.tld
- vm.tld

Comment 7 spamcop 2017-02-22 09:29:58 UTC

Created attachment 1256388 [details]
VM guest agent log

Guest agent log is attached.

Installed Packages
Name        : ovirt-guest-agent-common
Arch        : noarch
Version     : 1.0.13
Release     : 1.20161220085008.git165fff1.el7.centos

Comment 8 Michal Skrivanek 2017-02-22 09:51:15 UTC

the guest agent log doesn't seem to be from the same time interval (problem happened at 10:05, the agent log starts at 10:26)
What I meant was rather guest's system logs, like /var/log/messages or systemd journal as I suspect your guest simply decided to shut down normally, and it just so happens that it did that in the middle of migration, but it seems to be unrelated to that migration.
Thanks!

Comment 9 spamcop 2017-02-22 10:31:46 UTC

Created attachment 1256408 [details]
Migration 11:17 AM

Comment 10 spamcop 2017-02-22 10:32:12 UTC

Created attachment 1256409 [details]
Migration 11:17 AM

Comment 11 spamcop 2017-02-22 10:32:34 UTC

Created attachment 1256411 [details]
Migration 11:17 AM

Comment 12 spamcop 2017-02-22 10:32:58 UTC

Michal: unfortunately there are no other log entries in /var/log/messages, dmesg or journalctl but a new one in vdsm-source-20170222-1117.log.

FYI: I just started another migration at 11:17 AM - the timestamp in every log file should be the same now. Sorry causing confusion.

Comment 13 Michal Skrivanek 2017-02-22 11:37:51 UTC

ok. thanks. in this case I can see that the VM doesn't run immediately after the migration is triggered. This is suspicious, it may be that the qemu process crashes immediately as a consequence of starting migration. Can you please dig out the corresponding systemd journal on host, and qemu.log in /var/log/libvirt/qemu/<vmname>.log?

Comment 14 spamcop 2017-02-22 12:11:20 UTC

Created attachment 1256428 [details]
journal dest

Comment 15 spamcop 2017-02-22 12:11:46 UTC

Created attachment 1256429 [details]
journal source

Comment 16 spamcop 2017-02-22 12:12:14 UTC

Created attachment 1256430 [details]
qemu.log

Comment 17 spamcop 2017-02-24 09:37:24 UTC

Michal: are there any news concerning this issue?

Comment 18 Michal Skrivanek 2017-02-25 06:22:12 UTC

bug 1426727 is likely the same thing. Do you use QoS or any CPU profiles?
When you see which PID is killing the VM, can you please get the process name? To see if it is libvirt or something else

Comment 19 spamcop 2017-02-28 08:16:00 UTC

Yes, this is exactly the same issue (maybe anyone want to declare this ticket as duplicate).

Migration is working again after renaming the clusters cpu profile to "Default" and set "QoS Name" to "[Unlimited]"

Comment 20 Michal Skrivanek 2017-02-28 10:48:25 UTC

thanks for confirmation, we now have a fix too.

*** This bug has been marked as a duplicate of bug 1426727 ***

Note You need to log in before you can comment on or make changes to this bug.