Bug 1524119 - Failed to migrate HE VM
Summary: Failed to migrate HE VM
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.20.9.1
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ovirt-4.2.0
: 4.20.9.3
Assignee: Francesco Romani
QA Contact: Artyom
URL:
Whiteboard:
: 1502768 1524989 (view as bug list)
Depends On: 1512534
Blocks: 1455169 1516712
TreeView+ depends on / blocked
 
Reported: 2017-12-10 08:17 UTC by Artyom
Modified: 2019-04-28 17:31 UTC (History)
15 users (show)

Fixed In Version: vdsm-4.20.9.3
Clone Of:
Environment:
Last Closed: 2017-12-20 11:16:17 UTC
oVirt Team: Virt
Embargoed:
gklein: ovirt-4.2+
gklein: blocker+


Attachments (Terms of Use)
logs (885.02 KB, application/zip)
2017-12-10 08:18 UTC, Artyom
no flags Details
hosted-engine dumpxml (8.05 KB, text/plain)
2017-12-12 10:29 UTC, Artyom
no flags Details
logs with debug mode (17.88 MB, application/zip)
2017-12-12 12:33 UTC, Artyom
no flags Details
new_vdsm_logs (5.67 MB, application/zip)
2017-12-13 07:23 UTC, Artyom
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 85370 0 master MERGED vm: devices: add `type` to Lease device 2020-12-23 02:26:52 UTC
oVirt gerrit 85372 0 ovirt-4.2.0 MERGED vm: devices: add `type` to Lease device 2020-12-23 02:26:54 UTC
oVirt gerrit 85446 0 master MERGED virt: graphics: displayNetwork may be absent 2020-12-23 02:26:54 UTC
oVirt gerrit 85457 0 ovirt-4.2.0 MERGED virt: graphics: displayNetwork may be absent 2020-12-23 02:26:54 UTC
oVirt gerrit 85459 0 master MERGED vm: graphics: don't inject displayNetwork=None 2020-12-23 02:26:54 UTC

Description Artyom 2017-12-10 08:17:14 UTC
Description of problem:
Can not migrate HE VM, not via WebAdmin and not via CLI(hosted-engine --set-maintenance --mode=local), on the source host I can see:
    if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self)
libvirtError: Cannot access storage file '/var/run/vdsm/storage/7a9a1478-6a7c-465b-b0e5-598205be7faf/76b07256-29ce-4634-9f4a-db35d3bc61c6/0fa79d54-74ac-41bb-9936-c6ecdd947150' (as uid:107, gid:107): No such file or directory

Version-Release number of selected component (if applicable):
# rpm -qa |  grep vdsm
vdsm-client-4.20.9-1.el7ev.noarch
vdsm-4.20.9-1.el7ev.x86_64
vdsm-api-4.20.9-1.el7ev.noarch
vdsm-hook-openstacknet-4.20.9-1.el7ev.noarch
vdsm-hook-vhostmd-4.20.9-1.el7ev.noarch
vdsm-jsonrpc-4.20.9-1.el7ev.noarch
vdsm-hook-ethtool-options-4.20.9-1.el7ev.noarch
vdsm-yajsonrpc-4.20.9-1.el7ev.noarch
vdsm-python-4.20.9-1.el7ev.noarch
vdsm-hook-vmfex-dev-4.20.9-1.el7ev.noarch
vdsm-hook-vfio-mdev-4.20.9-1.el7ev.noarch
vdsm-http-4.20.9-1.el7ev.noarch
vdsm-hook-fcoe-4.20.9-1.el7ev.noarch

# rpm -qa |  grep hosted
ovirt-hosted-engine-ha-2.2.0-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.0-2.el7ev.noarch

How reproducible:
Always
We encountered the same issue in our automation environment

Steps to Reproduce:
1. Deploy HE environment
2. Try to migrate HE VM via WebAdmin
3.

Actual results:
HE VM failed to migrate

Expected results:
HE VM succeeds to migrate

Additional info:

Comment 1 Artyom 2017-12-10 08:18:46 UTC
Created attachment 1365499 [details]
logs

Comment 2 Yaniv Kaul 2017-12-12 08:57:57 UTC
Does it happen with the node zero deployment or the other way or both?

Comment 3 Martin Sivák 2017-12-12 09:13:29 UTC
- What kind of storage was used?
- Can you check if the file is there?
- Can you check (using mount) if the storage domain is properly mounted?

Comment 4 Michal Skrivanek 2017-12-12 09:23:08 UTC
also include vdsm.log from the time when the HE VM was started on that host as it contains the exact libvirt definition. And qemu log would be nice too.

Comment 5 Artyom 2017-12-12 10:28:15 UTC
- NFS


- The file does not exist under /var/run directory but exists under mount directory
# cd /var/run/vdsm/storage/cd6f0e9d-07fb-4751-8eb4-69a4d1497fc4/09ad8671-2448-46b0-8f83-57d821d3fff8/
-bash: cd: /var/run/vdsm/storage/cd6f0e9d-07fb-4751-8eb4-69a4d1497fc4/09ad8671-2448-46b0-8f83-57d821d3fff8/: No such file or directory
# pwd
/rhev/data-center/mnt/vserver-nas01.qa.lab.tlv.redhat.com:_nas01_ge__he__2__he/cd6f0e9d-07fb-4751-8eb4-69a4d1497fc4/images/09ad8671-2448-46b0-8f83-57d821d3fff8
# ll
total 4009548
-rw-rw----. 1 vdsm kvm 53687091200 Dec 12 12:20 61d3bcf5-3415-4539-a835-50eea4f8c2b8
-rw-rw----. 1 vdsm kvm     1048576 Dec 12 11:11 61d3bcf5-3415-4539-a835-50eea4f8c2b8.lease
-rw-r--r--. 1 vdsm kvm         285 Dec 11 13:02 61d3bcf5-3415-4539-a835-50eea4f8c2b8.meta


- I can see mount of HE storage on the host
vserver-nas01.qa.lab.tlv.redhat.com:/nas01/ge_he_2_he on /rhev/data-center/mnt/vserver-nas01.qa.lab.tlv.redhat.com:_nas01_ge__he__2__he type nfs (rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,soft,nosharecache,proto=tcp,timeo=600,retrans=6,sec=sys,mountaddr=10.35.66.202,mountvers=3,mountport=635,mountproto=udp,local_lock=none,addr=10.35.66.202)


QEMU logs
2017-12-12T08:48:48.132890Z qemu-kvm: terminating on signal 15 from pid 1067 (<unknown process>)
2017-12-12 09:14:57.575+0000: starting up libvirt version: 3.2.0, package: 14.el7_4.4 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2017-11-06-09:44:21, x86-039.build.eng.bos.redhat.com), qemu version: 2.9.0(qemu-kvm-rhev-2.9.0-16.el7_4.12), hostname: puma29.scl.lab.tlv.redhat.com
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=HostedEngine,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-67-HostedEngine/master-key.aes -machine rhel6.5.0,accel=kvm,usb=off,dump-guest-core=off -cpu Westmere,x2apic=on,hypervisor=on -m 16384 -realtime mlock=off -smp 4,maxcpus=6,sockets=6,cores=1,threads=1 -uuid 24128148-d2b8-4c5b-959c-10f55a43676d -smbios 'type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.4-18.el7,serial=30393036-3539-5A43-3332-313142363550,uuid=24128148-d2b8-4c5b-959c-10f55a43676d' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-67-HostedEngine/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2017-12-12T09:14:56,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-reboot -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive if=none,id=drive-ide0-1-0,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/var/run/vdsm/storage/cd6f0e9d-07fb-4751-8eb4-69a4d1497fc4/09ad8671-2448-46b0-8f83-57d821d3fff8/61d3bcf5-3415-4539-a835-50eea4f8c2b8,format=raw,if=none,id=drive-virtio-disk0,serial=09ad8671-2448-46b0-8f83-57d821d3fff8,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=31,id=hostnet0,vhost=on,vhostfd=33 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:16:27:30,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/24128148-d2b8-4c5b-959c-10f55a43676d.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/24128148-d2b8-4c5b-959c-10f55a43676d.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev socket,id=charchannel2,path=/var/lib/libvirt/qemu/channels/24128148-d2b8-4c5b-959c-10f55a43676d.org.ovirt.hosted-engine-setup.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=org.ovirt.hosted-engine-setup.0 -chardev pty,id=charconsole0 -device virtconsole,chardev=charconsole0,id=console0 -vnc 0:0,password -device VGA,id=video0,vgamem_mb=32,bus=pci.0,addr=0x2 -incoming defer -object rng-random,id=objrng0,filename=/dev/urandom -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x7 -msg timestamp=on
2017-12-12 09:14:57.583+0000: 11993: info : libvirt version: 3.2.0, package: 14.el7_4.4 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2017-11-06-09:44:21, x86-039.build.eng.bos.redhat.com)
2017-12-12 09:14:57.583+0000: 11993: info : hostname: puma29.scl.lab.tlv.redhat.com
2017-12-12 09:14:57.583+0000: 11993: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7ffaec114f80
2017-12-12T09:14:57.814558Z qemu-kvm: -chardev pty,id=charconsole0: char device redirected to /dev/pts/0 (label charconsole0)

Comment 6 Artyom 2017-12-12 10:29:39 UTC
Created attachment 1366545 [details]
hosted-engine dumpxml

Comment 7 Artyom 2017-12-12 12:33:12 UTC
Created attachment 1366624 [details]
logs with debug mode

Updated logs with DEBUG mode, I see that problem re-occurred after we had some short connectivity problem with VDSM.
Also strange that we have a lot lines for the same JSONRPC message at the same time

Thread-6::DEBUG::2017-12-12 14:14:03,601::stompreactor::479::jsonrpc.AsyncoreClient::(send) Sending response
Thread-6::DEBUG::2017-12-12 14:14:03,609::util::374::ovirt_hosted_engine_ha.lib.util._EventDistributor::(__log_debug) VDSM jsonrpc connection is not ready
Thread-6::ERROR::2017-12-12 14:14:03,609::util::274::ovirt_hosted_engine_ha.lib.util._EventDistributor::(_thread_func) Couldn't  connect to VDSM within 15 seconds
Thread-4::DEBUG::2017-12-12 14:14:07,384::cpu_load_no_engine::76::cpu_load_no_engine.CpuLoadNoEngine::(refresh_ticks) Ticks: total=199397071, busy=1876509
Thread-4::DEBUG::2017-12-12 14:14:07,437::util::374::cpu_load_no_engine.CpuLoadNoEngine::(__log_debug) Creating a new json-rpc connection to VDSM
Thread-4::DEBUG::2017-12-12 14:14:07,569::util::374::cpu_load_no_engine.CpuLoadNoEngine::(__log_debug) Waiting for VDSM to connect
Thread-4::DEBUG::2017-12-12 14:14:08,706::util::374::cpu_load_no_engine.CpuLoadNoEngine::(__log_debug) Waiting for VDSM to connect
Client localhost:54321::DEBUG::2017-12-12 14:14:09,048::stomp::617::yajsonrpc.protocols.stomp.AsyncClient::(handle_timeout) Timeout occurred, trying to reconnect
Client localhost:54321::DEBUG::2017-12-12 14:14:09,049::stomp::617::yajsonrpc.protocols.stomp.AsyncClient::(handle_timeout) Timeout occurred, trying to reconnect
Client localhost:54321::DEBUG::2017-12-12 14:14:09,051::stomp::617::yajsonrpc.protocols.stomp.AsyncClient::(handle_timeout) Timeout occurred, trying to reconnect
Client localhost:54321::DEBUG::2017-12-12 14:14:09,052::stomp::617::yajsonrpc.protocols.stomp.AsyncClient::(handle_timeout) Timeout occurred, trying to reconnect
Client localhost:54321::DEBUG::2017-12-12 14:14:09,052::stomp::617::yajsonrpc.protocols.stomp.AsyncClient::(handle_timeout) Timeout occurred, trying to reconnect
Client localhost:54321::DEBUG::2017-12-12 14:14:09,052::stomp::617::yajsonrpc.protocols.stomp.AsyncClient::(handle_timeout) Timeout occurred, trying to reconnect
Client localhost:54321::DEBUG::2017-12-12 14:14:09,053::stomp::617::yajsonrpc.protocols.stomp.AsyncClient::(handle_timeout) Timeout occurred, trying to reconnect
Client localhost:54321::DEBUG::2017-12-12 14:14:09,053::stomp::617::yajsonrpc.protocols.stomp.AsyncClient::(handle_timeout) Timeout occurred, trying to reconnect
Client localhost:54321::DEBUG::2017-12-12 14:14:09,055::stomp::617::yajsonrpc.protocols.stomp.AsyncClient::(handle_timeout) Timeout occurred, trying to reconnect
Client localhost:54321::DEBUG::2017-12-12 14:14:09,056::stomp::617::yajsonrpc.protocols.stomp.AsyncClient::(handle_timeout) Timeout occurred, trying to reconnect
Client localhost:54321::DEBUG::2017-12-12 14:14:09,057::stomp::617::yajsonrpc.protocols.stomp.AsyncClient::(handle_timeout) Timeout occurred, trying to reconnect
Client localhost:54321::DEBUG::2017-12-12 14:14:09,058::stomp::617::yajsonrpc.protocols.stomp.AsyncClient::(handle_timeout) Timeout occurred, trying to reconnect
Client localhost:54321::DEBUG::2017-12-12 14:14:09,058::stomp::617::yajsonrpc.protocols.stomp.AsyncClient::(handle_timeout) Timeout occurred, trying to reconnect
Client localhost:54321::DEBUG::2017-12-12 14:14:09,059::stomp::617::yajsonrpc.protocols.stomp.AsyncClient::(handle_timeout) Timeout occurred, trying to reconnect
Client localhost:54321::DEBUG::2017-12-12 14:14:09,060::stomp::617::yajsonrpc.protocols.stomp.AsyncClient::(handle_timeout) Timeout occurred, trying to reconnect
Client localhost:54321::DEBUG::2017-12-12 14:14:09,061::stomp::617::yajsonrpc.protocols.stomp.AsyncClient::(handle_timeout) Timeout occurred, trying to reconnect
Client localhost:54321::DEBUG::2017-12-12 14:14:09,061::stomp::617::yajsonrpc.protocols.stomp.AsyncClient::(handle_timeout) Timeout occurred, trying to reconnect
Client localhost:54321::DEBUG::2017-12-12 14:14:09,062::stomp::617::yajsonrpc.protocols.stomp.AsyncClient::(handle_timeout) Timeout occurred, trying to reconnect
...more

Comment 8 Denis Chaplygin 2017-12-12 13:04:45 UTC
Looks like vdsm has (died and) restarted and it removed all symlinks during cleanup on start. At the same time, broker, for some reason, failed to restart correctly and restore symlinks on it's startup. This should not happen very often, at least we do not expect vdsm to restart at all. And in my tests broker was always restarting successfully alongside with vdsm. Therefore, even as we have a good reproducer, i don't think it is a blocker for 4.2, as it should happen rarely.

Comment 9 Martin Sivák 2017-12-12 14:40:17 UTC
So, the VM uses an NFS disk with this path:

<source file='/var/run/vdsm/storage/cd6f0e9d-07fb-4751-8eb4-69a4d1497fc4/09ad8671-2448-46b0-8f83-57d821d3fff8/61d3bcf5-3415-4539-a835-50eea4f8c2b8'/>


And the path does not exist on the destination when vdsm dies and is restarted. How comes VDSM does not prepare the symlink during migration? And neither does engine it seems since it also fails when engine handles the migration.

Comment 10 Francesco Romani 2017-12-12 15:05:51 UTC
(In reply to Martin Sivák from comment #9)
> So, the VM uses an NFS disk with this path:
> 
> <source
> file='/var/run/vdsm/storage/cd6f0e9d-07fb-4751-8eb4-69a4d1497fc4/09ad8671-
> 2448-46b0-8f83-57d821d3fff8/61d3bcf5-3415-4539-a835-50eea4f8c2b8'/>
> 
> 
> And the path does not exist on the destination when vdsm dies and is
> restarted. How comes VDSM does not prepare the symlink during migration? And
> neither does engine it seems since it also fails when engine handles the
> migration.

It totally should, nothing has changed in this regard.

Does this means that any migration involving NFS disks is broken? Or is it only hosted engine (if so, why is it special?)

Also, which was the last 4.20.z version that worked?

Comment 11 Michal Skrivanek 2017-12-12 15:15:59 UTC
after reviewing the conditions it seems as a duplicate of bug 1522901

please retry with fixed build

Comment 17 Nir Soffer 2017-12-12 16:18:02 UTC
(In reply to Martin Sivák from comment #9)
> So, the VM uses an NFS disk with this path:
> 
> <source
> file='/var/run/vdsm/storage/cd6f0e9d-07fb-4751-8eb4-69a4d1497fc4/09ad8671-
> 2448-46b0-8f83-57d821d3fff8/61d3bcf5-3415-4539-a835-50eea4f8c2b8'/>
> 
> 
> And the path does not exist on the destination when vdsm dies and is
> restarted.

What do you mean does not exist on the destination?

> How comes VDSM does not prepare the symlink during migration? And
> neither does engine it seems since it also fails when engine handles the
> migration.

Vdsm create the symlinks in /run/vdsm/storage when disks are prepared. If the 
symlink was not created, maybe the disk was not prepared.

Comment 19 Artyom 2017-12-12 17:00:05 UTC
With new VDSM I have another error on destination host
2017-12-12 18:57:41,973+0200 INFO  (vm/346b8d80) [virt.vm] (vmId='346b8d80-b4f2-4a81-bf60-a662220c27a4') Enabling drive monitoring (drivemonitor:53)
2017-12-12 18:57:41,974+0200 ERROR (vm/346b8d80) [virt.vm] (vmId='346b8d80-b4f2-4a81-bf60-a662220c27a4') The vm start process failed (vm:914)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 843, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2632, in _run
    self._devices = self._make_devices()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2595, in _make_devices
    if dev.type != hwclass.DISK:
AttributeError: type

Comment 20 Yaniv Kaul 2017-12-12 17:13:42 UTC
(In reply to Artyom from comment #19)
> With new VDSM I have another error on destination host
> 2017-12-12 18:57:41,973+0200 INFO  (vm/346b8d80) [virt.vm]
> (vmId='346b8d80-b4f2-4a81-bf60-a662220c27a4') Enabling drive monitoring
> (drivemonitor:53)
> 2017-12-12 18:57:41,974+0200 ERROR (vm/346b8d80) [virt.vm]
> (vmId='346b8d80-b4f2-4a81-bf60-a662220c27a4') The vm start process failed
> (vm:914)
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 843, in
> _startUnderlyingVm
>     self._run()
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2632, in _run
>     self._devices = self._make_devices()
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2595, in
> _make_devices
>     if dev.type != hwclass.DISK:
> AttributeError: type

And? How does that affect migration? Is that a different BZ?

Comment 21 Michal Skrivanek 2017-12-12 17:29:57 UTC
please supply source and destination vdsm log containing VM start and the migration

Comment 22 Artyom 2017-12-13 07:23:46 UTC
Created attachment 1367106 [details]
new_vdsm_logs

Source host alma07
2017-12-13 08:51:08,912+0200 ERROR (migsrc/346b8d80) [virt.vm] (vmId='346b8d80-b4f2-4a81-bf60-a662220c27a4') migration destination error: Error creating the requested VM (migration:290)

Destination host master-vds10
2017-12-13 08:50:32,957+0200 ERROR (vm/346b8d80) [virt.vm] (vmId='346b8d80-b4f2-4a81-bf60-a662220c27a4') The vm start process failed (vm:914)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 843, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2632, in _run
    self._devices = self._make_devices()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2595, in _make_devices
    if dev.type != hwclass.DISK:
AttributeError: type

Comment 23 Artyom 2017-12-13 07:25:36 UTC
Migration failed because it can not start VM on the destination host.

Comment 24 Francesco Romani 2017-12-13 08:30:02 UTC
(In reply to Artyom from comment #22)
> Created attachment 1367106 [details]
> new_vdsm_logs
> 
> Source host alma07
> 2017-12-13 08:51:08,912+0200 ERROR (migsrc/346b8d80) [virt.vm]
> (vmId='346b8d80-b4f2-4a81-bf60-a662220c27a4') migration destination error:
> Error creating the requested VM (migration:290)
> 
> Destination host master-vds10
> 2017-12-13 08:50:32,957+0200 ERROR (vm/346b8d80) [virt.vm]
> (vmId='346b8d80-b4f2-4a81-bf60-a662220c27a4') The vm start process failed
> (vm:914)
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 843, in
> _startUnderlyingVm
>     self._run()
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2632, in _run
>     self._devices = self._make_devices()
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2595, in
> _make_devices
>     if dev.type != hwclass.DISK:
> AttributeError: type

Unfortunately this is a new bug, exposed by https://gerrit.ovirt.org/#/c/85200/

Fix available: https://gerrit.ovirt.org/#/c/85370/

Comment 25 Doron Fediuck 2017-12-13 10:30:05 UTC
*** Bug 1524989 has been marked as a duplicate of this bug. ***

Comment 26 Martin Sivák 2017-12-13 13:14:26 UTC
*** Bug 1524989 has been marked as a duplicate of this bug. ***

Comment 27 Artyom 2017-12-14 11:14:05 UTC
Checked on:
# rpm -qa | grep vdsm
vdsm-hook-vmfex-dev-4.20.9.2-1.el7ev.noarch
vdsm-hook-openstacknet-4.20.9.2-1.el7ev.noarch
vdsm-api-4.20.9.2-1.el7ev.noarch
vdsm-jsonrpc-4.20.9.2-1.el7ev.noarch
vdsm-hook-fcoe-4.20.9.2-1.el7ev.noarch
vdsm-hook-vhostmd-4.20.9.2-1.el7ev.noarch
vdsm-yajsonrpc-4.20.9.2-1.el7ev.noarch
vdsm-4.20.9.2-1.el7ev.x86_64
vdsm-client-4.20.9.2-1.el7ev.noarch
vdsm-python-4.20.9.2-1.el7ev.noarch
vdsm-hook-ethtool-options-4.20.9.2-1.el7ev.noarch
vdsm-http-4.20.9.2-1.el7ev.noarch
vdsm-hook-vfio-mdev-4.20.9.2-1.el7ev.noarch

# rpm -qa | grep hosted
ovirt-hosted-engine-ha-2.2.1-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.1-1.el7ev.noarch

I still have the issue that I described in the bug description
2017-12-14 13:01:14,866+0200 ERROR (migsrc/346b8d80) [virt.vm] (vmId='346b8d80-b4f2-4a81-bf60-a662220c27a4') Failed to migrate (migration:455)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 437, in _regular_run
    self._startUnderlyingMigration(time.time())
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 510, in _startUnderlyingMigration
    self._perform_with_downtime_thread(duri, muri)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 579, in _perform_with_downtime_thread
    self._perform_migration(duri, muri)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 528, in _perform_migration
    self._migration_flags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 126, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 512, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1679, in migrateToURI3
    if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self)
libvirtError: Cannot access storage file '/var/run/vdsm/storage/7a9a1478-6a7c-465b-b0e5-598205be7faf/76b07256-29ce-4634-9f4a-db35d3bc61c6/0fa79d54-74ac-41bb-9936-c6ecdd947150' (as uid:107, gid:107): No such file or directory
2017-12-14 13:01:14,905+0200 INFO  (jsonrpc/3) [api.virt] START getMigrationStatus() from=::ffff:10.35.72.159,58386 (api:46)
2017-12-14 13:01:14,905+0200 INFO  (jsonrpc/3) [api.virt] FINISH getMigrationStatus return={'status': {'message': 'Done', 'code': 0}, 'migrationStats': {'status': {'message': 'Fatal error during migration', 'code': 12}, 'progress': 0}} from=::ffff:10.35.72.159,58386 (api:52)


I think it does not connect to vdsmd restart:

# date
Thu Dec 14 13:08:29 IST 2017
# systemctl status vdsmd
● vdsmd.service - Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2017-12-14 09:22:05 IST; 3h 46min ago

ovirt-ha-broker log:
MainThread::INFO::2017-12-14 12:56:07,182::storage_backends::355::ovirt_hosted_engine_ha.lib.storage_backends::(connect) Preparing images
MainThread::INFO::2017-12-14 12:56:07,182::image::126::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Preparing images
MainThread::DEBUG::2017-12-14 12:56:07,182::stompreactor::479::jsonrpc.AsyncoreClient::(send) Sending response
MainThread::DEBUG::2017-12-14 12:56:07,186::stompreactor::479::jsonrpc.AsyncoreClient::(send) Sending response
MainThread::DEBUG::2017-12-14 12:56:07,193::image::104::ovirt_hosted_engine_ha.lib.image.Image::(get_images_list) getImagesList: ['45276ef6-0b41-4cde-a614-6651d0984302', '2b3ea455-941f-4b84-8821-62431afeec20', '76b07256-29ce-4634-9f4a-db35d3bc61c6', 'a405d488-cc8e-490d-a2cf-aa7507306453', '221b120c-f326-4340-9e78-61c9e55bace9', '0c7c0bfc-e32f-42f0-aad0-d4526df918ad']
MainThread::DEBUG::2017-12-14 12:56:07,193::stompreactor::479::jsonrpc.AsyncoreClient::(send) Sending response
MainThread::DEBUG::2017-12-14 12:56:07,201::image::140::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) ['46e91ef3-1ee7-459f-bc1e-d5a79fa9014d']
MainThread::DEBUG::2017-12-14 12:56:07,201::image::156::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Prepare image 00000000-0000-0000-0000-000000000000 7a9a1478-6a7c-465b-b0e5-598205be7faf 45276ef6-0b41-4cde-a614-6651d0984302 46e91ef3-1ee7-459f-bc1e-d5a79fa9014d
MainThread::DEBUG::2017-12-14 12:56:07,201::stompreactor::479::jsonrpc.AsyncoreClient::(send) Sending response
MainThread::DEBUG::2017-12-14 12:56:07,216::stompreactor::479::jsonrpc.AsyncoreClient::(send) Sending response
MainThread::DEBUG::2017-12-14 12:56:07,224::image::140::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) ['72297b7d-ab36-4511-acfe-28e4d4a7e165']
MainThread::DEBUG::2017-12-14 12:56:07,224::image::156::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Prepare image 00000000-0000-0000-0000-000000000000 7a9a1478-6a7c-465b-b0e5-598205be7faf 2b3ea455-941f-4b84-8821-62431afeec20 72297b7d-ab36-4511-acfe-28e4d4a7e165
MainThread::DEBUG::2017-12-14 12:56:07,224::stompreactor::479::jsonrpc.AsyncoreClient::(send) Sending response
MainThread::DEBUG::2017-12-14 12:56:07,240::stompreactor::479::jsonrpc.AsyncoreClient::(send) Sending response
MainThread::DEBUG::2017-12-14 12:56:07,247::image::140::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) ['0fa79d54-74ac-41bb-9936-c6ecdd947150']
MainThread::DEBUG::2017-12-14 12:56:07,247::image::156::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Prepare image 00000000-0000-0000-0000-000000000000 7a9a1478-6a7c-465b-b0e5-598205be7faf 76b07256-29ce-4634-9f4a-db35d3bc61c6 0fa79d54-74ac-41bb-9936-c6ecdd947150
MainThread::DEBUG::2017-12-14 12:56:07,247::stompreactor::479::jsonrpc.AsyncoreClient::(send) Sending response
MainThread::DEBUG::2017-12-14 12:56:07,263::stompreactor::479::jsonrpc.AsyncoreClient::(send) Sending response
MainThread::DEBUG::2017-12-14 12:56:07,270::image::140::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) ['bbb4b206-7e37-46d0-b80a-4d6161c99c94']
MainThread::DEBUG::2017-12-14 12:56:07,270::image::156::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Prepare image 00000000-0000-0000-0000-000000000000 7a9a1478-6a7c-465b-b0e5-598205be7faf a405d488-cc8e-490d-a2cf-aa7507306453 bbb4b206-7e37-46d0-b80a-4d6161c99c94
MainThread::DEBUG::2017-12-14 12:56:07,270::stompreactor::479::jsonrpc.AsyncoreClient::(send) Sending response
MainThread::DEBUG::2017-12-14 12:56:07,286::stompreactor::479::jsonrpc.AsyncoreClient::(send) Sending response
MainThread::DEBUG::2017-12-14 12:56:07,293::image::140::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) ['f864d1d1-f32a-4bea-bd28-c6bcc1ccd51c']
MainThread::DEBUG::2017-12-14 12:56:07,293::image::156::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Prepare image 00000000-0000-0000-0000-000000000000 7a9a1478-6a7c-465b-b0e5-598205be7faf 221b120c-f326-4340-9e78-61c9e55bace9 f864d1d1-f32a-4bea-bd28-c6bcc1ccd51c
MainThread::DEBUG::2017-12-14 12:56:07,294::stompreactor::479::jsonrpc.AsyncoreClient::(send) Sending response
MainThread::DEBUG::2017-12-14 12:56:07,309::stompreactor::479::jsonrpc.AsyncoreClient::(send) Sending response
MainThread::DEBUG::2017-12-14 12:56:07,316::image::140::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) ['fe1a3c75-5581-46fe-9b36-5f98bac0e03c']
MainThread::DEBUG::2017-12-14 12:56:07,317::image::156::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Prepare image 00000000-0000-0000-0000-000000000000 7a9a1478-6a7c-465b-b0e5-598205be7faf 0c7c0bfc-e32f-42f0-aad0-d4526df918ad fe1a3c75-5581-46fe-9b36-5f98bac0e03c

Comment 28 Artyom 2017-12-14 11:17:41 UTC
Looks I found reproduce:

HE VM runs on the host_1

1) Restart ovirt-ha-broker
Host has correct folder under /var/run/vdsm/storage
# ll
total 24
lrwxrwxrwx. 1 vdsm kvm 175 Dec  7 14:35 0c7c0bfc-e32f-42f0-aad0-d4526df918ad -> /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_alukiano_compute-ge-he-1/7a9a1478-6a7c-465b-b0e5-598205be7faf/images/0c7c0bfc-e32f-42f0-aad0-d4526df918ad
lrwxrwxrwx. 1 vdsm kvm 175 Dec  7 14:35 221b120c-f326-4340-9e78-61c9e55bace9 -> /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_alukiano_compute-ge-he-1/7a9a1478-6a7c-465b-b0e5-598205be7faf/images/221b120c-f326-4340-9e78-61c9e55bace9
lrwxrwxrwx. 1 vdsm kvm 175 Dec  7 14:35 2b3ea455-941f-4b84-8821-62431afeec20 -> /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_alukiano_compute-ge-he-1/7a9a1478-6a7c-465b-b0e5-598205be7faf/images/2b3ea455-941f-4b84-8821-62431afeec20
lrwxrwxrwx. 1 vdsm kvm 175 Dec 12 15:03 45276ef6-0b41-4cde-a614-6651d0984302 -> /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_alukiano_compute-ge-he-1/7a9a1478-6a7c-465b-b0e5-598205be7faf/images/45276ef6-0b41-4cde-a614-6651d0984302
lrwxrwxrwx. 1 vdsm kvm 175 Dec 14 13:11 76b07256-29ce-4634-9f4a-db35d3bc61c6 -> /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_alukiano_compute-ge-he-1/7a9a1478-6a7c-465b-b0e5-598205be7faf/images/76b07256-29ce-4634-9f4a-db35d3bc61c6
lrwxrwxrwx. 1 vdsm kvm 175 Dec 12 15:03 a405d488-cc8e-490d-a2cf-aa7507306453 -> /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_alukiano_compute-ge-he-1/7a9a1478-6a7c-465b-b0e5-598205be7faf/images/a405d488-cc8e-490d-a2cf-aa7507306453

2) Migrate HE VM - PASS
3) Recheck /var/run/vdsm/storage
# ll
total 20
lrwxrwxrwx. 1 vdsm kvm 175 Dec  7 14:35 0c7c0bfc-e32f-42f0-aad0-d4526df918ad -> /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_alukiano_compute-ge-he-1/7a9a1478-6a7c-465b-b0e5-598205be7faf/images/0c7c0bfc-e32f-42f0-aad0-d4526df918ad
lrwxrwxrwx. 1 vdsm kvm 175 Dec  7 14:35 221b120c-f326-4340-9e78-61c9e55bace9 -> /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_alukiano_compute-ge-he-1/7a9a1478-6a7c-465b-b0e5-598205be7faf/images/221b120c-f326-4340-9e78-61c9e55bace9
lrwxrwxrwx. 1 vdsm kvm 175 Dec  7 14:35 2b3ea455-941f-4b84-8821-62431afeec20 -> /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_alukiano_compute-ge-he-1/7a9a1478-6a7c-465b-b0e5-598205be7faf/images/2b3ea455-941f-4b84-8821-62431afeec20
lrwxrwxrwx. 1 vdsm kvm 175 Dec 12 15:03 45276ef6-0b41-4cde-a614-6651d0984302 -> /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_alukiano_compute-ge-he-1/7a9a1478-6a7c-465b-b0e5-598205be7faf/images/45276ef6-0b41-4cde-a614-6651d0984302
lrwxrwxrwx. 1 vdsm kvm 175 Dec 12 15:03 a405d488-cc8e-490d-a2cf-aa7507306453 -> /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_alukiano_compute-ge-he-1/7a9a1478-6a7c-465b-b0e5-598205be7faf/images/a405d488-cc8e-490d-a2cf-aa7507306453

Looks like image 76b07256-29ce-4634-9f4a-db35d3bc61c6 dissaperars after HE VM migration.

Comment 29 Francesco Romani 2017-12-14 11:51:51 UTC
please attach vdsm logs, both source and destination sides of the failed migration.

Comment 30 Michal Skrivanek 2017-12-14 12:53:02 UTC
so what did you reproduce, actually? In comment #28 you say that the migration passed - so is this new issue for another bug?

Comment 31 Artyom 2017-12-14 13:30:47 UTC
In https://bugzilla.redhat.com/show_bug.cgi?id=1524119#c28 I gave exactly reproduce steps for initial problem with HE VM migration.

I passed only once after ovirt-ha-broker restart, if I will try to migrate HE VM again from the same host, I will receive
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 437, in _regular_run
    self._startUnderlyingMigration(time.time())
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 510, in _startUnderlyingMigration
    self._perform_with_downtime_thread(duri, muri)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 579, in _perform_with_downtime_thread
    self._perform_migration(duri, muri)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 528, in _perform_migration
    self._migration_flags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 126, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 512, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1679, in migrateToURI3
    if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self)
libvirtError: Cannot access storage file '/var/run/vdsm/storage/7a9a1478-6a7c-465b-b0e5-598205be7faf/76b07256-29ce-4634-9f4a-db35d3bc61c6/0fa79d54-74ac-41bb-9936-c6ecdd947150' (as uid:107, gid:107): No such file or directory


You can find logs with the problem under https://bugzilla.redhat.com/attachment.cgi?id=1366624.

The fix http://gerrit.ovirt.org/85372 does not fix the initial bug(please see https://bugzilla.redhat.com/show_bug.cgi?id=1524119#c0)

Comment 32 Francesco Romani 2017-12-14 13:45:55 UTC
(In reply to Artyom from comment #31)
> In https://bugzilla.redhat.com/show_bug.cgi?id=1524119#c28 I gave exactly
> reproduce steps for initial problem with HE VM migration.
> 
> I passed only once after ovirt-ha-broker restart, if I will try to migrate
> HE VM again from the same host, I will receive
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 437,
> in _regular_run
>     self._startUnderlyingMigration(time.time())
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 510,
> in _startUnderlyingMigration
>     self._perform_with_downtime_thread(duri, muri)
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 579,
> in _perform_with_downtime_thread
>     self._perform_migration(duri, muri)
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 528,
> in _perform_migration
>     self._migration_flags)
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98,
> in f
>     ret = attr(*args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line
> 126, in wrapper
>     ret = f(*args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 512, in wrapper
>     return func(inst, *args, **kwargs)
>   File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1679, in
> migrateToURI3
>     if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed',
> dom=self)
> libvirtError: Cannot access storage file
> '/var/run/vdsm/storage/7a9a1478-6a7c-465b-b0e5-598205be7faf/76b07256-29ce-
> 4634-9f4a-db35d3bc61c6/0fa79d54-74ac-41bb-9936-c6ecdd947150' (as uid:107,
> gid:107): No such file or directory
> 
> 
> You can find logs with the problem under
> https://bugzilla.redhat.com/attachment.cgi?id=1366624.
> 
> The fix http://gerrit.ovirt.org/85372 does not fix the initial bug(please
> see https://bugzilla.redhat.com/show_bug.cgi?id=1524119#c0)

Thanks. Unfortunately in https://bugzilla.redhat.com/attachment.cgi?id=1366624 I see only one vdsm log, the source side if I'm not mistaken, so I can't really tell why it fails. Best option would be jave one environment on which it easily reproduce, I don't have it handy.

Comment 33 Francesco Romani 2017-12-14 14:58:39 UTC
Surprising, I managed to have access to the environment on which this reproduces and it seems the real cause of error is actually

2017-12-14 13:23:06,764+0200 ERROR (vm/346b8d80) [virt.vm] (vmId='346b8d80-b4f2-4a81-bf60-a662220c27a4') Failed to setup device vnc (vm:2519)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2516, in _setup_devices
    dev_object.setup()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/graphics.py", line 65, in setup
    display_network = self.specParams['displayNetwork']
KeyError: 'displayNetwork'
2017-12-14 13:23:06,765+0200 ERROR (vm/346b8d80) [virt.vm] (vmId='346b8d80-b4f2-4a81-bf60-a662220c27a4') The vm start process failed (vm:914)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 843, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2644, in _run
    self._setup_devices()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2516, in _setup_devices
    dev_object.setup()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/graphics.py", line 65, in setup
    display_network = self.specParams['displayNetwork']
KeyError: 'displayNetwork'
2017-12-14 13:23:06,765+0200 INFO  (vm/346b8d80) [virt.vm] (vmId='346b8d80-b4f2-4a81-bf60-a662220c27a4') Changed state to Down: 'displayNetwork' (code=1) (vm:1633)

This is surely one issue - not sure it is the last step needed, and not sure this means it is always broken.
Still, needs to be fixed, and https://gerrit.ovirt.org/85446 wants to do that.

Comment 34 Artyom 2017-12-14 15:46:22 UTC
With the patch https://gerrit.ovirt.org/85446. migration works well.

Comment 35 Artyom 2017-12-19 10:37:15 UTC
Verified on vdsm-4.20.9.3-1.el7ev.x86_64

Comment 36 Sandro Bonazzola 2017-12-20 11:16:17 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 37 Francesco Romani 2018-01-08 10:10:01 UTC
*** Bug 1502768 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.