Created attachment 1496367 [details] vdsm source/destination logs Description of problem: Migration of HE VM failed with '[Executor] Worker blocked' vdsm error. Storage Domains are unreachanbe after this and the connection to the hosts lost Version-Release number of selected component (if applicable): redhat-release-server-7.6-4.el7.x86_64 rhv-release-4.2.7-5-001.noarch vdsm-http-4.20.43-1.el7ev.noarch vdsm-api-4.20.43-1.el7ev.noarch vdsm-python-4.20.43-1.el7ev.noarch vdsm-hook-vhostmd-4.20.43-1.el7ev.noarch vdsm-yajsonrpc-4.20.43-1.el7ev.noarch vdsm-client-4.20.43-1.el7ev.noarch vdsm-hook-vmfex-dev-4.20.43-1.el7ev.noarch vdsm-hook-fcoe-4.20.43-1.el7ev.noarch vdsm-hook-openstacknet-4.20.43-1.el7ev.noarch vdsm-jsonrpc-4.20.43-1.el7ev.noarch vdsm-4.20.43-1.el7ev.x86_64 vdsm-hook-ethtool-options-4.20.43-1.el7ev.noarch vdsm-network-4.20.43-1.el7ev.x86_64 vdsm-common-4.20.43-1.el7ev.noarch How reproducible: happened while tier3 automation run - https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/rhv-4.2-ge-runner-tier3/87/ After this the storage Domains where unreachable for a period of time. Steps to Reproduce: pre-condition: environment with 3 hosts. HE VM is on host3 which is also SPM. 1. Send migration action for HE VM 2018-10-19 02:34:22,287 - MainThread - vms - DEBUG - Action request content is -- url:/ovirt-engine/api/vms/31408678-5102-4750-8702-66ad6cf7d1b0/migrate body:<action> <async>false</async> <force>true</force> <grace_period> <expiry>10</expiry> </grace_period> <host id="6aa23fa7-86cb-4ab3-b04f-d15fa6d4739d"/> </action> Actual results: Migration fails. On Destination host: 018-10-19 02:39:16,220+0300 ERROR (vm/31408678) [virt.vm] (vmId='31408678-5102-4750-8702-66ad6cf7d1b0') The vm start process failed (vm:948) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 877, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2898, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1110, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: resource busy: Failed to acquire lock: Lease is held by another host ########################### On source host: 2018-10-19 02:36:34,225+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.00883248 s, 116 MB/s\n'; <rc> = 0 (commands:86) 2018-10-19 02:36:34,986+0300 ERROR (migmon/31408678) [root] Unhandled exception (logutils:412) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/logutils.py", line 409, in wrapper return f(*a, **kw) File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 758, in run self.monitor_migration() File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 791, in monitor_migration job_stats = self._vm._dom.jobStats() File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1431, in jobStats if ret is None: raise libvirtError ('virDomainGetJobStats() failed', dom=self) libvirtError: Requested operation is not valid: domain is not running 2018-10-19 02:36:34,987+0300 ERROR (migmon/31408678) [root] FINISH thread <Thread(migmon/31408678, stopped daemon 140055173506816)> failed (concurrent:201) Traceback (most recent call last) ################ Connection to hosts lost. SDs are unreachable for a period of time Expected results:migration succeeds and SDs are available Additional info:engine, vdsm source/destination logs are available
Created attachment 1501681 [details] he migration error logs We hit the migration error in high frequency, this time in: ovirt-engine-4.2.7.4-0.1.el7ev.noarch It's strongly effect our automation.
Re-targeting, because these bugs either do not have blocker+, or do not have a patch posted
Polina, still reproducible?
Created attachment 1527555 [details] migration_last_4.2.8-5_build.tar.gz yes, I see it in the last automation run I attach migration_last_4.2.8-5_build.tar.gz - containing the extracted of engine and the vdsm logs with the errors around the incident time . Please let me know if more logs needed.
I do not see any problem with SD? Is there anything in event log?
Removing blocker+ until there's a reproducer in engineering. Can you please attach libvirt/qemu logs, so we can get an idea why the VM doesn't start? Simone, any idea why the host is allowed to go into maintenance when the HE VM is actually still running there, and potential side effects other than the observed storage disconnect?
it can definitely happen when host becomes unresponsive during PreparingForMaintenance (see MaintenanceVdsCommand)
Polina, so what is the problem again?
Re-test?
will be re-tested in the next automation run .
not seen in the last automation runs (all the tiers) for ovirt-engine-4.3.6.5-0.1.el7.noarch