Bug 1641617 - [Executor] Worker blocked vdsm error happens while HE VM migration.Followed by unreachable SDs
Summary: [Executor] Worker blocked vdsm error happens while HE VM migration.Followed b...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 4.2.7
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ovirt-4.4.0
: ---
Assignee: Ryan Barry
QA Contact: Polina
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-22 11:13 UTC by Polina
Modified: 2019-09-16 19:09 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-09-16 19:09:41 UTC
oVirt Team: Virt
Embargoed:
pm-rhel: ovirt-4.4+


Attachments (Terms of Use)
vdsm source/destination logs (7.64 MB, application/x-gzip)
2018-10-22 11:13 UTC, Polina
no flags Details
he migration error logs (66.45 KB, application/x-xz)
2018-11-05 08:25 UTC, Liran Rotenberg
no flags Details
migration_last_4.2.8-5_build.tar.gz (253.14 KB, application/gzip)
2019-02-06 15:42 UTC, Polina
no flags Details

Description Polina 2018-10-22 11:13:08 UTC
Created attachment 1496367 [details]
vdsm source/destination logs

Description of problem: Migration of HE VM failed with '[Executor] Worker blocked' vdsm error. Storage Domains are unreachanbe after this and the connection to the hosts lost

Version-Release number of selected component (if applicable):
redhat-release-server-7.6-4.el7.x86_64
rhv-release-4.2.7-5-001.noarch
vdsm-http-4.20.43-1.el7ev.noarch
vdsm-api-4.20.43-1.el7ev.noarch
vdsm-python-4.20.43-1.el7ev.noarch
vdsm-hook-vhostmd-4.20.43-1.el7ev.noarch
vdsm-yajsonrpc-4.20.43-1.el7ev.noarch
vdsm-client-4.20.43-1.el7ev.noarch
vdsm-hook-vmfex-dev-4.20.43-1.el7ev.noarch
vdsm-hook-fcoe-4.20.43-1.el7ev.noarch
vdsm-hook-openstacknet-4.20.43-1.el7ev.noarch
vdsm-jsonrpc-4.20.43-1.el7ev.noarch
vdsm-4.20.43-1.el7ev.x86_64
vdsm-hook-ethtool-options-4.20.43-1.el7ev.noarch
vdsm-network-4.20.43-1.el7ev.x86_64
vdsm-common-4.20.43-1.el7ev.noarch

How reproducible: happened while tier3 automation run - https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/rhv-4.2-ge-runner-tier3/87/
After this the storage Domains where unreachable for a period of time.

Steps to Reproduce:
pre-condition: environment with 3 hosts. HE VM is on host3 which is also SPM.
1. Send migration action for HE VM 
2018-10-19 02:34:22,287 - MainThread - vms - DEBUG - Action request content is --  url:/ovirt-engine/api/vms/31408678-5102-4750-8702-66ad6cf7d1b0/migrate body:<action>
    <async>false</async>
    <force>true</force>
    <grace_period>
        <expiry>10</expiry>
    </grace_period>
    <host id="6aa23fa7-86cb-4ab3-b04f-d15fa6d4739d"/>
</action>

Actual results: Migration fails. 
On Destination host: 
018-10-19 02:39:16,220+0300 ERROR (vm/31408678) [virt.vm] (vmId='31408678-5102-4750-8702-66ad6cf7d1b0') The vm start process failed (vm:948)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 877, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2898, in _run
    dom.createWithFlags(flags)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1110, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirtError: resource busy: Failed to acquire lock: Lease is held by another host
###########################
On source host:
2018-10-19 02:36:34,225+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.00883248 s, 116 MB/s\n'; <rc> = 0 (commands:86)
2018-10-19 02:36:34,986+0300 ERROR (migmon/31408678) [root] Unhandled exception (logutils:412)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/common/logutils.py", line 409, in wrapper
    return f(*a, **kw)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 758, in run
    self.monitor_migration()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 791, in monitor_migration
    job_stats = self._vm._dom.jobStats()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1431, in jobStats
    if ret is None: raise libvirtError ('virDomainGetJobStats() failed', dom=self)
libvirtError: Requested operation is not valid: domain is not running
2018-10-19 02:36:34,987+0300 ERROR (migmon/31408678) [root] FINISH thread <Thread(migmon/31408678, stopped daemon 140055173506816)> failed (concurrent:201)
Traceback (most recent call last)
################
Connection to hosts lost. 
SDs are unreachable for a period of time

Expected results:migration succeeds and SDs are available 

Additional info:engine, vdsm source/destination logs are available

Comment 2 Liran Rotenberg 2018-11-05 08:25:14 UTC
Created attachment 1501681 [details]
he migration error logs

We hit the migration error in high frequency, this time in:
ovirt-engine-4.2.7.4-0.1.el7ev.noarch
It's strongly effect our automation.

Comment 4 Ryan Barry 2019-01-21 13:34:22 UTC
Re-targeting, because these bugs either do not have blocker+, or do not have a patch posted

Comment 5 Ryan Barry 2019-01-24 00:13:43 UTC
Polina, still reproducible?

Comment 6 Polina 2019-02-06 15:42:34 UTC
Created attachment 1527555 [details]
migration_last_4.2.8-5_build.tar.gz

yes, I see it in the last automation run

I attach migration_last_4.2.8-5_build.tar.gz - containing the extracted of engine and the vdsm logs with the errors around the incident time .
Please let me know if more logs needed.

Comment 7 Michal Skrivanek 2019-03-07 13:16:24 UTC
I do not see any problem with SD? Is there anything in event log?

Comment 8 Ryan Barry 2019-03-07 14:02:49 UTC
Removing blocker+ until there's a reproducer in engineering. Can you please attach libvirt/qemu logs, so we can get an idea why the VM doesn't start?

Simone, any idea why the host is allowed to go into maintenance when the HE VM is actually still running there, and potential side effects other than the observed storage disconnect?

Comment 10 Michal Skrivanek 2019-03-14 11:58:08 UTC
it can definitely happen when host becomes unresponsive during PreparingForMaintenance (see MaintenanceVdsCommand)

Comment 11 Michal Skrivanek 2019-03-14 12:04:24 UTC
Polina, so what is the problem again?

Comment 12 Ryan Barry 2019-07-31 11:05:57 UTC
Re-test?

Comment 13 Polina 2019-07-31 11:36:19 UTC
will be re-tested in the next automation run .

Comment 14 Polina 2019-09-16 12:32:35 UTC
not seen in the last automation runs (all the tiers) for ovirt-engine-4.3.6.5-0.1.el7.noarch


Note You need to log in before you can comment on or make changes to this bug.