Bug 1598509 - Hosted engine VM migration failed on "libvirtError: Domain not found: no domain with matching name 'HostedEngine'"
Summary: Hosted engine VM migration failed on "libvirtError: Domain not found: no doma...
Keywords:
Status: CLOSED DUPLICATE of bug 1665934
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Agent
Version: 2.2.13
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.3.2
: ---
Assignee: Simone Tiraboschi
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-05 16:26 UTC by Elad
Modified: 2019-02-20 08:53 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-20 08:53:44 UTC
oVirt Team: Integration
Embargoed:
rule-engine: ovirt-4.3+


Attachments (Terms of Use)
logs (8.79 MB, application/x-gzip)
2018-07-05 16:26 UTC, Elad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1550925 0 low CLOSED virt-manager loses track of VM after 'virt-clone --replace' is used 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1665934 0 high CLOSED Host stuck in preparing to maintenance - operation failed: migration job: unexpectedly failed 2021-02-22 00:41:40 UTC

Internal Links: 1550925

Description Elad 2018-07-05 16:26:46 UTC
Created attachment 1456800 [details]
logs

Description of problem:
Hosted engine VM migration, triggered as part of host maintenance, failed on libvirtError.

Version-Release number of selected component (if applicable):
rhvm-4.2.5-0.1.el7ev.noarch

RHEL7.5
ovirt-host-deploy-1.7.4-1.el7ev.noarch
ovirt-imageio-daemon-1.4.1-0.el7ev.noarch
ovirt-host-4.2.3-1.el7ev.x86_64
python-ovirt-engine-sdk4-4.2.6-1.el7ev.x86_64
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.15-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.23-1.el7ev.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
ovirt-host-dependencies-4.2.3-1.el7ev.x86_64
cockpit-ovirt-dashboard-0.11.29-1.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
ovirt-setup-lib-1.1.4-1.el7ev.noarch
ovirt-provider-ovn-driver-1.2.12-1.el7ev.noarch
ovirt-imageio-common-1.4.1-0.el7ev.noarch
cockpit-machines-ovirt-169-1.el7.noarch
vdsm-yajsonrpc-4.20.33-1.el7ev.noarch
vdsm-http-4.20.33-1.el7ev.noarch
vdsm-hook-fcoe-4.20.33-1.el7ev.noarch
vdsm-client-4.20.33-1.el7ev.noarch
vdsm-4.20.33-1.el7ev.x86_64
vdsm-hook-openstacknet-4.20.33-1.el7ev.noarch
vdsm-hook-vhostmd-4.20.33-1.el7ev.noarch
vdsm-api-4.20.33-1.el7ev.noarch
vdsm-python-4.20.33-1.el7ev.noarch
vdsm-hook-vmfex-dev-4.20.33-1.el7ev.noarch
vdsm-hook-vfio-mdev-4.20.33-1.el7ev.noarch
vdsm-network-4.20.33-1.el7ev.x86_64
vdsm-jsonrpc-4.20.33-1.el7ev.noarch
vdsm-hook-ethtool-options-4.20.33-1.el7ev.noarch
vdsm-common-4.20.33-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.15-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.23-1.el7ev.noarch
libvirt-daemon-config-nwfilter-3.9.0-14.el7_5.6.x86_64
libvirt-daemon-driver-storage-iscsi-3.9.0-14.el7_5.6.x86_64
libvirt-3.9.0-14.el7_5.6.x86_64
libvirt-client-3.9.0-14.el7_5.6.x86_64
libvirt-daemon-driver-storage-core-3.9.0-14.el7_5.6.x86_64
libvirt-daemon-driver-storage-3.9.0-14.el7_5.6.x86_64
libvirt-daemon-config-network-3.9.0-14.el7_5.6.x86_64
libvirt-python-3.9.0-1.el7.x86_64
libvirt-daemon-driver-nodedev-3.9.0-14.el7_5.6.x86_64
libvirt-daemon-driver-storage-rbd-3.9.0-14.el7_5.6.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
libvirt-daemon-kvm-3.9.0-14.el7_5.6.x86_64
libvirt-libs-3.9.0-14.el7_5.6.x86_64
libvirt-daemon-driver-nwfilter-3.9.0-14.el7_5.6.x86_64
libvirt-daemon-driver-secret-3.9.0-14.el7_5.6.x86_64
libvirt-lock-sanlock-3.9.0-14.el7_5.6.x86_64
libvirt-daemon-driver-storage-disk-3.9.0-14.el7_5.6.x86_64
libvirt-daemon-driver-storage-scsi-3.9.0-14.el7_5.6.x86_64
libvirt-daemon-driver-network-3.9.0-14.el7_5.6.x86_64
libvirt-daemon-driver-lxc-3.9.0-14.el7_5.6.x86_64
libvirt-daemon-3.9.0-14.el7_5.6.x86_64
libvirt-daemon-driver-interface-3.9.0-14.el7_5.6.x86_64
libvirt-daemon-driver-storage-logical-3.9.0-14.el7_5.6.x86_64
libvirt-daemon-driver-qemu-3.9.0-14.el7_5.6.x86_64
libvirt-daemon-driver-storage-mpath-3.9.0-14.el7_5.6.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
qemu-img-rhev-2.10.0-21.el7_5.4.x86_64
libvirt-daemon-driver-qemu-3.9.0-14.el7_5.6.x86_64
qemu-kvm-common-rhev-2.10.0-21.el7_5.4.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64
Linux lynx14.lab.eng.tlv2.redhat.com 3.10.0-862.6.3.el7.x86_64 #1 SMP Fri Jun 15 17:57:37 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux


How reproducible:
Encountered once

Steps to Reproduce:
1. Put the host that has the hosted engine VM running in maintenance (local maintenance) while there are other hosted engine hosts with proper score in the cluster.


Actual results:

2018-07-04 01:43:32,369+0300 ERROR (migsrc/03d8d9a4) [virt.vm] (vmId='03d8d9a4-776a-4cba-a30c-ed369a276027') Failed to migrate (migration:455)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 437, in _regular_run
    self._startUnderlyingMigration(time.time())
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 509, in _startUnderlyingMigration
    self._perform_with_conv_schedule(duri, muri)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 587, in _perform_with_conv_schedule
    self._perform_migration(duri, muri)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 529, in _perform_migration
    self._migration_flags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1746, in migrateToURI3
    if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self)
libvirtError: Domain not found: no domain with matching name 'HostedEngine'



libvirt.log:

2018-07-03 22:43:30.818+0000: 43114: info : qemuMonitorIOProcess:447 : QEMU_MONITOR_IO_PROCESS: mon=0x7f4f24020600 buf={"return": {"status": "failed"}, "id": "libvirt-1673"}
 len=56
2018-07-03 22:43:30.818+0000: 43114: info : virObjectUnref:350 : OBJECT_UNREF: obj=0x7f4f24020600
2018-07-03 22:43:30.818+0000: 43118: info : virObjectUnref:350 : OBJECT_UNREF: obj=0x7f4f24020600
2018-07-03 22:43:30.818+0000: 43118: error : qemuMigrationCheckJobStatus:1457 : operation failed: migration job: unexpectedly failed



agent.log:


MainThread::INFO::2018-07-04 01:43:02,908::states::243::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to local maintenance mode
MainThread::INFO::2018-07-04 01:43:02,908::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenanceMigrateVm (score: 0)
MainThread::INFO::2018-07-04 01:43:03,035::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (LocalMaintenanceMigrateVm-EngineMigratingAway) sent? sent
MainThread::WARNING::2018-07-04 01:43:03,035::hosted_engine::601::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor.
MainThread::INFO::2018-07-04 01:43:03,040::states::72::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_penalize_memory) Penalizing score by 400 due to free memory 12352 being lower than required 16384
MainThread::INFO::2018-07-04 01:43:03,040::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineMigratingAway (score: 3000)
MainThread::INFO::2018-07-04 01:43:13,062::state_machine::169::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {'maintenance': False}
MainThread::INFO::2018-07-04 01:43:13,062::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host lynx16.lab.eng.tlv2.redhat.com (id 2): {'conf_on_shared_storage': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=86119 (Wed Jul  4 01:42:59 2018)\nhost-id=2\nscore=3400\nvm_conf_refresh_time=86120 (Wed Jul  4 01:42:59 2018)\nconf_on_shared_storage=True\nmaintenance=False\nstate=EngineDown\nstopped=False\n', 'hostname': 'lynx16.lab.eng.tlv2.redhat.com', 'alive': True, 'host-id': 2, 'engine-status': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'score': 3400, 'stopped': False, 'maintenance': False, 'crc32': '93239882', 'local_conf_timestamp': 86120, 'host-ts': 86119}




MainThread::ERROR::2018-07-04 01:43:34,010::states::828::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Migration failed: {'status': {'message': 'Fatal error during migration', 'code': 12}, 'progress': 15}



Expected results:
Hosted engine VM migration should succeed.

Additional info:
logs

Comment 1 Elad 2018-07-06 00:38:47 UTC
Platform related bug 1550925

Comment 2 Ryan Barry 2018-08-15 16:21:21 UTC
Martin -

Any guesses here? The platform bug doesn't look related, but I'm not sure if the agent matches by uuid. Judging from the description, it looks like we shouldn't be trying to migrate at all, or the cluster should be in global

Elad -

When you say "no other hosts with appropriate score", does that mean all other hosts are also in maintenance?

Comment 3 Ryan Barry 2018-08-15 16:24:34 UTC
Removing blocker until the questions above are answered, as this may not be a valid use case.

Comment 4 Michal Skrivanek 2018-08-15 16:43:29 UTC
yeah, that bug in comment #1 is totally unrelated

Comment 5 Michal Skrivanek 2018-08-15 16:58:10 UTC
the only relevant thing besides migration failure is in qemu log(in UTC):
2018-07-03 22:43:05.389+0000: initiating migration
2018-07-04 01:10:07.802+0000: initiating migration
2018-07-04 01:10:37.494+0000: shutting down, reason=migrated

need logs from the destination host. From the looks of it on the source side it could be a qemu crash. nothing interesting in libvirt log. 

is it reproducible?

Comment 6 Elad 2018-08-20 11:00:56 UTC
I don't have these logs and we haven't seen this ever since.

Comment 7 Michal Skrivanek 2018-08-20 12:25:36 UTC
ok. that's too bad, probably need to close WORKSFORME. I suppose it's not an automation blocker then?

Comment 8 Elad 2018-08-20 12:27:04 UTC
Closing for now. Will re-open if needed.

Comment 9 Elad 2018-09-02 15:09:26 UTC
Encountered again, re-openging.

hypervisor-lynx02/var/tmp/ovirt-logs-hypervisor/vdsm-logs/vdsm.log.11.xz - 


2018-08-30 16:35:35,458+0300 ERROR (migsrc/2429cc35) [virt.vm] (vmId='2429cc35-1fbc-4c4a-8b32-5cb287c0d176') Failed to migrate (migration:455)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 437, in _regular_run
    self._startUnderlyingMigration(time.time())
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 509, in _startUnderlyingMigration
    self._perform_with_conv_schedule(duri, muri)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 587, in _perform_with_conv_schedule
    self._perform_migration(duri, muri)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 529, in _perform_migration
    self._migration_flags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1746, in migrateToURI3
    if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self)
libvirtError: Domain not found: no domain with matching name 'HostedEngine'



Logs:

https://drive.google.com/file/d/1Iee7jqNnqMu8PqQp2kLbQ42VhiymL7sV/view?usp=sharing

Comment 10 Michal Skrivanek 2018-09-03 10:04:38 UTC
migration was started on lynx02:
2018-08-30 16:35:16,755+0300 INFO  (migsrc/2429cc35) [virt.vm] (vmId='2429cc35-1fbc-4c4a-8b32-5cb287c0d176') starting migration to qemu+tls://lynx01...
which was going on fine on destination lynx01 until the destination host initiated a destroy of that VM:
2018-08-30 16:35:33,437+0300 INFO  (jsonrpc/0) [api.virt] START destroy(gracefulAttempts=1) from=::1,36190, vmId=2429cc35-1fbc-4c4a-8b32-5cb287c0d176 (api:46)

which corresponds to the migration failure seen on source.

Seems it's killed by the HE agent on lynx01 because of:
MainThread::INFO::2018-08-30 16:35:23,761::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400)
MainThread::INFO::2018-08-30 16:35:32,784::states::510::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local host has best score (3400), attempting to start engine VM
MainThread::INFO::2018-08-30 16:35:32,931::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineStart) sent? sent
MainThread::INFO::2018-08-30 16:35:33,385::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStart (score: 3400)
MainThread::INFO::2018-08-30 16:35:33,420::hosted_engine::968::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Ensuring VDSM state is clear for engine VM
MainThread::INFO::2018-08-30 16:35:33,432::hosted_engine::988::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Cleaning state for non-running VM
MainThread::INFO::2018-08-30 16:35:36,516::hosted_engine::980::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Vdsm state for VM clean
MainThread::INFO::2018-08-30 16:35:36,516::hosted_engine::929::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Starting vm using `/usr/sbin/hosted-engine --vm-start`


so - ultimately seems a fault of the he-agent mistakenly thinking the HE VM is not running while it's migrating in. 
It may seem ok as the migration was just aborted and the HE VM continued to run on source host, but there's certainly a window for a race - when a totally healthy HE VM could be killed if the decision to clean up comes right before the migration succeeds.

Comment 11 Sandro Bonazzola 2018-10-01 07:28:13 UTC
Elad are you able to reproduce?

Comment 12 Elad 2018-10-04 04:23:18 UTC
This has occurred at least twice in our automation since reported (both incidents are detailed in this bug)

Comment 13 Sandro Bonazzola 2018-12-10 11:15:28 UTC
Does this reproduce in last 4.2.8 build?

Comment 14 Elad 2018-12-11 11:51:59 UTC
Not that I'm aware of. But if no fix was applied, I'm afraid the bug still exists and it's only a matter of time before we see it again, the same as we did in comment #10

Comment 15 Sandro Bonazzola 2018-12-13 12:54:57 UTC
Postponing to 4.3.0 not being identified as blocker for 4.2.8

Comment 16 Sandro Bonazzola 2019-02-18 07:54:54 UTC
Moving to 4.3.2 not being identified as blocker for 4.3.1.

Comment 17 Sandro Bonazzola 2019-02-20 08:53:44 UTC
closing as duplicate of bug #1665934 , seems to be same root cause.

*** This bug has been marked as a duplicate of bug 1665934 ***


Note You need to log in before you can comment on or make changes to this bug.