Created attachment 1456800 [details] logs Description of problem: Hosted engine VM migration, triggered as part of host maintenance, failed on libvirtError. Version-Release number of selected component (if applicable): rhvm-4.2.5-0.1.el7ev.noarch RHEL7.5 ovirt-host-deploy-1.7.4-1.el7ev.noarch ovirt-imageio-daemon-1.4.1-0.el7ev.noarch ovirt-host-4.2.3-1.el7ev.x86_64 python-ovirt-engine-sdk4-4.2.6-1.el7ev.x86_64 ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch ovirt-hosted-engine-ha-2.2.15-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.23-1.el7ev.noarch ovirt-vmconsole-host-1.0.4-1.el7ev.noarch ovirt-host-dependencies-4.2.3-1.el7ev.x86_64 cockpit-ovirt-dashboard-0.11.29-1.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch ovirt-setup-lib-1.1.4-1.el7ev.noarch ovirt-provider-ovn-driver-1.2.12-1.el7ev.noarch ovirt-imageio-common-1.4.1-0.el7ev.noarch cockpit-machines-ovirt-169-1.el7.noarch vdsm-yajsonrpc-4.20.33-1.el7ev.noarch vdsm-http-4.20.33-1.el7ev.noarch vdsm-hook-fcoe-4.20.33-1.el7ev.noarch vdsm-client-4.20.33-1.el7ev.noarch vdsm-4.20.33-1.el7ev.x86_64 vdsm-hook-openstacknet-4.20.33-1.el7ev.noarch vdsm-hook-vhostmd-4.20.33-1.el7ev.noarch vdsm-api-4.20.33-1.el7ev.noarch vdsm-python-4.20.33-1.el7ev.noarch vdsm-hook-vmfex-dev-4.20.33-1.el7ev.noarch vdsm-hook-vfio-mdev-4.20.33-1.el7ev.noarch vdsm-network-4.20.33-1.el7ev.x86_64 vdsm-jsonrpc-4.20.33-1.el7ev.noarch vdsm-hook-ethtool-options-4.20.33-1.el7ev.noarch vdsm-common-4.20.33-1.el7ev.noarch ovirt-hosted-engine-ha-2.2.15-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.23-1.el7ev.noarch libvirt-daemon-config-nwfilter-3.9.0-14.el7_5.6.x86_64 libvirt-daemon-driver-storage-iscsi-3.9.0-14.el7_5.6.x86_64 libvirt-3.9.0-14.el7_5.6.x86_64 libvirt-client-3.9.0-14.el7_5.6.x86_64 libvirt-daemon-driver-storage-core-3.9.0-14.el7_5.6.x86_64 libvirt-daemon-driver-storage-3.9.0-14.el7_5.6.x86_64 libvirt-daemon-config-network-3.9.0-14.el7_5.6.x86_64 libvirt-python-3.9.0-1.el7.x86_64 libvirt-daemon-driver-nodedev-3.9.0-14.el7_5.6.x86_64 libvirt-daemon-driver-storage-rbd-3.9.0-14.el7_5.6.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64 libvirt-daemon-kvm-3.9.0-14.el7_5.6.x86_64 libvirt-libs-3.9.0-14.el7_5.6.x86_64 libvirt-daemon-driver-nwfilter-3.9.0-14.el7_5.6.x86_64 libvirt-daemon-driver-secret-3.9.0-14.el7_5.6.x86_64 libvirt-lock-sanlock-3.9.0-14.el7_5.6.x86_64 libvirt-daemon-driver-storage-disk-3.9.0-14.el7_5.6.x86_64 libvirt-daemon-driver-storage-scsi-3.9.0-14.el7_5.6.x86_64 libvirt-daemon-driver-network-3.9.0-14.el7_5.6.x86_64 libvirt-daemon-driver-lxc-3.9.0-14.el7_5.6.x86_64 libvirt-daemon-3.9.0-14.el7_5.6.x86_64 libvirt-daemon-driver-interface-3.9.0-14.el7_5.6.x86_64 libvirt-daemon-driver-storage-logical-3.9.0-14.el7_5.6.x86_64 libvirt-daemon-driver-qemu-3.9.0-14.el7_5.6.x86_64 libvirt-daemon-driver-storage-mpath-3.9.0-14.el7_5.6.x86_64 ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch qemu-img-rhev-2.10.0-21.el7_5.4.x86_64 libvirt-daemon-driver-qemu-3.9.0-14.el7_5.6.x86_64 qemu-kvm-common-rhev-2.10.0-21.el7_5.4.x86_64 qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64 Linux lynx14.lab.eng.tlv2.redhat.com 3.10.0-862.6.3.el7.x86_64 #1 SMP Fri Jun 15 17:57:37 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux How reproducible: Encountered once Steps to Reproduce: 1. Put the host that has the hosted engine VM running in maintenance (local maintenance) while there are other hosted engine hosts with proper score in the cluster. Actual results: 2018-07-04 01:43:32,369+0300 ERROR (migsrc/03d8d9a4) [virt.vm] (vmId='03d8d9a4-776a-4cba-a30c-ed369a276027') Failed to migrate (migration:455) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 437, in _regular_run self._startUnderlyingMigration(time.time()) File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 509, in _startUnderlyingMigration self._perform_with_conv_schedule(duri, muri) File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 587, in _perform_with_conv_schedule self._perform_migration(duri, muri) File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 529, in _perform_migration self._migration_flags) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1746, in migrateToURI3 if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self) libvirtError: Domain not found: no domain with matching name 'HostedEngine' libvirt.log: 2018-07-03 22:43:30.818+0000: 43114: info : qemuMonitorIOProcess:447 : QEMU_MONITOR_IO_PROCESS: mon=0x7f4f24020600 buf={"return": {"status": "failed"}, "id": "libvirt-1673"} len=56 2018-07-03 22:43:30.818+0000: 43114: info : virObjectUnref:350 : OBJECT_UNREF: obj=0x7f4f24020600 2018-07-03 22:43:30.818+0000: 43118: info : virObjectUnref:350 : OBJECT_UNREF: obj=0x7f4f24020600 2018-07-03 22:43:30.818+0000: 43118: error : qemuMigrationCheckJobStatus:1457 : operation failed: migration job: unexpectedly failed agent.log: MainThread::INFO::2018-07-04 01:43:02,908::states::243::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to local maintenance mode MainThread::INFO::2018-07-04 01:43:02,908::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenanceMigrateVm (score: 0) MainThread::INFO::2018-07-04 01:43:03,035::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (LocalMaintenanceMigrateVm-EngineMigratingAway) sent? sent MainThread::WARNING::2018-07-04 01:43:03,035::hosted_engine::601::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor. MainThread::INFO::2018-07-04 01:43:03,040::states::72::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_penalize_memory) Penalizing score by 400 due to free memory 12352 being lower than required 16384 MainThread::INFO::2018-07-04 01:43:03,040::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineMigratingAway (score: 3000) MainThread::INFO::2018-07-04 01:43:13,062::state_machine::169::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {'maintenance': False} MainThread::INFO::2018-07-04 01:43:13,062::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host lynx16.lab.eng.tlv2.redhat.com (id 2): {'conf_on_shared_storage': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=86119 (Wed Jul 4 01:42:59 2018)\nhost-id=2\nscore=3400\nvm_conf_refresh_time=86120 (Wed Jul 4 01:42:59 2018)\nconf_on_shared_storage=True\nmaintenance=False\nstate=EngineDown\nstopped=False\n', 'hostname': 'lynx16.lab.eng.tlv2.redhat.com', 'alive': True, 'host-id': 2, 'engine-status': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'score': 3400, 'stopped': False, 'maintenance': False, 'crc32': '93239882', 'local_conf_timestamp': 86120, 'host-ts': 86119} MainThread::ERROR::2018-07-04 01:43:34,010::states::828::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Migration failed: {'status': {'message': 'Fatal error during migration', 'code': 12}, 'progress': 15} Expected results: Hosted engine VM migration should succeed. Additional info: logs
Platform related bug 1550925
Martin - Any guesses here? The platform bug doesn't look related, but I'm not sure if the agent matches by uuid. Judging from the description, it looks like we shouldn't be trying to migrate at all, or the cluster should be in global Elad - When you say "no other hosts with appropriate score", does that mean all other hosts are also in maintenance?
Removing blocker until the questions above are answered, as this may not be a valid use case.
yeah, that bug in comment #1 is totally unrelated
the only relevant thing besides migration failure is in qemu log(in UTC): 2018-07-03 22:43:05.389+0000: initiating migration 2018-07-04 01:10:07.802+0000: initiating migration 2018-07-04 01:10:37.494+0000: shutting down, reason=migrated need logs from the destination host. From the looks of it on the source side it could be a qemu crash. nothing interesting in libvirt log. is it reproducible?
I don't have these logs and we haven't seen this ever since.
ok. that's too bad, probably need to close WORKSFORME. I suppose it's not an automation blocker then?
Closing for now. Will re-open if needed.
Encountered again, re-openging. hypervisor-lynx02/var/tmp/ovirt-logs-hypervisor/vdsm-logs/vdsm.log.11.xz - 2018-08-30 16:35:35,458+0300 ERROR (migsrc/2429cc35) [virt.vm] (vmId='2429cc35-1fbc-4c4a-8b32-5cb287c0d176') Failed to migrate (migration:455) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 437, in _regular_run self._startUnderlyingMigration(time.time()) File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 509, in _startUnderlyingMigration self._perform_with_conv_schedule(duri, muri) File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 587, in _perform_with_conv_schedule self._perform_migration(duri, muri) File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 529, in _perform_migration self._migration_flags) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1746, in migrateToURI3 if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self) libvirtError: Domain not found: no domain with matching name 'HostedEngine' Logs: https://drive.google.com/file/d/1Iee7jqNnqMu8PqQp2kLbQ42VhiymL7sV/view?usp=sharing
migration was started on lynx02: 2018-08-30 16:35:16,755+0300 INFO (migsrc/2429cc35) [virt.vm] (vmId='2429cc35-1fbc-4c4a-8b32-5cb287c0d176') starting migration to qemu+tls://lynx01... which was going on fine on destination lynx01 until the destination host initiated a destroy of that VM: 2018-08-30 16:35:33,437+0300 INFO (jsonrpc/0) [api.virt] START destroy(gracefulAttempts=1) from=::1,36190, vmId=2429cc35-1fbc-4c4a-8b32-5cb287c0d176 (api:46) which corresponds to the migration failure seen on source. Seems it's killed by the HE agent on lynx01 because of: MainThread::INFO::2018-08-30 16:35:23,761::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400) MainThread::INFO::2018-08-30 16:35:32,784::states::510::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local host has best score (3400), attempting to start engine VM MainThread::INFO::2018-08-30 16:35:32,931::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineStart) sent? sent MainThread::INFO::2018-08-30 16:35:33,385::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStart (score: 3400) MainThread::INFO::2018-08-30 16:35:33,420::hosted_engine::968::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Ensuring VDSM state is clear for engine VM MainThread::INFO::2018-08-30 16:35:33,432::hosted_engine::988::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Cleaning state for non-running VM MainThread::INFO::2018-08-30 16:35:36,516::hosted_engine::980::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Vdsm state for VM clean MainThread::INFO::2018-08-30 16:35:36,516::hosted_engine::929::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Starting vm using `/usr/sbin/hosted-engine --vm-start` so - ultimately seems a fault of the he-agent mistakenly thinking the HE VM is not running while it's migrating in. It may seem ok as the migration was just aborted and the HE VM continued to run on source host, but there's certainly a window for a race - when a totally healthy HE VM could be killed if the decision to clean up comes right before the migration succeeds.
Elad are you able to reproduce?
This has occurred at least twice in our automation since reported (both incidents are detailed in this bug)
Does this reproduce in last 4.2.8 build?
Not that I'm aware of. But if no fix was applied, I'm afraid the bug still exists and it's only a matter of time before we see it again, the same as we did in comment #10
Postponing to 4.3.0 not being identified as blocker for 4.2.8
Moving to 4.3.2 not being identified as blocker for 4.3.1.
closing as duplicate of bug #1665934 , seems to be same root cause. *** This bug has been marked as a duplicate of bug 1665934 ***