Created attachment 1360964 [details] logs from engine and from source host Description of problem: If migration of HE VM failed because of a timeout, source host will have hanged state "EngineMigratingAway". Example of timeout traceback under vdsm.log 2017-11-30 16:12:19,020+0200 ERROR (migsrc/30e333df) [virt.vm] (vmId='30e333df-79e9-4749-af8a-37e3c68ddce5') Failed to migrate (migration:455) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 437, in _regular_run self._startUnderlyingMigration(time.time()) File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 510, in _startUnderlyingMigration self._perform_with_downtime_thread(duri, muri) File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 579, in _perform_with_downtime_thread self._perform_migration(duri, muri) File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 528, in _perform_migration self._migration_flags) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 125, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 586, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1679, in migrateToURI3 if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self) libvirtError: operation aborted: migration job: canceled by client Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.2.0-0.2.master.gitcbe3c76.el7ev.noarch How reproducible: Always Steps to Reproduce: 1. Configure HE environment with at least two hosts 2. Put host with HE VM to maintenance 3. Actual results: In case if migration will take a lot of time, VDSM will cancel it, but the host still will have HE state "EngineMigratingAway". Expected results: I believe in case if VDSM cancels HE VM migration, host HE state must be changed to EngineUp. Additional info: The bug exists both in 4.1 and 4.2, but it more critical for 4.1. Under 4.2 you can enable global maintenance and it will reset host state, but in 4.1 it does not work(restart of HE VM helped me in this case)
I've migrated back and forth at least 10 times between pair of ha-hosts. I had not seen this bug reproduced, hence moving to verified. Works for me on these components on host: rhvm-appliance-4.2-20180202.0.el7.noarch ovirt-hosted-engine-ha-2.2.4-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.9-1.el7ev.noarch Linux 3.10.0-693.el7.x86_64 #1 SMP Thu Jul 6 19:56:57 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux On engine: ovirt-engine-setup-4.2.1.5-0.1.el7.noarch Linux 3.10.0-693.19.1.el7.x86_64 #1 SMP Thu Feb 1 12:34:44 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.