Description of problem: If the vdsm service of a destination hypervisor for a VM migration experiences an issue during the migration - the migration task remains active even after the hypervisor gets fenced by the engine. This was tested by initiating a VM migration and issuing a 'service vdsmd stop' on the destination hypervisor. The engine soft-fenced the hypervisor, but the migration task remained active for 6 hours - at which time the migration finished successfully. Version-Release number of selected component (if applicable): rhevm-3.3.2-0.50 vdsm-4.13.2-0.13 How reproducible: Have not had 6 hour window in which to wait for the migration task to complete/clear, however I can easily reproduce the fact that the migration task still remains active and engine.log gets spammed with the messages to follow when vdsmd is killed on the destination hypervisor. Steps to Reproduce: 1. Start a migration 2. Stop vdsmd on the destination hypervisor 3. Actual results: VM goes into an "Unknown" state but is accessible. Migration task remains active for a very long time. Expected results: Engine should fail the migration once the problem with vdsm on the destination is detected, and the VM should return to a normal state on the source hypervisor
taking the bug
One confirmed issue is VDSM can go out of sync if it is restarted, or down for whatever reason, when migrations completes. The events sequence is: - migration is in progress - VDSM goes down - migration completes -> the VM is UP on the dst host according to libvirt! - VDSM returns up, does recovery and possibly does not properly recognize what happened in the meangime In that case VDSM will diligently wait for the full migration timeout to expire before to report the VM as UP; the default value for the timeout is 21600s, so 6h. I'll make a patch to make sure VDSM handles this case correctly.
posted tentative patch. Needs careful testing, in progress.
Jake, After deeper investigation I think I narrowed down the issue, and your last report confirms that this is also a matter of a specific -and unfortunate- sequence of events. The logs are no longer required, thanks.
easier way to reproduce and test: - start migration; - stop VDSM on dst host; migration will continue to run as soon as libvirt and qemu are up and running - once migration is done, restart VDSM on dst host - now the VM should be in unknown state for the said 6 hours despite being actually up and running.
@Michal, this bug doesn't have a DEV ack yet. QE will acked/nacked based on the the target release and time frames, in the regular Bugzilla workflow.
@Gil, the question is more about 3.5 vs 3.4 vs 3.3 considerations. missing dev_ack is due to me not agreeing with backports to 3.3. nor 3.4. I'm fine with 3.5 fix (adding back original needinfo on dave)
Patches merged to ovirt 3.5 (see http://gerrit.ovirt.org/#/c/31671/ and its deps), will be included in to the next RC, moving to MODIFIED
Verified on rhevm-3.5.0-0.10.master.el6ev.noarch Just instead of stop vdsm I stopped network(because Soft Fencing), migration failed and vm stay on the source host.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0159.html