Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1104733

Summary:	VDSM failure on migration destination causes stuck migration task
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Jake Hunsaker <jhunsaker>
Component:	vdsm	Assignee:	Francesco Romani <fromani>
Status:	CLOSED ERRATA	QA Contact:	Artyom <alukiano>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.3.0	CC:	bazulay, dsulliva, fromani, gklein, iheim, jhunsaker, lpeer, mavital, michal.skrivanek, mkalinin, ofrenkel, sherold, vfeenstr, yeylon
Target Milestone:	---	Keywords:	UseCase
Target Release:	3.5.0
Hardware:	x86_64
OS:	Linux
Whiteboard:	virt
Fixed In Version:	vt2.2	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-02-11 21:11:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1142923, 1156165

Description Jake Hunsaker 2014-06-04 14:48:23 UTC

Description of problem:

If the vdsm service of a destination hypervisor for a VM migration experiences an issue during the migration - the migration task remains active even after the hypervisor gets fenced by the engine.

This was tested by initiating a VM migration and issuing a 'service vdsmd stop' on the destination hypervisor. The engine soft-fenced the hypervisor, but the migration task remained active for 6 hours - at which time the migration finished successfully.

Version-Release number of selected component (if applicable):

rhevm-3.3.2-0.50
vdsm-4.13.2-0.13

How reproducible:

Have not had 6 hour window in which to wait for the migration task to complete/clear, however I can easily reproduce the fact that the migration task still remains active and engine.log gets spammed with the messages to follow when vdsmd is killed on the destination hypervisor.

Steps to Reproduce:
1. Start a migration
2. Stop vdsmd on the destination hypervisor
3.

Actual results:

VM goes into an "Unknown" state but is accessible. Migration task remains active for a very long time.

Expected results:

Engine should fail the migration once the problem with vdsm on the destination is detected, and the VM should return to a normal state on the source hypervisor

Comment 5 Francesco Romani 2014-06-06 13:31:47 UTC

taking the bug

Comment 7 Francesco Romani 2014-06-09 12:58:15 UTC

One confirmed issue is VDSM can go out of sync if it is restarted, or down for whatever reason, when migrations completes.

The events sequence is:
- migration is in progress
- VDSM goes down
- migration completes -> the VM is UP on the dst host according to libvirt!
- VDSM returns up, does recovery and possibly does not properly recognize what happened in the meangime

In that case VDSM will diligently wait for the full migration timeout to expire before to report the VM as UP; the default value for the timeout is 21600s, so 6h.

I'll make a patch to make sure VDSM handles this case correctly.

Comment 8 Francesco Romani 2014-06-09 14:38:07 UTC

posted tentative patch. Needs careful testing, in progress.

Comment 10 Francesco Romani 2014-06-09 14:54:10 UTC

Jake,

After deeper investigation I think I narrowed down the issue, and your last report confirms that this is also a matter of a specific -and unfortunate- sequence of events. The logs are no longer required, thanks.

Comment 14 Francesco Romani 2014-06-11 08:43:21 UTC

easier way to reproduce and test:

- start migration;
- stop VDSM on dst host; migration will continue to run as soon as libvirt and qemu are up and running
- once migration is done, restart VDSM on dst host
- now the VM should be in unknown state for the said 6 hours despite being actually up and running.

Comment 24 Gil Klein 2014-08-26 10:37:36 UTC

@Michal, this bug doesn't have a DEV ack yet. QE will acked/nacked based on the the target release and time frames, in the regular Bugzilla workflow.

Comment 25 Michal Skrivanek 2014-08-26 10:55:31 UTC

@Gil, the question is more about 3.5 vs 3.4 vs 3.3 considerations.
missing dev_ack is due to me not agreeing with backports to 3.3. nor 3.4. I'm fine with 3.5 fix
(adding back original needinfo on dave)

Comment 26 Francesco Romani 2014-08-27 13:10:31 UTC

Patches merged to ovirt 3.5 (see http://gerrit.ovirt.org/#/c/31671/ and its deps), will be included in to the next RC, moving to MODIFIED

Comment 29 Artyom 2014-09-07 12:09:16 UTC

Verified on rhevm-3.5.0-0.10.master.el6ev.noarch
Just instead of stop vdsm I stopped network(because Soft Fencing), migration failed and vm stay on the source host.

Comment 37 errata-xmlrpc 2015-02-11 21:11:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0159.html