1104733 – VDSM failure on migration destination causes stuck migration task

Bug 1104733 - VDSM failure on migration destination causes stuck migration task

Summary: VDSM failure on migration destination causes stuck migration task

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.3.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Francesco Romani
QA Contact:	Artyom
Docs Contact:
URL:
Whiteboard:	virt
Depends On:
Blocks:	rhev3.5beta 1156165
TreeView+	depends on / blocked

Reported:	2014-06-04 14:48 UTC by Jake Hunsaker
Modified:	2019-06-13 08:01 UTC (History)
CC List:	14 users (show)
Fixed In Version:	vt2.2
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-02-11 21:11:23 UTC
oVirt Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:0159	normal	SHIPPED_LIVE	vdsm 3.5.0 - bug fix and enhancement update	2015-02-12 01:35:58 UTC
oVirt gerrit	28511	master	MERGED	vm: detect migration completed on recovery	Never
oVirt gerrit	31669	ovirt-3.5	MERGED	vm: split migration completion in smaller methods	Never
oVirt gerrit	31670	ovirt-3.5	MERGED	vm: ensure valid Vm._dom before domDependentInit	Never
oVirt gerrit	31671	ovirt-3.5	MERGED	vm: detect migration completed on recovery	Never

Description Jake Hunsaker 2014-06-04 14:48:23 UTC

Description of problem:

If the vdsm service of a destination hypervisor for a VM migration experiences an issue during the migration - the migration task remains active even after the hypervisor gets fenced by the engine.

This was tested by initiating a VM migration and issuing a 'service vdsmd stop' on the destination hypervisor. The engine soft-fenced the hypervisor, but the migration task remained active for 6 hours - at which time the migration finished successfully.

Version-Release number of selected component (if applicable):

rhevm-3.3.2-0.50
vdsm-4.13.2-0.13

How reproducible:

Have not had 6 hour window in which to wait for the migration task to complete/clear, however I can easily reproduce the fact that the migration task still remains active and engine.log gets spammed with the messages to follow when vdsmd is killed on the destination hypervisor.

Steps to Reproduce:
1. Start a migration
2. Stop vdsmd on the destination hypervisor
3.

Actual results:

VM goes into an "Unknown" state but is accessible. Migration task remains active for a very long time.

Expected results:

Engine should fail the migration once the problem with vdsm on the destination is detected, and the VM should return to a normal state on the source hypervisor

Comment 5 Francesco Romani 2014-06-06 13:31:47 UTC

taking the bug

Comment 7 Francesco Romani 2014-06-09 12:58:15 UTC

One confirmed issue is VDSM can go out of sync if it is restarted, or down for whatever reason, when migrations completes.

The events sequence is:
- migration is in progress
- VDSM goes down
- migration completes -> the VM is UP on the dst host according to libvirt!
- VDSM returns up, does recovery and possibly does not properly recognize what happened in the meangime

In that case VDSM will diligently wait for the full migration timeout to expire before to report the VM as UP; the default value for the timeout is 21600s, so 6h.

I'll make a patch to make sure VDSM handles this case correctly.

Comment 8 Francesco Romani 2014-06-09 14:38:07 UTC

posted tentative patch. Needs careful testing, in progress.

Comment 10 Francesco Romani 2014-06-09 14:54:10 UTC

Jake,

After deeper investigation I think I narrowed down the issue, and your last report confirms that this is also a matter of a specific -and unfortunate- sequence of events. The logs are no longer required, thanks.

Comment 14 Francesco Romani 2014-06-11 08:43:21 UTC

easier way to reproduce and test:

- start migration;
- stop VDSM on dst host; migration will continue to run as soon as libvirt and qemu are up and running
- once migration is done, restart VDSM on dst host
- now the VM should be in unknown state for the said 6 hours despite being actually up and running.

Comment 24 Gil Klein 2014-08-26 10:37:36 UTC

@Michal, this bug doesn't have a DEV ack yet. QE will acked/nacked based on the the target release and time frames, in the regular Bugzilla workflow.

Comment 25 Michal Skrivanek 2014-08-26 10:55:31 UTC

@Gil, the question is more about 3.5 vs 3.4 vs 3.3 considerations.
missing dev_ack is due to me not agreeing with backports to 3.3. nor 3.4. I'm fine with 3.5 fix
(adding back original needinfo on dave)

Comment 26 Francesco Romani 2014-08-27 13:10:31 UTC

Patches merged to ovirt 3.5 (see http://gerrit.ovirt.org/#/c/31671/ and its deps), will be included in to the next RC, moving to MODIFIED

Comment 29 Artyom 2014-09-07 12:09:16 UTC

Verified on rhevm-3.5.0-0.10.master.el6ev.noarch
Just instead of stop vdsm I stopped network(because Soft Fencing), migration failed and vm stay on the source host.

Comment 37 errata-xmlrpc 2015-02-11 21:11:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0159.html

Note You need to log in before you can comment on or make changes to this bug.