Bug 1104733 - VDSM failure on migration destination causes stuck migration task
Summary: VDSM failure on migration destination causes stuck migration task
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.3.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 3.5.0
Assignee: Francesco Romani
QA Contact: Artyom
URL:
Whiteboard: virt
Depends On:
Blocks: rhev3.5beta 1156165
TreeView+ depends on / blocked
 
Reported: 2014-06-04 14:48 UTC by Jake Hunsaker
Modified: 2019-06-13 08:01 UTC (History)
14 users (show)

Fixed In Version: vt2.2
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-02-11 21:11:23 UTC
oVirt Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:0159 0 normal SHIPPED_LIVE vdsm 3.5.0 - bug fix and enhancement update 2015-02-12 01:35:58 UTC
oVirt gerrit 28511 0 master MERGED vm: detect migration completed on recovery Never
oVirt gerrit 31669 0 ovirt-3.5 MERGED vm: split migration completion in smaller methods Never
oVirt gerrit 31670 0 ovirt-3.5 MERGED vm: ensure valid Vm._dom before domDependentInit Never
oVirt gerrit 31671 0 ovirt-3.5 MERGED vm: detect migration completed on recovery Never

Description Jake Hunsaker 2014-06-04 14:48:23 UTC
Description of problem:

If the vdsm service of a destination hypervisor for a VM migration experiences an issue during the migration - the migration task remains active even after the hypervisor gets fenced by the engine.

This was tested by initiating a VM migration and issuing a 'service vdsmd stop' on the destination hypervisor. The engine soft-fenced the hypervisor, but the migration task remained active for 6 hours - at which time the migration finished successfully.

Version-Release number of selected component (if applicable):

rhevm-3.3.2-0.50
vdsm-4.13.2-0.13

How reproducible:

Have not had 6 hour window in which to wait for the migration task to complete/clear, however I can easily reproduce the fact that the migration task still remains active and engine.log gets spammed with the messages to follow when vdsmd is killed on the destination hypervisor.

Steps to Reproduce:
1. Start a migration
2. Stop vdsmd on the destination hypervisor
3.

Actual results:

VM goes into an "Unknown" state but is accessible. Migration task remains active for a very long time.

Expected results:

Engine should fail the migration once the problem with vdsm on the destination is detected, and the VM should return to a normal state on the source hypervisor

Comment 5 Francesco Romani 2014-06-06 13:31:47 UTC
taking the bug

Comment 7 Francesco Romani 2014-06-09 12:58:15 UTC
One confirmed issue is VDSM can go out of sync if it is restarted, or down for whatever reason, when migrations completes.

The events sequence is:
- migration is in progress
- VDSM goes down
- migration completes -> the VM is UP on the dst host according to libvirt!
- VDSM returns up, does recovery and possibly does not properly recognize what happened in the meangime

In that case VDSM will diligently wait for the full migration timeout to expire before to report the VM as UP; the default value for the timeout is 21600s, so 6h.

I'll make a patch to make sure VDSM handles this case correctly.

Comment 8 Francesco Romani 2014-06-09 14:38:07 UTC
posted tentative patch. Needs careful testing, in progress.

Comment 10 Francesco Romani 2014-06-09 14:54:10 UTC
Jake,

After deeper investigation I think I narrowed down the issue, and your last report confirms that this is also a matter of a specific -and unfortunate- sequence of events. The logs are no longer required, thanks.

Comment 14 Francesco Romani 2014-06-11 08:43:21 UTC
easier way to reproduce and test:

- start migration;
- stop VDSM on dst host; migration will continue to run as soon as libvirt and qemu are up and running
- once migration is done, restart VDSM on dst host
- now the VM should be in unknown state for the said 6 hours despite being actually up and running.

Comment 24 Gil Klein 2014-08-26 10:37:36 UTC
@Michal, this bug doesn't have a DEV ack yet. QE will acked/nacked based on the the target release and time frames, in the regular Bugzilla workflow.

Comment 25 Michal Skrivanek 2014-08-26 10:55:31 UTC
@Gil, the question is more about 3.5 vs 3.4 vs 3.3 considerations.
missing dev_ack is due to me not agreeing with backports to 3.3. nor 3.4. I'm fine with 3.5 fix
(adding back original needinfo on dave)

Comment 26 Francesco Romani 2014-08-27 13:10:31 UTC
Patches merged to ovirt 3.5 (see http://gerrit.ovirt.org/#/c/31671/ and its deps), will be included in to the next RC, moving to MODIFIED

Comment 29 Artyom 2014-09-07 12:09:16 UTC
Verified on rhevm-3.5.0-0.10.master.el6ev.noarch
Just instead of stop vdsm I stopped network(because Soft Fencing), migration failed and vm stay on the source host.

Comment 37 errata-xmlrpc 2015-02-11 21:11:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0159.html


Note You need to log in before you can comment on or make changes to this bug.