Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be available on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1413847 - Live migration failure not detected by OVirt
Summary: Live migration failure not detected by OVirt
Keywords:
Status: CLOSED DUPLICATE of bug 1414626
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.18.15.2
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Dan Kenigsberg
QA Contact: meital avital
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-17 06:38 UTC by Markus Stockhausen
Modified: 2017-01-25 10:21 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-01-25 10:21:43 UTC
oVirt Team: Virt


Attachments (Terms of Use)
VDSM target (648.53 KB, application/zip)
2017-01-17 06:39 UTC, Markus Stockhausen
no flags Details
VDSM source (217.74 KB, application/zip)
2017-01-17 06:40 UTC, Markus Stockhausen
no flags Details

Description Markus Stockhausen 2017-01-17 06:38:31 UTC
Description of problem:

live migration between hosts of same cluster fails. Source is a centos 7.3 node, target is a centos 7.2 node

Version-Release number of selected component (if applicable):

centos 7.2 host:
- libvirt 1.2.17-13.el7_2.6
- qemu 2.3.0-31.el7.21.1

centos 7.3 host:
- libvirt 2.0.0-10.el7_3.2
- qemu 2.6.0-27.1.el7

Ovirt engine
- ovirt 4.0.6

How reproducible:

100%

Steps to Reproduce:
1. start VM on centos 7.3 node
2. live migrate vm to centos 7.2 node 
3.

Actual results:

Migration does not finish. Cancelled after 6 hours.

Expected results:

migration succeeds

Additional info:

Logs of source vdsm and target vdsm attached. Live migrated VM is colvm60. Processing started at ~ 20:15:22

Comment 1 Markus Stockhausen 2017-01-17 06:39:41 UTC
Created attachment 1241559 [details]
VDSM target

Comment 2 Markus Stockhausen 2017-01-17 06:40:17 UTC
Created attachment 1241560 [details]
VDSM source

Comment 3 Tomas Jelinek 2017-01-18 06:55:14 UTC
So, what happens is that the downtime thread fails right at the beginning of the migration because of:
KeyError: 'memory_bps'
So the migration than keeps going with the minimal downtime which is not enough to finish the migration successfully and than it is cancelled. The strange thing is, why did libvirt not return the memory_bps...

I would guess the issue is that the monitor thread started before the migration actually started so the data returned by libvirt were not there yet.

@Markus: is this happening all the time or was this a one time issue? Is it happening with all VMs or only with this one?

Comment 4 Markus Stockhausen 2017-01-18 12:24:23 UTC
Failure rate is 100%. 5/5 migrations stalled because of this error.

I'm raising the severity to high as it affects core oVirt features for 7.3 nodes.

Comment 5 Markus Stockhausen 2017-01-18 20:36:53 UTC
Reason for the bug was a faulty network card with high packet drop. after exchange everything works flawlessly.

Nevertheless OVirt should detect and report the issue in the WebUI.

Comment 6 Tomas Jelinek 2017-01-25 10:21:43 UTC
ok, this is actually a subset of 1414626 so marking it as duplicate.

The root cause of this one is that when the stats don't contain some value, they throw an exception turning the monitor and downtime thread off letting the migration progress wrongly for couple of hours.

*** This bug has been marked as a duplicate of bug 1414626 ***


Note You need to log in before you can comment on or make changes to this bug.