Bug 1413847

Summary: Live migration failure not detected by OVirt
Product: [oVirt] vdsm Reporter: Markus Stockhausen <mst>
Component: GeneralAssignee: Dan Kenigsberg <danken>
Status: CLOSED DUPLICATE QA Contact: meital avital <mavital>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.18.15.2CC: bugs, mst, tjelinek
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-25 10:21:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
VDSM target
none
VDSM source none

Description Markus Stockhausen 2017-01-17 06:38:31 UTC
Description of problem:

live migration between hosts of same cluster fails. Source is a centos 7.3 node, target is a centos 7.2 node

Version-Release number of selected component (if applicable):

centos 7.2 host:
- libvirt 1.2.17-13.el7_2.6
- qemu 2.3.0-31.el7.21.1

centos 7.3 host:
- libvirt 2.0.0-10.el7_3.2
- qemu 2.6.0-27.1.el7

Ovirt engine
- ovirt 4.0.6

How reproducible:

100%

Steps to Reproduce:
1. start VM on centos 7.3 node
2. live migrate vm to centos 7.2 node 
3.

Actual results:

Migration does not finish. Cancelled after 6 hours.

Expected results:

migration succeeds

Additional info:

Logs of source vdsm and target vdsm attached. Live migrated VM is colvm60. Processing started at ~ 20:15:22

Comment 1 Markus Stockhausen 2017-01-17 06:39:41 UTC
Created attachment 1241559 [details]
VDSM target

Comment 2 Markus Stockhausen 2017-01-17 06:40:17 UTC
Created attachment 1241560 [details]
VDSM source

Comment 3 Tomas Jelinek 2017-01-18 06:55:14 UTC
So, what happens is that the downtime thread fails right at the beginning of the migration because of:
KeyError: 'memory_bps'
So the migration than keeps going with the minimal downtime which is not enough to finish the migration successfully and than it is cancelled. The strange thing is, why did libvirt not return the memory_bps...

I would guess the issue is that the monitor thread started before the migration actually started so the data returned by libvirt were not there yet.

@Markus: is this happening all the time or was this a one time issue? Is it happening with all VMs or only with this one?

Comment 4 Markus Stockhausen 2017-01-18 12:24:23 UTC
Failure rate is 100%. 5/5 migrations stalled because of this error.

I'm raising the severity to high as it affects core oVirt features for 7.3 nodes.

Comment 5 Markus Stockhausen 2017-01-18 20:36:53 UTC
Reason for the bug was a faulty network card with high packet drop. after exchange everything works flawlessly.

Nevertheless OVirt should detect and report the issue in the WebUI.

Comment 6 Tomas Jelinek 2017-01-25 10:21:43 UTC
ok, this is actually a subset of 1414626 so marking it as duplicate.

The root cause of this one is that when the stats don't contain some value, they throw an exception turning the monitor and downtime thread off letting the migration progress wrongly for couple of hours.

*** This bug has been marked as a duplicate of bug 1414626 ***