Bug 1527416

Summary: Wrong state returned in VM getStats when actual state changes in the middle
Product: [oVirt] vdsm Reporter: Michal Skrivanek <michal.skrivanek>
Component: CoreAssignee: Milan Zamazal <mzamazal>
Status: CLOSED CURRENTRELEASE QA Contact: Israel Pinto <ipinto>
Severity: urgent Docs Contact:
Priority: high    
Version: 4.20.15CC: bugs, lveyde
Target Milestone: ovirt-4.2.1Keywords: Regression
Target Release: ---Flags: rule-engine: ovirt-4.2+
ykaul: blocker+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: vdsm v4.20.14 Doc Type: Bug Fix
Doc Text:
Due to race in VM information retrieval, VMs could be lost on migrations. This has been fixed and migrations should be safe again.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-02-12 11:54:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michal Skrivanek 2017-12-19 11:41:11 UTC
the guest drive mapping introduced a significant delay into the VM.getStats call since it tries to update the mapping when it detects a change. That is likely to happen on lifecycle changes. In the OST case it took 1.2s to finish the whole call, and in the meantime the migration has finished. The getStats() call is not written with possible state change in mind, so if it so happens and the state moves from anything to Down in the middle of it it returns a Down state without exitCode and exitReason which confuses engine. We started to use the exitReason code to differentiate the various flavors of Down in engine in ~4.1 and in this case it results in misleading “VM powered off by admin” case

we need to fix the VM.getStats() to handle VM state changes in the middle
we need to fix the guest drive mapping updates to handle cleanly situations when the VM is either not ready yet or already gone

See http://lists.ovirt.org/pipermail/devel/2017-December/032282.html

Comment 1 Michal Skrivanek 2017-12-19 11:42:47 UTC
workaround should be to not run ovirt-guest-agent in the guest during VM migration

Comment 2 Israel Pinto 2018-01-25 08:43:59 UTC
Verify with:
Engine Version: 4.2.1.2-0.1.el7
Host: 
OS Version: RHEL - 7.4 - 18.el7
Kernel Version:3.10.0 - 693.17.1.el7.x86_64
KVM Version:2.9.0 - 16.el7_4.14
LIBVIRT Version:libvirt-3.2.0-14.el7_4.7
VDSM Version:vdsm-4.20.14-1.el7ev

Steps: 
1. Create 12 VMs and start them
2. Set migration bandwidth to 5 mbps (min migration time of 1 min 50 sec) 
3. Migrate all VMs and monitor VM status
Results:
All VMs migrated successfully,
The status reported in the UI was correct for all VMS

Comment 3 Sandro Bonazzola 2018-02-12 11:54:16 UTC
This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.