1369521 – After cluster upgrade from 3.6 to 4.0 with running HA vm, if vm is killed outside engine it starts as a 3.6 vm

Bug 1369521 - After cluster upgrade from 3.6 to 4.0 with running HA vm, if vm is killed outside engine it starts as a 3.6 vm

Summary: After cluster upgrade from 3.6 to 4.0 with running HA vm, if vm is killed out...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Virt
Sub Component:
Version:	4.0.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	ovirt-4.0.6
Target Release:	4.0.6
Assignee:	Marek Libra
QA Contact:	sefi litmanovich
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-08-23 15:57 UTC by sefi litmanovich
Modified:	2017-01-18 07:26 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-01-18 07:26:08 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.0.z+ mgoldboi: planning_ack+ tjelinek: devel_ack+ mavital: testing_ack+

Attachments	(Terms of Use)
engine log + vm xml (877.19 KB, application/x-gzip) 2016-08-23 15:57 UTC, sefi litmanovich	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	63211	master	MERGED	core: Fix race cond for NextRun and HA VM	2016-09-06 14:30:12 UTC
oVirt gerrit	63436	ovirt-engine-4.0	MERGED	core: Fix race cond for NextRun and HA VM	2016-09-06 15:53:02 UTC
oVirt gerrit	65906	master	POST	core: Fix race between AutoStartVmsRunner and ProcessDownVmCommand	2016-11-02 09:18:23 UTC
oVirt gerrit	66006	ovirt-engine-4.0	POST	core: Fix race between AutoStartVmsRunner and ProcessDownVmCommand	2016-11-03 08:17:36 UTC

Description sefi litmanovich 2016-08-23 15:57:24 UTC

Created attachment 1193358 [details]
engine log + vm xml

Description of problem:

When performing a cluster upgrade (changing compatibility version) from 3.6 to 4.0 with running vms, the vms are set to compatibility version 3.6 automatically and are marked for restart - after restart is done vms should start as 4.0 vms. 
In this case, a vm was running as HA vm before the upgrade.
After the upgrade it was marked for restart as expected and kept it's former state as a 3.6 compatibility vm.
Then I killed the vm's process in the host to test HA functionality - vm was immediately started, as expected, but it is started with 3.6 compatibility which is actually set in the custom compatibility version field. Vm's xml is also a 3.6 vm xml e.d. vcpu_placement attribute is set like this:

<vcpu placement='static'>1</vcpu>
instead of:
<vcpu placement='static' current='1'>16</vcpu>

Version-Release number of selected component (if applicable):
The bug occurred in pre integration build: 
ovirt-engine-4.0.2.7-0.0.master.20160823072450.git3b10fd7.el7.centos.noarch which is equivalent to 4.0.3 

How reproducible:
always

Steps to Reproduce:

1. Before cluster upgrade - dc-cluster are 3.6, hosts have vdsm-4.17.33 - create a vm and set it to HA - start the vm.
2. Upgrade the hosts one by one to vdsm-4.18.11 and then upgrade the cluster to 4.0 - get the warning and so on.
3. After upgrade vm has 'new configuration' triangle/mark and is still a 3.6 vm.
4. In the host machine kill the qemu-kvm process of the vm.

Result: 
Expected result: 

Actual results:
The HA policy revives the vm right away, the new configuration mark is gone - vm is set to ccv 3.6 but as custom ccv - that seems to be some intermediate state of the vm in the middle of automatic ccv flow. The vm's xml is a 3.6 xml (e.g. vcpu_placement attribute is like in 3.6 thus blocking in this case the cpu hot plug feature)

Expected results:
The HA policy revives the vm right away, the new configuration mark is gone - vm is set to ccv 4.0 both in configuration and in practice (4.0 features are available for this vm).

Additional info:
No helpful information can be found in engine.log but I attach it anyway.

2016-08-23 16:46:13,232 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-7) [23eecac0] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM test-vm-1 is down with erro
r. Exit message: Lost connection with qemu process.
2016-08-23 16:46:13,232 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-7) [23eecac0] add VM 'ec3e0c08-f2e2-46d5-9f11-b55efd14bee1'(test-vm-1) to HA rerun treatment
2016-08-23 16:46:13,402 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-7) [23eecac0] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM test-vm-1 
failed. It will be restarted automatically.

Attaching the vm's xml after HA policy has restarted it.

Comment 1 Michal Skrivanek 2016-08-23 17:39:12 UTC

it creates an inconsistency where the pending change is not there for some 3.6 VMs leading the user to think that all machines are fully upgraded to 4.0

Comment 2 sefi litmanovich 2016-08-30 15:51:00 UTC

Update - this bug has been re produced with rhevm-4.0.3-0.1.el7ev.noarch

Comment 3 Tomas Jelinek 2016-09-01 13:02:35 UTC

so, after a bit more digging into it it looks like a race between the AutoStartVmsRunner and the ProcessDownVmCommand. If the AutoStartVmsRunner is faster, it takes the lock of the VM and if in that moment the ProcessDownVmCommand tries to apply the next_run snapshot, the UpdateVmCommand will not get the lock and will not apply the snapshot consequently starting the VM in the old compatibility version.

I think the fix should be that the AutoStartVmsRunner will make sure that it will not start vms which have next run snapshot giving ProcessDownVmCommand the chance to apply it.

Comment 4 Marek Libra 2016-10-31 14:15:24 UTC

There's a race condition between AutoStartVmsRunner and ProcessDownVmCommand: 

AutoStartVmsRunner acquires VM lock for RunVmCommand causing conflict in acquiring lock by UpdateVmCommand called from the ProcessDownVmCommand.

Comment 5 Red Hat Bugzilla Rules Engine 2016-10-31 14:42:16 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 6 sefi litmanovich 2016-11-21 16:18:31 UTC

Verified with rhevm-4.0.6-0.1.el7ev.noarch, host - vdsm-4.18.16-1.el7ev.x86_64.
Verified according to steps in bug's description and behaviour after kill -9 is as expected

Note You need to log in before you can comment on or make changes to this bug.