Bug 1369521

Summary: After cluster upgrade from 3.6 to 4.0 with running HA vm, if vm is killed outside engine it starts as a 3.6 vm
Product: [oVirt] ovirt-engine Reporter: sefi litmanovich <slitmano>
Component: BLL.VirtAssignee: Marek Libra <mlibra>
Status: CLOSED CURRENTRELEASE QA Contact: sefi litmanovich <slitmano>
Severity: medium Docs Contact:
Priority: high    
Version: 4.0.3CC: bugs, mavital, mgoldboi, michal.skrivanek, tjelinek
Target Milestone: ovirt-4.0.6Flags: rule-engine: ovirt-4.0.z+
mgoldboi: planning_ack+
tjelinek: devel_ack+
mavital: testing_ack+
Target Release: 4.0.6   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-18 07:26:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine log + vm xml none

Description sefi litmanovich 2016-08-23 15:57:24 UTC
Created attachment 1193358 [details]
engine log + vm xml

Description of problem:

When performing a cluster upgrade (changing compatibility version) from 3.6 to 4.0 with running vms, the vms are set to compatibility version 3.6 automatically and are marked for restart - after restart is done vms should start as 4.0 vms. 
In this case, a vm was running as HA vm before the upgrade.
After the upgrade it was marked for restart as expected and kept it's former state as a 3.6 compatibility vm.
Then I killed the vm's process in the host to test HA functionality - vm was immediately started, as expected, but it is started with 3.6 compatibility which is actually set in the custom compatibility version field. Vm's xml is also a 3.6 vm xml e.d. vcpu_placement attribute is set like this:

<vcpu placement='static'>1</vcpu>
instead of:
<vcpu placement='static' current='1'>16</vcpu>

Version-Release number of selected component (if applicable):
The bug occurred in pre integration build: 
ovirt-engine-4.0.2.7-0.0.master.20160823072450.git3b10fd7.el7.centos.noarch which is equivalent to 4.0.3 

How reproducible:
always

Steps to Reproduce:

1. Before cluster upgrade - dc-cluster are 3.6, hosts have vdsm-4.17.33 - create a vm and set it to HA - start the vm.
2. Upgrade the hosts one by one to vdsm-4.18.11 and then upgrade the cluster to 4.0 - get the warning and so on.
3. After upgrade vm has 'new configuration' triangle/mark and is still a 3.6 vm.
4. In the host machine kill the qemu-kvm process of the vm.

Result: 
Expected result: 

Actual results:
The HA policy revives the vm right away, the new configuration mark is gone - vm is set to ccv 3.6 but as custom ccv - that seems to be some intermediate state of the vm in the middle of automatic ccv flow. The vm's xml is a 3.6 xml (e.g. vcpu_placement attribute is like in 3.6 thus blocking in this case the cpu hot plug feature)

Expected results:
The HA policy revives the vm right away, the new configuration mark is gone - vm is set to ccv 4.0 both in configuration and in practice (4.0 features are available for this vm).

Additional info:
No helpful information can be found in engine.log but I attach it anyway.

2016-08-23 16:46:13,232 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-7) [23eecac0] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM test-vm-1 is down with erro
r. Exit message: Lost connection with qemu process.
2016-08-23 16:46:13,232 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-7) [23eecac0] add VM 'ec3e0c08-f2e2-46d5-9f11-b55efd14bee1'(test-vm-1) to HA rerun treatment
2016-08-23 16:46:13,402 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-7) [23eecac0] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM test-vm-1 
failed. It will be restarted automatically.

Attaching the vm's xml after HA policy has restarted it.

Comment 1 Michal Skrivanek 2016-08-23 17:39:12 UTC
it creates an inconsistency where the pending change is not there for some 3.6 VMs leading the user to think that all machines are fully upgraded to 4.0

Comment 2 sefi litmanovich 2016-08-30 15:51:00 UTC
Update - this bug has been re produced with rhevm-4.0.3-0.1.el7ev.noarch

Comment 3 Tomas Jelinek 2016-09-01 13:02:35 UTC
so, after a bit more digging into it it looks like a race between the AutoStartVmsRunner and the ProcessDownVmCommand. If the AutoStartVmsRunner is faster, it takes the lock of the VM and if in that moment the ProcessDownVmCommand tries to apply the next_run snapshot, the UpdateVmCommand will not get the lock and will not apply the snapshot consequently starting the VM in the old compatibility version.

I think the fix should be that the AutoStartVmsRunner will make sure that it will not start vms which have next run snapshot giving ProcessDownVmCommand the chance to apply it.

Comment 4 Marek Libra 2016-10-31 14:15:24 UTC
There's a race condition between AutoStartVmsRunner and ProcessDownVmCommand: 

AutoStartVmsRunner acquires VM lock for RunVmCommand causing conflict in acquiring lock by UpdateVmCommand called from the ProcessDownVmCommand.

Comment 5 Red Hat Bugzilla Rules Engine 2016-10-31 14:42:16 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 6 sefi litmanovich 2016-11-21 16:18:31 UTC
Verified with rhevm-4.0.6-0.1.el7ev.noarch, host - vdsm-4.18.16-1.el7ev.x86_64.
Verified according to steps in bug's description and behaviour after kill -9 is as expected