Bug 1369521 - After cluster upgrade from 3.6 to 4.0 with running HA vm, if vm is killed outside engine it starts as a 3.6 vm
Summary: After cluster upgrade from 3.6 to 4.0 with running HA vm, if vm is killed out...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 4.0.3
Hardware: Unspecified
OS: Unspecified
high
medium vote
Target Milestone: ovirt-4.0.6
: 4.0.6
Assignee: Marek Libra
QA Contact: sefi litmanovich
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-23 15:57 UTC by sefi litmanovich
Modified: 2017-01-18 07:26 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-01-18 07:26:08 UTC
oVirt Team: Virt
rule-engine: ovirt-4.0.z+
mgoldboi: planning_ack+
tjelinek: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)
engine log + vm xml (877.19 KB, application/x-gzip)
2016-08-23 15:57 UTC, sefi litmanovich
no flags Details


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 63211 master MERGED core: Fix race cond for NextRun and HA VM 2016-09-06 14:30:12 UTC
oVirt gerrit 63436 ovirt-engine-4.0 MERGED core: Fix race cond for NextRun and HA VM 2016-09-06 15:53:02 UTC
oVirt gerrit 65906 master POST core: Fix race between AutoStartVmsRunner and ProcessDownVmCommand 2016-11-02 09:18:23 UTC
oVirt gerrit 66006 ovirt-engine-4.0 POST core: Fix race between AutoStartVmsRunner and ProcessDownVmCommand 2016-11-03 08:17:36 UTC

Description sefi litmanovich 2016-08-23 15:57:24 UTC
Created attachment 1193358 [details]
engine log + vm xml

Description of problem:

When performing a cluster upgrade (changing compatibility version) from 3.6 to 4.0 with running vms, the vms are set to compatibility version 3.6 automatically and are marked for restart - after restart is done vms should start as 4.0 vms. 
In this case, a vm was running as HA vm before the upgrade.
After the upgrade it was marked for restart as expected and kept it's former state as a 3.6 compatibility vm.
Then I killed the vm's process in the host to test HA functionality - vm was immediately started, as expected, but it is started with 3.6 compatibility which is actually set in the custom compatibility version field. Vm's xml is also a 3.6 vm xml e.d. vcpu_placement attribute is set like this:

<vcpu placement='static'>1</vcpu>
instead of:
<vcpu placement='static' current='1'>16</vcpu>

Version-Release number of selected component (if applicable):
The bug occurred in pre integration build: 
ovirt-engine-4.0.2.7-0.0.master.20160823072450.git3b10fd7.el7.centos.noarch which is equivalent to 4.0.3 

How reproducible:
always

Steps to Reproduce:

1. Before cluster upgrade - dc-cluster are 3.6, hosts have vdsm-4.17.33 - create a vm and set it to HA - start the vm.
2. Upgrade the hosts one by one to vdsm-4.18.11 and then upgrade the cluster to 4.0 - get the warning and so on.
3. After upgrade vm has 'new configuration' triangle/mark and is still a 3.6 vm.
4. In the host machine kill the qemu-kvm process of the vm.

Result: 
Expected result: 

Actual results:
The HA policy revives the vm right away, the new configuration mark is gone - vm is set to ccv 3.6 but as custom ccv - that seems to be some intermediate state of the vm in the middle of automatic ccv flow. The vm's xml is a 3.6 xml (e.g. vcpu_placement attribute is like in 3.6 thus blocking in this case the cpu hot plug feature)

Expected results:
The HA policy revives the vm right away, the new configuration mark is gone - vm is set to ccv 4.0 both in configuration and in practice (4.0 features are available for this vm).

Additional info:
No helpful information can be found in engine.log but I attach it anyway.

2016-08-23 16:46:13,232 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-7) [23eecac0] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM test-vm-1 is down with erro
r. Exit message: Lost connection with qemu process.
2016-08-23 16:46:13,232 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-7) [23eecac0] add VM 'ec3e0c08-f2e2-46d5-9f11-b55efd14bee1'(test-vm-1) to HA rerun treatment
2016-08-23 16:46:13,402 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-7) [23eecac0] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM test-vm-1 
failed. It will be restarted automatically.

Attaching the vm's xml after HA policy has restarted it.

Comment 1 Michal Skrivanek 2016-08-23 17:39:12 UTC
it creates an inconsistency where the pending change is not there for some 3.6 VMs leading the user to think that all machines are fully upgraded to 4.0

Comment 2 sefi litmanovich 2016-08-30 15:51:00 UTC
Update - this bug has been re produced with rhevm-4.0.3-0.1.el7ev.noarch

Comment 3 Tomas Jelinek 2016-09-01 13:02:35 UTC
so, after a bit more digging into it it looks like a race between the AutoStartVmsRunner and the ProcessDownVmCommand. If the AutoStartVmsRunner is faster, it takes the lock of the VM and if in that moment the ProcessDownVmCommand tries to apply the next_run snapshot, the UpdateVmCommand will not get the lock and will not apply the snapshot consequently starting the VM in the old compatibility version.

I think the fix should be that the AutoStartVmsRunner will make sure that it will not start vms which have next run snapshot giving ProcessDownVmCommand the chance to apply it.

Comment 4 Marek Libra 2016-10-31 14:15:24 UTC
There's a race condition between AutoStartVmsRunner and ProcessDownVmCommand: 

AutoStartVmsRunner acquires VM lock for RunVmCommand causing conflict in acquiring lock by UpdateVmCommand called from the ProcessDownVmCommand.

Comment 5 Red Hat Bugzilla Rules Engine 2016-10-31 14:42:16 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 6 sefi litmanovich 2016-11-21 16:18:31 UTC
Verified with rhevm-4.0.6-0.1.el7ev.noarch, host - vdsm-4.18.16-1.el7ev.x86_64.
Verified according to steps in bug's description and behaviour after kill -9 is as expected


Note You need to log in before you can comment on or make changes to this bug.