Created attachment 1193358 [details]
engine log + vm xml
Description of problem:
When performing a cluster upgrade (changing compatibility version) from 3.6 to 4.0 with running vms, the vms are set to compatibility version 3.6 automatically and are marked for restart - after restart is done vms should start as 4.0 vms.
In this case, a vm was running as HA vm before the upgrade.
After the upgrade it was marked for restart as expected and kept it's former state as a 3.6 compatibility vm.
Then I killed the vm's process in the host to test HA functionality - vm was immediately started, as expected, but it is started with 3.6 compatibility which is actually set in the custom compatibility version field. Vm's xml is also a 3.6 vm xml e.d. vcpu_placement attribute is set like this:
<vcpu placement='static' current='1'>16</vcpu>
Version-Release number of selected component (if applicable):
The bug occurred in pre integration build:
ovirt-engine-188.8.131.52-0.0.master.20160823072450.git3b10fd7.el7.centos.noarch which is equivalent to 4.0.3
Steps to Reproduce:
1. Before cluster upgrade - dc-cluster are 3.6, hosts have vdsm-4.17.33 - create a vm and set it to HA - start the vm.
2. Upgrade the hosts one by one to vdsm-4.18.11 and then upgrade the cluster to 4.0 - get the warning and so on.
3. After upgrade vm has 'new configuration' triangle/mark and is still a 3.6 vm.
4. In the host machine kill the qemu-kvm process of the vm.
The HA policy revives the vm right away, the new configuration mark is gone - vm is set to ccv 3.6 but as custom ccv - that seems to be some intermediate state of the vm in the middle of automatic ccv flow. The vm's xml is a 3.6 xml (e.g. vcpu_placement attribute is like in 3.6 thus blocking in this case the cpu hot plug feature)
The HA policy revives the vm right away, the new configuration mark is gone - vm is set to ccv 4.0 both in configuration and in practice (4.0 features are available for this vm).
No helpful information can be found in engine.log but I attach it anyway.
2016-08-23 16:46:13,232 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-7) [23eecac0] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM test-vm-1 is down with erro
r. Exit message: Lost connection with qemu process.
2016-08-23 16:46:13,232 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-7) [23eecac0] add VM 'ec3e0c08-f2e2-46d5-9f11-b55efd14bee1'(test-vm-1) to HA rerun treatment
2016-08-23 16:46:13,402 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-7) [23eecac0] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM test-vm-1
failed. It will be restarted automatically.
Attaching the vm's xml after HA policy has restarted it.
it creates an inconsistency where the pending change is not there for some 3.6 VMs leading the user to think that all machines are fully upgraded to 4.0
Update - this bug has been re produced with rhevm-4.0.3-0.1.el7ev.noarch
so, after a bit more digging into it it looks like a race between the AutoStartVmsRunner and the ProcessDownVmCommand. If the AutoStartVmsRunner is faster, it takes the lock of the VM and if in that moment the ProcessDownVmCommand tries to apply the next_run snapshot, the UpdateVmCommand will not get the lock and will not apply the snapshot consequently starting the VM in the old compatibility version.
I think the fix should be that the AutoStartVmsRunner will make sure that it will not start vms which have next run snapshot giving ProcessDownVmCommand the chance to apply it.
There's a race condition between AutoStartVmsRunner and ProcessDownVmCommand:
AutoStartVmsRunner acquires VM lock for RunVmCommand causing conflict in acquiring lock by UpdateVmCommand called from the ProcessDownVmCommand.
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Verified with rhevm-4.0.6-0.1.el7ev.noarch, host - vdsm-4.18.16-1.el7ev.x86_64.
Verified according to steps in bug's description and behaviour after kill -9 is as expected