Description of problem: Following https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.6/html/Upgrade_Guide/Upgrading_a_Red_Hat_Enterprise_Linux_6_Cluster_to_Red_Hat_Enterprise_Linux_7.html step #3 "Set the InClusterUpgrade" scheduling policy fails, we suspect this is a race condition when having more than 100, or even more VMs in cluster. Version-Release number of selected component (if applicable): Engine: rhevm-3.6.8.1-0.1.el6.noarch postgresql-server-8.4.20-6.el6.x86_64 Hypervisors: Red Hat Enterprise Virtualization Hypervisor release 6.7 (20160219.0.el6ev) How reproducible: 100% in customer's environment Steps to Reproduce: 1. Have a cluster with more than 100 VMs, actually, in customer's case, he has 310 VMs 2. Follow step #1 and #2 from documentation 3. Set scheduling policy to InClusterUpgrade mode for cluster you want to upgrade Actual results: Fails with: 2016-08-12 11:11:18,604 ERROR [org.ovirt.engine.core.bll.UpdateVmCommand] (ajp-/127.0.0.1:8702-8) [17d532c7] Command 'org.ovirt.engine.core.bll.UpdateVmCommand' failed: CallableStatementCallback; SQL [{call incrementdbgeneration(?)}]; ERROR: deadlock detected Expected results: InClusterPolicy should be set fine for cluster Additional info: This hapened on another setup where customer had 172 VMs in cluster, then on a different cluster with only 11 VMs, there were no issues. Will attach relevant logs.
According to your log statements this engine is still 3.5 (VdsUpdateRunTimeInfo doesn't exist in 3.6 codebase) so this setup seems unfinished.
In the engine.log I can't see any reference to VdsUpdateRunTimeInfo after 8/12 @ 09:03:15,629 According to the setup log, the system was updated to 3.6.8 on 8/12 @ 09:58:20. ovirt-engine restart was on 8/12 @ 11:09:04,968 Attaching the setup logs in priv
Definitely a race between incrementdbgeneration which is called by an update to a vm, and the insertvmguestagentinterface There update of the cluster is doing a an UpdateVM for all the VM in the cluster, including the Up Vms of course. This is colliding with a batch update from the vms monitoring and aparently the update isn't made in the same order, and we hit the deadlock looking further.
This problem will not be limited to any particular cluster parameter change (like InClusterUpgrade) but would be applicable to all any update to the cluster, when incrementdbgeneration is called right? e.g. while changing cluster compatibility level can we hit this race condition?
Full analysis of the cause: 1. UpdateClusterCommand calls UpdateVm in a loop, without sorting, for *every* call, without any condition. Fix should be to call it *only* if the cluster compat level changed AND to sort the vms before looping over them 2. Update guest interfaces in batch should also be sorted by vm id to prevent from deadlocking with the loop update from UpdateCluster 3. nice to have, @ahadas suggest maybe no to call update guest interfaces in TX.
In response to comment #14: The UpdateVm is called if cluster version has changed (please see UpdateClusterCommand.getSharedLocks() )
Update: this deadlock cannot be reproduced with postgresql 9.5.3 (fedora 24) and can be reproduced with postgresql 8.4.20 (rhel 6). I'll check with postgres 9.2 (rhel 7)
it can happen with postgres 9.2 so a proper fix will be done for 4.0
Verified with rhevm-4.1.0.4-0.1.el7.noarch. Had a cluster with 120 Vms running with guest agent working and reporting IP. Changed cluster compatibility version from 4.0 to 4.1 (hosts were 4.1 all the time) and monitored the updateVm calls with tail on engine log. Repeated this upgrade flow several times, no race has occurred in any of the iterations.