Bug 1366786 - [InClusterUpgrade] Possible race condition with large amount of VMs in cluster
Summary: [InClusterUpgrade] Possible race condition with large amount of VMs in cluster
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.6.8
Hardware: Unspecified
OS: Linux
high
high
Target Milestone: ovirt-4.1.0-alpha
: ---
Assignee: Arik
QA Contact: sefi litmanovich
URL:
Whiteboard:
Depends On:
Blocks: 1369415 1369418
TreeView+ depends on / blocked
 
Reported: 2016-08-12 19:54 UTC by Javier Coscia
Modified: 2019-12-16 06:21 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, the update of the compatibility version of a cluster with many running virtual machines that are installed with the guest-agent caused a deadlock that caused the update to fail. In some cases, these clusters could not be upgraded to a newer compatibility version. Now, the deadlock in the database has been prevented so that a cluster with many running virtual machines that are installed with the guest-agent can be upgraded to newer compatibility version.
Clone Of:
: 1369415 1369418 (view as bug list)
Environment:
Last Closed: 2017-04-25 00:54:44 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2525531 0 None None None 2016-08-16 15:01:26 UTC
Red Hat Product Errata RHEA-2017:0997 0 normal SHIPPED_LIVE Red Hat Virtualization Manager (ovirt-engine) 4.1 GA 2017-04-18 20:11:26 UTC
oVirt gerrit 62372 0 ovirt-engine-4.0 MERGED core: fix monitoring of guest agent nics 2016-08-22 08:18:32 UTC
oVirt gerrit 62392 0 ovirt-engine-3.6 MERGED core: Cluster update updates VMs only if necessary 2016-08-18 08:22:33 UTC
oVirt gerrit 62514 0 ovirt-engine-4.0 MERGED core: make VmBase comparable 2016-08-22 08:17:59 UTC
oVirt gerrit 62515 0 ovirt-engine-4.0 MERGED core: update cluster to query only static vm data 2016-08-22 08:18:20 UTC
oVirt gerrit 62516 0 ovirt-engine-4.0 MERGED core: determine the order of vm statistics updates 2016-08-22 08:17:53 UTC
oVirt gerrit 62517 0 ovirt-engine-4.0 MERGED core: determine the order of guest agent nic updates 2016-08-22 08:18:09 UTC
oVirt gerrit 62518 0 ovirt-engine-4.0 MERGED core: fix possible deadlock on update cluster version 2016-08-22 08:17:47 UTC
oVirt gerrit 62521 0 master MERGED core: make VmBase comparable 2016-08-18 12:54:06 UTC
oVirt gerrit 62522 0 master MERGED core: update cluster to query only static vm data 2016-08-18 13:50:01 UTC
oVirt gerrit 62523 0 master MERGED core: determine the order of vm statistics updates 2016-08-21 07:03:12 UTC
oVirt gerrit 62524 0 master MERGED core: determine the order of guest agent nic updates 2016-08-21 07:31:48 UTC
oVirt gerrit 62525 0 master MERGED core: fix possible deadlock on update cluster version 2016-08-22 07:45:36 UTC
oVirt gerrit 62637 0 ovirt-engine-3.6 MERGED core: fix monitoring of guest agent nics 2016-08-22 12:19:01 UTC
oVirt gerrit 62742 0 ovirt-engine-4.0.3 MERGED core: fix monitoring of guest agent nics 2016-08-24 08:04:27 UTC

Description Javier Coscia 2016-08-12 19:54:27 UTC
Description of problem:

Following https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.6/html/Upgrade_Guide/Upgrading_a_Red_Hat_Enterprise_Linux_6_Cluster_to_Red_Hat_Enterprise_Linux_7.html step #3 "Set the InClusterUpgrade" scheduling policy fails, we suspect this is a race condition when having more than 100, or even more VMs in cluster.

Version-Release number of selected component (if applicable):

Engine: 
rhevm-3.6.8.1-0.1.el6.noarch
postgresql-server-8.4.20-6.el6.x86_64

Hypervisors: 
Red Hat Enterprise Virtualization Hypervisor release 6.7 (20160219.0.el6ev)


How reproducible:
100% in customer's environment

Steps to Reproduce:
1. Have a cluster with more than 100 VMs, actually, in customer's case, he has 310 VMs
2. Follow step #1 and #2 from documentation
3. Set scheduling policy to InClusterUpgrade mode for cluster you want to upgrade

Actual results:

Fails with:

2016-08-12 11:11:18,604 ERROR [org.ovirt.engine.core.bll.UpdateVmCommand] (ajp-/127.0.0.1:8702-8) [17d532c7] Command 'org.ovirt.engine.core.bll.UpdateVmCommand' failed: CallableStatementCallback; SQL [{call incrementdbgeneration(?)}]; ERROR: deadlock detected

Expected results:

InClusterPolicy should be set fine for cluster

Additional info:

This hapened on another setup where customer had 172 VMs in cluster, then on a different cluster with only 11 VMs, there were no issues.

Will attach relevant logs.

Comment 5 Roy Golan 2016-08-14 08:07:46 UTC
According to your log statements this engine is still 3.5 (VdsUpdateRunTimeInfo doesn't exist in 3.6 codebase) so this setup seems unfinished.

Comment 6 Javier Coscia 2016-08-14 21:32:01 UTC
In the engine.log I can't see any reference to VdsUpdateRunTimeInfo after 8/12 @ 09:03:15,629

According to the setup log, the system was updated to 3.6.8 on 8/12 @ 09:58:20. ovirt-engine restart was on 8/12 @ 11:09:04,968

Attaching the setup logs in priv

Comment 8 Roy Golan 2016-08-15 12:58:56 UTC
Definitely a race between incrementdbgeneration which is called by an update to a vm, and the insertvmguestagentinterface 

There update of the cluster is doing a an UpdateVM for all the VM in the cluster, including the Up Vms of course. This is colliding with a batch update from the vms monitoring and aparently the update isn't made in the same order, and we hit the deadlock

looking further.

Comment 9 Ameya Charekar 2016-08-15 16:04:39 UTC
This problem will not be limited to any particular cluster parameter change (like InClusterUpgrade) but would be applicable to all any update to the cluster, when incrementdbgeneration is called right?

e.g. while changing cluster compatibility level can we hit this race condition?

Comment 14 Roy Golan 2016-08-16 08:48:18 UTC
Full analysis of the cause:

1. UpdateClusterCommand calls UpdateVm in a loop, without sorting, for *every* call, without any condition. Fix should be to call it *only* if the cluster compat level changed AND to sort the vms before looping over them

2. Update guest interfaces in batch should also be sorted by vm id to prevent from deadlocking with the loop update from UpdateCluster

3. nice to have, @ahadas suggest maybe no to call update guest interfaces in TX.

Comment 16 Marek Libra 2016-08-16 09:23:22 UTC
In response to comment #14: 
The UpdateVm is called if cluster version has changed (please see UpdateClusterCommand.getSharedLocks() )

Comment 21 Arik 2016-08-17 15:14:30 UTC
Update: this deadlock cannot be reproduced with postgresql 9.5.3 (fedora 24) and can be reproduced with postgresql 8.4.20 (rhel 6).
I'll check with postgres 9.2 (rhel 7)

Comment 23 Arik 2016-08-18 08:35:08 UTC
it can happen with postgres 9.2 so a proper fix will be done for 4.0

Comment 35 sefi litmanovich 2017-02-05 16:40:20 UTC
Verified with rhevm-4.1.0.4-0.1.el7.noarch.

Had a cluster with 120 Vms running with guest agent working and reporting IP.
Changed cluster compatibility version from 4.0 to 4.1 (hosts were 4.1 all the time) and monitored the updateVm calls with tail on engine log.
Repeated this upgrade flow several times, no race has occurred in any of the iterations.


Note You need to log in before you can comment on or make changes to this bug.