1366786 – [InClusterUpgrade] Possible race condition with large amount of VMs in cluster

Bug 1366786 - [InClusterUpgrade] Possible race condition with large amount of VMs in cluster

Summary: [InClusterUpgrade] Possible race condition with large amount of VMs in cluster

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.6.8
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.1.0-alpha
Target Release:	---
Assignee:	Arik
QA Contact:	sefi litmanovich
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1369415 1369418
TreeView+	depends on / blocked

Reported:	2016-08-12 19:54 UTC by Javier Coscia
Modified:	2019-12-16 06:21 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, the update of the compatibility version of a cluster with many running virtual machines that are installed with the guest-agent caused a deadlock that caused the update to fail. In some cases, these clusters could not be upgraded to a newer compatibility version. Now, the deadlock in the database has been prevented so that a cluster with many running virtual machines that are installed with the guest-agent can be upgraded to newer compatibility version.
Clone Of:
Clones:	1369415 1369418 (view as bug list)
Environment:
Last Closed:	2017-04-25 00:54:44 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2525531	None	None	None	2016-08-16 15:01:26 UTC
Red Hat Product Errata	RHEA-2017:0997	normal	SHIPPED_LIVE	Red Hat Virtualization Manager (ovirt-engine) 4.1 GA	2017-04-18 20:11:26 UTC
oVirt gerrit	62372	ovirt-engine-4.0	MERGED	core: fix monitoring of guest agent nics	2016-08-22 08:18:32 UTC
oVirt gerrit	62392	ovirt-engine-3.6	MERGED	core: Cluster update updates VMs only if necessary	2016-08-18 08:22:33 UTC
oVirt gerrit	62514	ovirt-engine-4.0	MERGED	core: make VmBase comparable	2016-08-22 08:17:59 UTC
oVirt gerrit	62515	ovirt-engine-4.0	MERGED	core: update cluster to query only static vm data	2016-08-22 08:18:20 UTC
oVirt gerrit	62516	ovirt-engine-4.0	MERGED	core: determine the order of vm statistics updates	2016-08-22 08:17:53 UTC
oVirt gerrit	62517	ovirt-engine-4.0	MERGED	core: determine the order of guest agent nic updates	2016-08-22 08:18:09 UTC
oVirt gerrit	62518	ovirt-engine-4.0	MERGED	core: fix possible deadlock on update cluster version	2016-08-22 08:17:47 UTC
oVirt gerrit	62521	master	MERGED	core: make VmBase comparable	2016-08-18 12:54:06 UTC
oVirt gerrit	62522	master	MERGED	core: update cluster to query only static vm data	2016-08-18 13:50:01 UTC
oVirt gerrit	62523	master	MERGED	core: determine the order of vm statistics updates	2016-08-21 07:03:12 UTC
oVirt gerrit	62524	master	MERGED	core: determine the order of guest agent nic updates	2016-08-21 07:31:48 UTC
oVirt gerrit	62525	master	MERGED	core: fix possible deadlock on update cluster version	2016-08-22 07:45:36 UTC
oVirt gerrit	62637	ovirt-engine-3.6	MERGED	core: fix monitoring of guest agent nics	2016-08-22 12:19:01 UTC
oVirt gerrit	62742	ovirt-engine-4.0.3	MERGED	core: fix monitoring of guest agent nics	2016-08-24 08:04:27 UTC

Description Javier Coscia 2016-08-12 19:54:27 UTC

Description of problem:

Following https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.6/html/Upgrade_Guide/Upgrading_a_Red_Hat_Enterprise_Linux_6_Cluster_to_Red_Hat_Enterprise_Linux_7.html step #3 "Set the InClusterUpgrade" scheduling policy fails, we suspect this is a race condition when having more than 100, or even more VMs in cluster.

Version-Release number of selected component (if applicable):

Engine: 
rhevm-3.6.8.1-0.1.el6.noarch
postgresql-server-8.4.20-6.el6.x86_64

Hypervisors: 
Red Hat Enterprise Virtualization Hypervisor release 6.7 (20160219.0.el6ev)


How reproducible:
100% in customer's environment

Steps to Reproduce:
1. Have a cluster with more than 100 VMs, actually, in customer's case, he has 310 VMs
2. Follow step #1 and #2 from documentation
3. Set scheduling policy to InClusterUpgrade mode for cluster you want to upgrade

Actual results:

Fails with:

2016-08-12 11:11:18,604 ERROR [org.ovirt.engine.core.bll.UpdateVmCommand] (ajp-/127.0.0.1:8702-8) [17d532c7] Command 'org.ovirt.engine.core.bll.UpdateVmCommand' failed: CallableStatementCallback; SQL [{call incrementdbgeneration(?)}]; ERROR: deadlock detected

Expected results:

InClusterPolicy should be set fine for cluster

Additional info:

This hapened on another setup where customer had 172 VMs in cluster, then on a different cluster with only 11 VMs, there were no issues.

Will attach relevant logs.

Comment 5 Roy Golan 2016-08-14 08:07:46 UTC

According to your log statements this engine is still 3.5 (VdsUpdateRunTimeInfo doesn't exist in 3.6 codebase) so this setup seems unfinished.

Comment 6 Javier Coscia 2016-08-14 21:32:01 UTC

In the engine.log I can't see any reference to VdsUpdateRunTimeInfo after 8/12 @ 09:03:15,629

According to the setup log, the system was updated to 3.6.8 on 8/12 @ 09:58:20. ovirt-engine restart was on 8/12 @ 11:09:04,968

Attaching the setup logs in priv

Comment 8 Roy Golan 2016-08-15 12:58:56 UTC

Definitely a race between incrementdbgeneration which is called by an update to a vm, and the insertvmguestagentinterface 

There update of the cluster is doing a an UpdateVM for all the VM in the cluster, including the Up Vms of course. This is colliding with a batch update from the vms monitoring and aparently the update isn't made in the same order, and we hit the deadlock

looking further.

Comment 9 Ameya Charekar 2016-08-15 16:04:39 UTC

This problem will not be limited to any particular cluster parameter change (like InClusterUpgrade) but would be applicable to all any update to the cluster, when incrementdbgeneration is called right?

e.g. while changing cluster compatibility level can we hit this race condition?

Comment 14 Roy Golan 2016-08-16 08:48:18 UTC

Full analysis of the cause:

1. UpdateClusterCommand calls UpdateVm in a loop, without sorting, for *every* call, without any condition. Fix should be to call it *only* if the cluster compat level changed AND to sort the vms before looping over them

2. Update guest interfaces in batch should also be sorted by vm id to prevent from deadlocking with the loop update from UpdateCluster

3. nice to have, @ahadas suggest maybe no to call update guest interfaces in TX.

Comment 16 Marek Libra 2016-08-16 09:23:22 UTC

In response to comment #14: 
The UpdateVm is called if cluster version has changed (please see UpdateClusterCommand.getSharedLocks() )

Comment 21 Arik 2016-08-17 15:14:30 UTC

Update: this deadlock cannot be reproduced with postgresql 9.5.3 (fedora 24) and can be reproduced with postgresql 8.4.20 (rhel 6).
I'll check with postgres 9.2 (rhel 7)

Comment 23 Arik 2016-08-18 08:35:08 UTC

it can happen with postgres 9.2 so a proper fix will be done for 4.0

Comment 35 sefi litmanovich 2017-02-05 16:40:20 UTC

Verified with rhevm-4.1.0.4-0.1.el7.noarch.

Had a cluster with 120 Vms running with guest agent working and reporting IP.
Changed cluster compatibility version from 4.0 to 4.1 (hosts were 4.1 all the time) and monitored the updateVm calls with tail on engine log.
Repeated this upgrade flow several times, no race has occurred in any of the iterations.

Note You need to log in before you can comment on or make changes to this bug.