1609319 – failed to update cluster compatibility to version 4.2

Bug 1609319 - failed to update cluster compatibility to version 4.2

Summary: failed to update cluster compatibility to version 4.2

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.2.4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-4.2.7
Target Release:	---
Assignee:	Andrej Krejcir
QA Contact:	Vitalii Yerys
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-07-27 14:30 UTC by Stefano Stagnaro
Modified:	2018-11-02 14:31 UTC (History)
CC List:	5 users (show)
Fixed In Version:	ovirt-engine-4.2.7.3
Clone Of:
Environment:
Last Closed:	2018-11-02 14:31:49 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.2+

Attachments	(Terms of Use)
full engine.log (5.77 MB, text/plain) 2018-07-27 14:37 UTC, Stefano Stagnaro	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	93746	master	MERGED	core: Small cleanup of UpdateClusterCommand	2020-06-17 13:49:43 UTC
oVirt gerrit	93747	master	MERGED	core: Change order of code blocks in UpdateClusterCommand	2020-06-17 13:49:42 UTC
oVirt gerrit	93748	master	MERGED	core: Make SupportedAdditionalClusterFeature a BusinessEntity	2020-06-17 13:49:42 UTC
oVirt gerrit	93778	master	ABANDONED	core: Create generic compensation entity.	2020-06-17 13:49:42 UTC
oVirt gerrit	93842	master	MERGED	core: Create ChildCompensationWrapper class	2020-06-17 13:49:42 UTC
oVirt gerrit	93889	master	MERGED	core: Cleanup UpdateVmCommand	2020-06-17 13:49:42 UTC
oVirt gerrit	93890	master	MERGED	core: Small cleanup of VmDeviceUtils	2020-06-17 13:49:41 UTC
oVirt gerrit	93892	master	MERGED	core: UpdateVmCommand uses CompensationContext	2020-06-17 13:49:41 UTC
oVirt gerrit	93893	master	MERGED	core: Small cleanup of VmTemplateManagementCommand	2020-06-17 13:49:41 UTC
oVirt gerrit	93894	master	MERGED	core: UpdateVmTemplateCommand uses CompensationContext	2020-06-17 13:49:41 UTC
oVirt gerrit	93895	master	MERGED	core: Make UpdateClusterCommand non-transactive	2020-06-17 13:49:41 UTC
oVirt gerrit	93979	master	MERGED	core: Fix device id when updating graphics device	2020-06-17 13:49:41 UTC
oVirt gerrit	94125	ovirt-engine-4.2	MERGED	core: Small cleanup of UpdateClusterCommand	2020-06-17 13:49:40 UTC
oVirt gerrit	94126	ovirt-engine-4.2	MERGED	core: Change order of code blocks in UpdateClusterCommand	2020-06-17 13:49:40 UTC
oVirt gerrit	94127	ovirt-engine-4.2	MERGED	core: Cleanup UpdateVmCommand	2020-06-17 13:49:40 UTC
oVirt gerrit	94128	ovirt-engine-4.2	MERGED	core: Small cleanup of VmDeviceUtils	2020-06-17 13:49:40 UTC
oVirt gerrit	94129	ovirt-engine-4.2	MERGED	core: Small cleanup of VmTemplateManagementCommand	2020-06-17 13:49:40 UTC
oVirt gerrit	94130	ovirt-engine-4.2	MERGED	core: Fix device id when updating graphics device	2020-06-17 13:49:40 UTC
oVirt gerrit	94205	ovirt-engine-4.2	MERGED	core: Make SupportedAdditionalClusterFeature a BusinessEntity	2020-06-17 13:49:39 UTC
oVirt gerrit	94206	ovirt-engine-4.2	MERGED	core: Create ChildCompensationWrapper class	2020-06-17 13:49:39 UTC
oVirt gerrit	94556	ovirt-engine-4.2	MERGED	core: UpdateVmCommand uses CompensationContext	2020-06-17 13:49:39 UTC
oVirt gerrit	94557	ovirt-engine-4.2	MERGED	core: UpdateVmTemplateCommand uses CompensationContext	2020-06-17 13:49:39 UTC
oVirt gerrit	94558	ovirt-engine-4.2	MERGED	core: Make UpdateClusterCommand non-transactive	2020-06-17 13:49:39 UTC

Description Stefano Stagnaro 2018-07-27 14:30:28 UTC

Description of problem:
After upgrading oVirt from 4.1 to 4.2.4 I'm trying to update the cluster compatibility level from 4.1 to 4.2 but the task fails with error "Failed to update Host cluster".

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.4.5-0.1.el7_3.noarch

How reproducible:
The problem seems arise only with large amount of VMs per cluster (>700).

Steps to Reproduce:
1. edit cluster properties
2. change the Compatibility Version from 4.1 to 4.2
3. confirm and wait till the error appear

Actual results:
The update task fails after nearly 5 minutes. The VM that causes the error seems random at every attempt. 

Expected results:
Cluster compatibility update ends without errors and turns to 4.2.

Additional info:
engine.log shows the following errors:
2018-07-27 14:16:29,017+02 ERROR [org.ovirt.engine.core.bll.UpdateVmCommand] (Transaction Reaper Worker 0) [] Transaction rolled-back for command 'org.ovirt.engine.core.bll.UpdateVmCommand'.
2018-07-27 14:16:29,017+02 ERROR [org.ovirt.engine.core.bll.UpdateClusterCommand] (Transaction Reaper Worker 0) [] Transaction rolled-back for command 'org.ovirt.engine.core.bll.UpdateClusterCommand'.
2018-07-27 14:16:29,297+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-15) [52696305] EVENT_ID: SYSTEM_FAILED_UPDATE_VM(252), Failed to Update VM CRLABOV-037-33 that was initiated by system.
2018-07-27 14:16:29,307+02 ERROR [org.ovirt.engine.core.bll.CommandsFactory] (default task-15) [] An exception has occurred while trying to create a command object for command 'UpdateVm' with parameters 'VmManagementParametersBase:{commandId='748892c9-7fc5-4dda-b0da-a407765e2a8c', user='null', commandType='Unknown', vmId='cf8fec91-a8aa-43d7-9c58-f53829249470'}': WELD-000049: Unable to invoke protected final void org.ovirt.engine.core.bll.CommandBase.postConstruct() on org.ovirt.engine.core.bll.UpdateVmCommand@52b30ee2
2018-07-27 14:16:29,311+02 ERROR [org.ovirt.engine.core.bll.CommandsFactory] (default task-15) [] Exception: java.lang.IllegalStateException: Transaction Local transaction (delegate=TransactionImple < ac, BasicAction: 0:ffffac15fa64:3d3ee80d:5b5b0064:59cf status: ActionStatus.ABORTED >, owner=Local transaction context for provider JBoss JTA transaction provider) is not active STATUS_ROLLEDBACK
2018-07-27 14:16:29,368+02 ERROR [org.ovirt.engine.core.bll.UpdateClusterCommand] (default task-15) [] Command 'org.ovirt.engine.core.bll.UpdateClusterCommand' failed: null
2018-07-27 14:16:29,368+02 ERROR [org.ovirt.engine.core.bll.UpdateClusterCommand] (default task-15) [] Exception: javax.ejb.EJBTransactionRolledbackException
2018-07-27 14:16:29,402+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-15) [] EVENT_ID: USER_UPDATE_CLUSTER_FAILED(812), Failed to update Host cluster (User: pavanm@unicloudidattica.local-authz)

Please find annexed the full engine.log

Comment 1 Stefano Stagnaro 2018-07-27 14:37:12 UTC

Created attachment 1471130 [details]
full engine.log

Comment 2 Arik 2018-07-29 07:32:08 UTC

Interesting. This bug reveals a fundamental issue in the implementation of upgrading cluster version:

The flow goes like that:
1. UpdateCluster is called
2. UpdateVm is called per-VM in the cluster
3. Optionally, SetVmDestroyOnReboot is called for each running VM

UpdateCluster is transactional.
UpdateVm is transactional.

Since UpdateVm uses transaction scope Required, each execution of UpdateVm is done within the transaction scope of UpdateCluster. This is the desired behavior because we wanted successful UpdateVm executions to be rolled-back upon a failure to upgrade one particular VM in the cluster. That's also why all UpdateVm executions are done within the same thread of the one that executes UpdateCluster.

So we have a transaction timeout of 5 min (i.e., 300 sec) for UpdateCluster. That means all VM updates should be completed within this time frame.

But when it comes to such amount of VMs, it may not be possible to complete all the VM updates in 5 min. Looking at the log, it seems that each VM update is fairly quick (~60msec) but it takes time for the execution thread to switch from one VM update to another. So it takes ~1 sec to complete a particular VM update and move to the next one.

Therefore, in 300 sec we manage to update ~300 VMs, which is far less than the number of VMs in the cluster.

Note: in this case, all VMs were not running (because SetVmDestroyOnReboot was not called). It would take more time to update a running VM due to the interaction with the host.

Conclusions:

Things to resolve this issue:

1. UpdateCluster should no longer be transactional. It was alright when update-cluster was a short operation that only updated the configuration of the cluster in the database, but now that it updates every VM in the cluster + possibly interacts with the host it is not appropriate anymore.

2. UpdateVm should no longer be transactional. It should use the compensation context of UpdateCluster when executed as a child of it.

Other things I thought of while investigating the log:

3. Once we change those commands to be non-transactional, we could also speed up the upgrade process by invoking the VM updates using more than one thread.

4. We currently possibly call SetVmDestroyOnReboot by each VM update. This has two drawbacks:
(a) There is a costly (in terms of execution time) call to VDSM per-VM
(b) This host-level flag is not rolled back for VMs that were successfully updated when a problem occurs afterward
It would be much smarter to call this verb per-host with all the VMs that were updated only at the end of UpdateCluster, when we know all VMs were updated successfully.

Comment 3 Raz Tamir 2018-10-13 09:59:19 UTC

Hi Andrej,

Please add 'priority' to the bug.

Comment 4 Vitalii Yerys 2018-10-18 16:33:59 UTC

Verified by adding a timeout after VM update of 3 sec, and performing cluster compatibility version update from 4.1 to 4.2.
Upgrade was done to the latest 4.2.7 currently available build.

steps:
1. Deploy 4.1 env.
2. create big amount of VMs - (450) in my case.
3. Upgrade 4.1 to 4.2
4. Insert timeout into engine process on updateVM action (used byteman)
5. Change cluster compatibility version.

Note: On previous ovirt-engine-4.2.7.2-0.1.el7ev.noarch I was able to reproduce the issue, and it occurred after 300 sec. With new build, cluster compatibility update process took 26 min 10 sec (as it should because of timeout after updateVM procedure). All VMs were update successfully.

RHV release: rhv-release-4.2.7-5-001.noarch
ovirt-engine: ovirt-engine-4.2.7.3-0.1.el7ev.noarch
rhel: Red Hat Enterprise Linux Server release 7.6 (Maipo)

Comment 5 Sandro Bonazzola 2018-11-02 14:31:49 UTC

This bugzilla is included in oVirt 4.2.7 release, published on November 2nd 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.7 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.