Bug 1609319 - failed to update cluster compatibility to version 4.2
Summary: failed to update cluster compatibility to version 4.2
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: General
Version: 4.2.4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ovirt-4.2.7
: ---
Assignee: Andrej Krejcir
QA Contact: Vitalii Yerys
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-27 14:30 UTC by Stefano Stagnaro
Modified: 2018-11-02 14:31 UTC (History)
5 users (show)

Fixed In Version: ovirt-engine-4.2.7.3
Clone Of:
Environment:
Last Closed: 2018-11-02 14:31:49 UTC
oVirt Team: Virt
Embargoed:
rule-engine: ovirt-4.2+


Attachments (Terms of Use)
full engine.log (5.77 MB, text/plain)
2018-07-27 14:37 UTC, Stefano Stagnaro
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 93746 0 master MERGED core: Small cleanup of UpdateClusterCommand 2020-06-17 13:49:43 UTC
oVirt gerrit 93747 0 master MERGED core: Change order of code blocks in UpdateClusterCommand 2020-06-17 13:49:42 UTC
oVirt gerrit 93748 0 master MERGED core: Make SupportedAdditionalClusterFeature a BusinessEntity 2020-06-17 13:49:42 UTC
oVirt gerrit 93778 0 master ABANDONED core: Create generic compensation entity. 2020-06-17 13:49:42 UTC
oVirt gerrit 93842 0 master MERGED core: Create ChildCompensationWrapper class 2020-06-17 13:49:42 UTC
oVirt gerrit 93889 0 master MERGED core: Cleanup UpdateVmCommand 2020-06-17 13:49:42 UTC
oVirt gerrit 93890 0 master MERGED core: Small cleanup of VmDeviceUtils 2020-06-17 13:49:41 UTC
oVirt gerrit 93892 0 master MERGED core: UpdateVmCommand uses CompensationContext 2020-06-17 13:49:41 UTC
oVirt gerrit 93893 0 master MERGED core: Small cleanup of VmTemplateManagementCommand 2020-06-17 13:49:41 UTC
oVirt gerrit 93894 0 master MERGED core: UpdateVmTemplateCommand uses CompensationContext 2020-06-17 13:49:41 UTC
oVirt gerrit 93895 0 master MERGED core: Make UpdateClusterCommand non-transactive 2020-06-17 13:49:41 UTC
oVirt gerrit 93979 0 master MERGED core: Fix device id when updating graphics device 2020-06-17 13:49:41 UTC
oVirt gerrit 94125 0 ovirt-engine-4.2 MERGED core: Small cleanup of UpdateClusterCommand 2020-06-17 13:49:40 UTC
oVirt gerrit 94126 0 ovirt-engine-4.2 MERGED core: Change order of code blocks in UpdateClusterCommand 2020-06-17 13:49:40 UTC
oVirt gerrit 94127 0 ovirt-engine-4.2 MERGED core: Cleanup UpdateVmCommand 2020-06-17 13:49:40 UTC
oVirt gerrit 94128 0 ovirt-engine-4.2 MERGED core: Small cleanup of VmDeviceUtils 2020-06-17 13:49:40 UTC
oVirt gerrit 94129 0 ovirt-engine-4.2 MERGED core: Small cleanup of VmTemplateManagementCommand 2020-06-17 13:49:40 UTC
oVirt gerrit 94130 0 ovirt-engine-4.2 MERGED core: Fix device id when updating graphics device 2020-06-17 13:49:40 UTC
oVirt gerrit 94205 0 ovirt-engine-4.2 MERGED core: Make SupportedAdditionalClusterFeature a BusinessEntity 2020-06-17 13:49:39 UTC
oVirt gerrit 94206 0 ovirt-engine-4.2 MERGED core: Create ChildCompensationWrapper class 2020-06-17 13:49:39 UTC
oVirt gerrit 94556 0 ovirt-engine-4.2 MERGED core: UpdateVmCommand uses CompensationContext 2020-06-17 13:49:39 UTC
oVirt gerrit 94557 0 ovirt-engine-4.2 MERGED core: UpdateVmTemplateCommand uses CompensationContext 2020-06-17 13:49:39 UTC
oVirt gerrit 94558 0 ovirt-engine-4.2 MERGED core: Make UpdateClusterCommand non-transactive 2020-06-17 13:49:39 UTC

Description Stefano Stagnaro 2018-07-27 14:30:28 UTC
Description of problem:
After upgrading oVirt from 4.1 to 4.2.4 I'm trying to update the cluster compatibility level from 4.1 to 4.2 but the task fails with error "Failed to update Host cluster".

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.4.5-0.1.el7_3.noarch

How reproducible:
The problem seems arise only with large amount of VMs per cluster (>700).

Steps to Reproduce:
1. edit cluster properties
2. change the Compatibility Version from 4.1 to 4.2
3. confirm and wait till the error appear

Actual results:
The update task fails after nearly 5 minutes. The VM that causes the error seems random at every attempt. 

Expected results:
Cluster compatibility update ends without errors and turns to 4.2.

Additional info:
engine.log shows the following errors:
2018-07-27 14:16:29,017+02 ERROR [org.ovirt.engine.core.bll.UpdateVmCommand] (Transaction Reaper Worker 0) [] Transaction rolled-back for command 'org.ovirt.engine.core.bll.UpdateVmCommand'.
2018-07-27 14:16:29,017+02 ERROR [org.ovirt.engine.core.bll.UpdateClusterCommand] (Transaction Reaper Worker 0) [] Transaction rolled-back for command 'org.ovirt.engine.core.bll.UpdateClusterCommand'.
2018-07-27 14:16:29,297+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-15) [52696305] EVENT_ID: SYSTEM_FAILED_UPDATE_VM(252), Failed to Update VM CRLABOV-037-33 that was initiated by system.
2018-07-27 14:16:29,307+02 ERROR [org.ovirt.engine.core.bll.CommandsFactory] (default task-15) [] An exception has occurred while trying to create a command object for command 'UpdateVm' with parameters 'VmManagementParametersBase:{commandId='748892c9-7fc5-4dda-b0da-a407765e2a8c', user='null', commandType='Unknown', vmId='cf8fec91-a8aa-43d7-9c58-f53829249470'}': WELD-000049: Unable to invoke protected final void org.ovirt.engine.core.bll.CommandBase.postConstruct() on org.ovirt.engine.core.bll.UpdateVmCommand@52b30ee2
2018-07-27 14:16:29,311+02 ERROR [org.ovirt.engine.core.bll.CommandsFactory] (default task-15) [] Exception: java.lang.IllegalStateException: Transaction Local transaction (delegate=TransactionImple < ac, BasicAction: 0:ffffac15fa64:3d3ee80d:5b5b0064:59cf status: ActionStatus.ABORTED >, owner=Local transaction context for provider JBoss JTA transaction provider) is not active STATUS_ROLLEDBACK
2018-07-27 14:16:29,368+02 ERROR [org.ovirt.engine.core.bll.UpdateClusterCommand] (default task-15) [] Command 'org.ovirt.engine.core.bll.UpdateClusterCommand' failed: null
2018-07-27 14:16:29,368+02 ERROR [org.ovirt.engine.core.bll.UpdateClusterCommand] (default task-15) [] Exception: javax.ejb.EJBTransactionRolledbackException
2018-07-27 14:16:29,402+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-15) [] EVENT_ID: USER_UPDATE_CLUSTER_FAILED(812), Failed to update Host cluster (User: pavanm@unicloudidattica.local-authz)

Please find annexed the full engine.log

Comment 1 Stefano Stagnaro 2018-07-27 14:37:12 UTC
Created attachment 1471130 [details]
full engine.log

Comment 2 Arik 2018-07-29 07:32:08 UTC
Interesting. This bug reveals a fundamental issue in the implementation of upgrading cluster version:

The flow goes like that:
1. UpdateCluster is called
2. UpdateVm is called per-VM in the cluster
3. Optionally, SetVmDestroyOnReboot is called for each running VM

UpdateCluster is transactional.
UpdateVm is transactional.

Since UpdateVm uses transaction scope Required, each execution of UpdateVm is done within the transaction scope of UpdateCluster. This is the desired behavior because we wanted successful UpdateVm executions to be rolled-back upon a failure to upgrade one particular VM in the cluster. That's also why all UpdateVm executions are done within the same thread of the one that executes UpdateCluster.

So we have a transaction timeout of 5 min (i.e., 300 sec) for UpdateCluster. That means all VM updates should be completed within this time frame.

But when it comes to such amount of VMs, it may not be possible to complete all the VM updates in 5 min. Looking at the log, it seems that each VM update is fairly quick (~60msec) but it takes time for the execution thread to switch from one VM update to another. So it takes ~1 sec to complete a particular VM update and move to the next one.

Therefore, in 300 sec we manage to update ~300 VMs, which is far less than the number of VMs in the cluster.

Note: in this case, all VMs were not running (because SetVmDestroyOnReboot was not called). It would take more time to update a running VM due to the interaction with the host.

Conclusions:

Things to resolve this issue:

1. UpdateCluster should no longer be transactional. It was alright when update-cluster was a short operation that only updated the configuration of the cluster in the database, but now that it updates every VM in the cluster + possibly interacts with the host it is not appropriate anymore.

2. UpdateVm should no longer be transactional. It should use the compensation context of UpdateCluster when executed as a child of it.

Other things I thought of while investigating the log:

3. Once we change those commands to be non-transactional, we could also speed up the upgrade process by invoking the VM updates using more than one thread.

4. We currently possibly call SetVmDestroyOnReboot by each VM update. This has two drawbacks:
(a) There is a costly (in terms of execution time) call to VDSM per-VM
(b) This host-level flag is not rolled back for VMs that were successfully updated when a problem occurs afterward
It would be much smarter to call this verb per-host with all the VMs that were updated only at the end of UpdateCluster, when we know all VMs were updated successfully.

Comment 3 Raz Tamir 2018-10-13 09:59:19 UTC
Hi Andrej,

Please add 'priority' to the bug.

Comment 4 Vitalii Yerys 2018-10-18 16:33:59 UTC
Verified by adding a timeout after VM update of 3 sec, and performing cluster compatibility version update from 4.1 to 4.2.
Upgrade was done to the latest 4.2.7 currently available build.

steps:
1. Deploy 4.1 env.
2. create big amount of VMs - (450) in my case.
3. Upgrade 4.1 to 4.2
4. Insert timeout into engine process on updateVM action (used byteman)
5. Change cluster compatibility version.

Note: On previous ovirt-engine-4.2.7.2-0.1.el7ev.noarch I was able to reproduce the issue, and it occurred after 300 sec. With new build, cluster compatibility update process took 26 min 10 sec (as it should because of timeout after updateVM procedure). All VMs were update successfully.

RHV release: rhv-release-4.2.7-5-001.noarch
ovirt-engine: ovirt-engine-4.2.7.3-0.1.el7ev.noarch
rhel: Red Hat Enterprise Linux Server release 7.6 (Maipo)

Comment 5 Sandro Bonazzola 2018-11-02 14:31:49 UTC
This bugzilla is included in oVirt 4.2.7 release, published on November 2nd 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.7 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.