Bug 1609319

Summary:

failed to update cluster compatibility to version 4.2

Product:

[oVirt] ovirt-engine

Reporter:

Stefano Stagnaro <stefano.stagnaro>

Component:

General

Assignee:

Andrej Krejcir <akrejcir>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Vitalii Yerys <vyerys>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

4.2.4.5

CC:

ahadas, akrejcir, bugs, michal.skrivanek, ratamir

Target Milestone:

ovirt-4.2.7

Flags:

rule-engine: ovirt-4.2+

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

ovirt-engine-4.2.7.3

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-11-02 14:31:49 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Virt

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
full engine.log	none

Description Stefano Stagnaro 2018-07-27 14:30:28 UTC

Description of problem:
After upgrading oVirt from 4.1 to 4.2.4 I'm trying to update the cluster compatibility level from 4.1 to 4.2 but the task fails with error "Failed to update Host cluster".

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.4.5-0.1.el7_3.noarch

How reproducible:
The problem seems arise only with large amount of VMs per cluster (>700).

Steps to Reproduce:
1. edit cluster properties
2. change the Compatibility Version from 4.1 to 4.2
3. confirm and wait till the error appear

Actual results:
The update task fails after nearly 5 minutes. The VM that causes the error seems random at every attempt. 

Expected results:
Cluster compatibility update ends without errors and turns to 4.2.

Additional info:
engine.log shows the following errors:
2018-07-27 14:16:29,017+02 ERROR [org.ovirt.engine.core.bll.UpdateVmCommand] (Transaction Reaper Worker 0) [] Transaction rolled-back for command 'org.ovirt.engine.core.bll.UpdateVmCommand'.
2018-07-27 14:16:29,017+02 ERROR [org.ovirt.engine.core.bll.UpdateClusterCommand] (Transaction Reaper Worker 0) [] Transaction rolled-back for command 'org.ovirt.engine.core.bll.UpdateClusterCommand'.
2018-07-27 14:16:29,297+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-15) [52696305] EVENT_ID: SYSTEM_FAILED_UPDATE_VM(252), Failed to Update VM CRLABOV-037-33 that was initiated by system.
2018-07-27 14:16:29,307+02 ERROR [org.ovirt.engine.core.bll.CommandsFactory] (default task-15) [] An exception has occurred while trying to create a command object for command 'UpdateVm' with parameters 'VmManagementParametersBase:{commandId='748892c9-7fc5-4dda-b0da-a407765e2a8c', user='null', commandType='Unknown', vmId='cf8fec91-a8aa-43d7-9c58-f53829249470'}': WELD-000049: Unable to invoke protected final void org.ovirt.engine.core.bll.CommandBase.postConstruct() on org.ovirt.engine.core.bll.UpdateVmCommand@52b30ee2
2018-07-27 14:16:29,311+02 ERROR [org.ovirt.engine.core.bll.CommandsFactory] (default task-15) [] Exception: java.lang.IllegalStateException: Transaction Local transaction (delegate=TransactionImple < ac, BasicAction: 0:ffffac15fa64:3d3ee80d:5b5b0064:59cf status: ActionStatus.ABORTED >, owner=Local transaction context for provider JBoss JTA transaction provider) is not active STATUS_ROLLEDBACK
2018-07-27 14:16:29,368+02 ERROR [org.ovirt.engine.core.bll.UpdateClusterCommand] (default task-15) [] Command 'org.ovirt.engine.core.bll.UpdateClusterCommand' failed: null
2018-07-27 14:16:29,368+02 ERROR [org.ovirt.engine.core.bll.UpdateClusterCommand] (default task-15) [] Exception: javax.ejb.EJBTransactionRolledbackException
2018-07-27 14:16:29,402+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-15) [] EVENT_ID: USER_UPDATE_CLUSTER_FAILED(812), Failed to update Host cluster (User: pavanm@unicloudidattica.local-authz)

Please find annexed the full engine.log

Comment 1 Stefano Stagnaro 2018-07-27 14:37:12 UTC

Created attachment 1471130 [details]
full engine.log

Comment 2 Arik 2018-07-29 07:32:08 UTC

Interesting. This bug reveals a fundamental issue in the implementation of upgrading cluster version:

The flow goes like that:
1. UpdateCluster is called
2. UpdateVm is called per-VM in the cluster
3. Optionally, SetVmDestroyOnReboot is called for each running VM

UpdateCluster is transactional.
UpdateVm is transactional.

Since UpdateVm uses transaction scope Required, each execution of UpdateVm is done within the transaction scope of UpdateCluster. This is the desired behavior because we wanted successful UpdateVm executions to be rolled-back upon a failure to upgrade one particular VM in the cluster. That's also why all UpdateVm executions are done within the same thread of the one that executes UpdateCluster.

So we have a transaction timeout of 5 min (i.e., 300 sec) for UpdateCluster. That means all VM updates should be completed within this time frame.

But when it comes to such amount of VMs, it may not be possible to complete all the VM updates in 5 min. Looking at the log, it seems that each VM update is fairly quick (~60msec) but it takes time for the execution thread to switch from one VM update to another. So it takes ~1 sec to complete a particular VM update and move to the next one.

Therefore, in 300 sec we manage to update ~300 VMs, which is far less than the number of VMs in the cluster.

Note: in this case, all VMs were not running (because SetVmDestroyOnReboot was not called). It would take more time to update a running VM due to the interaction with the host.

Conclusions:

Things to resolve this issue:

1. UpdateCluster should no longer be transactional. It was alright when update-cluster was a short operation that only updated the configuration of the cluster in the database, but now that it updates every VM in the cluster + possibly interacts with the host it is not appropriate anymore.

2. UpdateVm should no longer be transactional. It should use the compensation context of UpdateCluster when executed as a child of it.

Other things I thought of while investigating the log:

3. Once we change those commands to be non-transactional, we could also speed up the upgrade process by invoking the VM updates using more than one thread.

4. We currently possibly call SetVmDestroyOnReboot by each VM update. This has two drawbacks:
(a) There is a costly (in terms of execution time) call to VDSM per-VM
(b) This host-level flag is not rolled back for VMs that were successfully updated when a problem occurs afterward
It would be much smarter to call this verb per-host with all the VMs that were updated only at the end of UpdateCluster, when we know all VMs were updated successfully.

Comment 3 Raz Tamir 2018-10-13 09:59:19 UTC

Hi Andrej,

Please add 'priority' to the bug.

Comment 4 Vitalii Yerys 2018-10-18 16:33:59 UTC

Verified by adding a timeout after VM update of 3 sec, and performing cluster compatibility version update from 4.1 to 4.2.
Upgrade was done to the latest 4.2.7 currently available build.

steps:
1. Deploy 4.1 env.
2. create big amount of VMs - (450) in my case.
3. Upgrade 4.1 to 4.2
4. Insert timeout into engine process on updateVM action (used byteman)
5. Change cluster compatibility version.

Note: On previous ovirt-engine-4.2.7.2-0.1.el7ev.noarch I was able to reproduce the issue, and it occurred after 300 sec. With new build, cluster compatibility update process took 26 min 10 sec (as it should because of timeout after updateVM procedure). All VMs were update successfully.

RHV release: rhv-release-4.2.7-5-001.noarch
ovirt-engine: ovirt-engine-4.2.7.3-0.1.el7ev.noarch
rhel: Red Hat Enterprise Linux Server release 7.6 (Maipo)

Comment 5 Sandro Bonazzola 2018-11-02 14:31:49 UTC

This bugzilla is included in oVirt 4.2.7 release, published on November 2nd 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.7 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.