Bug 1609319
Summary: | failed to update cluster compatibility to version 4.2 | ||||||
---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Stefano Stagnaro <stefano.stagnaro> | ||||
Component: | General | Assignee: | Andrej Krejcir <akrejcir> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Vitalii Yerys <vyerys> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4.2.4.5 | CC: | ahadas, akrejcir, bugs, michal.skrivanek, ratamir | ||||
Target Milestone: | ovirt-4.2.7 | Flags: | rule-engine:
ovirt-4.2+
|
||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | ovirt-engine-4.2.7.3 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-11-02 14:31:49 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Stefano Stagnaro
2018-07-27 14:30:28 UTC
Created attachment 1471130 [details]
full engine.log
Interesting. This bug reveals a fundamental issue in the implementation of upgrading cluster version: The flow goes like that: 1. UpdateCluster is called 2. UpdateVm is called per-VM in the cluster 3. Optionally, SetVmDestroyOnReboot is called for each running VM UpdateCluster is transactional. UpdateVm is transactional. Since UpdateVm uses transaction scope Required, each execution of UpdateVm is done within the transaction scope of UpdateCluster. This is the desired behavior because we wanted successful UpdateVm executions to be rolled-back upon a failure to upgrade one particular VM in the cluster. That's also why all UpdateVm executions are done within the same thread of the one that executes UpdateCluster. So we have a transaction timeout of 5 min (i.e., 300 sec) for UpdateCluster. That means all VM updates should be completed within this time frame. But when it comes to such amount of VMs, it may not be possible to complete all the VM updates in 5 min. Looking at the log, it seems that each VM update is fairly quick (~60msec) but it takes time for the execution thread to switch from one VM update to another. So it takes ~1 sec to complete a particular VM update and move to the next one. Therefore, in 300 sec we manage to update ~300 VMs, which is far less than the number of VMs in the cluster. Note: in this case, all VMs were not running (because SetVmDestroyOnReboot was not called). It would take more time to update a running VM due to the interaction with the host. Conclusions: Things to resolve this issue: 1. UpdateCluster should no longer be transactional. It was alright when update-cluster was a short operation that only updated the configuration of the cluster in the database, but now that it updates every VM in the cluster + possibly interacts with the host it is not appropriate anymore. 2. UpdateVm should no longer be transactional. It should use the compensation context of UpdateCluster when executed as a child of it. Other things I thought of while investigating the log: 3. Once we change those commands to be non-transactional, we could also speed up the upgrade process by invoking the VM updates using more than one thread. 4. We currently possibly call SetVmDestroyOnReboot by each VM update. This has two drawbacks: (a) There is a costly (in terms of execution time) call to VDSM per-VM (b) This host-level flag is not rolled back for VMs that were successfully updated when a problem occurs afterward It would be much smarter to call this verb per-host with all the VMs that were updated only at the end of UpdateCluster, when we know all VMs were updated successfully. Hi Andrej, Please add 'priority' to the bug. Verified by adding a timeout after VM update of 3 sec, and performing cluster compatibility version update from 4.1 to 4.2. Upgrade was done to the latest 4.2.7 currently available build. steps: 1. Deploy 4.1 env. 2. create big amount of VMs - (450) in my case. 3. Upgrade 4.1 to 4.2 4. Insert timeout into engine process on updateVM action (used byteman) 5. Change cluster compatibility version. Note: On previous ovirt-engine-4.2.7.2-0.1.el7ev.noarch I was able to reproduce the issue, and it occurred after 300 sec. With new build, cluster compatibility update process took 26 min 10 sec (as it should because of timeout after updateVM procedure). All VMs were update successfully. RHV release: rhv-release-4.2.7-5-001.noarch ovirt-engine: ovirt-engine-4.2.7.3-0.1.el7ev.noarch rhel: Red Hat Enterprise Linux Server release 7.6 (Maipo) This bugzilla is included in oVirt 4.2.7 release, published on November 2nd 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.7 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |