Bug 1366786
| Summary: | [InClusterUpgrade] Possible race condition with large amount of VMs in cluster | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Javier Coscia <jcoscia> | |
| Component: | ovirt-engine | Assignee: | Arik <ahadas> | |
| Status: | CLOSED ERRATA | QA Contact: | sefi litmanovich <slitmano> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 3.6.8 | CC: | achareka, ahadas, baptiste.agasse, bgraveno, flo_bugzilla, gklein, jcoscia, kmashalk, kshukla, lsurette, mgoldboi, michal.skrivanek, mlibra, mtessun, rbalakri, rgolan, Rhev-m-bugs, srevivo, ykaul | |
| Target Milestone: | ovirt-4.1.0-alpha | Keywords: | ZStream | |
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Previously, the update of the compatibility version of a cluster with many running virtual machines that are installed with the guest-agent caused a deadlock that caused the update to fail. In some cases, these clusters could not be upgraded to a newer compatibility version. Now, the deadlock in the database has been prevented so that a cluster with many running virtual machines that are installed with the guest-agent can be upgraded to newer compatibility version.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1369415 1369418 (view as bug list) | Environment: | ||
| Last Closed: | 2017-04-25 00:54:44 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1369415, 1369418 | |||
|
Description
Javier Coscia
2016-08-12 19:54:27 UTC
According to your log statements this engine is still 3.5 (VdsUpdateRunTimeInfo doesn't exist in 3.6 codebase) so this setup seems unfinished. In the engine.log I can't see any reference to VdsUpdateRunTimeInfo after 8/12 @ 09:03:15,629 According to the setup log, the system was updated to 3.6.8 on 8/12 @ 09:58:20. ovirt-engine restart was on 8/12 @ 11:09:04,968 Attaching the setup logs in priv Definitely a race between incrementdbgeneration which is called by an update to a vm, and the insertvmguestagentinterface There update of the cluster is doing a an UpdateVM for all the VM in the cluster, including the Up Vms of course. This is colliding with a batch update from the vms monitoring and aparently the update isn't made in the same order, and we hit the deadlock looking further. This problem will not be limited to any particular cluster parameter change (like InClusterUpgrade) but would be applicable to all any update to the cluster, when incrementdbgeneration is called right? e.g. while changing cluster compatibility level can we hit this race condition? Full analysis of the cause: 1. UpdateClusterCommand calls UpdateVm in a loop, without sorting, for *every* call, without any condition. Fix should be to call it *only* if the cluster compat level changed AND to sort the vms before looping over them 2. Update guest interfaces in batch should also be sorted by vm id to prevent from deadlocking with the loop update from UpdateCluster 3. nice to have, @ahadas suggest maybe no to call update guest interfaces in TX. In response to comment #14: The UpdateVm is called if cluster version has changed (please see UpdateClusterCommand.getSharedLocks() ) Update: this deadlock cannot be reproduced with postgresql 9.5.3 (fedora 24) and can be reproduced with postgresql 8.4.20 (rhel 6). I'll check with postgres 9.2 (rhel 7) it can happen with postgres 9.2 so a proper fix will be done for 4.0 Verified with rhevm-4.1.0.4-0.1.el7.noarch. Had a cluster with 120 Vms running with guest agent working and reporting IP. Changed cluster compatibility version from 4.0 to 4.1 (hosts were 4.1 all the time) and monitored the updateVm calls with tail on engine log. Repeated this upgrade flow several times, no race has occurred in any of the iterations. |