Bug 1486543
Summary: | Migration leads to VM running on 2 Hosts | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> | |
Component: | vdsm | Assignee: | Milan Zamazal <mzamazal> | |
Status: | CLOSED ERRATA | QA Contact: | Nisim Simsolo <nsimsolo> | |
Severity: | urgent | Docs Contact: | ||
Priority: | high | |||
Version: | 4.1.2 | CC: | danken, lsurette, mavital, michal.skrivanek, nsimsolo, ratamir, srevivo, trichard, ycui, ykaul, ylavi | |
Target Milestone: | ovirt-4.2.0 | Keywords: | ZStream | |
Target Release: | --- | Flags: | lsvaty:
testing_plan_complete-
|
|
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Previously, when a VM was migrating and the source host became non-operational, the VM could end up running on two hosts simultaneously. This has now been fixed.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1487913 (view as bug list) | Environment: | ||
Last Closed: | 2018-05-15 17:51:57 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1487913 |
Description
Germano Veit Michel
2017-08-30 05:36:09 UTC
While looking at the logs, I've noticed (on dch11bl01): 2017-08-28 04:26:54,119+0530 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.00 seconds (__init__:533) 2017-08-28 04:26:54,124+0530 WARN (jsonrpc/4) [virt.vm] (vmId='9d3969d2-af9e-4033-b20d-865b82a73a23') Failed to get metadata, domain not connected. (vm:2765) which was happening while libvirt had this: Aug 28 04:27:14 dch11bl01 journal: Cannot start job (query, none) for domain cdvpgint02; current job is (none, migration in) owned by (0 <null>, 0 remoteDispatchDomainMigratePrepare3Params) for (0s, 33s) Aug 28 04:27:14 dch11bl01 journal: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigratePrepare3Params) Could it be that libvirt did not report the domain to us yet, so we did not report it in getAllVmStats? From https://bugs.launchpad.net/nova/+bug/1254872/comments/2 I see the explanation for this error: What this error message indicates is that two separate threads have made libvirt API calls against the same VM. One of the calls has either hung completely, or is taking a very long time to respond, causing the second API call to report this error message. So need to understand: 1. Did we or didn't we report the VM to Engine? 2. Why libvirt got hung? (In reply to Yaniv Kaul from comment #3) > While looking at the logs, I've noticed (on dch11bl01): > 2017-08-28 04:26:54,119+0530 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC > call Host.getAllVmStats succeeded in 0.00 seconds (__init__:533) > 2017-08-28 04:26:54,124+0530 WARN (jsonrpc/4) [virt.vm] > (vmId='9d3969d2-af9e-4033-b20d-865b82a73a23') Failed to get metadata, domain > not connected. (vm:2765) > > which was happening while libvirt had this: > Aug 28 04:27:14 dch11bl01 journal: Cannot start job (query, none) for domain > cdvpgint02; current job is (none, migration in) owned by (0 <null>, 0 > remoteDispatchDomainMigratePrepare3Params) for (0s, 33s) > Aug 28 04:27:14 dch11bl01 journal: Timed out during operation: cannot > acquire state change lock (held by remoteDispatchDomainMigratePrepare3Params) > > Could it be that libvirt did not report the domain to us yet, so we did not > report it in getAllVmStats? no, vdsm reports it even before we reach libvirt > From https://bugs.launchpad.net/nova/+bug/1254872/comments/2 I see the > explanation for this error: > What this error message indicates is that two separate threads have made > libvirt API calls against the same VM. One of the calls has either hung > completely, or is taking a very long time to respond, causing the second API > call to report this error message. > > So need to understand: > 1. Did we or didn't we report the VM to Engine? It's hard to say when there are so many issues in VdsBroker calls to hosts, many seem to fail on networking exception, it may happen easily that an operation is triggered on vdsm side but it returns error on engine > 2. Why libvirt got hung? it's not surprising when there are networking issues between hosts and/or storage problems (especially on NFS) ok, based on provided logs I think we have a plausible explanation: engine starts migration from host A to host B A starts migrating, B starts waiting in _waitForUnderlyingMigration() A goes NonOperational due to storage connection issue, but th eVM is not affected(libvirt migration continues) engine calls destroy() on destination (code in moveVmsToUnknown() calling destroyVmOnDestination()) B vdsm VM entry is "destroyed", but the actual libvirt/QEMU VM is left in place (completeIncomingMigration() doesn't destroy the VM until migration is completed) now B reports VM in Down(and is then "destroyed" from engine the second time to pick up return value) and the VM is stopped being reported on B completely. A comes back to Up state while the libvirt migration is still ongoing and is eventually completed some 15 minutes later A reports Down/Migration Succeeded and "engine's migration" monitoring is switched to the supposed destination host B But B doesn't report anything about this VM, and it's no longer running on A either -> "VM is running in DB but not on host" -> HA logic restart the VM because it's no longer running This should be fixable by explicit VM destroy on destination even before migration is completed. This should be easy to simulate, it should happen every time the source host goes NonOperational while successfully migrating a VM and the host eventually goes back to Up can you verify it on master? should be in tomorrow's nightly build Michal, I think that https://gerrit.ovirt.org/#/c/81227/ would not save us from corruption, we need something like https://gerrit.ovirt.org/#/c/78772 . Without the latter, qemu on the destination can start corrupting the data before Vdsm's domDependentInit runs (which can be forever if vdsm crashed). Please consider taking it in before resolving the bug. discussion will continue in the patch 78772, but for now I consider this ready for QE testing in 4.2 (In reply to Michal Skrivanek from comment #9) > can you verify it on master? should be in tomorrow's nightly build Yes, we can try. INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Project 'ovirt-engine'/Component 'vdsm' mismatch] For more info please contact: rhv-devops INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Project 'ovirt-engine'/Component 'vdsm' mismatch] For more info please contact: rhv-devops Verification builds: rhvm-4.2.0.2-0.1.el7 vdsm-4.20.9.1-1.el7ev.x86_64 sanlock-3.5.0-1.el7.x86_64 qemu-kvm-rhev-2.9.0-16.el7_4.12.x86_64 libvirt-client-3.2.0-14.el7_4.5.x86_64 Verification scenario: 1. Change migration policy to migrate slower ( in webadmin, edit cluster -> migration policy -> set bandwidth to custom -> set bandwidth to custom - 10 Mbps) 2. Start migrating from a non-spm host to an spm one 3. Wait until migration progress starts showing up in the webadmin and is not 0 (e.g. the migration is going on and copying something) 4. Cut the line between engine and the source host (using iptables DROP rule) 5. From webadmin, refresh source host capabilities (it will enforce engine to try to talk to source host and fail on a network exception which we want). 6. Wait until the migration actually finishes. 7. When it does finish, allow the connection to the source host again and wait until it will be re-initiated. 8. when the source host is back up again, verify: - in webadmin the VM is marked as down - "virsh -r list" on destination host - migrated VM is running on it 9. wait and observe the engine logs, engine tries to execute RunVmCommand due to HA handling. It has about 50% chance in each try to do it in the destination host not causing split brain (but we still have a 50% chance it will start it on the source host causing a split brain :)) 10. Wait for VM state to change back to up in webadmin and verify split brain is not occurred using "virsh -r list" command on both hosts. INFO: Bug status (VERIFIED) wasn't changed but the folowing should be fixed: [Project 'ovirt-engine'/Component 'vdsm' mismatch] For more info please contact: rhv-devops Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1489 BZ<2>Jira Resync sync2jira sync2jira |