Description of problem: The logs and events here are extremely complex and convoluted but this looks like a bug on vdsm to me. This the way I see it: 1. VM Migrates from A to B 2. B doesn't report it, but the VM Migrates and exists on B. 3. At the end of migration, A stops reporting the VM (destroyed) 4. Engine doesn't see the VM neither on B (doesn't report) or A (destroyed after migration finished) 5. VM is HA, engine triggers HA (VM must be up somewhere) and restarts the VM elsewhere, VM possibly running on 2 hosts (HA and B). Note: VM (cdvpgira01) is HA but without lease. Here are the details. There are from the qemu logs for the VM: TIMESTAMP ACTION HOST 2017-08-28 21:35:36.555+0000 MIG_FROM dch12bl01 2017-08-28 21:35:37.350+0000 starting dch11bl13 2017-08-28 21:56:54.369+0000 shutting dch12bl01 = Running on dch11bl13 2017-08-28 22:00:14.382+0000 starting dch11bl04 (???) 2017-08-28 22:16:18.352+0000 starting dch12bl04 2017-08-28 22:17:30.254+0000 MIG_FROM dch11bl04 2017-08-28 22:18:30.200+0000 shutting dch11bl04 = Running on dch11bl13 AND dch12bl04 * here we migrated the SECOND, just started copy of the VM * Engine and VDSM timestamps are around 5h30m ahead of qemu logs. * Engine and VDSM timestamps have drifts of +/- 2m. Now what happened on the (???), why was it started on dch11bl04? 2017-08-29 03:29:21,909+05 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler5) [17b50bde] VM 'accf8231-66bf-446f-8f41-1bf645abe31a'(cdvpgira01) is running in db and not running on VDS '2a02097f-7026-4d63-93d7-3db474b1f889'(dch11bl13) So it was not reported by the host it just migrated to (dch11bl13 - 21:35 qemu logs). And the qemu-kvm process existed for several hours there. Destination Migration (dch11bl13) 2017-08-29 03:05:34,853+0530 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::ffff:A.B.C.D:34046 (protocoldetector:72) 2017-08-29 03:05:35,058+0530 INFO (jsonrpc/7) [vdsm.api] START __init__(cif=<clientIF.clientIF object at 0x2c78090>, params={u'guestFQDN': u'cdvpgira01 2017-08-29 03:05:35,063+0530 INFO (vm/accf8231) [virt.vm] (vmId='accf8231-66bf-446f-8f41-1bf645abe31a') VM wrapper has started (vm:1930) Could this be related? 2017-08-29 03:05:37,963+0530 WARN (jsonrpc/1) [virt.vm] (vmId='accf8231-66bf-446f-8f41-1bf645abe31a') Failed to get metadata, domain not connected. (vm:2765) 2017-08-29 03:05:37,963+0530 ERROR (jsonrpc/1) [jsonrpc.JsonRpcServer] Internal server error (__init__:570) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 565, in _handle_request res = method(**params) File "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 198, in _dynamicMethod result = fn(*methodArgs) File "/usr/share/vdsm/API.py", line 1454, in getAllVmIoTunePolicies io_tune_policies_dict = self._cif.getAllVmIoTunePolicies() File "/usr/share/vdsm/clientIF.py", line 448, in getAllVmIoTunePolicies 'current_values': v.getIoTune()} File "/usr/share/vdsm/virt/vm.py", line 2803, in getIoTune result = self.getIoTuneResponse() File "/usr/share/vdsm/virt/vm.py", line 2816, in getIoTuneResponse res = self._dom.blockIoTune( File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 47, in __getattr__ % self.vmid) NotConnectedError: VM u'accf8231-66bf-446f-8f41-1bf645abe31a' was not started yet or was shut down Later: 2017-08-29 03:13:11,282+0530 INFO (jsonrpc/2) [vdsm.api] START destroy(gracefulAttempts=1) (api:39) 2017-08-29 03:13:11,287+0530 INFO (vm/accf8231) [virt.vm] (vmId='accf8231-66bf-446f-8f41-1bf645abe31a') Changed state to Down: VM destroyed during the Source Host (dch12bl01) 2017-08-29 03:05:33,557+0530 INFO (jsonrpc/2) [vdsm.api] START migrate 2017-08-29 03:05:35,453+0530 INFO (migsrc/accf8231) [virt.vm] (vmId='accf8231-66bf-446f-8f41-1bf645abe31a') Creation of destination VM took: 1 seconds (migration:455) 2017-08-29 03:05:35,453+0530 INFO (migsrc/accf8231) [virt.vm] (vmId='accf8231-66bf-446f-8f41-1bf645abe31a') starting migration to qemu+tls://A.B.C.D... It's migrating: 2017-08-29 03:10:45,512+0530 INFO (migmon/accf8231) [virt.vm] (vmId='accf8231-66bf-446f-8f41-1bf645abe31a') Migration Progress: 310 seconds elapsed, 26% of data processed, total data: 65632MB, processed data: 16062MB, remaining data: 48945MB, transfer speed 52MBps, zero pages: 168320MB, compressed: 0MB, dirty rate: 0, memory iteration: 1 (migration:815) And it finishes fine. 2017-08-29 03:26:56,571+0530 INFO (migsrc/accf8231) [virt.vm] (vmId='accf8231-66bf-446f-8f41-1bf645abe31a') Changed state to Down: Migration succeeded (code=4) (vm:1222) So: 1. Migration from dch12bl01 to dch11bl13 2. From dch12bl01 perspective, succeeded 3. From dch11bl13 perspective, reported as MigratingTo then did not report anymore. Failed? 4. As soon as the Destroy for the VM on the Source Host is sent, the engine notices the VM is not running anywhere and starts it on dch11bl04 Another proof the VM existed on both hosts at the same time (hosts clock are off +/-2 min). This is qemu-kvm +5h30m. This is the host that the VM migrate to, but wasn't reported to the engine. VM existed there for several hours: Aug 29 03:05:37 dch11bl13 systemd-machined: New machine qemu-47-cdvpgira01. Aug 29 16:01:22 dch11bl13 systemd-machined: Machine qemu-47-cdvpgira01 terminated. Overlap here: Aug 29 03:30:14 dch11bl04 systemd-machined: New machine qemu-36-cdvpgira01. Aug 29 03:48:30 dch11bl04 systemd-machined: Machine qemu-36-cdvpgira01 terminated. Overlap here: Aug 29 03:46:18 dch12bl04 systemd-machined: New machine qemu-54-cdvpgira01. Aug 29 06:11:57 dch12bl04 systemd-machined: Machine qemu-54-cdvpgira01 terminated. And the VM jumped to several other hosts... all overlapping. If the VM existed after migration on dch11bl13, why was it not reported in getAllVmsStats? I think this triggered the HA re-run. Also note that when the HA re-run was triggered, the engine (by luck) tried to start the VM on the host it already existed (Migration Destination - dch11bl13)!! And it failed: 2017-08-29 03:30:20,982+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler2) [] EVENT_ID: VM_DOWN_ERROR(119), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM cdvpgira01 is down with error. Exit message: Requested operation is not valid: domain 'cdvpgira01' is already active. And several messages like this, which don't make much sense to me as they contradict the MigrateVDS Command: 2017-08-29 03:07:12,256+05 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler2) [] VM 'accf8231-66bf-446f-8f41-1bf645abe31a'(cdvpgira01) was unexpectedly detected as 'MigratingTo' on VDS '2a02097f-7026-4d63-93d7-3db474b1f889'(dch11bl13) (expected on 'd7191d33-899d-49ea-9309-d51da80bc5eb') Version-Release number of selected component (if applicable): ovirt-engine-4.1.2.3-0.1.el7.noarch vdsm-4.19.15-1.el7ev.x86_64
While looking at the logs, I've noticed (on dch11bl01): 2017-08-28 04:26:54,119+0530 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.00 seconds (__init__:533) 2017-08-28 04:26:54,124+0530 WARN (jsonrpc/4) [virt.vm] (vmId='9d3969d2-af9e-4033-b20d-865b82a73a23') Failed to get metadata, domain not connected. (vm:2765) which was happening while libvirt had this: Aug 28 04:27:14 dch11bl01 journal: Cannot start job (query, none) for domain cdvpgint02; current job is (none, migration in) owned by (0 <null>, 0 remoteDispatchDomainMigratePrepare3Params) for (0s, 33s) Aug 28 04:27:14 dch11bl01 journal: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigratePrepare3Params) Could it be that libvirt did not report the domain to us yet, so we did not report it in getAllVmStats? From https://bugs.launchpad.net/nova/+bug/1254872/comments/2 I see the explanation for this error: What this error message indicates is that two separate threads have made libvirt API calls against the same VM. One of the calls has either hung completely, or is taking a very long time to respond, causing the second API call to report this error message. So need to understand: 1. Did we or didn't we report the VM to Engine? 2. Why libvirt got hung?
(In reply to Yaniv Kaul from comment #3) > While looking at the logs, I've noticed (on dch11bl01): > 2017-08-28 04:26:54,119+0530 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC > call Host.getAllVmStats succeeded in 0.00 seconds (__init__:533) > 2017-08-28 04:26:54,124+0530 WARN (jsonrpc/4) [virt.vm] > (vmId='9d3969d2-af9e-4033-b20d-865b82a73a23') Failed to get metadata, domain > not connected. (vm:2765) > > which was happening while libvirt had this: > Aug 28 04:27:14 dch11bl01 journal: Cannot start job (query, none) for domain > cdvpgint02; current job is (none, migration in) owned by (0 <null>, 0 > remoteDispatchDomainMigratePrepare3Params) for (0s, 33s) > Aug 28 04:27:14 dch11bl01 journal: Timed out during operation: cannot > acquire state change lock (held by remoteDispatchDomainMigratePrepare3Params) > > Could it be that libvirt did not report the domain to us yet, so we did not > report it in getAllVmStats? no, vdsm reports it even before we reach libvirt > From https://bugs.launchpad.net/nova/+bug/1254872/comments/2 I see the > explanation for this error: > What this error message indicates is that two separate threads have made > libvirt API calls against the same VM. One of the calls has either hung > completely, or is taking a very long time to respond, causing the second API > call to report this error message. > > So need to understand: > 1. Did we or didn't we report the VM to Engine? It's hard to say when there are so many issues in VdsBroker calls to hosts, many seem to fail on networking exception, it may happen easily that an operation is triggered on vdsm side but it returns error on engine > 2. Why libvirt got hung? it's not surprising when there are networking issues between hosts and/or storage problems (especially on NFS)
ok, based on provided logs I think we have a plausible explanation: engine starts migration from host A to host B A starts migrating, B starts waiting in _waitForUnderlyingMigration() A goes NonOperational due to storage connection issue, but th eVM is not affected(libvirt migration continues) engine calls destroy() on destination (code in moveVmsToUnknown() calling destroyVmOnDestination()) B vdsm VM entry is "destroyed", but the actual libvirt/QEMU VM is left in place (completeIncomingMigration() doesn't destroy the VM until migration is completed) now B reports VM in Down(and is then "destroyed" from engine the second time to pick up return value) and the VM is stopped being reported on B completely. A comes back to Up state while the libvirt migration is still ongoing and is eventually completed some 15 minutes later A reports Down/Migration Succeeded and "engine's migration" monitoring is switched to the supposed destination host B But B doesn't report anything about this VM, and it's no longer running on A either -> "VM is running in DB but not on host" -> HA logic restart the VM because it's no longer running This should be fixable by explicit VM destroy on destination even before migration is completed. This should be easy to simulate, it should happen every time the source host goes NonOperational while successfully migrating a VM and the host eventually goes back to Up
can you verify it on master? should be in tomorrow's nightly build
Michal, I think that https://gerrit.ovirt.org/#/c/81227/ would not save us from corruption, we need something like https://gerrit.ovirt.org/#/c/78772 . Without the latter, qemu on the destination can start corrupting the data before Vdsm's domDependentInit runs (which can be forever if vdsm crashed). Please consider taking it in before resolving the bug.
discussion will continue in the patch 78772, but for now I consider this ready for QE testing in 4.2
(In reply to Michal Skrivanek from comment #9) > can you verify it on master? should be in tomorrow's nightly build Yes, we can try.
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Project 'ovirt-engine'/Component 'vdsm' mismatch] For more info please contact: rhv-devops
Verification builds: rhvm-4.2.0.2-0.1.el7 vdsm-4.20.9.1-1.el7ev.x86_64 sanlock-3.5.0-1.el7.x86_64 qemu-kvm-rhev-2.9.0-16.el7_4.12.x86_64 libvirt-client-3.2.0-14.el7_4.5.x86_64 Verification scenario: 1. Change migration policy to migrate slower ( in webadmin, edit cluster -> migration policy -> set bandwidth to custom -> set bandwidth to custom - 10 Mbps) 2. Start migrating from a non-spm host to an spm one 3. Wait until migration progress starts showing up in the webadmin and is not 0 (e.g. the migration is going on and copying something) 4. Cut the line between engine and the source host (using iptables DROP rule) 5. From webadmin, refresh source host capabilities (it will enforce engine to try to talk to source host and fail on a network exception which we want). 6. Wait until the migration actually finishes. 7. When it does finish, allow the connection to the source host again and wait until it will be re-initiated. 8. when the source host is back up again, verify: - in webadmin the VM is marked as down - "virsh -r list" on destination host - migrated VM is running on it 9. wait and observe the engine logs, engine tries to execute RunVmCommand due to HA handling. It has about 50% chance in each try to do it in the destination host not causing split brain (but we still have a 50% chance it will start it on the source host causing a split brain :)) 10. Wait for VM state to change back to up in webadmin and verify split brain is not occurred using "virsh -r list" command on both hosts.
INFO: Bug status (VERIFIED) wasn't changed but the folowing should be fixed: [Project 'ovirt-engine'/Component 'vdsm' mismatch] For more info please contact: rhv-devops
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1489
BZ<2>Jira Resync
sync2jira