Bug 1486543

Summary:	Migration leads to VM running on 2 Hosts
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Germano Veit Michel <gveitmic>
Component:	vdsm	Assignee:	Milan Zamazal <mzamazal>
Status:	CLOSED ERRATA	QA Contact:	Nisim Simsolo <nsimsolo>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	4.1.2	CC:	danken, lsurette, mavital, michal.skrivanek, nsimsolo, ratamir, srevivo, trichard, ycui, ykaul, ylavi
Target Milestone:	ovirt-4.2.0	Keywords:	ZStream
Target Release:	---	Flags:	lsvaty: testing_plan_complete-
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Previously, when a VM was migrating and the source host became non-operational, the VM could end up running on two hosts simultaneously. This has now been fixed.	Story Points:	---
Clone Of:
Clones:	1487913 (view as bug list)		Environment:
Last Closed:	2018-05-15 17:51:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Virt	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1487913

Description Germano Veit Michel 2017-08-30 05:36:09 UTC

Description of problem:

The logs and events here are extremely complex and convoluted but this looks like a bug on vdsm to me.

This the way I see it:
1. VM Migrates from A to B
2. B doesn't report it, but the VM Migrates and exists on B.
3. At the end of migration, A stops reporting the VM (destroyed)
4. Engine doesn't see the VM neither on B (doesn't report) or A (destroyed after migration finished)
5. VM is HA, engine triggers HA (VM must be up somewhere) and restarts the VM elsewhere, VM possibly running on 2 hosts (HA and B).

Note: VM (cdvpgira01) is HA but without lease.

Here are the details. There are from the qemu logs for the VM:

TIMESTAMP                      ACTION    HOST
2017-08-28  21:35:36.555+0000  MIG_FROM  dch12bl01
2017-08-28  21:35:37.350+0000  starting  dch11bl13
2017-08-28  21:56:54.369+0000  shutting  dch12bl01
            = Running on dch11bl13

2017-08-28  22:00:14.382+0000  starting  dch11bl04 (???)

2017-08-28  22:16:18.352+0000  starting  dch12bl04
2017-08-28  22:17:30.254+0000  MIG_FROM  dch11bl04
2017-08-28  22:18:30.200+0000  shutting  dch11bl04
            = Running on dch11bl13 AND dch12bl04
            * here we migrated the SECOND, just started copy of the VM 

* Engine and VDSM timestamps are around 5h30m ahead of qemu logs.
* Engine and VDSM timestamps have drifts of +/- 2m.

Now what happened on the (???), why was it started on dch11bl04?

2017-08-29 03:29:21,909+05 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler5) [17b50bde] VM 'accf8231-66bf-446f-8f41-1bf645abe31a'(cdvpgira01) is running in db and not running on VDS '2a02097f-7026-4d63-93d7-3db474b1f889'(dch11bl13)

So it was not reported by the host it just migrated to (dch11bl13 - 21:35 qemu logs). And the qemu-kvm process existed for several hours there.

Destination Migration (dch11bl13)

2017-08-29 03:05:34,853+0530 INFO  (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::ffff:A.B.C.D:34046 (protocoldetector:72)
2017-08-29 03:05:35,058+0530 INFO  (jsonrpc/7) [vdsm.api] START __init__(cif=<clientIF.clientIF object at 0x2c78090>, params={u'guestFQDN': u'cdvpgira01
2017-08-29 03:05:35,063+0530 INFO  (vm/accf8231) [virt.vm] (vmId='accf8231-66bf-446f-8f41-1bf645abe31a') VM wrapper has started (vm:1930)

Could this be related?

2017-08-29 03:05:37,963+0530 WARN  (jsonrpc/1) [virt.vm] (vmId='accf8231-66bf-446f-8f41-1bf645abe31a') Failed to get metadata, domain not connected. (vm:2765)
2017-08-29 03:05:37,963+0530 ERROR (jsonrpc/1) [jsonrpc.JsonRpcServer] Internal server error (__init__:570)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 565, in _handle_request
    res = method(**params)
  File "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 198, in _dynamicMethod
    result = fn(*methodArgs)
  File "/usr/share/vdsm/API.py", line 1454, in getAllVmIoTunePolicies
    io_tune_policies_dict = self._cif.getAllVmIoTunePolicies()
  File "/usr/share/vdsm/clientIF.py", line 448, in getAllVmIoTunePolicies
    'current_values': v.getIoTune()}
  File "/usr/share/vdsm/virt/vm.py", line 2803, in getIoTune
    result = self.getIoTuneResponse()
  File "/usr/share/vdsm/virt/vm.py", line 2816, in getIoTuneResponse
    res = self._dom.blockIoTune(
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 47, in __getattr__
    % self.vmid)
NotConnectedError: VM u'accf8231-66bf-446f-8f41-1bf645abe31a' was not started yet or was shut down

Later:

2017-08-29 03:13:11,282+0530 INFO  (jsonrpc/2) [vdsm.api] START destroy(gracefulAttempts=1) (api:39)
2017-08-29 03:13:11,287+0530 INFO  (vm/accf8231) [virt.vm] (vmId='accf8231-66bf-446f-8f41-1bf645abe31a') Changed state to Down: VM destroyed during the 

Source Host (dch12bl01)

2017-08-29 03:05:33,557+0530 INFO  (jsonrpc/2) [vdsm.api] START migrate
2017-08-29 03:05:35,453+0530 INFO  (migsrc/accf8231) [virt.vm] (vmId='accf8231-66bf-446f-8f41-1bf645abe31a') Creation of destination VM took: 1 seconds (migration:455)
2017-08-29 03:05:35,453+0530 INFO  (migsrc/accf8231) [virt.vm] (vmId='accf8231-66bf-446f-8f41-1bf645abe31a') starting migration to qemu+tls://A.B.C.D...

It's migrating:

2017-08-29 03:10:45,512+0530 INFO  (migmon/accf8231) [virt.vm] (vmId='accf8231-66bf-446f-8f41-1bf645abe31a') Migration Progress: 310 seconds elapsed, 26% of data processed, total data: 65632MB, processed data: 16062MB, remaining data: 48945MB, transfer speed 52MBps, zero pages: 168320MB, compressed: 0MB, dirty rate: 0, memory iteration: 1 (migration:815)

And it finishes fine.

2017-08-29 03:26:56,571+0530 INFO  (migsrc/accf8231) [virt.vm] (vmId='accf8231-66bf-446f-8f41-1bf645abe31a') Changed state to Down: Migration succeeded (code=4) (vm:1222)

So:
1. Migration from dch12bl01 to dch11bl13
2. From dch12bl01 perspective, succeeded
3. From dch11bl13 perspective, reported as MigratingTo then did not report anymore. Failed?
4. As soon as the Destroy for the VM on the Source Host is sent, the engine notices the VM is not running anywhere and starts it on dch11bl04

Another proof the VM existed on both hosts at the same time (hosts clock are off +/-2 min). This is qemu-kvm +5h30m.

This is the host that the VM migrate to, but wasn't reported to the engine. VM existed there for several hours:
Aug 29 03:05:37 dch11bl13 systemd-machined: New machine qemu-47-cdvpgira01.
Aug 29 16:01:22 dch11bl13 systemd-machined: Machine qemu-47-cdvpgira01 terminated.

Overlap here:
Aug 29 03:30:14 dch11bl04 systemd-machined: New machine qemu-36-cdvpgira01.
Aug 29 03:48:30 dch11bl04 systemd-machined: Machine qemu-36-cdvpgira01 terminated.

Overlap here:
Aug 29 03:46:18 dch12bl04 systemd-machined: New machine qemu-54-cdvpgira01.
Aug 29 06:11:57 dch12bl04 systemd-machined: Machine qemu-54-cdvpgira01 terminated.

And the VM jumped to several other hosts... all overlapping.

If the VM existed after migration on dch11bl13, why was it not reported in getAllVmsStats? I think this triggered the HA re-run.

Also note that when the HA re-run was triggered, the engine (by luck) tried to start the VM on the host it already existed (Migration Destination - dch11bl13)!! And it failed:

2017-08-29 03:30:20,982+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler2) [] EVENT_ID: VM_DOWN_ERROR(119), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM cdvpgira01 is down with error. Exit message: Requested operation is not valid: domain 'cdvpgira01' is already active.

And several messages like this, which don't make much sense to me as they contradict the MigrateVDS Command:

2017-08-29 03:07:12,256+05 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler2) [] VM 'accf8231-66bf-446f-8f41-1bf645abe31a'(cdvpgira01) was unexpectedly detected as 'MigratingTo' on VDS '2a02097f-7026-4d63-93d7-3db474b1f889'(dch11bl13) (expected on 'd7191d33-899d-49ea-9309-d51da80bc5eb')

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.2.3-0.1.el7.noarch
vdsm-4.19.15-1.el7ev.x86_64

Comment 3 Yaniv Kaul 2017-08-30 07:11:47 UTC

While looking at the logs, I've noticed (on dch11bl01):
2017-08-28 04:26:54,119+0530 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.00 seconds (__init__:533)
2017-08-28 04:26:54,124+0530 WARN  (jsonrpc/4) [virt.vm] (vmId='9d3969d2-af9e-4033-b20d-865b82a73a23') Failed to get metadata, domain not connected. (vm:2765)

which was happening while libvirt had this:
Aug 28 04:27:14 dch11bl01 journal: Cannot start job (query, none) for domain cdvpgint02; current job is (none, migration in) owned by (0 <null>, 0 remoteDispatchDomainMigratePrepare3Params) for (0s, 33s)
Aug 28 04:27:14 dch11bl01 journal: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigratePrepare3Params)

Could it be that libvirt did not report the domain to us yet, so we did not report it in getAllVmStats?

From https://bugs.launchpad.net/nova/+bug/1254872/comments/2 I see the explanation for this error:
What this error message indicates is that two separate threads have made libvirt API calls against the same VM. One of the calls has either hung completely, or is taking a very long time to respond, causing the second API call to report this error message.

So need to understand:
1. Did we or didn't we report the VM to Engine?
2. Why libvirt got hung?

Comment 5 Michal Skrivanek 2017-08-30 09:58:23 UTC

(In reply to Yaniv Kaul from comment #3)
> While looking at the logs, I've noticed (on dch11bl01):
> 2017-08-28 04:26:54,119+0530 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC
> call Host.getAllVmStats succeeded in 0.00 seconds (__init__:533)
> 2017-08-28 04:26:54,124+0530 WARN  (jsonrpc/4) [virt.vm]
> (vmId='9d3969d2-af9e-4033-b20d-865b82a73a23') Failed to get metadata, domain
> not connected. (vm:2765)
> 
> which was happening while libvirt had this:
> Aug 28 04:27:14 dch11bl01 journal: Cannot start job (query, none) for domain
> cdvpgint02; current job is (none, migration in) owned by (0 <null>, 0
> remoteDispatchDomainMigratePrepare3Params) for (0s, 33s)
> Aug 28 04:27:14 dch11bl01 journal: Timed out during operation: cannot
> acquire state change lock (held by remoteDispatchDomainMigratePrepare3Params)
> 
> Could it be that libvirt did not report the domain to us yet, so we did not
> report it in getAllVmStats?

no, vdsm reports it even before we reach libvirt

> From https://bugs.launchpad.net/nova/+bug/1254872/comments/2 I see the
> explanation for this error:
> What this error message indicates is that two separate threads have made
> libvirt API calls against the same VM. One of the calls has either hung
> completely, or is taking a very long time to respond, causing the second API
> call to report this error message.
> 
> So need to understand:
> 1. Did we or didn't we report the VM to Engine?

It's hard to say when there are so many issues in VdsBroker calls to hosts, many seem to fail on networking exception, it may happen easily that an operation is triggered on vdsm side but it returns error on engine


> 2. Why libvirt got hung?

it's not surprising when there are networking issues between hosts and/or storage problems (especially on NFS)

Comment 7 Michal Skrivanek 2017-08-30 13:14:52 UTC

ok, based on provided logs I think we have a plausible explanation:
engine starts migration from host A to host B
A starts migrating, B starts waiting in _waitForUnderlyingMigration() 
A goes NonOperational due to storage connection issue, but th eVM is not affected(libvirt migration continues)
engine calls destroy() on destination (code in moveVmsToUnknown() calling destroyVmOnDestination())
B vdsm VM entry is "destroyed", but the actual libvirt/QEMU VM is left in place (completeIncomingMigration() doesn't destroy the VM until migration is completed)
now B reports VM in Down(and is then "destroyed" from engine the second time to pick up return value) and the VM is stopped being reported on B completely.
A comes back to Up state while the libvirt migration is still ongoing and is eventually completed some 15 minutes later
A reports Down/Migration Succeeded and "engine's migration" monitoring is switched to the supposed destination host B
But B doesn't report anything about this VM, and it's no longer running on A either -> "VM is running in DB but not on host" -> HA logic restart the VM because it's no longer running

This should be fixable by explicit VM destroy on destination even before migration is completed.

This should be easy to simulate, it should happen every time the source host goes NonOperational while successfully migrating a VM and the host eventually goes back to Up

Comment 9 Michal Skrivanek 2017-08-31 08:28:28 UTC

can you verify it on master? should be in tomorrow's nightly build

Comment 10 Dan Kenigsberg 2017-08-31 13:32:47 UTC

Michal, I think that https://gerrit.ovirt.org/#/c/81227/ would not save us from corruption, we need something like https://gerrit.ovirt.org/#/c/78772 .

Without the latter, qemu on the destination can start corrupting the data before Vdsm's domDependentInit runs (which can be forever if vdsm crashed). Please consider taking it in before resolving the bug.

Comment 11 Michal Skrivanek 2017-09-01 09:26:25 UTC

discussion will continue in the patch 78772, but for now I consider this ready for QE testing in 4.2

Comment 12 meital avital 2017-09-02 06:30:20 UTC

(In reply to Michal Skrivanek from comment #9)
> can you verify it on master? should be in tomorrow's nightly build

Yes, we can try.

Comment 15 RHV bug bot 2017-12-06 16:18:34 UTC

INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Project 'ovirt-engine'/Component 'vdsm' mismatch]

For more info please contact: rhv-devops

Comment 16 RHV bug bot 2017-12-12 21:16:45 UTC

INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Project 'ovirt-engine'/Component 'vdsm' mismatch]

For more info please contact: rhv-devops

Comment 17 Nisim Simsolo 2017-12-13 15:18:00 UTC

Verification builds:
rhvm-4.2.0.2-0.1.el7
vdsm-4.20.9.1-1.el7ev.x86_64
sanlock-3.5.0-1.el7.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.12.x86_64
libvirt-client-3.2.0-14.el7_4.5.x86_64

Verification scenario:
1. Change migration policy to migrate slower ( in webadmin, edit cluster -> migration policy -> set bandwidth to custom -> set bandwidth to custom - 10 Mbps)
2. Start migrating from a non-spm host to an spm one
3. Wait until migration progress starts showing up in the webadmin and is not 0 (e.g. the migration is going on and copying something)
4. Cut the line between engine and the source host (using iptables DROP rule)
5. From webadmin, refresh source host capabilities (it will enforce engine to try to talk to source host and fail on a network exception which we want).
6. Wait until the migration actually finishes.
7. When it does finish, allow the connection to the source host again and wait until it will be re-initiated.
8. when the source host is back up again, verify:
- in webadmin the VM is marked as down
- "virsh -r list" on destination host - migrated VM is running on it
9. wait and observe the engine logs, engine tries to execute RunVmCommand due to HA handling. It has about 50% chance in each try to do it in the destination host not causing split brain (but we still have a 50% chance it will start it on the source host causing a split brain :))
10. Wait for VM state to change back to up in webadmin and verify split brain is not occurred using "virsh -r list" command on both hosts.

Comment 18 RHV bug bot 2017-12-18 17:06:13 UTC

INFO: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Project 'ovirt-engine'/Component 'vdsm' mismatch]

For more info please contact: rhv-devops

Comment 23 errata-xmlrpc 2018-05-15 17:51:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1489

Comment 24 Franta Kust 2019-05-16 13:08:22 UTC

BZ<2>Jira Resync

Comment 25 Daniel Gur 2019-08-28 13:14:35 UTC

sync2jira

Comment 26 Daniel Gur 2019-08-28 13:19:38 UTC

sync2jira