Bug 1458745

Summary: [DR] - Hosted engine VM migration fails with 10ms latency between hosted engine hosts
Product: [oVirt] ovirt-hosted-engine-ha Reporter: Elad <ebenahar>
Component: GeneralAssignee: bugs <bugs>
Status: CLOSED CURRENTRELEASE QA Contact: Artyom <alukiano>
Severity: high Docs Contact:
Priority: high    
Version: 2.1.0.5CC: bugs, dfediuck, ebenahar, jcall, mavital, msivak, tnisan, ylavi
Target Milestone: ovirt-4.2.0Keywords: TestOnly, Triaged
Target Release: ---Flags: rule-engine: ovirt-4.2+
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-20 10:57:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1337914, 1478848, 1498327    
Bug Blocks: 1284364, 1534978    
Attachments:
Description Flags
A migration failure of "normal" VM none

Description Elad 2017-06-05 11:30:19 UTC
Description of problem:
On an environment deployed as hosted engine for disaster recovery scenarios tests, with 4 hosts in the HE cluster, I simulated a 10ms latency on 2 of the hosts (using tc configured on the hosts NIC) which supposed to be in a remote location on a stretched cluster with the other 2 hosts. 

We see that when trying to migrate the hosted engine VM between one of the hosts which don't have latency configured on its NIC to a host that has latency configured, the HE VM migration fails. Sometimes it even fails to start.


Version-Release number of selected component (if applicable):


How reproducible:
Most of HE VM migration attempts

Steps to Reproduce:
On an environment with 4 hosts, deployed as hosted engine over iSCSI, with additional data domains (iSCSI and NFS) in the DC:
1. Configure 10ms latency on 2 of the hosts NIC:
tc qdisc add dev eno1 root netem delay 10ms
2. Put the hosts with no latency configured, where the HE VM is running on in maintenance, so the HE VM will be migrated to a host with latency configured on its NIC


Actual results:

Sometimes HE VM migration fails with the following error from Sanlock:

2017-05-29 10:43:00,582+0300 ERROR (vm/2eb55ea4) [virt.vm] (vmId='2eb55ea4-dd9e-42b2-bec7-9eea1cfaf322') The vm start process failed (vm:632)
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 563, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/virt/vm.py", line 2021, in _run
    self._connection.createXML(domxml, flags),
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 941, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3777, in createXML
    if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: resource busy: Failed to acquire lock: error -243


hosted-engine-ha agent.log:

MainThread::INFO::2017-05-29 10:44:51,984::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUnexpectedlyDown (score: 0)



Sometimes, hosted engine VM migration fails in vdsm.log when after putting the hosts with the high latency configured on their NIC in maintenance while the HE VM is running on one of them:

2017-06-05 13:37:04,837+0300 ERROR (migsrc/2eb55ea4) [virt.vm] (vmId='2eb55ea4-dd9e-42b2-bec7-9eea1cfaf322') Failed to acquire lock: No space left on device (migration:287)
2017-06-05 13:37:05,348+0300 ERROR (migsrc/2eb55ea4) [virt.vm] (vmId='2eb55ea4-dd9e-42b2-bec7-9eea1cfaf322') Failed to migrate (migration:429)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 411, in run
    self._startUnderlyingMigration(time.time())
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 489, in _startUnderlyingMigration
    self._perform_with_downtime_thread(duri, muri)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 556, in _perform_with_downtime_thread
    self._perform_migration(duri, muri)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 529, in _perform_migration
    self._vm._dom.migrateToURI3(duri, params, flags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 69, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 941, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1939, in migrateToURI3
    if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self)
libvirtError: Failed to acquire lock: No space left on device



There are other 2 non HE VMs which are configured with VM leases to be HA. Their migration works with no issues (with the same environment conditions - 10ms latency on 2 of the 4 hosts).


Expected results:
Need to decide if HE VM migration should succeed (considering the fact that non HE VMs migration works properly with the same conditions)

Additional info:

All logs from hypervisors and engine will be provided soon.
The setup is available for investigation, please contact me.

Comment 1 Elad 2017-06-05 11:39:29 UTC
Version-Release number of selected component (if applicable):

ovirt-hosted-engine-setup-2.1.0.5-1.el7ev.noarch
ovirt-hosted-engine-ha-2.1.0.5-1.el7ev.noarch
vdsm-4.19.11-1.el7ev.x86_64
libvirt-daemon-2.0.0-10.el7_3.8.x86_64
qemu-img-rhev-2.6.0-28.el7_3.9.x86_64
libvirt-daemon-driver-qemu-2.0.0-10.el7_3.8.x86_64
qemu-kvm-common-rhev-2.6.0-28.el7_3.9.x86_64
qemu-kvm-tools-rhev-2.6.0-28.el7_3.9.x86_64
ipxe-roms-qemu-20160127-5.git6366fa7a.el7.noarch
qemu-kvm-rhev-2.6.0-28.el7_3.9.x86_64
sanlock-3.4.0-1.el7.x86_64
selinux-policy-3.13.1-102.el7_3.16.noarch
device-mapper-multipath-0.4.9-99.el7_3.3.x86_64
rhevm-4.1.2-0.1.el7.noarch
rhvm-appliance-4.1.20170425.0-1.el7.noarch

Comment 2 Martin Sivák 2017-06-05 15:56:13 UTC
Created attachment 1285083 [details]
A migration failure of "normal" VM

Before I answer the comments I have two questions about the test scenario:

- Do we support remote storage at all? Or a cluster that is not all at a single location? I am pretty sure we are telling people it is a bad idea (especially when using NFS).

- Was the latency just between the hosts of between hosts and storage as well?

The reason for all the issues you were describing:

The test setup was not really able to synchronize the two separated sets of hosts with this latency. The hosts from each group did not see the hosts from each other group as reliable. This caused the host that was moving to maintenance to avoid any migration as there was no "best backup destination". The engine then got stuck in Preparing for maintenance (recoverable by Activating the host again).

> How reproducible:
> Most of HE VM migration attempts

Manual or maintenance flow related? Hosted engine uses migration in just those two cases. I saw manual migration to finish fine and maintenance to fail because there was no good enough destination.


> Sometimes HE VM migration fails with the following error from Sanlock:
> libvirtError: resource busy: Failed to acquire lock: error -243

This is not a migration error. You should indicate which host reported this, because this was reported by a free and active host from the second group. And the host did the right thing. It could not find the engine and so it tried to start a new instance of it. Sanlock properly prevented that, because a completely good engine is still up on another host.


> MainThread::INFO::2017-05-29
> 10:44:51,984::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.
> HostedEngine::(start_monitoring) Current state EngineUnexpectedlyDown
> (score: 0)

This is a separate bug, we have an issue in detection the locking failure. VDSM events will solve this (we just merged the support to vdsm master branch).


>     if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed',
> dom=self)
> libvirtError: Failed to acquire lock: No space left on device

This is either a storage space issue or a latency issue when talking to the remote storage. Not hosted engine related...


> Expected results:
> Need to decide if HE VM migration should succeed (considering the fact that
> non HE VMs migration works properly with the same conditions)

It definitely does not work fine either. I see multiple attempts were needed to actually migrate the VM. We do not take this risk with HE VM. We only perform one migration attempt and then let the admin to fix the situation.

Note: The engine can do a better job to inform the admin about this, it has all the data already.


Please note that severe storage connection latency or packet loss can cause sanlock to lose synchronization and that can result in involuntary host reboot via watchdog.

Comment 3 Yaniv Kaul 2017-06-05 17:11:54 UTC
Elad, can you clarify the topology? The latency is on which network?
- To the storage?
- On the migration network?
- The mgmt network?
- Some of the above are essentially on the same interface / same network?

(we need to test each independently)

Comment 4 Elad 2017-06-06 08:20:42 UTC
(In reply to Yaniv Kaul from comment #3)
> Elad, can you clarify the topology? The latency is on which network?
1 DC, 1 cluster, 4 hosts in it deployed as hosted engine. 2 of the hosts with latency simulated by tc on their NIC for all in and out traffic. The NIC is used for all host traffic, no separation (mgmt, storage, VMs migration)

Comment 5 Doron Fediuck 2017-06-06 16:34:14 UTC
(In reply to Elad from comment #4)
> 2 of the hosts
> with latency simulated by tc on their NIC for all in and out traffic. The
> NIC is used for all host traffic, no separation (mgmt, storage, VMs
> migration)

In this case this is unrelated to hosted engine. We can possibly add a warning
as Martin mentioned, but it will be relevant for all storage related activities.

Comment 6 Nikolai Sednev 2017-06-14 12:09:07 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1448699 might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1458745#c2 

"
> MainThread::INFO::2017-05-29
> 10:44:51,984::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.
> HostedEngine::(start_monitoring) Current state EngineUnexpectedlyDown
> (score: 0)

This is a separate bug, we have an issue in detection the locking failure. VDSM events will solve this (we just merged the support to vdsm master branch)."

Comment 7 Yaniv Lavi 2017-06-19 13:17:37 UTC
We are working on active-active testing for oVirt 4.2. We want to make sure it works well and this is part of that work.

Comment 8 Martin Sivák 2017-06-19 13:44:18 UTC
The storage must be reliable enough, the connections apparently break too much in the simulated environment.

Was the delay of every packet independent?

A ---- sent
  B ---- sent

Or was it sequential?

A ---- sent B ---- sent

The first case would be something we might want to handle to some degree. The second case is something you won't encounter on a real network.

Hosted engine has nothing to do with both.. it just asks for data from storage. But it only considers other hosts up if the last report arrived less than a minute ago. And NFS probably takes more than just a single packet to perform any disk operation.

Comment 9 Elad 2017-07-23 14:31:19 UTC
Martin, all I can say for sure is that the latency on the host's NIC was 10ms during the migration

Comment 10 Martin Sivák 2017-07-24 15:05:03 UTC
And how was the latency inserted? Have you tried downloading a plain file to see the effect of it?

independent latency: the download starts slower, but reaches full speed
sequential latency: it will be sloooow

Comment 11 Elad 2017-07-24 15:44:39 UTC
(In reply to Martin Sivák from comment #10)
> And how was the latency inserted? Have you tried downloading a plain file to
> see the effect of it?
Using tc:
# tc qdisc add dev eth0 root netem delay 10ms

The effect was seen in ping, latency was higher than 10ms for each icmp packet 
> 
> independent latency: the download starts slower, but reaches full speed
> sequential latency: it will be sloooow

Comment 12 Martin Sivák 2017-11-22 12:27:59 UTC
We enabled migration profiles for hosted engine VM and it might fix migration issues as well. Please see #1478848

As for storage and synchronization issues when latency is introduced, I do not see anything we can do here from the hosted engine side. The grace periods are already pretty long (1 minute for host liveliness check, 5 minutes for engine up check).

Comment 13 Martin Sivák 2017-11-27 14:16:59 UTC
I just realized we fixed couple of bugs where we interrogated the storage too often. That combined with the latency might have also contributed to the issue.

Can you please retest using the latest 4.2 Hosted Engine (or even 4.1, it contains the most important fixes as well)?

Comment 14 Artyom 2017-11-30 14:41:31 UTC
HE environment with 3 hosts on NFS storage
HE VM runs on host_3

Verification steps:
1) Configure network delay on host_1 and host_2
# tc qdisc show dev enp26s0f0
qdisc netem 8001: root refcnt 65 limit 1000 delay 10.0ms
2) Put host with HE VM to maintenance
3) HE VM migrated to host_1 without any troubles

I tried it a number of times and it looks fine, no exceptions under agent or vdsm log.

I checked it on both 4.1 and 4.2.
4.2
=================================
[root@alma07 ~]# rpm -qa | grep vdsm
vdsm-python-4.20.8-1.el7ev.noarch
vdsm-4.20.8-1.el7ev.x86_64
vdsm-hook-vfio-mdev-4.20.8-1.el7ev.noarch
vdsm-yajsonrpc-4.20.8-1.el7ev.noarch
vdsm-jsonrpc-4.20.8-1.el7ev.noarch
vdsm-hook-ethtool-options-4.20.8-1.el7ev.noarch
vdsm-hook-openstacknet-4.20.8-1.el7ev.noarch
vdsm-http-4.20.8-1.el7ev.noarch
vdsm-client-4.20.8-1.el7ev.noarch
vdsm-hook-vhostmd-4.20.8-1.el7ev.noarch
vdsm-api-4.20.8-1.el7ev.noarch
vdsm-hook-vmfex-dev-4.20.8-1.el7ev.noarch
vdsm-hook-fcoe-4.20.8-1.el7ev.noarch
[root@alma07 ~]# rpm -qa | grep hoste
ovirt-hosted-engine-ha-2.2.0-0.2.master.gitcbe3c76.el7ev.noarch
ovirt-hosted-engine-setup-2.2.0-0.4.master.git0b67c21.el7ev.noarch

4.1
=================================
[root@alma05 ~]# rpm -qa | grep vdsm
vdsm-hook-vmfex-dev-4.19.40-1.el7ev.noarch
vdsm-xmlrpc-4.19.40-1.el7ev.noarch
vdsm-cli-4.19.40-1.el7ev.noarch
vdsm-yajsonrpc-4.19.40-1.el7ev.noarch
vdsm-api-4.19.40-1.el7ev.noarch
vdsm-jsonrpc-4.19.40-1.el7ev.noarch
vdsm-python-4.19.40-1.el7ev.noarch
vdsm-client-4.19.40-1.el7ev.noarch
vdsm-4.19.40-1.el7ev.x86_64
[root@alma05 ~]# rpm -qa | grep hosted
ovirt-hosted-engine-setup-2.1.4-1.el7ev.noarch
ovirt-hosted-engine-ha-2.1.8-1.el7ev.noarch

I opened another bug https://bugzilla.redhat.com/show_bug.cgi?id=1519289 that I encountered via verification process, but it does not really relate to this bug.

Comment 15 Sandro Bonazzola 2017-12-20 10:57:55 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.