Bug 1459829 - [DR] - Hosted-Engine VM migration fails on: 'libvirtError: Failed to acquire lock: No space left on device' (while there is latency of more than 7 ms between src and dst hosts.) [NEEDINFO]
[DR] - Hosted-Engine VM migration fails on: 'libvirtError: Failed to acquire ...
Status: POST
Product: ovirt-engine
Classification: oVirt
Component: BLL.HostedEngine (Show other bugs)
---
x86_64 Unspecified
high Severity high (vote)
: ovirt-4.2.4
: ---
Assigned To: Yanir Quinn
Elad
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-08 06:03 EDT by Elad
Modified: 2018-05-25 06:30 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: SLA
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
yquinn: needinfo? (pagranat)
rule-engine: ovirt‑4.2+
ylavi: exception+


Attachments (Terms of Use)
dst-vdsm.log (1.15 MB, application/x-xz)
2017-06-10 06:52 EDT, Elad
no flags Details
sanlock.log (61.14 KB, application/x-gzip)
2017-06-12 10:45 EDT, Elad
no flags Details
logs-25.6 (2.05 MB, application/x-gzip)
2017-06-25 08:20 EDT, Elad
no flags Details
logs - 17.9.17 (1.96 MB, application/x-gzip)
2017-09-17 09:06 EDT, Elad
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 89445 master POST core : migrate HE VM with engine in Maintenance 2018-04-25 08:05 EDT

  None (edit)
Description Elad 2017-06-08 06:03:40 EDT
Description of problem:
VM migration fails with the following error in vdsm:


2017-06-08 12:25:34,558+0300 ERROR (migsrc/2eb55ea4) [virt.vm] (vmId='2eb55ea4-dd9e-42b2-bec7-9eea1cfaf322') Failed to migrate (migration:429)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 411, in run
    self._startUnderlyingMigration(time.time())
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 489, in _startUnderlyingMigration
    self._perform_with_downtime_thread(duri, muri)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 556, in _perform_with_downtime_thread
    self._perform_migration(duri, muri)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 529, in _perform_migration
    self._vm._dom.migrateToURI3(duri, params, flags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 69, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 941, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1939, in migrateToURI3
    if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self)
libvirtError: Failed to acquire lock: No space left on device



Version-Release number of selected component (if applicable):
vdsm-4.19.12-1.el7ev.x86_64
libvirt-daemon-2.0.0-10.el7_3.8.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.9.x86_64
sanlock-3.4.0-1.el7.x86_64
selinux-policy-3.13.1-102.el7_3.16.noarch

How reproducible:
Not always

Steps to Reproduce:
In my RHV environment, had 4 hosts deployed as hosted-engine. I put the host that had the hosted engine VM running on to maintenance, the hosted-engine VM migration started


Actual results:
Hosted engine VM migration failed with the mentioned error in vdsm


libvirt.log:

2017-06-08 09:25:34.308+0000: 1562: error : virNetClientProgramDispatchError:177 : Failed to acquire lock: No space left on device


Expected results:
VM migration should succeed

Will attach logs separately as they are too big for attachment
Comment 2 Michal Skrivanek 2017-06-09 02:33:36 EDT
(In reply to Elad from comment #0)
> Description of problem:
> VM migration fails with the following error in vdsm:
[...]
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 556, in _perform_with_downtime_thread

is this a new installation or upgraded 3.6? Apparently you're using 3.6 legacy migration policy. Is that intentional?

> libvirtError: Failed to acquire lock: No space left on device

this is quite self-explanatory. Even if it would be bogus I expect you actually check the disk space and note it in the bug description

> Version-Release number of selected component (if applicable):
> vdsm-4.19.12-1.el7ev.x86_64
> libvirt-daemon-2.0.0-10.el7_3.8.x86_64

this is an unreleased version. Moreover el7_3.9 was released 15 days ago. So you r system is not up to date

> qemu-kvm-rhev-2.6.0-28.el7_3.9.x86_64
> sanlock-3.4.0-1.el7.x86_64
> selinux-policy-3.13.1-102.el7_3.16.noarch
> 
> How reproducible:
> Not always
> 
> Steps to Reproduce:
> In my RHV environment, had 4 hosts deployed as hosted-engine. I put the host
> that had the hosted engine VM running on to maintenance, the hosted-engine
> VM migration started

if you talk about HE please always distinguish between the special HE maintenance and engine/UI initiated host maintenance.

> Actual results:
> Hosted engine VM migration failed with the mentioned error in vdsm
> 
> 
> libvirt.log:
> 
> 2017-06-08 09:25:34.308+0000: 1562: error :
> virNetClientProgramDispatchError:177 : Failed to acquire lock: No space left
> on device
> 
> 
> Expected results:
> VM migration should succeed
> 
> Will attach logs separately as they are too big for attachment

Sorry, but this is a very poor bug report. Before dumping a 339MB file on us, I do expect you do some basic analysis, at the very least check the disk space when you see a failure due to no space left.
Comment 3 Elad 2017-06-09 05:31:53 EDT
(In reply to Michal Skrivanek from comment #2)
> (In reply to Elad from comment #0)
> > Description of problem:
> > VM migration fails with the following error in vdsm:
> [...]
> >   File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 556, in _perform_with_downtime_thread
> 
> is this a new installation or upgraded 3.6? Apparently you're using 3.6
> legacy migration policy. Is that intentional?
This is not an upgraded setup, it's new 4.1 installation 
> > libvirtError: Failed to acquire lock: No space left on device
> 
> this is quite self-explanatory. Even if it would be bogus I expect you
> actually check the disk space and note it in the bug description
VM disk space? This is the hosted-engine VM which its disk resides on the hosted_storage SD which had 36G free space in the time of the VM migration. 

> > Version-Release number of selected component (if applicable):
> > vdsm-4.19.12-1.el7ev.x86_64
> > libvirt-daemon-2.0.0-10.el7_3.8.x86_64
> 
> this is an unreleased version. Moreover el7_3.9 was released 15 days ago. So
> you r system is not up to date
> 
> > qemu-kvm-rhev-2.6.0-28.el7_3.9.x86_64
> > sanlock-3.4.0-1.el7.x86_64
> > selinux-policy-3.13.1-102.el7_3.16.noarch
> > 
> > How reproducible:
> > Not always
> > 
> > Steps to Reproduce:
> > In my RHV environment, had 4 hosts deployed as hosted-engine. I put the host
> > that had the hosted engine VM running on to maintenance, the hosted-engine
> > VM migration started
> 
> if you talk about HE please always distinguish between the special HE
> maintenance and engine/UI initiated host maintenance.
UI maintenance

> 
> > Actual results:
> > Hosted engine VM migration failed with the mentioned error in vdsm
> > 
> > 
> > libvirt.log:
> > 
> > 2017-06-08 09:25:34.308+0000: 1562: error :
> > virNetClientProgramDispatchError:177 : Failed to acquire lock: No space left
> > on device
> > 
> > 
> > Expected results:
> > VM migration should succeed
> > 
> > Will attach logs separately as they are too big for attachment
> 
> Sorry, but this is a very poor bug report. Before dumping a 339MB file on
> us, I do expect you do some basic analysis, at the very least check the disk
> space when you see a failure due to no space left.

Sorry for the huge logs file, our systems have libvirtd.log in DEBUG log level, that's the reason why the file is so big. Other that libvirtd.log, vdsm and engine log are provided, I'll upload also HE logs.
Comment 4 Elad 2017-06-09 05:42:58 EDT
Correction, libvirtd.log was not rotated for long period. It's not in debug
Comment 5 Michal Skrivanek 2017-06-09 05:47:24 EDT
(In reply to Elad from comment #3)
> (In reply to Michal Skrivanek from comment #2)
> > (In reply to Elad from comment #0)
> > > Description of problem:
> > > VM migration fails with the following error in vdsm:
> > [...]
> > >   File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 556, in _perform_with_downtime_thread
> > 
> > is this a new installation or upgraded 3.6? Apparently you're using 3.6
> > legacy migration policy. Is that intentional?
> This is not an upgraded setup, it's new 4.1 installation 

this seems to be a common problem, I filed bug 1460160 for that. It's not related to your issue though.

> > > libvirtError: Failed to acquire lock: No space left on device
> > 
> > this is quite self-explanatory. Even if it would be bogus I expect you
> > actually check the disk space and note it in the bug description
> VM disk space? This is the hosted-engine VM which its disk resides on the
> hosted_storage SD which had 36G free space in the time of the VM migration. 

SD of HE VM is not relevant. The virtualization host's disk space is relevant. Since I see the libvirt log file alone is 6GB I wouldn't be surprised if you run out of space. Please check it

[...]
> vdsm and engine log are provided, I'll upload also HE logs.

no need for more logs right now, rather check the disk space first
Comment 6 Elad 2017-06-09 05:52:06 EDT
(In reply to Michal Skrivanek from comment #5)

> SD of HE VM is not relevant. The virtualization host's disk space is
> relevant. Since I see the libvirt log file alone is 6GB I wouldn't be
> surprised if you run out of space. Please check it

Both source and destination hosts file system have enough free space:

Source host:

[root@green-vdse ~]# df -h
Filesystem                                                   Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup01-root                                  263G   24G  240G   9% /
devtmpfs                                                      16G     0   16G   0% /dev
tmpfs                                                         16G  8.0K   16G   1% /dev/shm
tmpfs                                                         16G  362M   16G   3% /run
tmpfs                                                         16G     0   16G   0% /sys/fs/cgroup
/dev/sda1                                                    673M  192M  447M  31% /boot
yellow-vdsb.qa.lab.tlv.redhat.com:/Storage_NFS/dr/dr0        2.0T  441G  1.6T  22% /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_dr_dr0
yellow-vdsb.qa.lab.tlv.redhat.com:/Storage_NFS/dr/dr_export  2.0T  441G  1.6T  22% /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_dr_dr__export
tmpfs                                                        3.2G     0  3.2G   0% /run/user/0


Destination host:

[root@green-vdsd ~]# df -h
Filesystem                                                   Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup01-root                                  263G   17G  247G   7% /
devtmpfs                                                      16G     0   16G   0% /dev
tmpfs                                                         16G  8.0K   16G   1% /dev/shm
tmpfs                                                         16G  362M   16G   3% /run
tmpfs                                                         16G     0   16G   0% /sys/fs/cgroup
/dev/sda1                                                    673M  193M  446M  31% /boot
yellow-vdsb.qa.lab.tlv.redhat.com:/Storage_NFS/dr/dr0        2.0T  441G  1.6T  22% /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_dr_dr0
yellow-vdsb.qa.lab.tlv.redhat.com:/Storage_NFS/dr/dr_export  2.0T  441G  1.6T  22% /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_dr_dr__export
/dev/mapper/49eb9232--588b--4ee9--9d42--da711d450fa4-master  976M  1.3M  924M   1% /rhev/data-center/mnt/blockSD/49eb9232-588b-4ee9-9d42-da711d450fa4/master
tmpfs                                                        3.2G     0  3.2G   0% /run/user/0
Comment 7 Michal Skrivanek 2017-06-09 10:38:57 EDT
it might be the sanlock space for HE

your destination vdsm.log is from a different period so there is no useful information form the dst host
Comment 8 Elad 2017-06-10 06:52 EDT
Created attachment 1286649 [details]
dst-vdsm.log
Comment 9 Michal Skrivanek 2017-06-12 04:34:28 EDT
thanks
seems sanlock run out of space for some reason
we also need sanlock.log from the destination host
Comment 10 Elad 2017-06-12 10:45 EDT
Created attachment 1287064 [details]
sanlock.log
Comment 11 Tomas Jelinek 2017-06-16 10:58:21 EDT
the VDSM log is quite flood by:

2017-06-08 12:20:54,406+0300 ERROR (jsonrpc/5) [root] failed to retrieve Hosted Engine HA info (api:252)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 231, in _getHaInfo
    stats = instance.get_all_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 105, in get_all_stats
    stats = broker.get_stats_from_storage(service)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 233, in get_stats_from_storage
    result = self._checked_communicate(request)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 261, in _checked_communicate
    .format(message or response))
RequestError: Request failed: failed to read metadata: [Errno 5] Input/output error

It looks a lot like a storage issue. Can you please check if the HE SD is accessible from the destination host? Also please check the HE metadata are accessible (you can find the metadata_volume_UUID and metadata_image_UUID in /etc/ovirt-hosted-engine/hosted-engine.conf)
Comment 12 Elad 2017-06-25 08:20 EDT
Created attachment 1291700 [details]
logs-25.6

Managed to reproduce with no IO errors.
Comment 13 Tomas Jelinek 2017-06-26 03:10:26 EDT
ok, and can you please check if the HE SD is accessible from both source and the destination host? Also please check the HE metadata are accessible (you can find the metadata_volume_UUID and metadata_image_UUID in /etc/ovirt-hosted-engine/hosted-engine.conf)
Comment 14 Elad 2017-06-26 04:02:02 EDT
Source:

[root@aqua-vds4 eebb2bc4-63d3-4510-8a5f-2c2d10db089b]# cat /etc/ovirt-hosted-engine/hosted-engine.conf |grep -i metadata_image_UUID
metadata_image_UUID=eebb2bc4-63d3-4510-8a5f-2c2d10db089b
[root@aqua-vds4 eebb2bc4-63d3-4510-8a5f-2c2d10db089b]# ll /rhev/data-center/mnt/blockSD/6c381bfc-3c4e-4421-8b56-9a091274a14d/images/eebb2bc4-63d3-4510-8a5f-2c2d10db089b
total 0
lrwxrwxrwx. 1 vdsm kvm 78 Jun 25 11:03 2e858842-e776-4ecb-a3f0-fbc95f60fd8e -> /dev/6c381bfc-3c4e-4421-8b56-9a091274a14d/2e858842-e776-4ecb-a3f0-fbc95f60fd8e



[root@aqua-vds4 dom_md]# less /rhev/data-center/mnt/blockSD/6c381bfc-3c4e-4421-8b56-9a091274a14d/dom_md/metadata |grep


Destination:

[root@rose12 ~]# ll /rhev/data-center/mnt/blockSD/6c381bfc-3c4e-4421-8b56-9a091274a14d/images/eebb2bc4-63d3-4510-8a5f-2c2d10db089b
total 4
lrwxrwxrwx. 1 vdsm kvm 78 May  8 15:56 2e858842-e776-4ecb-a3f0-fbc95f60fd8e -> /dev/6c381bfc-3c4e-4421-8b56-9a091274a14d/2e858842-e776-4ecb-a3f0-fbc95f60fd8e
[root@rose12 ~]#
Comment 15 Tomas Jelinek 2017-06-26 11:32:10 EDT
moving to storage for further investigation
Comment 16 Allon Mureinik 2017-06-29 03:22:37 EDT
We have a known issue about bogus "no space left" errors (IIRC, this is solved in RHEL 7.4 - Nir, keep me honest here?).
Looking at the sanlock logs seems to confirm this is indeed not a space issue:

2017-06-12 17:33:19+0300 86368 [4339]: e573f26c aio timeout RD 0x7f7ac40009b0:0x7f7ac40009c0:0x7f7ad07ec000 ioto 10 to_count 4
2017-06-12 17:33:19+0300 86368 [4339]: s3 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_dr_dr0/e573f26c-077a-4902-93df-863711a46647/dom_md/ids
2017-06-12 17:33:19+0300 86368 [4339]: s3 renewal error -202 delta_length 20 last_success 86276


Nir - can you take a look please?
Comment 17 Allon Mureinik 2017-07-02 16:38:26 EDT
4.1.4 is planned as a minimal, fast, z-stream version to fix any open issues we may have in supporting the upcoming EL 7.4.

Pushing out anything unrelated, although if there's a minimal/trival, SAFE fix that's ready on time, we can consider introducing it in 4.1.4.
Comment 19 Nir Soffer 2017-08-04 14:00:33 EDT
(In reply to Allon Mureinik from comment #16)
> We have a known issue about bogus "no space left" errors (IIRC, this is
> solved in RHEL 7.4 - Nir, keep me honest here?).

"No space left on device" is not a bogus error, it is just a reuse of system errors
for sanlock purposes. The meaning of this error is that there is no lockspace
("no space"). David and me discussed improving these errors for 7.5, but this is 
not easy, as libvirt depends on some of the reused system errors.


> Looking at the sanlock logs seems to confirm this is indeed not a space
> issue:
> 
> 2017-06-12 17:33:19+0300 86368 [4339]: e573f26c aio timeout RD
> 0x7f7ac40009b0:0x7f7ac40009c0:0x7f7ad07ec000 ioto 10 to_count 4
> 2017-06-12 17:33:19+0300 86368 [4339]: s3 delta_renew read timeout 10 sec
> offset 0
> /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_dr_dr0/
> e573f26c-077a-4902-93df-863711a46647/dom_md/ids
> 2017-06-12 17:33:19+0300 86368 [4339]: s3 renewal error -202 delta_length 20
> last_success 86276

$ git grep '\-202'
src/sanlock_rv.h:#define SANLK_AIO_TIMEOUT      -202

This storage had timeouts, this is not normal situation and it is expected that
sanlock will fail to renew the lease if storage is not responsive enough.

This looks like the same issue we had with storage jobs started on a host that did
not join the lockspace yet. In this case, vdsm is reporting {"acquired": False} 
in repoStat, but engine is *ignoring* this.

For storage jobs, we solved the issue by enhancing our clusterlock apis to wait 
until a lockspace is joined when trying  to acquire a lease. Here, libvirt is
taking the lease, and it does not have such feature. See
https://gerrit.ovirt.org/#/q/topic:wait-for-host-id

The solution for this is to keep the domain "acquired" status on engine side, and
when selecting a host for migration of HA VM, filter out hosts that did not
acquire the lockspace on the required domains.

I did not look at the logs yet. I expect to find "acquired" False in repoStats
messages in vdsm log on the destination host.
Comment 22 Maor 2017-09-12 11:13:38 EDT
It seem to be very similar to https://bugzilla.redhat.com/1422508.
Nir what do you think?
Comment 23 Nir Soffer 2017-09-12 14:39:16 EDT
This is the same error (ENOSPC) but not the same issue.

When we run storage jobs, we wait now until the host id is acquired when trying to 
acquire a lease. Libvirt does not wait until a host id is acquired, so if engine
will try to start a vm with a lease (e.g. hosted engine vm) on a host that did not
acquire the host id, the operation is expected to fail.

The solution is filtering out hosts that did not acquire a host id when starting
vms, see comment 19.
Comment 24 Michal Skrivanek 2017-09-12 17:08:45 EDT
So we keep it unsolved in 4.1.z, imght be ok assuming that there is the migration retry mechanism taking care of the failure. But I'm not sure if the retry mechanism works for HE VM. 
Elad, can you check the overall result of the migration, retries should be visible in engine.log
Comment 25 Allon Mureinik 2017-09-13 04:34:06 EDT
(In reply to Michal Skrivanek from comment #24)
> So we keep it unsolved in 4.1.z, imght be ok assuming that there is the
> migration retry mechanism taking care of the failure. But I'm not sure if
> the retry mechanism works for HE VM. 
> Elad, can you check the overall result of the migration, retries should be
> visible in engine.log
I agree.
So, if I understand your bottom line:
- In 4.2, for a regular VM, retries should solve this
- For SHE, need to make sure the migration also has a retry feature?
Comment 26 Michal Skrivanek 2017-09-13 07:46:18 EDT
yes
Comment 27 Elad 2017-09-17 09:06 EDT
Created attachment 1326959 [details]
logs - 17.9.17

(In reply to Michal Skrivanek from comment #24)
> So we keep it unsolved in 4.1.z, imght be ok assuming that there is the
> migration retry mechanism taking care of the failure. But I'm not sure if
> the retry mechanism works for HE VM. 
> Elad, can you check the overall result of the migration, retries should be
> visible in engine.log

The host is stuck in 'preparing for maintenance' forever unless it's being activated manually.

In engine.log the following is repeated:

2017-09-17 08:58:09,611-04 INFO  [org.ovirt.engine.core.bll.MaintenanceVdsCommand] (DefaultQuartzScheduler2) [2b754914] Running command: MaintenanceVdsCommand internal: true. Entities affected :  ID: fca7300c-760a-45aa-aa81-a8968fb8abef Type: VDS
2017-09-17 08:58:09,614-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SetHaMaintenanceModeVDSCommand] (DefaultQuartzScheduler2) [2b754914] START, SetHaMaintenanceModeVDSCommand(HostName = host_mixed_2, SetHaMaintenanceModeVDSCommandParameters:{runAsync='true', hostId='fca7300c-760a-45aa-aa81-a8968fb8abef'}), log id: 4378d73d
2017-09-17 08:58:10,619-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SetHaMaintenanceModeVDSCommand] (DefaultQuartzScheduler2) [2b754914] FINISH, SetHaMaintenanceModeVDSCommand, log id: 4378d73d
2017-09-17 08:58:10,656-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler2) [2b754914] EVENT_ID: VDS_MAINTENANCE(15), Correlation ID: 2b754914, Job ID: 77d7427a-d311-457c-a5f8-170839213f4a, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Host host_mixed_2 was switched to Maintenance Mode.


Attaching logs (engine and vdsm logs from src and dst hosts
Comment 28 Allon Mureinik 2017-11-15 10:36:26 EST
(In reply to Allon Mureinik from comment #25)
> (In reply to Michal Skrivanek from comment #24)
> > So we keep it unsolved in 4.1.z, imght be ok assuming that there is the
> > migration retry mechanism taking care of the failure. But I'm not sure if
> > the retry mechanism works for HE VM. 
> > Elad, can you check the overall result of the migration, retries should be
> > visible in engine.log
> I agree.
> So, if I understand your bottom line:
> - In 4.2, for a regular VM, retries should solve this
> - For SHE, need to make sure the migration also has a retry feature?
If this is the gap, it should be owned by SLA.
Comment 29 Martin Sivák 2017-11-16 05:44:50 EST
And we are not fixing this gap intentionally. Failing migration of HE requires a manual check by design. That is also the reason we do not ever migrate the VM except in the maintenance case.

We can reevaluate this decision, but I do not see us doing this for 4.2.

This can also be considered a libvirt bug btw, retry is just a workaround.
Comment 30 Nir Soffer 2017-11-16 06:10:09 EST
(In reply to Martin Sivák from comment #29)
> This can also be considered a libvirt bug btw, retry is just a workaround.

I don't think this can be considered as libvirt bug. Libvirt assumed that the host
is already part of the sanlock cluster, it does not control the sanlock
configuration on the cluster.

Hosts publish the status of the host_id in repoStat {"acquired": False|True, ...}
This status is ignored by engine so engine may try to start a storage job or
migrate a vm to such host.

In storage we handle this case now by waiting until the host id is acquired on
a host before starting the storage job. This is easy since the same code that 
take a storage lease also manage acquiring the host id.

In other flows, the host acquired status should be considered when selecting a
target host for migration or for starting a vm with a lease.

Noe also that host may move from {"acquired": True} to {"acquired": False} after
storage issue, when sanlock looses access to storage. Vdsm monitor the status and
will try to acquire the host id again, which take at least 20 seconds. During
the time of the "re-acquire" operations on the host which require a lease will fail
with "no space left" (meaning no lockspace).
Comment 31 Martin Sivák 2017-11-21 04:50:03 EST
Hosted engine does not report the host as ready until the lock is acquired. But I can't promise there won't be a race there. Fully propagated and reliable status is hard in this case and might not be worth the effort.
Comment 32 Martin Sivák 2017-11-22 07:29:34 EST
Elad? How often does this reproduce? Is it possible it was a "random" fluke only?
Comment 33 Elad 2017-12-06 04:25:39 EST
It reproduces every time with hosted engine VM while there is latency of more than 7 ms between src and dst hosts.
Comment 34 Martin Sivák 2018-01-10 07:17:11 EST
I do not see us fixing this for the following reasons:

- HE will not retry migrations by design
- it requires a rather high latency between hosts that are supposed to be on the same L2 network
- it only happens during maintenance flow meaning an sysadmin is nearby
- the host can be activated again cancelling the preparing for maintenance state
- engine initiated migration will work just fine

So I consider this to be a corner case. Maybe better reporting might be needed to alert the sysadmin, but we do not want to change the automated flow for safety reasons.

The severity is not high from my point of view for all those reasons and also because a simple workaround of migrating the VM manually before putting the host to maintenance exists.
Comment 35 Yaniv Lavi 2018-01-17 08:17:42 EST
We need to migrate the HE first and then move to maintenance, when the host running HE is moved to maintenance via the API/UI.
Comment 36 Yanir Quinn 2018-03-29 04:01:18 EDT
I'm trying to reproduce in small scale before testing a potential fix but unable to do so,
I have 2 hosts, with a latency of 200ms between them. 
Put the host running HE in maintenance via UI , HE migrated successfully every time. 

Version-Release number of selected component (if applicable):

ovirt-host-4.3.0-0.0.master.20180313084422.git907085c.el7.centos.x86_64
ovirt-hosted-engine-ha-2.3.0-0.0.master.20180327115926.20180327115923.gita82c0c6.el7.centos.noarch
ovirt-host-deploy-1.8.0-0.0.master.20180313163958.git3b2935f.el7.centos.noarch
ovirt-host-dependencies-4.3.0-0.0.master.20180313084422.git907085c.el7.centos.x86_64
vdsm-4.30.0-184.gite147ade.el7.centos.x86_64
libvirt-daemon-3.2.0-14.el7_4.9.x86_64
qemu-kvm-ev-2.9.0-16.el7_4.14.1.x86_64
sanlock-3.5.0-1.el7.x86_64
selinux-policy-3.13.1-166.el7_4.9.noarch


Does the issue reproduce only running 2 hosts on your side?
Comment 37 Elad 2018-04-08 05:11:35 EDT
Hi, as I don't have such environment at the moment I can't tell. 
But I think that having 2 or 4 hosts in the cluster shouldn't affect the results of the migration.

How did you simulate latency?
Comment 38 Yanir Quinn 2018-04-08 05:50:34 EDT
used netem on one host, then tried on both (should'nt really matter).
for 100ms latency for example : 
# tc qdisc add dev eth0 root netem delay 100ms

(ping shows there is a delay)
Comment 39 Yanir Quinn 2018-04-16 08:10:20 EDT
tried also on ovirtmgmt nic,  still couldn't reproduce.

In https://bugzilla.redhat.com/show_bug.cgi?id=1458745 i can see that a similar issue was reproduce (or tried to be reproduce) using the host's nic (as i did previously)  

Could it be possible to set up and run the test enviroment again and reproduce the issue?

If the issue still reproduces, can it be possible to set up and run a similar enviroment with a possible fix from these package:
http://jenkins.ovirt.org/job/ovirt-engine_master_check-patch-el7-x86_64/38863/
Comment 40 Elad 2018-04-16 09:59:31 EDT
Meital, please handle this, thanks.
Comment 41 Yanir Quinn 2018-04-16 12:02:15 EDT
I have performed the same steps on an enviroment with 3 physical hosts and a HE VM, here are the results:

- applying 100ms delay on each host (200ms in total) :
  issue reproduced, src host stuck on preparing for maintenance.

- applying 7 or 10ms delay on each host (14ms or 20ms in total) :
  Successful migration of HE, no issues. 

- applying 55ms delay on each host (110ms in total) :
  Successful migration of HE, no issues.

delay was applied using tc, e.g : tc qdisc change dev p1p1 root netem delay 7ms.

These results indicate that the issue can occur when the latency is extremely high.
Comment 43 Yaniv Lavi 2018-04-22 09:49:32 EDT
Please test with 150ms latency and open a bug to update to 110ms for now.

Note You need to log in before you can comment on or make changes to this bug.