Bug 1236161 - VDSM response to path preparation during live VM migration was very slow
Summary: VDSM response to path preparation during live VM migration was very slow
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.5.1
Hardware: Unspecified
OS: Linux
medium
high
Target Milestone: ovirt-4.1.0-alpha
: ---
Assignee: Nir Soffer
QA Contact: mlehrer
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-06-26 17:51 UTC by Gordon Watson
Modified: 2019-08-15 04:48 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-02-06 11:38:57 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:
ykaul: needinfo+


Attachments (Terms of Use)
Repoplot of vdsm log encompassing the failed migration (359.20 KB, application/pdf)
2016-03-09 00:15 UTC, Nir Soffer
no flags Details
Repoplot of vdsm log after restart (2.18 MB, text/plain)
2016-03-09 00:16 UTC, Nir Soffer
no flags Details
Repoplot of vdsm log after restart (63.81 KB, application/pdf)
2016-03-09 00:17 UTC, Nir Soffer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1374545 0 urgent CLOSED Guest LVs created in ovirt raw volumes are auto activated on the hypervisor in RHEL 7 2021-06-10 11:42:20 UTC
Red Hat Product Errata RHEA-2017:0998 0 normal SHIPPED_LIVE VDSM bug fix and enhancement update 4.1 GA 2017-04-18 20:11:39 UTC

Internal Links: 1374545

Description Gordon Watson 2015-06-26 17:51:50 UTC
Description of problem:

A VM was migrated to a specific host and it failed as only three of the five Logical Volumes were able to be activated during the path preparation phase on the destination host before it timed out (based on the default 'migration_listener_timeout').


Version-Release number of selected component (if applicable):

RHEV 3.5.1
RHEV-H 7.1 '20150420.0' w/vdsm-4.16.13.1-1.el7ev


How reproducible:

Not.


Steps to Reproduce:
1.
2.
3.

Actual results:

"Timeout while waiting for path preparation".


Expected results:

Five Logical Volumes should be able to be activated within 78 seconds (in this specific case).


Additional info:

Comment 13 Nir Soffer 2015-09-09 16:30:16 UTC
May be related to bug 1247075.

I suggest to test with https://gerrit.ovirt.org/45738 when the patch is ready.

Comment 16 Nir Soffer 2015-10-07 12:51:22 UTC
(In reply to Gordon Watson from comment #0)
> A VM was migrated to a specific host and it failed as only three of the five
> Logical Volumes were able to be activated during the path preparation phase
> on the destination host before it timed out (based on the default
> 'migration_listener_timeout').

Gordon, can we have information about the hardware and about the load the machine in the same timeframe of the logs you mention?

Hardware:
- Number of cores
- Memory

Software:
- OS (EL7.x? EL6.x? both?)

Load:
- cpu load
- vdsm cpu usage
- overall cpu usage
- io usage
- memory usage
- swap usage

Comment 20 Nir Soffer 2015-11-20 13:12:12 UTC
Investigating logs, will update when I have better understanding of this issue.

Comment 21 Nir Soffer 2016-03-09 00:15:02 UTC
Created attachment 1134349 [details]
Repoplot of vdsm log encompassing the failed migration

Comment 22 Nir Soffer 2016-03-09 00:16:22 UTC
Created attachment 1134350 [details]
Repoplot of vdsm log after restart

Comment 23 Nir Soffer 2016-03-09 00:17:14 UTC
Created attachment 1134355 [details]
Repoplot of vdsm log after restart

Comment 24 Nir Soffer 2016-03-09 00:52:37 UTC
Comparing the log during migration (attachment 1134349 [details]) and after vdsm restart
(attachment 1134355 [details]), we can see:

                  Before                 After
-----------------------------------------------------------
lastCheck         med (up to 30s)        low
read delay        very low               very low
lvm commands      slow (up to 6s)        fast (up to 0.4s
monitor commands  slow (up to 8s)        fast (up to 0.04s)

According to comment 19, we don't have overloaded hypervisor.
We don't know about vdsm cpu usage, or memory usage.

We can see lot of "unfetched domain" errors:

Thread-14::ERROR::2015-06-18 02:57:37,259::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 37f6b4bf-85e5-4098-899c-a83be5fe1667
Thread-14::ERROR::2015-06-18 02:57:37,259::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 37f6b4bf-85e5-4098-899c-a83be5fe1667

$ grep ERROR vdsm.log.2 | grep -i unfetched | wc -l
1034

Each time we see this, vdsm perform a vgs with all devices to find the domain.
This generate lot of useless lvm commands, slowing down other operations waiting
on lvm operation mutex. This is fixed in 3.6:
https://gerrit.ovirt.org/#/q/project:vdsm+branch:master+topic:fc-connect-storage-server

This may be related to overloading vdsm on many cores (bug 1247075), or to
vdsm memory leak (bug 1299491), or both. We should check if 3.5.8 or 3.6
resolve this issue. Note that on 3.5 cpu_affinity must be configured manually.

Need to investigate the _startUnderlyingVm error to understand what is the cause
of the delays.

Comment 25 Yaniv Lavi 2016-05-09 11:02:33 UTC
oVirt 4.0 Alpha has been released, moving to oVirt 4.0 Beta target.

Comment 31 Nir Soffer 2016-07-24 16:17:45 UTC
This issue should be much improved on 4.0, since lvm commands are not serialized
any more, so unrelated lvm commands should not delay image preparation.

Comment 32 Yaniv Lavi 2016-08-18 08:54:56 UTC
Should we move this to QA testing on 4.0.x?

Comment 33 Yaniv Kaul 2016-12-01 14:17:04 UTC
(In reply to Yaniv Dary from comment #32)
> Should we move this to QA testing on 4.0.x?

Yes.

Comment 34 Raz Tamir 2016-12-01 14:28:10 UTC
Can we get a clear steps to reproduce and any additional info about special hardware that needed for testing this?

Comment 40 Nir Soffer 2017-01-04 08:50:11 UTC
We don't know what is the root cause, and could never reproduced it, so we cannot
document anything.

Comment 41 Yaniv Lavi 2017-01-04 09:37:09 UTC
(In reply to Nir Soffer from comment #40)
> We don't know what is the root cause, and could never reproduced it, so we
> cannot
> document anything.

Do please nack the doc text, don't set the type.

Comment 43 eberman 2017-02-01 10:52:04 UTC
Nir- what are the reproduction detail's?

Comment 44 Nir Soffer 2017-02-01 12:31:46 UTC
(In reply to eberman from comment #43)
> Nir- what are the reproduction detail's?

See comment 40.

Comment 45 eberman 2017-02-06 11:38:57 UTC
closing, need reproduction steps


Note You need to log in before you can comment on or make changes to this bug.