Bug 1236161

Summary: VDSM response to path preparation during live VM migration was very slow
Product: Red Hat Enterprise Virtualization Manager Reporter: Gordon Watson <gwatson>
Component: vdsmAssignee: Nir Soffer <nsoffer>
Status: CLOSED INSUFFICIENT_DATA QA Contact: mlehrer
Severity: high Docs Contact:
Priority: medium    
Version: 3.5.1CC: ahoness, amureini, bazulay, eberman, fromani, gklein, gwatson, lpeer, lsurette, mkalinin, nsoffer, oourfali, pzhukov, ratamir, rick.beldin, sherold, srevivo, tnisan, ycui, ykaul, ylavi
Target Milestone: ovirt-4.1.0-alphaKeywords: Performance
Target Release: ---Flags: ykaul: needinfo+
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-06 11:38:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Repoplot of vdsm log encompassing the failed migration
none
Repoplot of vdsm log after restart
none
Repoplot of vdsm log after restart none

Description Gordon Watson 2015-06-26 17:51:50 UTC
Description of problem:

A VM was migrated to a specific host and it failed as only three of the five Logical Volumes were able to be activated during the path preparation phase on the destination host before it timed out (based on the default 'migration_listener_timeout').


Version-Release number of selected component (if applicable):

RHEV 3.5.1
RHEV-H 7.1 '20150420.0' w/vdsm-4.16.13.1-1.el7ev


How reproducible:

Not.


Steps to Reproduce:
1.
2.
3.

Actual results:

"Timeout while waiting for path preparation".


Expected results:

Five Logical Volumes should be able to be activated within 78 seconds (in this specific case).


Additional info:

Comment 13 Nir Soffer 2015-09-09 16:30:16 UTC
May be related to bug 1247075.

I suggest to test with https://gerrit.ovirt.org/45738 when the patch is ready.

Comment 16 Nir Soffer 2015-10-07 12:51:22 UTC
(In reply to Gordon Watson from comment #0)
> A VM was migrated to a specific host and it failed as only three of the five
> Logical Volumes were able to be activated during the path preparation phase
> on the destination host before it timed out (based on the default
> 'migration_listener_timeout').

Gordon, can we have information about the hardware and about the load the machine in the same timeframe of the logs you mention?

Hardware:
- Number of cores
- Memory

Software:
- OS (EL7.x? EL6.x? both?)

Load:
- cpu load
- vdsm cpu usage
- overall cpu usage
- io usage
- memory usage
- swap usage

Comment 20 Nir Soffer 2015-11-20 13:12:12 UTC
Investigating logs, will update when I have better understanding of this issue.

Comment 21 Nir Soffer 2016-03-09 00:15:02 UTC
Created attachment 1134349 [details]
Repoplot of vdsm log encompassing the failed migration

Comment 22 Nir Soffer 2016-03-09 00:16:22 UTC
Created attachment 1134350 [details]
Repoplot of vdsm log after restart

Comment 23 Nir Soffer 2016-03-09 00:17:14 UTC
Created attachment 1134355 [details]
Repoplot of vdsm log after restart

Comment 24 Nir Soffer 2016-03-09 00:52:37 UTC
Comparing the log during migration (attachment 1134349 [details]) and after vdsm restart
(attachment 1134355 [details]), we can see:

                  Before                 After
-----------------------------------------------------------
lastCheck         med (up to 30s)        low
read delay        very low               very low
lvm commands      slow (up to 6s)        fast (up to 0.4s
monitor commands  slow (up to 8s)        fast (up to 0.04s)

According to comment 19, we don't have overloaded hypervisor.
We don't know about vdsm cpu usage, or memory usage.

We can see lot of "unfetched domain" errors:

Thread-14::ERROR::2015-06-18 02:57:37,259::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 37f6b4bf-85e5-4098-899c-a83be5fe1667
Thread-14::ERROR::2015-06-18 02:57:37,259::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 37f6b4bf-85e5-4098-899c-a83be5fe1667

$ grep ERROR vdsm.log.2 | grep -i unfetched | wc -l
1034

Each time we see this, vdsm perform a vgs with all devices to find the domain.
This generate lot of useless lvm commands, slowing down other operations waiting
on lvm operation mutex. This is fixed in 3.6:
https://gerrit.ovirt.org/#/q/project:vdsm+branch:master+topic:fc-connect-storage-server

This may be related to overloading vdsm on many cores (bug 1247075), or to
vdsm memory leak (bug 1299491), or both. We should check if 3.5.8 or 3.6
resolve this issue. Note that on 3.5 cpu_affinity must be configured manually.

Need to investigate the _startUnderlyingVm error to understand what is the cause
of the delays.

Comment 25 Yaniv Lavi 2016-05-09 11:02:33 UTC
oVirt 4.0 Alpha has been released, moving to oVirt 4.0 Beta target.

Comment 31 Nir Soffer 2016-07-24 16:17:45 UTC
This issue should be much improved on 4.0, since lvm commands are not serialized
any more, so unrelated lvm commands should not delay image preparation.

Comment 32 Yaniv Lavi 2016-08-18 08:54:56 UTC
Should we move this to QA testing on 4.0.x?

Comment 33 Yaniv Kaul 2016-12-01 14:17:04 UTC
(In reply to Yaniv Dary from comment #32)
> Should we move this to QA testing on 4.0.x?

Yes.

Comment 34 Raz Tamir 2016-12-01 14:28:10 UTC
Can we get a clear steps to reproduce and any additional info about special hardware that needed for testing this?

Comment 40 Nir Soffer 2017-01-04 08:50:11 UTC
We don't know what is the root cause, and could never reproduced it, so we cannot
document anything.

Comment 41 Yaniv Lavi 2017-01-04 09:37:09 UTC
(In reply to Nir Soffer from comment #40)
> We don't know what is the root cause, and could never reproduced it, so we
> cannot
> document anything.

Do please nack the doc text, don't set the type.

Comment 43 eberman 2017-02-01 10:52:04 UTC
Nir- what are the reproduction detail's?

Comment 44 Nir Soffer 2017-02-01 12:31:46 UTC
(In reply to eberman from comment #43)
> Nir- what are the reproduction detail's?

See comment 40.

Comment 45 eberman 2017-02-06 11:38:57 UTC
closing, need reproduction steps