Description of problem: A VM was migrated to a specific host and it failed as only three of the five Logical Volumes were able to be activated during the path preparation phase on the destination host before it timed out (based on the default 'migration_listener_timeout'). Version-Release number of selected component (if applicable): RHEV 3.5.1 RHEV-H 7.1 '20150420.0' w/vdsm-4.16.13.1-1.el7ev How reproducible: Not. Steps to Reproduce: 1. 2. 3. Actual results: "Timeout while waiting for path preparation". Expected results: Five Logical Volumes should be able to be activated within 78 seconds (in this specific case). Additional info:
May be related to bug 1247075. I suggest to test with https://gerrit.ovirt.org/45738 when the patch is ready.
(In reply to Gordon Watson from comment #0) > A VM was migrated to a specific host and it failed as only three of the five > Logical Volumes were able to be activated during the path preparation phase > on the destination host before it timed out (based on the default > 'migration_listener_timeout'). Gordon, can we have information about the hardware and about the load the machine in the same timeframe of the logs you mention? Hardware: - Number of cores - Memory Software: - OS (EL7.x? EL6.x? both?) Load: - cpu load - vdsm cpu usage - overall cpu usage - io usage - memory usage - swap usage
Investigating logs, will update when I have better understanding of this issue.
Created attachment 1134349 [details] Repoplot of vdsm log encompassing the failed migration
Created attachment 1134350 [details] Repoplot of vdsm log after restart
Created attachment 1134355 [details] Repoplot of vdsm log after restart
Comparing the log during migration (attachment 1134349 [details]) and after vdsm restart (attachment 1134355 [details]), we can see: Before After ----------------------------------------------------------- lastCheck med (up to 30s) low read delay very low very low lvm commands slow (up to 6s) fast (up to 0.4s monitor commands slow (up to 8s) fast (up to 0.04s) According to comment 19, we don't have overloaded hypervisor. We don't know about vdsm cpu usage, or memory usage. We can see lot of "unfetched domain" errors: Thread-14::ERROR::2015-06-18 02:57:37,259::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 37f6b4bf-85e5-4098-899c-a83be5fe1667 Thread-14::ERROR::2015-06-18 02:57:37,259::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 37f6b4bf-85e5-4098-899c-a83be5fe1667 $ grep ERROR vdsm.log.2 | grep -i unfetched | wc -l 1034 Each time we see this, vdsm perform a vgs with all devices to find the domain. This generate lot of useless lvm commands, slowing down other operations waiting on lvm operation mutex. This is fixed in 3.6: https://gerrit.ovirt.org/#/q/project:vdsm+branch:master+topic:fc-connect-storage-server This may be related to overloading vdsm on many cores (bug 1247075), or to vdsm memory leak (bug 1299491), or both. We should check if 3.5.8 or 3.6 resolve this issue. Note that on 3.5 cpu_affinity must be configured manually. Need to investigate the _startUnderlyingVm error to understand what is the cause of the delays.
oVirt 4.0 Alpha has been released, moving to oVirt 4.0 Beta target.
This issue should be much improved on 4.0, since lvm commands are not serialized any more, so unrelated lvm commands should not delay image preparation.
Should we move this to QA testing on 4.0.x?
(In reply to Yaniv Dary from comment #32) > Should we move this to QA testing on 4.0.x? Yes.
Can we get a clear steps to reproduce and any additional info about special hardware that needed for testing this?
We don't know what is the root cause, and could never reproduced it, so we cannot document anything.
(In reply to Nir Soffer from comment #40) > We don't know what is the root cause, and could never reproduced it, so we > cannot > document anything. Do please nack the doc text, don't set the type.
Nir- what are the reproduction detail's?
(In reply to eberman from comment #43) > Nir- what are the reproduction detail's? See comment 40.
closing, need reproduction steps