Description of problem:
When a lot of block (LVM) storage domains is used, the monitoring threads may block other tasks for a long time which can result in timeouts on storage domain activation
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. create 50 LVM storage domains
2. attach one more storage domain
operation may fail due to a timeout
storage domain is attached successfully
Can you attach these logs, describing the same timeframe?
Currently targeting to 3.6.0, as we're quite late in 3.5, and this probably won't be an easy fix.
If we see it's simple enough, we can consider backporting to 3.5.1.
(In reply to Evgheni Dereveanchin from comment #0)
> Steps to Reproduce:
> 1. create 50 LVM storage domains
> 2. attach one more storage domain
This BZ was reported against 3.4.0.
3.5.0 includes a whole lot of performance enhancements, and we've clocked 80 domains in a single DC.
Moving to ON_QA to verify.
Created attachment 1132571 [details]
Repoplot of vdsm.log
Looking at attachment 1132571 [details], we can see:
1. High lastCheck values for long time between 09:35-09:58
2. Very low read delay values
3. Very slow lvm commands (up to 30+ seconds)
4. Very slow mailbox dd commands (up to 38 seconds)
5. Very slow storage domain monitoring dd commands (up to 14 seconds).
This can be explained by overloaded hypervisor, or by overloaded
vdsm, running on many cores.
This issue may be fixed in 3.6 which pins vdsm to 1 cpu by default.
In vdsm log, we 54 SCSI rescan operations in this timeframe. Each time we
do a rescan, we drop the entire lvm cache, which causes storm of lvm reload
We also see 564 refresh domain operations, and 542 errors related to
Thread-44::DEBUG::2014-10-23 09:01:33,289::domainMonitor::192::Storage.DomainMonitorThread::(_monitorDomain) Refreshing domain 8ea8a6d5-b5e6-4120-bed0-539dd45a2457
Thread-44::ERROR::2014-10-23 09:01:33,675::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 8ea8a6d5-b5e6-4120-bed0-539dd45a2457
Thread-44::ERROR::2014-10-23 09:01:33,840::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 8ea8a6d5-b5e6-4120-bed0-539dd45a2457
When we have unfetched domains during a refresh operation, we are doing
additional vgs operation for looking up the domain.
This issue was fixed on 3.6, by adding connectStorageServer to FC storage.
Domain are prefetched at this point, avoiding the "looking for unfetched
domain" errors, and the additional lvm commands.
Also we can see calls to multiapth -r:
Thread-47::DEBUG::2014-10-23 09:56:54,681::multipath::110::Storage.Misc.excCmd::(rescan) '/usr/bin/sudo -n /sbin/multipath -r' (cwd None)
Thread-47::DEBUG::2014-10-23 09:57:30,991::multipath::110::Storage.Misc.excCmd::(rescan) SUCCESS: <err> = ''; <rc> = 0
This call is reloading multipath maps (multipath -r), which may effect
lvm commands. We are not calling "multipath -r" since 3.4.
Since some of the issues are relevant only to FC storage, this must be verified on FC.
Some issues are similar to bug 1081962. This is a variant of that bug, or a duplicate.
Returning to ON_QA until we get the logs from the scsi tests, and the results on FC.
Created attachment 1134333 [details]
Vdsm logs for FC test
Created attachment 1134334 [details]
Repoplot of vdsm.log for FC test
Looking the repoplot (attachment 1134334 [details]) we can see:
1. High lastCheck values (up to 75 seconds). This is expected with current
code when using 50 storage domains, tracked in bug 1081962.
2. Very low read delay
3. Lot of lvm commands, most take less then one seconds, some up to 4.5 seconds
Activating a domain takes less then 60 seconds, so it did not time out.
Timeout is possible if the hypervisor is overloaded, or storage operations are
performed, slowing down lvm commands. This was not tested here, and will be
tested for bug 1081962.
Looking in vdsm log (attachment 1134333 [details]), in the activation at 2016-03-08 00:07:41
1. Storage domain activation took 52 seconds
2. Most of the time was spend waiting for lvm operation mutex (29.4 seconds).
Improved in https://gerrit.ovirt.org/53540
3. We invalidate entire lvm cache 4(!!!!) times - this cause lvm reload
in all storage domains - at least 50 calls to vgs each time we invalidate
the cache. We should invalidate only the vg being activate, for reloading
the vg metadata before activating it.
4. We reload vg metadata 4(!) times (6.8 seconds)
We should reload the vg once.
5. We refresh multiapth 2 times (8.2 seconds)
We should refresh multipath once.
6. We scan FC devices 2 times (1.8 seocnds)
We should scan FC devices once
7. We scan iSCSI devices 2 times (0.8 seconds)
We should not scan iSCSI devices as the activate domain is FC
with fast storage and little io, we cannot reproduce this issue, but
the system behavior is poor.
Created attachment 1134337 [details]
executor 6 log from FC test
Created attachment 1134338 [details]
activation log at 2016-03-08 00:07:41
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.