Bug 1158825
Summary: | [scale] storage domain activation may time out if a large number of block domains is in use | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Evgheni Dereveanchin <ederevea> | ||||||||||||
Component: | vdsm | Assignee: | Nir Soffer <nsoffer> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Eldad Marciano <emarcian> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | medium | ||||||||||||||
Version: | 3.4.0 | CC: | acanan, aefrat, amureini, bazulay, ederevea, emarcian, iheim, lpeer, lsurette, mkalinin, mtessun, nsoffer, scohen, yeylon, ykaul | ||||||||||||
Target Milestone: | ovirt-3.6.0-rc | ||||||||||||||
Target Release: | 3.6.0 | ||||||||||||||
Hardware: | Unspecified | ||||||||||||||
OS: | Unspecified | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2016-03-09 19:27:09 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | Scale | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Description
Evgheni Dereveanchin
2014-10-30 10:17:11 UTC
Can you attach these logs, describing the same timeframe? - /var/log/messages - /var/log/sanlock.log Currently targeting to 3.6.0, as we're quite late in 3.5, and this probably won't be an easy fix. If we see it's simple enough, we can consider backporting to 3.5.1. (In reply to Evgheni Dereveanchin from comment #0) > Steps to Reproduce: > 1. create 50 LVM storage domains > 2. attach one more storage domain This BZ was reported against 3.4.0. 3.5.0 includes a whole lot of performance enhancements, and we've clocked 80 domains in a single DC. Moving to ON_QA to verify. Created attachment 1132571 [details]
Repoplot of vdsm.log
Looking at attachment 1132571 [details], we can see: 1. High lastCheck values for long time between 09:35-09:58 2. Very low read delay values 3. Very slow lvm commands (up to 30+ seconds) 4. Very slow mailbox dd commands (up to 38 seconds) 5. Very slow storage domain monitoring dd commands (up to 14 seconds). This can be explained by overloaded hypervisor, or by overloaded vdsm, running on many cores. This issue may be fixed in 3.6 which pins vdsm to 1 cpu by default. In vdsm log, we 54 SCSI rescan operations in this timeframe. Each time we do a rescan, we drop the entire lvm cache, which causes storm of lvm reload commands. We also see 564 refresh domain operations, and 542 errors related to unfetched domains: Thread-44::DEBUG::2014-10-23 09:01:33,289::domainMonitor::192::Storage.DomainMonitorThread::(_monitorDomain) Refreshing domain 8ea8a6d5-b5e6-4120-bed0-539dd45a2457 Thread-44::ERROR::2014-10-23 09:01:33,675::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 8ea8a6d5-b5e6-4120-bed0-539dd45a2457 Thread-44::ERROR::2014-10-23 09:01:33,840::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 8ea8a6d5-b5e6-4120-bed0-539dd45a2457 When we have unfetched domains during a refresh operation, we are doing additional vgs operation for looking up the domain. This issue was fixed on 3.6, by adding connectStorageServer to FC storage. Domain are prefetched at this point, avoiding the "looking for unfetched domain" errors, and the additional lvm commands. - https://gerrit.ovirt.org/#/q/project:vdsm+branch:master+topic:fc-connect-storage-server - https://gerrit.ovirt.org/44272 Also we can see calls to multiapth -r: Thread-47::DEBUG::2014-10-23 09:56:54,681::multipath::110::Storage.Misc.excCmd::(rescan) '/usr/bin/sudo -n /sbin/multipath -r' (cwd None) Thread-47::DEBUG::2014-10-23 09:57:30,991::multipath::110::Storage.Misc.excCmd::(rescan) SUCCESS: <err> = ''; <rc> = 0 This call is reloading multipath maps (multipath -r), which may effect lvm commands. We are not calling "multipath -r" since 3.4. Since some of the issues are relevant only to FC storage, this must be verified on FC. Some issues are similar to bug 1081962. This is a variant of that bug, or a duplicate. Returning to ON_QA until we get the logs from the scsi tests, and the results on FC. Created attachment 1134333 [details]
Vdsm logs for FC test
Created attachment 1134334 [details]
Repoplot of vdsm.log for FC test
Looking the repoplot (attachment 1134334 [details]) we can see: 1. High lastCheck values (up to 75 seconds). This is expected with current code when using 50 storage domains, tracked in bug 1081962. 2. Very low read delay 3. Lot of lvm commands, most take less then one seconds, some up to 4.5 seconds Activating a domain takes less then 60 seconds, so it did not time out. Timeout is possible if the hypervisor is overloaded, or storage operations are performed, slowing down lvm commands. This was not tested here, and will be tested for bug 1081962. Looking in vdsm log (attachment 1134333 [details]), in the activation at 2016-03-08 00:07:41 1. Storage domain activation took 52 seconds 2. Most of the time was spend waiting for lvm operation mutex (29.4 seconds). Improved in https://gerrit.ovirt.org/53540 3. We invalidate entire lvm cache 4(!!!!) times - this cause lvm reload in all storage domains - at least 50 calls to vgs each time we invalidate the cache. We should invalidate only the vg being activate, for reloading the vg metadata before activating it. 4. We reload vg metadata 4(!) times (6.8 seconds) We should reload the vg once. 5. We refresh multiapth 2 times (8.2 seconds) We should refresh multipath once. 6. We scan FC devices 2 times (1.8 seocnds) We should scan FC devices once 7. We scan iSCSI devices 2 times (0.8 seconds) We should not scan iSCSI devices as the activate domain is FC Summary: with fast storage and little io, we cannot reproduce this issue, but the system behavior is poor. Created attachment 1134337 [details]
executor 6 log from FC test
Created attachment 1134338 [details]
activation log at 2016-03-08 00:07:41
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0362.html |