1158825 – [scale] storage domain activation may time out if a large number of block domains is in use

Bug 1158825 - [scale] storage domain activation may time out if a large number of block domains is in use

Summary: [scale] storage domain activation may time out if a large number of block dom...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	ovirt-3.6.0-rc
Target Release:	3.6.0
Assignee:	Nir Soffer
QA Contact:	Eldad Marciano
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-10-30 10:17 UTC by Evgheni Dereveanchin
Modified:	2019-10-10 09:27 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-03-09 19:27:09 UTC
oVirt Team:	Scale
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Repoplot of vdsm.log (378.17 KB, application/pdf) 2016-03-02 23:12 UTC, Nir Soffer	no flags	Details
Vdsm logs for FC test (2.37 MB, application/x-bzip) 2016-03-08 22:59 UTC, Nir Soffer	no flags	Details
Repoplot of vdsm.log for FC test (764.47 KB, application/pdf) 2016-03-08 23:00 UTC, Nir Soffer	no flags	Details
executor 6 log from FC test (1.94 MB, text/plain) 2016-03-08 23:40 UTC, Nir Soffer	no flags	Details
activation log at 2016-03-08 00:07:41 (23.82 KB, text/plain) 2016-03-08 23:43 UTC, Nir Soffer	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:0362	0	normal	SHIPPED_LIVE	vdsm 3.6.0 bug fix and enhancement update	2016-03-09 23:49:32 UTC

Description Evgheni Dereveanchin 2014-10-30 10:17:11 UTC

Description of problem:
When a lot of block (LVM) storage domains is used, the monitoring threads may block other tasks for a long time which can result in timeouts on storage domain activation

Version-Release number of selected component (if applicable):
vdsm-4.14.11-5.el6ev

How reproducible:
sometimes

Steps to Reproduce:
1. create 50 LVM storage domains
2. attach one more storage domain

Actual results:
operation may fail due to a timeout

Expected results:
storage domain is attached successfully

Comment 4 Nir Soffer 2014-11-02 20:21:24 UTC

Can you attach these logs, describing the same timeframe?

- /var/log/messages
- /var/log/sanlock.log

Comment 7 Allon Mureinik 2014-11-07 09:28:22 UTC

Currently targeting to 3.6.0, as we're quite late in 3.5, and this probably won't be an easy fix.

If we see it's simple enough, we can consider backporting to 3.5.1.

Comment 8 Allon Mureinik 2015-03-22 15:10:47 UTC

(In reply to Evgheni Dereveanchin from comment #0)
> Steps to Reproduce:
> 1. create 50 LVM storage domains
> 2. attach one more storage domain
This BZ was reported against 3.4.0.
3.5.0 includes a whole lot of performance enhancements, and we've clocked 80 domains in a single DC.

Moving to ON_QA to verify.

Comment 15 Nir Soffer 2016-03-02 23:12:29 UTC

Created attachment 1132571 [details]
Repoplot of vdsm.log

Comment 16 Nir Soffer 2016-03-03 00:13:40 UTC

Looking at attachment 1132571 [details], we can see:

1. High lastCheck values for long time between 09:35-09:58
2. Very low read delay values
3. Very slow lvm commands (up to 30+ seconds)
4. Very slow mailbox dd commands (up to 38 seconds)
5. Very slow storage domain monitoring dd commands (up to 14 seconds).

This can be explained by overloaded hypervisor, or by overloaded
vdsm, running on many cores.
This issue may be fixed in 3.6 which pins vdsm to 1 cpu by default.

In vdsm log, we 54 SCSI rescan operations in this timeframe. Each time we
do a rescan, we drop the entire lvm cache, which causes storm of lvm reload
commands.

We also see 564 refresh domain operations, and 542 errors related to
unfetched domains:

Thread-44::DEBUG::2014-10-23 09:01:33,289::domainMonitor::192::Storage.DomainMonitorThread::(_monitorDomain) Refreshing domain 8ea8a6d5-b5e6-4120-bed0-539dd45a2457
Thread-44::ERROR::2014-10-23 09:01:33,675::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 8ea8a6d5-b5e6-4120-bed0-539dd45a2457
Thread-44::ERROR::2014-10-23 09:01:33,840::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 8ea8a6d5-b5e6-4120-bed0-539dd45a2457

When we have unfetched domains during a refresh operation, we are doing
additional vgs operation for looking up the domain.

This issue was fixed on 3.6, by adding connectStorageServer to FC storage.
Domain are prefetched at this point, avoiding the "looking for unfetched
domain" errors, and the additional lvm commands.
- https://gerrit.ovirt.org/#/q/project:vdsm+branch:master+topic:fc-connect-storage-server
- https://gerrit.ovirt.org/44272

Also we can see calls to multiapth -r:

Thread-47::DEBUG::2014-10-23 09:56:54,681::multipath::110::Storage.Misc.excCmd::(rescan) '/usr/bin/sudo -n /sbin/multipath -r' (cwd None)
Thread-47::DEBUG::2014-10-23 09:57:30,991::multipath::110::Storage.Misc.excCmd::(rescan) SUCCESS: <err> = ''; <rc> = 0

This call is reloading multipath maps (multipath -r), which may effect
lvm commands. We are not calling "multipath -r" since 3.4.

Since some of the issues are relevant only to FC storage, this must be verified on FC.

Some issues are similar to bug 1081962. This is a variant of that bug, or a duplicate.

Returning to ON_QA until we get the logs from the scsi tests, and the results on FC.

Comment 21 Nir Soffer 2016-03-08 22:59:48 UTC

Created attachment 1134333 [details]
Vdsm logs for FC test

Comment 22 Nir Soffer 2016-03-08 23:00:44 UTC

Created attachment 1134334 [details]
Repoplot of vdsm.log for FC test

Comment 23 Nir Soffer 2016-03-08 23:38:07 UTC

Looking the repoplot (attachment 1134334 [details]) we can see:

1. High lastCheck values (up to 75 seconds). This is expected with current 
   code when using 50 storage domains, tracked in bug 1081962.

2. Very low read delay

3. Lot of lvm commands, most take less then one seconds, some up to 4.5 seconds

Activating a domain takes less then 60 seconds, so it did not time out.

Timeout is possible if the hypervisor is overloaded, or storage operations are
performed, slowing down lvm commands. This was not tested here, and will be
tested for bug 1081962.

Looking in vdsm log (attachment 1134333 [details]), in the activation at 2016-03-08 00:07:41

1. Storage domain activation took 52 seconds

2. Most of the time was spend waiting for lvm operation mutex (29.4 seconds).
   Improved in https://gerrit.ovirt.org/53540

3. We invalidate entire lvm cache 4(!!!!) times - this cause lvm reload
   in all storage domains - at least 50 calls to vgs each time we invalidate
   the cache. We should invalidate only the vg being activate, for reloading
   the vg metadata before activating it.

4. We reload vg metadata 4(!) times (6.8 seconds)
   We should reload the vg once.

5. We refresh multiapth 2 times (8.2 seconds)
   We should refresh multipath once.

6. We scan FC devices 2 times (1.8 seocnds)
   We should scan FC devices once

7. We scan iSCSI devices 2 times (0.8 seconds)
   We should not scan iSCSI devices as the activate domain is FC

Summary:

with fast storage and little io, we cannot reproduce this issue, but
the system behavior is poor.

Comment 24 Nir Soffer 2016-03-08 23:40:04 UTC

Created attachment 1134337 [details]
executor 6 log from FC test

Comment 25 Nir Soffer 2016-03-08 23:43:50 UTC

Created attachment 1134338 [details]
activation log at 2016-03-08 00:07:41

Comment 27 errata-xmlrpc 2016-03-09 19:27:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html

Note You need to log in before you can comment on or make changes to this bug.