1099856 – [SCALE] VDSM is consuming a lot of CPU time even with no active VMs on 100 NFS storage domains

Bug 1099856 - [SCALE] VDSM is consuming a lot of CPU time even with no active VMs on 100 NFS storage domains

Summary: [SCALE] VDSM is consuming a lot of CPU time even with no active VMs on 100 NF...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Nir Soffer
QA Contact:	Yuri Obshansky
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:	rhev3.5beta 1156165
TreeView+	depends on / blocked

Reported:	2014-05-21 11:07 UTC by Aharon Canan
Modified:	2016-02-10 18:18 UTC (History)
CC List:	11 users (show)
Fixed In Version:	vt1.3, 4.16.0-1.el6_5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-02-16 13:37:34 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	scohen: needinfo+ scohen: Triaged+

Attachments	(Terms of Use)
script to create 100 SDs (822 bytes, text/x-python) 2014-05-21 11:07 UTC, Aharon Canan	no flags	Details
logs (3.98 MB, application/x-gzip) 2014-05-21 11:13 UTC, Aharon Canan	no flags	Details
create_sd.py (1.09 KB, text/x-python) 2014-05-21 11:19 UTC, Aharon Canan	no flags	Details
profile results sorted by time (10.84 KB, text/plain) 2014-05-21 15:06 UTC, Nir Soffer	no flags	Details
profile results sorted by cumultive time (15.83 KB, text/plain) 2014-05-21 15:07 UTC, Nir Soffer	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	28089	0	master	MERGED	nfsSD: Remove unneeded and expensive mount check	Never

Description Aharon Canan 2014-05-21 11:07:52 UTC

Created attachment 897917 [details]
script to create 100 SDs

Description of problem:
high cpu usage attributed to 'vdsm' after setting 100 NFS storage domains

Version-Release number of selected component (if applicable):
is36.4 
vdsm-4.13.2-0.17.el6ev


How reproducible:
100%

Steps to Reproduce:
1. set NFS DC with 3 hosts (not sure if we really need 3 hosts)
2. create 100 NFS storage domains
3. run "top" on one of the hosts and check vdsm

Actual results:
==========
 7589 vdsm       0 -20 3537m  55m 6532 S 249.4  0.6   2540:04 vdsm                                                                                                                                                                           


Expected results:


Additional info:

Comment 1 Aharon Canan 2014-05-21 11:13:07 UTC

Created attachment 897918 [details]
logs

Comment 2 Aharon Canan 2014-05-21 11:19:01 UTC

Created attachment 897920 [details]
create_sd.py

Comment 3 Nir Soffer 2014-05-21 11:29:30 UTC

How many cores in the machine that show 249% cpu usage?

Comment 4 Aharon Canan 2014-05-21 11:33:08 UTC

4 cores 1 socket

Comment 5 Nir Soffer 2014-05-21 11:41:35 UTC

Some more info from the machine:

cpu: Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz
release: Red Hat Enterprise Linux Server release 6.5 Beta (Santiago)
last yum update: 2014-05-04 (missing lot of updates)

Comment 6 Nir Soffer 2014-05-21 14:23:32 UTC

Please repeat this test with sane number of storage domains - we have customers using 30-40 storage domains, and it would be useful to see how the system behave in normal conditions to evaluate the severity of this issue.

Comment 7 Nir Soffer 2014-05-21 15:04:13 UTC

I reproduced this partially using master (2014-05-21) setup with 30 nfs storage domains. I don't get the extreme cpu reported by Aharon, only little high cpu usage of a about 20% out of 800%.

Attached profiles showing where time is spent on this setup on the spm.

Comment 8 Nir Soffer 2014-05-21 15:06:28 UTC

Created attachment 898046 [details]
profile results sorted by time

Comment 9 Nir Soffer 2014-05-21 15:07:07 UTC

Created attachment 898047 [details]
profile results sorted by cumultive time

Comment 10 Nir Soffer 2014-05-21 15:15:40 UTC

The high cpu usage is caused by inefficient implementation of the mount related code, having O(N^2) complexity.

NfsStorageDomain.selftest is responsible to 267 seconds of total 458 seconds of cpu time (58%).

Comment 11 Nir Soffer 2014-05-21 15:20:16 UTC

Set severity to medium and schedule for 3.5.0, since with normal setup (30 storage domains), this is not a major issue. This is also not a regression, the code responsible for this is from 2012.

Comment 12 Nir Soffer 2014-05-21 15:23:07 UTC

Marina, can you tell us what is a common number of storage domains in the field? Do we support systems with more than 30-40 NFS storage domains?

Comment 13 Allon Mureinik 2014-05-21 17:57:02 UTC

Without requirement guidelines, these kind of bugs are pointless.

Sean - We need concrete definition on the size of environment we need to support, and the hardware we require customers to have for it.

Aharon/Gil - we need input on what QA are able to test.

[in any event, 100 SDs sounds like a usecase we'll never see in the field, and if we do, the first action item would be to consolidate them.]

Comment 16 Aharon Canan 2014-05-22 09:18:32 UTC

Nir, 

You asked me to set 100 SDs in comment #7 from https://bugzilla.redhat.com/show_bug.cgi?id=1095907
 
Anyway, in case it is supported we need to fix, 
In case it is not, we need to block the option to add SD above supported numbers.

I think is it up to PM to decide and then we should continue accordingly 

Sean?

Comment 17 Nir Soffer 2014-05-22 09:48:07 UTC

(In reply to Aharon Canan from comment #16)
> You asked me to set 100 SDs in comment #7 from
> https://bugzilla.redhat.com/show_bug.cgi?id=1095907

In https://bugzilla.redhat.com/show_bug.cgi?id=1095907#c6 I asked for "30 ISCIS storage domains"
In https://bugzilla.redhat.com/show_bug.cgi?id=1095907#c7 I suggested to create "lot of (100?) mounts"

Sorry if that was not clear.

Comment 20 Nir Soffer 2014-05-25 15:48:50 UTC

Aharon, can you test the attached patch with your setup?

Comment 21 Aharon Canan 2014-05-26 08:06:39 UTC

do not have resource for integration testing for now.

Comment 23 Yuri Obshansky 2014-12-23 07:53:14 UTC

Bug verified on
RHEV-M 3.5.0-0.22.el6ev
RHEL - 6Server - 6.6.0.2.el6
libvirt-0.10.2-46.el6_6.1
vdsm-4.16.7.6-1.el6ev 
Created 100 NFS Storage Domains and checked top on host:
top - 18:52:14 up 12 days,  9:09,  2 users,  load average: 1.29, 1.21, 1.15
Tasks: 1343 total,   1 running, 1341 sleeping,   0 stopped,   1 zombie
Cpu(s):  0.3%us,  1.0%sy,  0.0%ni, 97.9%id,  0.7%wa,  0.0%hi, 0.0%si,
0.0%st
Mem:  396875340k total,  6112184k used, 390763156k free,   159604k buffers
Swap: 16383996k total,        0k used, 16383996k free,  1931216k cached
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ COMMAND
45620 vdsm       0 -20 33.6g 121m 9700 S 62.4  0.0 171:36.09 vdsm
Bug didn't reproduce.

Note You need to log in before you can comment on or make changes to this bug.