1658866 – [scale] adding storage domain creates significant load

Bug 1658866 - [scale] adding storage domain creates significant load

Summary: [scale] adding storage domain creates significant load

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.1.11
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.3.1
Target Release:	4.3.0
Assignee:	Nir Soffer
QA Contact:	David Vaanunu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-12-13 04:01 UTC by Germano Veit Michel
Modified:	2023-12-15 16:16 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-05-08 12:36:32 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3939711	None	None	None	2019-02-26 06:08:21 UTC
Red Hat Product Errata	RHBA-2019:1077	None	None	None	2019-05-08 12:36:38 UTC
oVirt gerrit	66932	'None'	MERGED	lvm: Limit number of concurrent commands	2020-10-21 03:27:52 UTC
oVirt gerrit	96422	'None'	ABANDONED	[RFC] monitor: Read LVM VG metadata block	2020-10-21 03:28:05 UTC
oVirt gerrit	96436	master	POST	[RFC] lvm: Validate vg cache using lvm metadata	2021-05-23 08:23:40 UTC

Description Germano Veit Michel 2018-12-13 04:01:05 UTC

Description of problem:

Scenario (scale):
- 50+ Storage Domains in the DC
- 25 Hosts

Action:
- Adding another SD to the Pool

Problems:
- Unexpected temporary load spikes, causing sub-optimal service for already running VMs.
- VM monitors unresponsive for a short time

In VDSM logs:

1. ConnectStorageServer
2018-12-12 13:42:54,474+0100 DEBUG (jsonrpc/0) [jsonrpc.JsonRpcServer] Calling 'StoragePool.connectStorageServer'....

2. HBA scan:
2018-12-12 13:42:54,565+0100 DEBUG (jsonrpc/0) [storage.HBA] Starting scan (hba:58)

3. VGS to find new SD
2018-12-12 13:43:00,528+0100 DEBUG (monitor/6d637c9) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-47 /usr/bin/sudo -n /sbin/lvm vgs ...
2018-12-12 13:43:00,653+0100 DEBUG (jsonrpc/0) [storage.HSM] Found SD uuids: ('.....') (hsm:2350)

4. ConnectStorageServer returns
2018-12-12 13:43:00,658+0100 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call StoragePool.connectStorageServer succeeded in 6.18 seconds (__init__:539)

Until now, everything looks fine, even the lvm vgs is quick. But at this point on we start having load spikes with the SDs refreshing and monitoring for the new SD starting.

Some things I noticed:

a) Before the new SD was added, we had 3-4 LVM commands per second, but while the new SD is being added there are tons. I understand vdsm droped the caches and is refrshing everything, including the SDs that already existed. See the number of lvm commands per minute:

     36 2018-12-12 13:41
     52 2018-12-12 13:42
    443 2018-12-12 13:43 <--- 1 SD added, going from 56 to 57.
    283 2018-12-12 13:44
    438 2018-12-12 13:45
    328 2018-12-12 13:46
     48 2018-12-12 13:47
     24 2018-12-12 13:48

Multiply this by 25 hosts and we should have ~10000 lvm commands per minute, may even impact storage?

b) It seems each SD monitor does its own vgs when refreshing, instead of a common vgs for all of them.

2018-12-12 13:43:11,014+0100 (monitor/3ff1fae) /sbin/lvm vgs
2018-12-12 13:43:11,031+0100 (monitor/093830f) /sbin/lvm vgs
2018-12-12 13:43:11,047+0100 (monitor/e36c27b) /sbin/lvm vgs
2018-12-12 13:43:11,064+0100 (monitor/ec6f1c2) /sbin/lvm vgs
2018-12-12 13:43:11,081+0100 (monitor/8f6811b) /sbin/lvm vgs
2018-12-12 13:43:11,099+0100 (monitor/0e39d15) /sbin/lvm vgs
2018-12-12 13:43:11,115+0100 (monitor/f6c271d) /sbin/lvm vgs
2018-12-12 13:43:11,131+0100 (monitor/4b8d054) /sbin/lvm vgs
2018-12-12 13:43:11,146+0100 (monitor/6d637c9) /sbin/lvm vgs
2018-12-12 13:43:11,168+0100 (monitor/3bba612) /sbin/lvm vgs
2018-12-12 13:43:11,190+0100 (monitor/dc425d0) /sbin/lvm vgs
2018-12-12 13:43:11,205+0100 (monitor/894d769) /sbin/lvm vgs
2018-12-12 13:43:11,222+0100 (monitor/8a49441) /sbin/lvm vgs
2018-12-12 13:43:11,237+0100 (monitor/0b380e6) /sbin/lvm vgs
2018-12-12 13:43:11,253+0100 (monitor/f3d9d4f) /sbin/lvm vgs
2018-12-12 13:43:11,268+0100 (monitor/9c14d90) /sbin/lvm vgs
2018-12-12 13:43:11,285+0100 (monitor/0455c44) /sbin/lvm vgs
2018-12-12 13:43:11,301+0100 (monitor/c24021c) /sbin/lvm vgs
2018-12-12 13:43:11,317+0100 (monitor/a2050d0) /sbin/lvm vgs
2018-12-12 13:43:11,332+0100 (monitor/62162e6) /sbin/lvm vgs
2018-12-12 13:43:11,347+0100 (monitor/330d343) /sbin/lvm vgs
2018-12-12 13:43:11,363+0100 (monitor/7c5f77f) /sbin/lvm vgs
2018-12-12 13:43:11,378+0100 (monitor/731c37b) /sbin/lvm vgs
2018-12-12 13:43:11,393+0100 (monitor/59a2d31) /sbin/lvm vgs
2018-12-12 13:43:11,409+0100 (monitor/b75c534) /sbin/lvm vgs
2018-12-12 13:43:11,424+0100 (monitor/8114b42) /sbin/lvm vgs
2018-12-12 13:43:11,440+0100 (monitor/aa26ebd) /sbin/lvm vgs
2018-12-12 13:43:11,456+0100 (monitor/4f349c7) /sbin/lvm vgs
2018-12-12 13:43:11,472+0100 (monitor/0260715) /sbin/lvm vgs

c) The load only seems to go back to normal levels once all SDS are produced and monitored.

2018-12-12 13:46:30,075+0100 INFO  (monitor/012101a) [storage.Monitor] Domain 012101ae-e683-4393-bba6-cee062d67a2a became VALID (monitor:471)

d) Context switches also spike, I assume due to the big amount of new processes being spawned (i.e. lvm commands)

10:41:35 PM    proc/s   cswch/s
10:42:35 PM     33.78  33366.76
10:43:36 PM     52.67  62432.96
10:44:36 PM     64.65  62035.40
10:45:36 PM     80.33  85906.61
10:46:36 PM     84.95  87470.62
10:47:36 PM     40.98  36692.07

e) Run queue and load average also jump

10:41:35 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
10:42:35 PM         4      2452      1.99      2.65      2.79         0
10:43:36 PM        40      2574     17.38      6.12      3.92         4 
10:44:36 PM         7      2473     17.23      8.94      5.08         0
10:45:36 PM         2      2481     29.93     14.19      7.13         0
10:46:36 PM         3      2515     35.19     18.42      9.01         0
10:47:36 PM         1      2495     19.31     16.78      9.03         0

f) CPU usage, note the amount of time spent in kernelspace (lvm syscalls?)

10:41:35 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
10:42:35 PM     all      6.11      0.00      1.90      0.14      0.00     91.84
10:43:36 PM     all      8.12      0.00     23.80      0.48      0.00     67.61
10:44:36 PM     all      7.84      0.00     22.88      0.57      0.00     68.71
10:45:36 PM     all      9.41      0.00     40.94      0.75      0.00     48.91
10:46:36 PM     all      9.58      0.00     43.07      0.75      0.00     46.60
10:47:36 PM     all      6.28      0.00      2.93      0.15      0.00     90.64

g) IO activity noticeably spikes too

10:41:35 PM       tps      rtps      wtps   bread/s   bwrtn/s
10:42:35 PM    331.78    231.36    100.42   9887.35   2938.26
10:43:36 PM   1432.41   1327.85    104.57  28176.10   2975.54
10:44:36 PM   1772.15   1657.40    114.74  36133.27   4514.01
10:45:36 PM   2464.68   2355.14    109.54  48581.85   3271.79
10:46:36 PM   2520.50   2410.22    110.27  51025.23   3544.96
10:47:36 PM    494.47    395.87     98.60  13853.77   3006.62

h) Finally, VM monitors go unresponsive at 13:46:24. 

2018-12-12 13:46:24,303+0100 WARN  (jsonrpc/7) [virt.vm] (vmId='fafa7e5d-cbd7-4f46-ba65-0f7254dd91a6') monitor became unresponsive (command timeout, age=63.5500000007) (vm:5132)

I'm not sure, but it seems to be the same as BZ1598266 which is about GetVolumeSize apparently not being able to run during refreshing/producing the SDs, so UpdateVolumes for each VM is blocked.
I don't fully understand that area (far from it), so it may have some misunderstandings of how this part works.

Version-Release number of selected component (if applicable):
vdsm-4.19.45-1.el7ev.x86_64

How reproducible:
Happens constantly on customer env (scale)

Steps to Reproduce:
1. Add Storage Domain to Pool with 50+ SDs.

Actual results:
Higher than expected load

Expected results:
Some resources used, but not as significant.

Comment 6 Roy Golan 2018-12-19 08:48:48 UTC

Germano can you look at this https://bugzilla.redhat.com/show_bug.cgi?id=1488892 - this reduced the number of full refreshes and landed in 4.2.1

Comment 7 Germano Veit Michel 2018-12-20 01:15:37 UTC

(In reply to Roy Golan from comment #6)
> Germano can you look at this
> https://bugzilla.redhat.com/show_bug.cgi?id=1488892 - this reduced the
> number of full refreshes and landed in 4.2.1

Hi Roy,

Nice.

But this case has just 2 active and 2 enabled paths and the load spikes are *very* high. There seem to be a big contention of LVM commands because there are 50+ SDs in total, so dropping all the caches and refreshing everything fires a lot of lvm commands. Even if refreshed just once, I think the problem may still be perceptible. So I'm not sure that patch is enough to full solve this. What do you think?

Comment 8 Nir Soffer 2018-12-22 18:04:52 UTC

Vdsm does not limit the number of lvm commnads, and for every domain we refresh lvm
caches when performing operations like adding a storage domain. If you have big
amount of storage domains you will have big amount of concurrent lvm commands
in such case.

We discussed this possible issue in the past, but never seen a real case when it
was problematic.

I think this patch can help:
https://gerrit.ovirt.org/c/66932/

I suggest to first reproduce this issue in scale environment, and see if the patch
makes a difference.

Comment 9 Nir Soffer 2018-12-23 12:36:04 UTC

Patch https://gerrit.ovirt.org/96422 is only for discussion, eliminating unneeded
lvm refreshes. I think we can work on this for 4.4, since this requires major 
changes in the system.

Comment 10 Nir Soffer 2018-12-24 16:40:10 UTC

Patch https://gerrit.ovirt.org/96436 is also only for discussion. I think this is
the direction we should take in 4.4 if we want to improve scaleability.

Comment 11 Sandro Bonazzola 2019-01-28 09:39:45 UTC

This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 13 Nir Soffer 2019-02-20 11:10:17 UTC

We limit now the amount of concurrent lvm calls to 10 per host.
This may help to reduce the load.

We also have experimental patches that may improve caching more,
but they require much more work. We will continue to work on this after
we have tests results for the current code.

Daniel, can you reproduce with vdsm 4.30.8, and test with vdsm 4.30.9?

Comment 14 Daniel Gur 2019-02-21 07:54:10 UTC

As I see the bug now in modified state we will receive the fix in the coming build.
We added it to our sprint to validate it in the coming build?

Regarding your request of VDSM - What VDSM will be part of RHV 4.3 Beta 2  that will be built today Feb 21

Comment 15 RHV bug bot 2019-02-21 17:26:32 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ]

For more info please contact: rhv-devops

Comment 16 Daniel Gur 2019-02-24 10:26:14 UTC

I have noticed that QA ack was missing ( Acked it now)
Is this part of the 4.3.1-2 drop?

Comment 17 Nir Soffer 2019-02-25 14:46:08 UTC

Germano, can you define what is "significant load", and how can we measure it?

I think we should first reproduce this issue with vdsm 4.30.8, and then
try the same with vdsm-4.30.9, and compare the results.

Comment 18 Daniel Gur 2019-02-25 15:06:54 UTC

nir can you please explain how we can verify your latest change  of concurrent lvm calls to 10 per host.

Comment 19 Nir Soffer 2019-02-25 17:00:47 UTC

(In reply to Daniel Gur from comment #18)
> nir can you please explain how we can verify your latest change  of
> concurrent lvm calls to 10 per host.

We can watch the number of lvm processes run by vdsm, and check that we 
never see more than 10 processes while working with 50 storage domains.
In previous versions we could see up to 50 lvm processes (one per domain).

Comment 20 Germano Veit Michel 2019-02-26 00:27:51 UTC

(In reply to Nir Soffer from comment #17)
> Germano, can you define what is "significant load", and how can we measure
> it?
> 
> I think we should first reproduce this issue with vdsm 4.30.8, and then
> try the same with vdsm-4.30.9, and compare the results.

Hi Nir,

I added some examples of what the "significant load" load produces in comment #0. See items 'd' to 'g', which are easier to see from a single host.
On the actual customer environment, which has dozens of hosts, all hosts doing this at the same time impacts the storage performance for the VMs, they get reduced peformance too (IO).

If I understand your patch correctly, we will spread this load over more time to generate less impact by limiting concurrent lvm commands. But it is not doing N SDs at a time and when finished start another SD, it will still do all SDs at the same time, just limiting the amount of lvm commands. So each SD will take much longer to produce than doing N SDs at a time. I'm afraid this will make things like GetVolumeSize block for much longer as producing a SD will now take longer, triggering things like BZ1598266 even more?

Comment 23 David Vaanunu 2019-02-26 15:29:46 UTC

Scale Environment:
  Hosted-Engine
  2 Hosts - SuperMicro Server with ~200VMs each

  rhv-release-4.3.1-1-001
  vdsm-4.30.9-1.el7ev.x86_64


Test Preparation:
* Create 60 LUNs on NetApp (FC)

Two Tests were run:
1. Add 50 Storage Domains - Duration: 2:05 Hrs (2019-02-24 17:00-19:00 Israel Time)
2. Add more 10 Storage Domains - Duration ~40min (2019-02-25 17:15-17:55 Israel Time)

During the tests , resource was collected using 'nmon'.(Link bellow)

Results:
1. Hosts CPU - ~12%  , (Sys - 2% , IO wait - 0%)
2. Hosts Disk Busy - 4-5%


VDSM log:

No 'monitor became unresponsive' messages




Link for 'nmon' results:

https://drive.google.com/open?id=1-I8JsjrZLJaE29qndwQmUR5phA6cCwpp

Comment 26 Nir Soffer 2019-02-26 15:52:14 UTC

(In reply to Germano Veit Michel from comment #20)
> (In reply to Nir Soffer from comment #17)
 
> If I understand your patch correctly, we will spread this load over more
> time to generate less impact by limiting concurrent lvm commands.

Right. Before this patch, we could have up to 50 lvm commands on every host
running at the same time, reading from remote storage.

With the patch, we can have only 10 lvm commands running at the same time.

> But it is
> not doing N SDs at a time and when finished start another SD, it will still
> do all SDs at the same time, just limiting the amount of lvm commands. So
> each SD will take much longer to produce than doing N SDs at a time.

In my tests running many lvm commands just make every commands slower. We had
see every lvm command takes up to 80 seconds few years ago. So I don't think
limiting the number of commands will slow down anything.

> I'm
> afraid this will make things like GetVolumeSize block for much longer as
> producing a SD will now take longer, triggering things like BZ1598266 even
> more?

I don't think it will effect this, but we need to test this. Basically any
storage domain function can be block for unlimited time because of the shared
domain cache if iscsi/fc rescan is very slow. The real issue is slow rescans
not slow lvm commands.

Comment 31 Nir Soffer 2019-04-08 17:13:52 UTC

We could not reproduce the issue in 4.3, and we did not test with earlier version,
so I'm not sure we understand this issue. It is possible that this was improved
by limiting number of concurrent lvm commands in vdsm, but without reproducing
the issue with a version that does include this change we cannot be sure.

Therefore I don't think we can have meaningful doc text.

Comment 33 errata-xmlrpc 2019-05-08 12:36:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1077

Comment 34 Daniel Gur 2019-08-28 13:11:22 UTC

sync2jira

Comment 35 Daniel Gur 2019-08-28 13:15:34 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.