Bug 1366556 - MOM causes Vdsm to slow down, high number of 'vmGetIoTunePolicy' API calls
Summary: MOM causes Vdsm to slow down, high number of 'vmGetIoTunePolicy' API calls
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: mom
Classification: oVirt
Component: Core
Version: 0.5.5
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ovirt-4.0.5
: 0.5.8
Assignee: Jenny Tokar
QA Contact: Shira Maximov
URL:
Whiteboard: SLA
: 1337229 (view as bug list)
Depends On: 1374988
Blocks: 1392986
TreeView+ depends on / blocked
 
Reported: 2016-08-12 10:31 UTC by Francesco Romani
Modified: 2017-01-18 08:12 UTC (History)
9 users (show)

Fixed In Version: mom-0.5.8-1
Doc Type: Enhancement
Doc Text:
This update ensures that the Memory Overcommit Manager (MoM) uses just one request to retrieve the ioTune configuration and status. This retrieval is more efficient and lowers the load imposed on the VDSM. Previously one request per virtual machine was made which caused a high load on the VDSM.
Clone Of:
Environment:
Last Closed: 2017-01-18 07:37:42 UTC
oVirt Team: SLA
Embargoed:
rule-engine: ovirt-4.0.z+
mgoldboi: planning_ack+
dfediuck: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1374988 0 high CLOSED MOM causes Vdsm to slow down, high number of 'vmGetIoTunePolicy' API calls - vdsm 2021-02-22 00:41:40 UTC
oVirt gerrit 63971 0 'None' 'MERGED' 'Add caching for vm IO tune policies' 2019-11-28 14:40:17 UTC
oVirt gerrit 65169 0 'None' 'MERGED' 'Add bulk interfaces to the makefile.' 2019-11-28 14:40:17 UTC

Internal Links: 1374988

Description Francesco Romani 2016-08-12 10:31:19 UTC
Description of problem:
MoM 0.5.5 uses an excessive amountof calls to 'vmGetIoTunePolicy' to Vdsm, causing high load on the Vdsm server. Considering one box with 110 active VMs, we see

6329 calls in 15m19s (roughly 6.88 calls per second!)

Vdsm load with MOM running, reported by the Vdsm health monitor:

Thread-129::DEBUG::2016-08-12 07:01:25,708::health::122::health::(_check_resources) user=27.78%, sys=9.55%, rss=412200 kB (+272), threads=47
Thread-129::DEBUG::2016-08-12 07:02:25,885::health::122::health::(_check_resources) user=27.97%, sys=9.00%, rss=412360 kB (+160), threads=50
Thread-129::DEBUG::2016-08-12 07:03:26,065::health::122::health::(_check_resources) user=25.28%, sys=8.43%, rss=410700 kB (-1660), threads=47
Thread-129::DEBUG::2016-08-12 07:04:26,238::health::122::health::(_check_resources) user=24.88%, sys=8.08%, rss=410456 kB (-244), threads=47
Thread-129::DEBUG::2016-08-12 07:05:26,413::health::122::health::(_check_resources) user=25.60%, sys=8.47%, rss=410116 kB (-340), threads=47
Thread-129::DEBUG::2016-08-12 07:06:26,591::health::122::health::(_check_resources) user=34.38%, sys=11.73%, rss=410764 kB (+648), threads=47
Thread-129::DEBUG::2016-08-12 07:07:26,766::health::122::health::(_check_resources) user=27.57%, sys=8.82%, rss=410748 kB (-16), threads=47
Thread-129::DEBUG::2016-08-12 07:08:26,947::health::122::health::(_check_resources) user=28.97%, sys=9.28%, rss=411192 kB (+444), threads=48
Thread-129::DEBUG::2016-08-12 07:09:27,134::health::122::health::(_check_resources) user=28.60%, sys=9.37%, rss=411864 kB (+672), threads=47
Thread-129::DEBUG::2016-08-12 07:10:27,312::health::122::health::(_check_resources) user=30.82%, sys=10.32%, rss=411840 kB (-24), threads=50

Vdsm load with MOM disabled:
Thread-129::DEBUG::2016-08-12 07:20:29,174::health::122::health::(_check_resources) user=6.30%, sys=2.82%, rss=410392 kB (+40), threads=47
Thread-129::DEBUG::2016-08-12 07:21:29,364::health::122::health::(_check_resources) user=6.28%, sys=2.77%, rss=411708 kB (+1316), threads=47
Thread-129::DEBUG::2016-08-12 07:22:29,542::health::122::health::(_check_resources) user=6.12%, sys=2.63%, rss=410372 kB (-1336), threads=47
Thread-129::DEBUG::2016-08-12 07:23:29,743::health::122::health::(_check_resources) user=6.47%, sys=2.90%, rss=410372 kB (+0), threads=47
Thread-129::DEBUG::2016-08-12 07:24:29,925::health::122::health::(_check_resources) user=6.23%, sys=2.78%, rss=411692 kB (+1320), threads=47
Thread-129::DEBUG::2016-08-12 07:25:30,131::health::122::health::(_check_resources) user=6.05%, sys=2.80%, rss=411804 kB (+112), threads=47
Thread-129::DEBUG::2016-08-12 07:26:30,315::health::122::health::(_check_resources) user=6.20%, sys=2.88%, rss=411812 kB (+8), threads=47
Thread-129::DEBUG::2016-08-12 07:27:30,493::health::122::health::(_check_resources) user=6.07%, sys=2.68%, rss=411796 kB (-16), threads=47
Thread-129::DEBUG::2016-08-12 07:28:30,671::health::122::health::(_check_resources) user=6.08%, sys=2.92%, rss=410584 kB (-1212), threads=47
Thread-129::DEBUG::2016-08-12 07:29:30,849::health::122::health::(_check_resources) user=6.33%, sys=2.88%, rss=412068 kB (+1484), threads=47



Version-Release number of selected component (if applicable):
MOM 0.5.5
Vdsm 4.18.11

How reproducible:
100%

Steps to Reproduce:
1. Run 110 VMs to have comparable numbers has above

Actual results:
high load on Vdsm

Expected results:
less load on Vdsm

Additional info:
Will probably requires new API to Vdsm

Comment 1 Francesco Romani 2016-08-12 11:15:22 UTC
Happens also with vdsm 4.18.10 and 4.18.6

Comment 2 Eldad Marciano 2016-08-15 11:56:37 UTC
(In reply to Francesco Romani from comment #1)
> Happens also with vdsm 4.18.10 and 4.18.6

Can you elaborate please what in your box?
CPU's Memory etc..

Can we estimate the ratio ? like 110vms running x amount of 'vmGetIoTunePolicy'

Comment 3 Francesco Romani 2016-08-17 12:55:25 UTC
(In reply to Eldad Marciano from comment #2)
> (In reply to Francesco Romani from comment #1)
> > Happens also with vdsm 4.18.10 and 4.18.6
> 
> Can you elaborate please what in your box?
> CPU's Memory etc..

The data I mentioned in description is a QE box with RHEL 7.2.
It's running 32 Xeon(R) CPU E5-2630 v3 @ 2.40GHz with 64 GiB of RAM

I'll attach one example xml

> Can we estimate the ratio ? like 110vms running x amount of
> 'vmGetIoTunePolicy'

IIRC MOM calls vmGetIoTunePolicy every 15s, for each vm.

Comment 5 Francesco Romani 2016-08-18 11:09:56 UTC
(In reply to Francesco Romani from comment #0)
> Description of problem:

This is way less clear than it should. Let me clarify a bit.

> MoM 0.5.5 uses an excessive amountof calls to 'vmGetIoTunePolicy' to Vdsm,
> causing high load on the Vdsm server. Considering one box with 110 active
> VMs, we see
> 
> 6329 calls in 15m19s (roughly 6.88 calls per second!)

To get this number, this is what I did:

on the testing host, count the number of calls, using

# xzgrep 'RPC call vmGetIoTune' /var/log/vdsm/vdsm.log.3.xz > calls.txt

excerpt of 'calls.txt'

Thread-2936835::INFO::2016-08-17 11:01:05,885::bindingxmlrpc::1282::vds::(wrapper) RPC call vmGetIoTunePolicy finished (code=0) in 0.82 seconds
Thread-2936836::INFO::2016-08-17 11:01:05,886::bindingxmlrpc::1282::vds::(wrapper) RPC call vmGetIoTunePolicy finished (code=0) in 0.82 seconds
Thread-2936837::INFO::2016-08-17 11:01:05,890::bindingxmlrpc::1282::vds::(wrapper) RPC call vmGetIoTunePolicy finished (code=0) in 0.00 seconds


This is logged by the RPC subsystem once every call ends. We don't care now about how a single call took, we just need the numbers of them. So we do:

# wc -l calls.txt 
12757 calls.txt

Now in this run I have less calls. Let's see the time interval.
The first call in this set

# head -n 1 calls.txt 
Thread-2936835::INFO::2016-08-17 11:01:05,885::bindingxmlrpc::1282::vds::(wrapper) RPC call vmGetIoTunePolicy finished (code=0) in 0.82 seconds

while the last call in the set

# tail -n 1 calls.txt 
Thread-2950401::INFO::2016-08-17 12:01:00,921::bindingxmlrpc::1282::vds::(wrapper) RPC call vmGetIoTunePolicy finished (code=0) in 0.00 seconds

We can now compute the timespan, considering begin time = 11:01:06 and end time = 12:01:01 -> this gives us 59m55s -> 3595s

so 12757 calls in 3595s is 3.55calls/s assuming uniform rate, which is debatable, but I want to highlight the high number of calls to vmGetIoTunePolicy, which is easily understandable grepping the vdsm log.

Furthermore, the following is a proof of the impact on Vdsm of those calls, but it is unrelated to the number of calls

> Vdsm load with MOM running, reported by the Vdsm health monitor:
> 
> Thread-129::DEBUG::2016-08-12
> 07:01:25,708::health::122::health::(_check_resources) user=27.78%,
> sys=9.55%, rss=412200 kB (+272), threads=47
> Thread-129::DEBUG::2016-08-12
> 07:02:25,885::health::122::health::(_check_resources) user=27.97%,
> sys=9.00%, rss=412360 kB (+160), threads=50
> Thread-129::DEBUG::2016-08-12
> 07:03:26,065::health::122::health::(_check_resources) user=25.28%,
> sys=8.43%, rss=410700 kB (-1660), threads=47
> Thread-129::DEBUG::2016-08-12
> 07:04:26,238::health::122::health::(_check_resources) user=24.88%,
> sys=8.08%, rss=410456 kB (-244), threads=47
> Thread-129::DEBUG::2016-08-12
> 07:05:26,413::health::122::health::(_check_resources) user=25.60%,
> sys=8.47%, rss=410116 kB (-340), threads=47
> Thread-129::DEBUG::2016-08-12
> 07:06:26,591::health::122::health::(_check_resources) user=34.38%,
> sys=11.73%, rss=410764 kB (+648), threads=47
> Thread-129::DEBUG::2016-08-12
> 07:07:26,766::health::122::health::(_check_resources) user=27.57%,
> sys=8.82%, rss=410748 kB (-16), threads=47
> Thread-129::DEBUG::2016-08-12
> 07:08:26,947::health::122::health::(_check_resources) user=28.97%,
> sys=9.28%, rss=411192 kB (+444), threads=48
> Thread-129::DEBUG::2016-08-12
> 07:09:27,134::health::122::health::(_check_resources) user=28.60%,
> sys=9.37%, rss=411864 kB (+672), threads=47
> Thread-129::DEBUG::2016-08-12
> 07:10:27,312::health::122::health::(_check_resources) user=30.82%,
> sys=10.32%, rss=411840 kB (-24), threads=50
> 
> Vdsm load with MOM disabled:
> Thread-129::DEBUG::2016-08-12
> 07:20:29,174::health::122::health::(_check_resources) user=6.30%, sys=2.82%,
> rss=410392 kB (+40), threads=47
> Thread-129::DEBUG::2016-08-12
> 07:21:29,364::health::122::health::(_check_resources) user=6.28%, sys=2.77%,
> rss=411708 kB (+1316), threads=47
> Thread-129::DEBUG::2016-08-12
> 07:22:29,542::health::122::health::(_check_resources) user=6.12%, sys=2.63%,
> rss=410372 kB (-1336), threads=47
> Thread-129::DEBUG::2016-08-12
> 07:23:29,743::health::122::health::(_check_resources) user=6.47%, sys=2.90%,
> rss=410372 kB (+0), threads=47
> Thread-129::DEBUG::2016-08-12
> 07:24:29,925::health::122::health::(_check_resources) user=6.23%, sys=2.78%,
> rss=411692 kB (+1320), threads=47
> Thread-129::DEBUG::2016-08-12
> 07:25:30,131::health::122::health::(_check_resources) user=6.05%, sys=2.80%,
> rss=411804 kB (+112), threads=47
> Thread-129::DEBUG::2016-08-12
> 07:26:30,315::health::122::health::(_check_resources) user=6.20%, sys=2.88%,
> rss=411812 kB (+8), threads=47
> Thread-129::DEBUG::2016-08-12
> 07:27:30,493::health::122::health::(_check_resources) user=6.07%, sys=2.68%,
> rss=411796 kB (-16), threads=47
> Thread-129::DEBUG::2016-08-12
> 07:28:30,671::health::122::health::(_check_resources) user=6.08%, sys=2.92%,
> rss=410584 kB (-1212), threads=47
> Thread-129::DEBUG::2016-08-12
> 07:29:30,849::health::122::health::(_check_resources) user=6.33%, sys=2.88%,
> rss=412068 kB (+1484), threads=47

I think it is better if I repharase this bug to highlight that MOM is having high performance impact on Vdsm (demonstrated by the health check above), most likely related to vmGetIoTunePolicy.

Comment 6 Martin Sivák 2016-09-27 15:43:20 UTC
*** Bug 1337229 has been marked as a duplicate of this bug. ***

Comment 7 Martin Sivák 2016-10-31 13:37:17 UTC
The MOM side is fixed, but there is no VDSM counterpart ready in 4.0.5. You need master branch vdsm and a config file change to test this or just sanity for 4.0.5 as the defaults there won't use anything from this new functionality by default.

Comment 8 Eldad Marciano 2016-11-03 17:06:36 UTC
verifying on top of 4.0.5.8 
with 110 vms
stil we have the same amount of calls

-=>>cat /var/log/vdsm/vdsm.log |grep '2016-11-03 17:04' |grep 'vmGetIoTunePolicy finished' |wc -l |awk '{print $1 / 60}'
6.4

Comment 9 Eldad Marciano 2016-11-03 17:07:48 UTC
(In reply to Eldad Marciano from comment #8)
> verifying on top of 4.0.5.8 
> with 110 vms
> stil we have the same amount of calls
> 
> -=>>cat /var/log/vdsm/vdsm.log |grep '2016-11-03 17:04' |grep
> 'vmGetIoTunePolicy finished' |wc -l |awk '{print $1 / 60}'
> 6.4

also we have the same load 
Thread-13::DEBUG::2016-11-03 17:06:11,008::health::122::health::(_check_resources) user=38.03%, sys=10.05%, rss=359656 kB (+26744), threads=63

Comment 10 Jenny Tokar 2016-11-07 11:55:04 UTC
That sounds weird. 
Are you using the new interface: vdsmjsonrpcbulk?

Comment 11 Eldad Marciano 2016-11-07 12:14:20 UTC
(In reply to Jenny Tokar from comment #10)
> That sounds weird. 
> Are you using the new interface: vdsmjsonrpcbulk?

Yep, mom version mom-0.5.8-1.el7ev.noarch

Comment 12 Eldad Marciano 2016-11-07 12:15:15 UTC
(In reply to Eldad Marciano from comment #11)
> (In reply to Jenny Tokar from comment #10)
> > That sounds weird. 
> > Are you using the new interface: vdsmjsonrpcbulk?
> 
> Yep, mom version mom-0.5.8-1.el7ev.noarch

any further steps required ?
we using default installation setup

Comment 13 Martin Sivák 2016-11-07 13:59:19 UTC
Eldad: did you read comment #7 that explicitly states that the 4.0.5 VDSM does NOT use the new functionality?

Comment 14 Eldad Marciano 2016-11-07 14:48:03 UTC
(In reply to Martin Sivák from comment #13)
> Eldad: did you read comment #7 that explicitly states that the 4.0.5 VDSM
> does NOT use the new functionality?

So could you change the target release ?
since I verified i have the same version as the target release.

Comment 15 Shira Maximov 2016-11-13 10:25:14 UTC
verified on :
mom-0.5.8-1.el7ev.noarch
vdsm-4.18.15-23.git9656d58.el7.centos.x86_64

load :
Thread-15::DEBUG::2016-11-13 09:48:37,366::health::122::health::(_check_resources) user=6.92%, sys=3.65%, rss=347040 kB (-6340), threads=63
Thread-15::DEBUG::2016-11-13 09:49:37,367::health::90::health::(_check) Checking health
Thread-15::DEBUG::2016-11-13 09:49:37,660::health::97::health::(_check_garbage) Collected 424 objects
Thread-15::DEBUG::2016-11-13 09:49:37,663::health::122::health::(_check_resources) user=6.98%, sys=3.62%, rss=347096 kB (+56), threads=63


policies Calls:
-=>>cat /var/log/vdsm/vdsm.log | grep 'getAllVmIoTunePolicies succeeded' | grep "2016-11-13 10:21" | wc -l 
4


Note You need to log in before you can comment on or make changes to this bug.