This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1275762 - [SCALE] vdsm may use several 1000 percent of CPU
[SCALE] vdsm may use several 1000 percent of CPU
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.5.4
x86_64 Linux
high Severity high
: ovirt-3.5.6
: ---
Assigned To: Francesco Romani
Eldad Marciano
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-10-27 12:36 EDT by Martin Tessun
Modified: 2016-04-19 21:26 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Due to the implementation of the python's VM Global Interpreter Lock, an high number of threads causes performance penalties when running on multiple cores Consequence: Vdsm requires exceptionally high amount of CPU, wasting resources Fix: Implemented cpu pinning support in Vdsm Result: Performance gains, big reduction of CPU consumption.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-04-19 21:26:01 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Virt
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 46522 None None None Never
oVirt gerrit 47013 None None None Never

  None (edit)
Description Martin Tessun 2015-10-27 12:36:39 EDT
Description of problem:
The issue is mainly that on some hypervisors the load of the vdsm process goes up to several 1000 percent.
As soon as the running VMs are migrated away the CPU consumption of vdsm gets back to a normal level.

Version-Release number of selected component (if applicable):
RHEV 3.5.4
vdsm 4.16

How reproducible:
sometimes.


Steps to Reproduce:

* Hardware used: Dell R820 with 4xE5-4620v2 (8 cores) and 512GB RAM
* Hypervisor OS: RHEL6
* RHEV-M 3.5.4
* Approx 50+ VMs running with guest-agent enabled.

Actual results:

Sometimes the load of vdsm goes up to several 1000 percent (4000-5000% have been observed).
The consumption stays at that level until VMs are migrated of the host. As soon as the migrations are cancelled the consumption goes back to that high level of several 1000 percent.


Expected results:
vdsm CPU consumption should be much lower.

Additional info:
Gil has seen this issue onsite as well.
We should try to reproduce the issue in house with our scaling team.
Comment 6 Francesco Romani 2015-11-02 06:55:38 EST
(In reply to Martin Tessun from comment #0)
> The consumption stays at that level until VMs are migrated of the host. As
> soon as the migrations are cancelled the consumption goes back to that high
> level of several 1000 percent.

To be sure I understood: you mean "migrated off" (like "migrated away"), right?

So it eats huge amounts of CPU while it is running VMs, right? If no VMs are running, then the CPU consumption is low (or negligible), can you confirm?

> Additional info:
> Gil has seen this issue onsite as well.
> We should try to reproduce the issue in house with our scaling team.

Any attempt to do that?
Comment 8 Martin Tessun 2015-11-02 07:03:02 EST
(In reply to Francesco Romani from comment #6)
> (In reply to Martin Tessun from comment #0)
> > The consumption stays at that level until VMs are migrated of the host. As
> > soon as the migrations are cancelled the consumption goes back to that high
> > level of several 1000 percent.
> 
> To be sure I understood: you mean "migrated off" (like "migrated away"),
> right?
> 
> So it eats huge amounts of CPU while it is running VMs, right? If no VMs are
> running, then the CPU consumption is low (or negligible), can you confirm?

Indeed. Even if the VMs are still running, but in the process of migrating off the host, the load already drops to a convenient level.

If you abort the migration the load of vdsm-process goes up to its previous level.

> 
> > Additional info:
> > Gil has seen this issue onsite as well.
> > We should try to reproduce the issue in house with our scaling team.
> 
> Any attempt to do that?
Comment 14 Francesco Romani 2015-11-02 07:52:55 EST
(In reply to Martin Tessun from comment #8)
> (In reply to Francesco Romani from comment #6)
> > (In reply to Martin Tessun from comment #0)
> > > The consumption stays at that level until VMs are migrated of the host. As
> > > soon as the migrations are cancelled the consumption goes back to that high
> > > level of several 1000 percent.
> > 
> > To be sure I understood: you mean "migrated off" (like "migrated away"),
> > right?
> > 
> > So it eats huge amounts of CPU while it is running VMs, right? If no VMs are
> > running, then the CPU consumption is low (or negligible), can you confirm?
> 
> Indeed. Even if the VMs are still running, but in the process of migrating
> off the host, the load already drops to a convenient level.
> 
> If you abort the migration the load of vdsm-process goes up to its previous
> level.

OK.

In 4.16.28 we VDSM gained cpu_affinity support. It is possible to pin VDSM on just one core editing vdsm.conf:

===
[vars]
cpu_affinity = 1
===

We don't recommend to pin on cpu #0 because system task often default on that CPU, so it could be pretty crowded.

We expect significant improvements of cpu consumption if VDSM runs pinned to one CPU. Please try this setting and see if/how it helps.

Please note I'm _not_ claiming this is the definitive fix for this issue. I'm still reviewing the logs, to see if there is something else here.
Comment 15 Francesco Romani 2015-11-02 08:36:33 EST
(In reply to Francesco Romani from comment #14)
> In 4.16.28 we VDSM gained cpu_affinity support. It is possible to pin VDSM
> on just one core editing vdsm.conf:
> 
> ===
> [vars]
> cpu_affinity = 1
> ===
> 
> We don't recommend to pin on cpu #0 because system task often default on
> that CPU, so it could be pretty crowded.
> 
> We expect significant improvements of cpu consumption if VDSM runs pinned to
> one CPU. Please try this setting and see if/how it helps.
> 
> Please note I'm _not_ claiming this is the definitive fix for this issue.
> I'm still reviewing the logs, to see if there is something else here.

You can have a good approximation of this behaviour using the taskset tool manually. The problem is this is a little cumbersome on RHEL6.

Steps:
1. locate VDSM pid. Let's call this VDSM_PID (shell variable syntax)
2. run taskset manually, with something like

# for pid in $( ls /proc/$VDSM_PID/task/ ); do taskset -c -p 1 $pid; done

3. to verify the affinity of existing VDSM threads

# for pid in $( ls /proc/$VDSM_PID/task/ ); do taskset -c -p $pid; done

this setting will persist until next VDSM restart.

With the above, it is easier to evaluate the benefits of CPU affinity without upgrading VDSM.
Comment 16 Martin Tessun 2015-11-03 04:23:22 EST
(In reply to Francesco Romani from comment #15)
> 
> Steps:
> 1. locate VDSM pid. Let's call this VDSM_PID (shell variable syntax)
> 2. run taskset manually, with something like
> 
> # for pid in $( ls /proc/$VDSM_PID/task/ ); do taskset -c -p 1 $pid; done
> 
> 3. to verify the affinity of existing VDSM threads
> 
> # for pid in $( ls /proc/$VDSM_PID/task/ ); do taskset -c -p $pid; done
> 
> this setting will persist until next VDSM restart.
> 
> With the above, it is easier to evaluate the benefits of CPU affinity
> without upgrading VDSM.



I just asked for the steps to be taken. I will report back with the results, once they are available.

Leaving the needinfo on me for that time.

Cheers,
Martin
Comment 17 Martin Tessun 2015-11-03 12:30:52 EST
Just sharing the feedback for pinning/affinity:

before upgrading to 3.5.5 I wanted to test the CPU affinity and have to say that it seems to solve the strange cpu usage behaviour. I have tried it on 4 hosts that were showing that "strange" usage and right now it looks as expected.

Cheers,
Martin
Comment 18 Francesco Romani 2015-11-04 07:16:51 EST
First scan of provided logs revelead no suspicious activity.
Comment 21 Francesco Romani 2015-11-11 09:08:03 EST
tasksest patches landed in 3.5.x (https://gerrit.ovirt.org/#/c/46522/) and they are reported to work well (comment 17). No other issue found, so moving to MODIFIED
Comment 22 Red Hat Bugzilla Rules Engine 2015-11-11 09:19:24 EST
Fixed bug tickets must have target milestone set prior to fixing them. Please set the correct milestone and move the bugs back to the previous status after this is corrected.
Comment 23 Gil Klein 2015-11-16 12:16:37 EST
Michal, Can this be verified based on BZ #1265205 ?
Comment 24 Michal Skrivanek 2015-11-16 12:19:32 EST
It is likely a duplicate. Indeed it makes sense to test once
Comment 25 Gil Klein 2015-11-18 02:46:25 EST
Verified based on the verification of BZ #1265205

Note You need to log in before you can comment on or make changes to this bug.