Bug 1182097 - [RFE][scale] improve MOM performance
Summary: [RFE][scale] improve MOM performance
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: mom
Classification: oVirt
Component: RFEs
Version: ---
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-3.6.2
: 0.5.1
Assignee: Adam Litke
QA Contact: Eldad Marciano
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-01-14 12:48 UTC by Michal Skrivanek
Modified: 2016-03-11 07:21 UTC (History)
17 users (show)

Fixed In Version: mom-0.5.0
Clone Of:
Environment:
Last Closed: 2016-03-11 07:21:01 UTC
oVirt Team: SLA
Embargoed:
dfediuck: ovirt-3.6.z?
sherold: Triaged+
rule-engine: planning_ack?
dfediuck: devel_ack+
rule-engine: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1227714 0 unspecified CLOSED [RFE] Allow mom to run as standalone process 2021-02-22 00:41:40 UTC
oVirt gerrit 41602 0 None None None Never

Internal Links: 1227714

Description Michal Skrivanek 2015-01-14 12:48:48 UTC
this is a follow up bug to the scaling issues identified as part of bug 1177634

The profiling data shows significant amount of time spent in MOM rules parsing. Perhaps we don't need to do it that often or find a different way.

A separate problem is a thread-per-VM approach in MOM which contributes to up to 50% of the total threads we have in vdsm which is a serious performance problem on beefy systems where we run ~200 VMs per host

Comment 1 Michal Skrivanek 2015-01-19 08:39:21 UTC
fromt he parent bug:

4. MOM
------

spark.py:211(Parser.buildState) is taking much more time on 3.5 - it is called
much more and consume 40 seconds *in* the fuction in 3.5, but 0 seconds in 
3.4 - look like a bad change in this function.

3.4:

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  47369    0.000    0.000   52.608    0.001 spark.py:211(Parser.buildState)

3.5:

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 157109   40.947    0.000   91.926    0.001 spark.py:211(Parser.buildState)


I suggest to repeat this profiling on much bigger machine so we can check
handling of much more vms.

Comment 2 Michal Skrivanek 2015-01-21 07:47:59 UTC
requesting 3.5.z since it seems like a regression

Comment 4 Eyal Edri 2015-02-25 08:39:37 UTC
3.5.1 is already full with bugs (over 80), and since none of these bugs were added as urgent for 3.5.1 release in the tracker bug, moving to 3.5.2

Comment 5 Doron Fediuck 2015-04-12 07:55:18 UTC
Michal,
this should be a duplicate of bug 1182094.
Please close it as such.

Comment 6 Michal Skrivanek 2015-04-12 08:44:15 UTC
(In reply to Doron Fediuck from comment #5)
> Michal,
> this should be a duplicate of bug 1182094.
> Please close it as such.

nthese are two separate issues. This bug is about MOM performance which has nothing to do with NUMA problems in bug 1182094

Comment 7 Doron Fediuck 2015-04-12 08:58:14 UTC
(In reply to Michal Skrivanek from comment #6)
> (In reply to Doron Fediuck from comment #5)
> > Michal,
> > this should be a duplicate of bug 1182094.
> > Please close it as such.
> 
> nthese are two separate issues. This bug is about MOM performance which has
> nothing to do with NUMA problems in bug 1182094

In this case it's not a .z item as the desired fixed is changing the architecture
vdsm and mom are using in 3.5.
It may or may not be done for 3.6.0 depending on capacity.

1. Fixing target version.
2. Removed regression as the changes are related to NUMA which is a different BZ.

Comment 8 Michal Skrivanek 2015-04-12 09:27:01 UTC
(In reply to Doron Fediuck from comment #7)
> (In reply to Michal Skrivanek from comment #6)
> > (In reply to Doron Fediuck from comment #5)
> > > Michal,
> > > this should be a duplicate of bug 1182094.
> > > Please close it as such.
> > 
> > nthese are two separate issues. This bug is about MOM performance which has
> > nothing to do with NUMA problems in bug 1182094
> 
> In this case it's not a .z item as the desired fixed is changing the
> architecture
> vdsm and mom are using in 3.5.
> It may or may not be done for 3.6.0 depending on capacity.
> 
> 1. Fixing target version.
> 2. Removed regression as the changes are related to NUMA which is a
> different BZ.

I understand it's problematic to fix, but see comment #1 which does indicate this is a regression in 3.5 since 3.4.
I don't think we should ignore it without a sufficient scalability testing.
Barak, thoughts?

Comment 10 Doron Fediuck 2015-04-12 09:48:22 UTC
(In reply to Michal Skrivanek from comment #8)
> (In reply to Doron Fediuck from comment #7)
> > (In reply to Michal Skrivanek from comment #6)
> > > (In reply to Doron Fediuck from comment #5)
> > > > Michal,
> > > > this should be a duplicate of bug 1182094.
> > > > Please close it as such.
> > > 
> > > nthese are two separate issues. This bug is about MOM performance which has
> > > nothing to do with NUMA problems in bug 1182094
> > 
> > In this case it's not a .z item as the desired fixed is changing the
> > architecture
> > vdsm and mom are using in 3.5.
> > It may or may not be done for 3.6.0 depending on capacity.
> > 
> > 1. Fixing target version.
> > 2. Removed regression as the changes are related to NUMA which is a
> > different BZ.
> 
> I understand it's problematic to fix, but see comment #1 which does indicate
> this is a regression in 3.5 since 3.4.
> I don't think we should ignore it without a sufficient scalability testing.
> Barak, thoughts?

As explained in comment 7, regressions are numa related and handled as a part
of bug 1182094. Nothing changed in MoM architecture so no regression other
than the one handled in 1182094.

Comment 11 Doron Fediuck 2015-04-12 09:58:25 UTC
Based on current information this RFE should cover:
1. Reducing number of threads in MoM (avoid thread per VM where possible).
2. Consider changing mom-vdsm architecture to a separate service (as mom used to be prior to current implementation).

Comment 12 Doron Fediuck 2015-04-12 10:00:14 UTC
Adam,
any additional improvements that should be considered for this RFE?

Comment 13 Adam Litke 2015-04-14 19:50:58 UTC
Nothing additional but just a comment about the spark.py profile data:

spark.py is the lisp lexer/parser we're using and hasn't changed since its initial import.  The line being referenced in the profile is only called as a result of the policy being changed.  Did we change the frequency of calls to the vdsm setMOMPolicyParameters API?

Comment 14 Scott Herold 2015-04-27 16:52:30 UTC
Removing from 3.6 due to capacity

Comment 15 Yaniv Lavi 2015-12-01 18:37:43 UTC
I see this was moved to future, but we have a patch merged attached.
Is this fixed?

Comment 16 Michal Skrivanek 2015-12-02 11:16:36 UTC
comment #11 items are covered as far as I know. This can be closed as fixed on my side, at Martin's discretion.
Anything on Adam's comment #14?

Comment 17 Red Hat Bugzilla Rules Engine 2015-12-02 14:57:20 UTC
Fixed bug tickets must have version flags set prior to fixing them. Please set the correct version flags and move the bugs back to the previous status after this is corrected.

Comment 18 Red Hat Bugzilla Rules Engine 2015-12-02 14:57:20 UTC
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 19 Red Hat Bugzilla Rules Engine 2015-12-02 14:57:20 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 20 Red Hat Bugzilla Rules Engine 2015-12-02 14:58:58 UTC
This request has been proposed for two releases. This is invalid flag usage. The ovirt-future release flag has been cleared. If you wish to change the release flag, you must clear one release flag and then set the other release flag to ?.

Comment 21 Red Hat Bugzilla Rules Engine 2015-12-28 11:51:19 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 22 Eyal Edri 2016-01-13 10:59:21 UTC
afaik mom 0.5.1 is already released, shouldn't this bug be ON_QA?

Comment 23 Martin Sivák 2016-01-13 11:50:53 UTC
I was released so long ago that CLOSED CURRENT RELEASE might be better. Objections?

Comment 24 Red Hat Bugzilla Rules Engine 2016-01-13 11:50:57 UTC
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 25 Eyal Edri 2016-01-13 12:05:34 UTC
it can't be closed current release if the milestone is ovirt-3.6.2.
we need to wait until ovirt 3.6.2 will be GA.

Comment 26 Gil Klein 2016-02-24 14:33:45 UTC
Verified based on the results of  https://bugzilla.redhat.com/show_bug.cgi?id=1177634#c103


Note You need to log in before you can comment on or make changes to this bug.