this is a follow up bug to the scaling issues identified as part of bug 1177634 The profiling data shows significant amount of time spent in MOM rules parsing. Perhaps we don't need to do it that often or find a different way. A separate problem is a thread-per-VM approach in MOM which contributes to up to 50% of the total threads we have in vdsm which is a serious performance problem on beefy systems where we run ~200 VMs per host
fromt he parent bug: 4. MOM ------ spark.py:211(Parser.buildState) is taking much more time on 3.5 - it is called much more and consume 40 seconds *in* the fuction in 3.5, but 0 seconds in 3.4 - look like a bad change in this function. 3.4: ncalls tottime percall cumtime percall filename:lineno(function) 47369 0.000 0.000 52.608 0.001 spark.py:211(Parser.buildState) 3.5: ncalls tottime percall cumtime percall filename:lineno(function) 157109 40.947 0.000 91.926 0.001 spark.py:211(Parser.buildState) I suggest to repeat this profiling on much bigger machine so we can check handling of much more vms.
requesting 3.5.z since it seems like a regression
3.5.1 is already full with bugs (over 80), and since none of these bugs were added as urgent for 3.5.1 release in the tracker bug, moving to 3.5.2
Michal, this should be a duplicate of bug 1182094. Please close it as such.
(In reply to Doron Fediuck from comment #5) > Michal, > this should be a duplicate of bug 1182094. > Please close it as such. nthese are two separate issues. This bug is about MOM performance which has nothing to do with NUMA problems in bug 1182094
(In reply to Michal Skrivanek from comment #6) > (In reply to Doron Fediuck from comment #5) > > Michal, > > this should be a duplicate of bug 1182094. > > Please close it as such. > > nthese are two separate issues. This bug is about MOM performance which has > nothing to do with NUMA problems in bug 1182094 In this case it's not a .z item as the desired fixed is changing the architecture vdsm and mom are using in 3.5. It may or may not be done for 3.6.0 depending on capacity. 1. Fixing target version. 2. Removed regression as the changes are related to NUMA which is a different BZ.
(In reply to Doron Fediuck from comment #7) > (In reply to Michal Skrivanek from comment #6) > > (In reply to Doron Fediuck from comment #5) > > > Michal, > > > this should be a duplicate of bug 1182094. > > > Please close it as such. > > > > nthese are two separate issues. This bug is about MOM performance which has > > nothing to do with NUMA problems in bug 1182094 > > In this case it's not a .z item as the desired fixed is changing the > architecture > vdsm and mom are using in 3.5. > It may or may not be done for 3.6.0 depending on capacity. > > 1. Fixing target version. > 2. Removed regression as the changes are related to NUMA which is a > different BZ. I understand it's problematic to fix, but see comment #1 which does indicate this is a regression in 3.5 since 3.4. I don't think we should ignore it without a sufficient scalability testing. Barak, thoughts?
(In reply to Michal Skrivanek from comment #8) > (In reply to Doron Fediuck from comment #7) > > (In reply to Michal Skrivanek from comment #6) > > > (In reply to Doron Fediuck from comment #5) > > > > Michal, > > > > this should be a duplicate of bug 1182094. > > > > Please close it as such. > > > > > > nthese are two separate issues. This bug is about MOM performance which has > > > nothing to do with NUMA problems in bug 1182094 > > > > In this case it's not a .z item as the desired fixed is changing the > > architecture > > vdsm and mom are using in 3.5. > > It may or may not be done for 3.6.0 depending on capacity. > > > > 1. Fixing target version. > > 2. Removed regression as the changes are related to NUMA which is a > > different BZ. > > I understand it's problematic to fix, but see comment #1 which does indicate > this is a regression in 3.5 since 3.4. > I don't think we should ignore it without a sufficient scalability testing. > Barak, thoughts? As explained in comment 7, regressions are numa related and handled as a part of bug 1182094. Nothing changed in MoM architecture so no regression other than the one handled in 1182094.
Based on current information this RFE should cover: 1. Reducing number of threads in MoM (avoid thread per VM where possible). 2. Consider changing mom-vdsm architecture to a separate service (as mom used to be prior to current implementation).
Adam, any additional improvements that should be considered for this RFE?
Nothing additional but just a comment about the spark.py profile data: spark.py is the lisp lexer/parser we're using and hasn't changed since its initial import. The line being referenced in the profile is only called as a result of the policy being changed. Did we change the frequency of calls to the vdsm setMOMPolicyParameters API?
Removing from 3.6 due to capacity
I see this was moved to future, but we have a patch merged attached. Is this fixed?
comment #11 items are covered as far as I know. This can be closed as fixed on my side, at Martin's discretion. Anything on Adam's comment #14?
Fixed bug tickets must have version flags set prior to fixing them. Please set the correct version flags and move the bugs back to the previous status after this is corrected.
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
This request has been proposed for two releases. This is invalid flag usage. The ovirt-future release flag has been cleared. If you wish to change the release flag, you must clear one release flag and then set the other release flag to ?.
afaik mom 0.5.1 is already released, shouldn't this bug be ON_QA?
I was released so long ago that CLOSED CURRENT RELEASE might be better. Objections?
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.
it can't be closed current release if the milestone is ovirt-3.6.2. we need to wait until ovirt 3.6.2 will be GA.
Verified based on the results of https://bugzilla.redhat.com/show_bug.cgi?id=1177634#c103