Bug 1273783

Summary: After PLANNER-252 Planner slowed down chaining solving more than 20%
Product: [Retired] JBoss BRMS Platform 6 Reporter: jvahala
Component: OptaPlannerAssignee: Geoffrey De Smet <gdesmet>
Status: CLOSED EOL QA Contact: Lukáš Petrovický <lpetrovi>
Severity: high Docs Contact:
Priority: urgent    
Version: 6.2.0CC: kverlaen, lpetrovi, rrajasek
Target Milestone: CR1Keywords: Performance, Regression
Target Release: 6.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-27 19:05:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
constr. heur. 9 results averaged, CR1 vs CR2
none
results 100 averaged, CR1 vs CR2 vs snapshot from 30. october
none
same as previous attachment but with snapshot from 3.11.2015
none
benchmark, 20 nov. none

Description jvahala 2015-10-21 08:34:43 UTC
Description of problem:
I am going to add much bigger datasets so we can see how planner scales up. If there is no performance loss on big datasets, I would leave it how it is, otherwise it would be appropriate to deal with this. 


How reproducible:
compare results from optaplanner perf job

Comment 1 Geoffrey De Smet 2015-10-21 12:02:17 UTC
Based on jiri's work, we currently believe the perf degradation sneaked in during the PLANNER-252 work (because CR1 was fine and the snapshot of 24-AUG-2015 wasn't). It reproduces on VRP's Time Windowed 400 dataset too, so this is serious.

Comment 2 Geoffrey De Smet 2015-10-23 15:10:03 UTC
Attempt 1 to reproduce this performance degradation does NOT reproduce (this also explains why I didn't see it when I pushed the changes of PLANNER-252 as I did a perf comparison then).

What I did (which is far less scientific than Jiri's JMH benchmark):
I git checkout tag 6.3.0.CR1 and 6.3.0.CR2 in 2 different, new IntelliJ projects. In both cases I run VehicleRoutingApp and load the dataset "cvrptw-400customers" and solve them for a few minutes (side by side).

6.3.0.CR1: Solving ended: time spent (172700), best score (0hard/-12987815soft), average calculate count per second (19004), environment mode (REPRODUCIBLE).
6.3.0.CR2: Solving ended: time spent (172465), best score (0hard/-12987815soft), average calculate count per second (20202), environment mode (REPRODUCIBLE).

If we look at the ACC, it indicates that CR2 is actually 6% faster (instead of 20% slower) than CR1.

So let's investigate further, some theories to investigate:

I am using these VM arguments: "-Xmx1536M -server" - what is being used during the JMH benchmarks?
Theory A) JMH benchmarks use a memory size that is ok for CR1 but too low for CR2.
Theory B) There's a memory leak that affects performance and it only surfaces in longer testing.
Theory C) The UI somehow is a bottleneck that makes my manual runs useless. I did uncheck "refresh screen", but the score is still calculated on every new best solution in the GUI KieBase anyway.
Theory D) We're using different datasets, etc. But I doubt this is the case because the JMH benchmark report clearly used the same TW 400 dataset.

Comment 3 Geoffrey De Smet 2015-10-23 15:20:35 UTC
VisualVM shows that CR2 does use more memory than CR1, but I don't see a memory leak. For TW400, CR1 nicely goes up to 150MB, while CR2 goes up to 175MB. In both cases the GC activity is 13%.

Theory E) Hardware difference is causing the difference results.

Comment 4 Geoffrey De Smet 2015-10-23 15:29:54 UTC
Theory F) What's the log level of the JMH tests? What's the environmentMode? Maybe we added a trace message or an extra assert, this can heavily influence performance.

Comment 5 Geoffrey De Smet 2015-10-23 15:46:13 UTC
on dataset belgium-n2750-k55.vrp (which doesn't have time windows, so not expecting a gain in CR2 for it)
CR1: Solving ended: time spent (502933), best score (0hard/-150181soft), average calculate count per second (16696), environment mode (REPRODUCIBLE).
CR2: Solving ended: time spent (503291), best score (0hard/-150181soft), average calculate count per second (16432), environment mode (REPRODUCIBLE).
so 1% slower (ignorable)

Comment 6 Geoffrey De Smet 2015-10-23 15:52:09 UTC
on cvrptw-400customers again, but incremental score calculation (instead of drools):

CR1: Solving ended: time spent (85020), best score (0hard/-11957959soft), average calculate count per second (156846), environment mode (REPRODUCIBLE).
CR2: Solving ended: time spent (85326), best score (0hard/-11978188soft), average calculate count per second (151254), environment mode (REPRODUCIBLE).
CR2 is 3% slower than CR1. Wierd as in the drools comparison it's faster. Can be (un)luck.

Comment 7 jvahala 2015-10-23 19:35:04 UTC
Aftery Geoffrey's analysis I am going to start for mistakes I could have made :)

Comment 8 jvahala 2015-10-23 20:51:12 UTC
about theories ...
right now, I can't say anything to B and C. 

A: JHM used params:  -jvmArgs "-Xms6144m -Xmx6144m -XX:PermSize=512m -XX:MaxPermSize=768m" -gc true \

D: we have seen performance loss not only on TW 400 but also on TSP datasets (with more than 2K entities), we need to stay out of vrp and thing about chaining in general

E: unfortunately I can't see any performance information about the machine, but it has been ran on brmsperf05-rhel6-x86_64, I suppose it should be some regular PC station.

F: environmental mode is set on REPRODUCIBLE, since I need always same seed and it logs same as PRODUCTION. but I like this theory, the slf4j was present as it would be usually in production (ok I am not sure with this) but it's log level was set to "warn" only

my own theory: the tests was very short and planner's performance might be worse at start but then convergates to better values. This is not yet tested and needs to be so we are sure what is happening.

Comment 9 jvahala 2015-10-23 20:58:52 UTC
for now, I am going to focus on construction heuristics only, since we saw perf loss in CH too. rules for benchmarking will be simple: end until CH is finished. 

this can give us nice baseline and perf tests will be done much quicker than local search tests. I am going to use scaled datasets and no logging.

Comment 10 jvahala 2015-10-24 21:39:08 UTC
Created attachment 1086136 [details]
constr. heur.  9 results averaged, CR1 vs CR2

Comment 11 jvahala 2015-10-24 21:40:05 UTC
I attach first results of construction heurisitcs. I've benchmarked 6.3.0.CR1 against 6.3.0.CR2. Results are in milliseconds, so lower is better. There is also percentage comparison. 

Numbers are averages over 9 results, so it's not(!) statistically strong, anyway it gives us some hint.

Only chnage in perf repo between those tests runs is here: https://gitlab.mw.lab.eng.bos.redhat.com/jvahala/optaplanner-perf/commit/065a85388070a4586fa43affee695b740ca1f5b7  

I am runing statistically much stronger tests right now, but they took more than two days. (for one planner version)

Comment 12 Geoffrey De Smet 2015-10-26 07:34:08 UTC
Jiri, please give your theory's an unique id too, to avoid confusion. Once either of us forks a theory, give each fork of the theory a unique id too. So:

Theory G: the tests was very short and planner's performance might be
worse at start but then convergates to better values.

Comment 13 Geoffrey De Smet 2015-10-26 09:21:22 UTC
Theory H: It's drools. Wrong because Jiri's and my benchmarks used incremental java to rule this out.

Theory I: the example code itself changed.

Comment 14 Geoffrey De Smet 2015-10-26 10:03:48 UTC
I managed to reproduce locally with VRP on dataset USA 10k (usa-n10000-k100.vrp) and incremental score calculation:

CH FFD only:
CR1: average calculate count per second (804373)
CR2: average calculate count per second (551718)
68%

LS LA with change and swap moves only:
Solving ended: time spent (1302094), best score (0hard/-3767996615soft), average calculate count per second (332692), environment mode (REPRODUCIBLE).
Solving ended: time spent (1302198), best score (0hard/-3772280714soft), average calculate count per second (279519), environment mode (REPRODUCIBLE).
84%

Comment 15 Geoffrey De Smet 2015-10-26 12:07:04 UTC
If I remove ArrivalTimeUpdatingVariableListener (and TimeWindowedCustomer), which aren't used by USA 10k because it has no time windows, I get:

CR1: Solving ended: time spent (1882298), best score (0hard/-3755403694soft), average calculate count per second (353721), environment mode (REPRODUCIBLE).
CR2: Solving ended: time spent (1882432), best score (0hard/-3757430725soft), average calculate count per second (328260), environment mode (REPRODUCIBLE).
93%

Comment 16 jvahala 2015-10-26 12:08:44 UTC
Interesting catch. 

btw wouldn't be easier to start with TSP, since it's much smaller problem?

Comment 17 Geoffrey De Smet 2015-10-26 12:55:40 UTC
Homberger_1000_RC110_0.vrp:

CH's:
CR1: Solving ended: time spent (1895), best score (0hard/-61013057soft), average calculate count per second (391706), environment mode (REPRODUCIBLE).
CR2: Solving ended: time spent (2245), best score (0hard/-61013057soft), average calculate count per second (330638), environment mode (REPRODUCIBLE).
84%

LS change and swap moves
CR1: Solving ended: time spent (487765), best score (0hard/-52697638soft), average calculate count per second (180849), environment mode (REPRODUCIBLE).
CR2: Solving ended: time spent (487933), best score (0hard/-52393907soft), average calculate count per second (193441), environment mode (REPRODUCIBLE).
107%

Comment 18 Geoffrey De Smet 2015-10-26 12:55:51 UTC
Homberger_1000_RC110_0.vrp LS change moves only
CR1: Solving ended: time spent (107468), best score (0hard/-56295277soft), average calculate count per second (296103), environment mode (REPRODUCIBLE).
CR2: Solving ended: time spent (107393), best score (0hard/-56923409soft), average calculate count per second (243960), environment mode (REPRODUCIBLE).
82%

Comment 19 Geoffrey De Smet 2015-10-26 14:11:52 UTC
One probably cause found through profiling:
  https://issues.jboss.org/browse/PLANNER-471
I am not yet convinced that will the only problem, but I 'll fix this one first and then we can reevaluate.

Comment 20 Geoffrey De Smet 2015-10-27 11:10:05 UTC
The SortedMap problem is fixed on master:
  http://github.com/droolsjbpm/optaplanner/commit/8d2037792
and on 6.3.x:
  http://github.com/droolsjbpm/optaplanner/commit/cc41efa10

Jiri, can you pull that and rerun the benchmarks? I don't believe this handles the entire performance regression, but it's a step in the good direction, so we need to measure, not guess.

Comment 21 Geoffrey De Smet 2015-10-27 14:27:10 UTC
Another cause found through profiling: usage of HashSet.add()
  https://issues.jboss.org/browse/PLANNER-472
Not sure if this one is fixable though.

Comment 22 Geoffrey De Smet 2015-10-29 13:20:26 UTC
Fixed PLANNER-472 too. So the Set.add() perf loss is fixed on master:
  http://github.com/droolsjbpm/optaplanner/compare/47ee88a...3708619
and on 6.3.x:
  http://github.com/droolsjbpm/optaplanner/compare/cc41efa...9bdb0ec

Now I see a performance in the same ballpark as CR1. Jiri's benchmarking should give us a good measurement if there's an actual perf loss or gain on TSP, VRP and VRPTW. Note that more complex use cases than VRPTW should definitely see a good overall performance gain due to the smart shadow vars.

Comment 23 jvahala 2015-11-01 18:01:09 UTC
new attachement created, summary of CR1, CR2 and SNAPSHOT(30. 10.) on CH perf test, We can see great perf rise from CR2 to snapshot, but snapsnot is still slower than CR1. 

It can be seen that time windowed solutions decreases to ~91% but no time windows decreased to ~84% of original performance, when we talk about VRP.

huge instances of cloud balance are sligtly better, smaller ones slightly slower, so there is no big deal. 

TPS is in general still slower by ~ 5%

Comment 24 jvahala 2015-11-01 18:02:27 UTC
Created attachment 1088398 [details]
results 100 averaged, CR1 vs CR2 vs snapshot from 30. october

Comment 25 Geoffrey De Smet 2015-11-04 10:03:06 UTC
Those last benchmark results mentioned by Jiri might not be using the code with the last fix (due to Jenkins rebuild speed, Jboss Nexus uptodateness, maven updatePolicy, etc we can't guarantee it 100%, although it's likely). Rather than speculate, jiri and I decided to wait for today's benchmarks before investigating further.

Comment 26 jvahala 2015-11-05 10:57:32 UTC
Created attachment 1090018 [details]
same as previous attachment but with snapshot from 3.11.2015

Comment 27 Geoffrey De Smet 2015-11-06 19:59:56 UTC
All fixes are in both of the last reports. It's better, but we still have some perf impact for simple VRP cases (at least for CH's).

Next things to benchmark:
LocalSearch on VRP with Time Windows on different sizes.

Comment 28 Geoffrey De Smet 2015-11-17 09:21:35 UTC
Relates to:
VRP TW datasets: build realistic Belgium and USA timewindowed datasets for demo's and benchmarking
  https://issues.jboss.org/browse/PLANNER-477

Comment 29 jvahala 2015-11-20 09:30:46 UTC
Created attachment 1097083 [details]
benchmark, 20 nov.

Comment 30 jvahala 2015-11-20 15:02:30 UTC
We can see speed improvement against CR1, but there are still some cases where planner is 10 - 20% slower. these problems needs to be fixed, but I suggest to split this issue into different ones. 

moving to verified