Bug 1273783
Summary: | After PLANNER-252 Planner slowed down chaining solving more than 20% | ||
---|---|---|---|
Product: | [Retired] JBoss BRMS Platform 6 | Reporter: | jvahala |
Component: | OptaPlanner | Assignee: | Geoffrey De Smet <gdesmet> |
Status: | CLOSED EOL | QA Contact: | Lukáš Petrovický <lpetrovi> |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | 6.2.0 | CC: | kverlaen, lpetrovi, rrajasek |
Target Milestone: | CR1 | Keywords: | Performance, Regression |
Target Release: | 6.2.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-03-27 19:05:21 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
jvahala
2015-10-21 08:34:43 UTC
Based on jiri's work, we currently believe the perf degradation sneaked in during the PLANNER-252 work (because CR1 was fine and the snapshot of 24-AUG-2015 wasn't). It reproduces on VRP's Time Windowed 400 dataset too, so this is serious. Attempt 1 to reproduce this performance degradation does NOT reproduce (this also explains why I didn't see it when I pushed the changes of PLANNER-252 as I did a perf comparison then). What I did (which is far less scientific than Jiri's JMH benchmark): I git checkout tag 6.3.0.CR1 and 6.3.0.CR2 in 2 different, new IntelliJ projects. In both cases I run VehicleRoutingApp and load the dataset "cvrptw-400customers" and solve them for a few minutes (side by side). 6.3.0.CR1: Solving ended: time spent (172700), best score (0hard/-12987815soft), average calculate count per second (19004), environment mode (REPRODUCIBLE). 6.3.0.CR2: Solving ended: time spent (172465), best score (0hard/-12987815soft), average calculate count per second (20202), environment mode (REPRODUCIBLE). If we look at the ACC, it indicates that CR2 is actually 6% faster (instead of 20% slower) than CR1. So let's investigate further, some theories to investigate: I am using these VM arguments: "-Xmx1536M -server" - what is being used during the JMH benchmarks? Theory A) JMH benchmarks use a memory size that is ok for CR1 but too low for CR2. Theory B) There's a memory leak that affects performance and it only surfaces in longer testing. Theory C) The UI somehow is a bottleneck that makes my manual runs useless. I did uncheck "refresh screen", but the score is still calculated on every new best solution in the GUI KieBase anyway. Theory D) We're using different datasets, etc. But I doubt this is the case because the JMH benchmark report clearly used the same TW 400 dataset. VisualVM shows that CR2 does use more memory than CR1, but I don't see a memory leak. For TW400, CR1 nicely goes up to 150MB, while CR2 goes up to 175MB. In both cases the GC activity is 13%. Theory E) Hardware difference is causing the difference results. Theory F) What's the log level of the JMH tests? What's the environmentMode? Maybe we added a trace message or an extra assert, this can heavily influence performance. on dataset belgium-n2750-k55.vrp (which doesn't have time windows, so not expecting a gain in CR2 for it) CR1: Solving ended: time spent (502933), best score (0hard/-150181soft), average calculate count per second (16696), environment mode (REPRODUCIBLE). CR2: Solving ended: time spent (503291), best score (0hard/-150181soft), average calculate count per second (16432), environment mode (REPRODUCIBLE). so 1% slower (ignorable) on cvrptw-400customers again, but incremental score calculation (instead of drools): CR1: Solving ended: time spent (85020), best score (0hard/-11957959soft), average calculate count per second (156846), environment mode (REPRODUCIBLE). CR2: Solving ended: time spent (85326), best score (0hard/-11978188soft), average calculate count per second (151254), environment mode (REPRODUCIBLE). CR2 is 3% slower than CR1. Wierd as in the drools comparison it's faster. Can be (un)luck. Aftery Geoffrey's analysis I am going to start for mistakes I could have made :) about theories ... right now, I can't say anything to B and C. A: JHM used params: -jvmArgs "-Xms6144m -Xmx6144m -XX:PermSize=512m -XX:MaxPermSize=768m" -gc true \ D: we have seen performance loss not only on TW 400 but also on TSP datasets (with more than 2K entities), we need to stay out of vrp and thing about chaining in general E: unfortunately I can't see any performance information about the machine, but it has been ran on brmsperf05-rhel6-x86_64, I suppose it should be some regular PC station. F: environmental mode is set on REPRODUCIBLE, since I need always same seed and it logs same as PRODUCTION. but I like this theory, the slf4j was present as it would be usually in production (ok I am not sure with this) but it's log level was set to "warn" only my own theory: the tests was very short and planner's performance might be worse at start but then convergates to better values. This is not yet tested and needs to be so we are sure what is happening. for now, I am going to focus on construction heuristics only, since we saw perf loss in CH too. rules for benchmarking will be simple: end until CH is finished. this can give us nice baseline and perf tests will be done much quicker than local search tests. I am going to use scaled datasets and no logging. Created attachment 1086136 [details]
constr. heur. 9 results averaged, CR1 vs CR2
I attach first results of construction heurisitcs. I've benchmarked 6.3.0.CR1 against 6.3.0.CR2. Results are in milliseconds, so lower is better. There is also percentage comparison. Numbers are averages over 9 results, so it's not(!) statistically strong, anyway it gives us some hint. Only chnage in perf repo between those tests runs is here: https://gitlab.mw.lab.eng.bos.redhat.com/jvahala/optaplanner-perf/commit/065a85388070a4586fa43affee695b740ca1f5b7 I am runing statistically much stronger tests right now, but they took more than two days. (for one planner version) Jiri, please give your theory's an unique id too, to avoid confusion. Once either of us forks a theory, give each fork of the theory a unique id too. So: Theory G: the tests was very short and planner's performance might be worse at start but then convergates to better values. Theory H: It's drools. Wrong because Jiri's and my benchmarks used incremental java to rule this out. Theory I: the example code itself changed. I managed to reproduce locally with VRP on dataset USA 10k (usa-n10000-k100.vrp) and incremental score calculation: CH FFD only: CR1: average calculate count per second (804373) CR2: average calculate count per second (551718) 68% LS LA with change and swap moves only: Solving ended: time spent (1302094), best score (0hard/-3767996615soft), average calculate count per second (332692), environment mode (REPRODUCIBLE). Solving ended: time spent (1302198), best score (0hard/-3772280714soft), average calculate count per second (279519), environment mode (REPRODUCIBLE). 84% If I remove ArrivalTimeUpdatingVariableListener (and TimeWindowedCustomer), which aren't used by USA 10k because it has no time windows, I get: CR1: Solving ended: time spent (1882298), best score (0hard/-3755403694soft), average calculate count per second (353721), environment mode (REPRODUCIBLE). CR2: Solving ended: time spent (1882432), best score (0hard/-3757430725soft), average calculate count per second (328260), environment mode (REPRODUCIBLE). 93% Interesting catch. btw wouldn't be easier to start with TSP, since it's much smaller problem? Homberger_1000_RC110_0.vrp: CH's: CR1: Solving ended: time spent (1895), best score (0hard/-61013057soft), average calculate count per second (391706), environment mode (REPRODUCIBLE). CR2: Solving ended: time spent (2245), best score (0hard/-61013057soft), average calculate count per second (330638), environment mode (REPRODUCIBLE). 84% LS change and swap moves CR1: Solving ended: time spent (487765), best score (0hard/-52697638soft), average calculate count per second (180849), environment mode (REPRODUCIBLE). CR2: Solving ended: time spent (487933), best score (0hard/-52393907soft), average calculate count per second (193441), environment mode (REPRODUCIBLE). 107% Homberger_1000_RC110_0.vrp LS change moves only CR1: Solving ended: time spent (107468), best score (0hard/-56295277soft), average calculate count per second (296103), environment mode (REPRODUCIBLE). CR2: Solving ended: time spent (107393), best score (0hard/-56923409soft), average calculate count per second (243960), environment mode (REPRODUCIBLE). 82% One probably cause found through profiling: https://issues.jboss.org/browse/PLANNER-471 I am not yet convinced that will the only problem, but I 'll fix this one first and then we can reevaluate. The SortedMap problem is fixed on master: http://github.com/droolsjbpm/optaplanner/commit/8d2037792 and on 6.3.x: http://github.com/droolsjbpm/optaplanner/commit/cc41efa10 Jiri, can you pull that and rerun the benchmarks? I don't believe this handles the entire performance regression, but it's a step in the good direction, so we need to measure, not guess. Another cause found through profiling: usage of HashSet.add() https://issues.jboss.org/browse/PLANNER-472 Not sure if this one is fixable though. Fixed PLANNER-472 too. So the Set.add() perf loss is fixed on master: http://github.com/droolsjbpm/optaplanner/compare/47ee88a...3708619 and on 6.3.x: http://github.com/droolsjbpm/optaplanner/compare/cc41efa...9bdb0ec Now I see a performance in the same ballpark as CR1. Jiri's benchmarking should give us a good measurement if there's an actual perf loss or gain on TSP, VRP and VRPTW. Note that more complex use cases than VRPTW should definitely see a good overall performance gain due to the smart shadow vars. new attachement created, summary of CR1, CR2 and SNAPSHOT(30. 10.) on CH perf test, We can see great perf rise from CR2 to snapshot, but snapsnot is still slower than CR1. It can be seen that time windowed solutions decreases to ~91% but no time windows decreased to ~84% of original performance, when we talk about VRP. huge instances of cloud balance are sligtly better, smaller ones slightly slower, so there is no big deal. TPS is in general still slower by ~ 5% Created attachment 1088398 [details]
results 100 averaged, CR1 vs CR2 vs snapshot from 30. october
Those last benchmark results mentioned by Jiri might not be using the code with the last fix (due to Jenkins rebuild speed, Jboss Nexus uptodateness, maven updatePolicy, etc we can't guarantee it 100%, although it's likely). Rather than speculate, jiri and I decided to wait for today's benchmarks before investigating further. Created attachment 1090018 [details]
same as previous attachment but with snapshot from 3.11.2015
All fixes are in both of the last reports. It's better, but we still have some perf impact for simple VRP cases (at least for CH's). Next things to benchmark: LocalSearch on VRP with Time Windows on different sizes. Relates to: VRP TW datasets: build realistic Belgium and USA timewindowed datasets for demo's and benchmarking https://issues.jboss.org/browse/PLANNER-477 Created attachment 1097083 [details]
benchmark, 20 nov.
We can see speed improvement against CR1, but there are still some cases where planner is 10 - 20% slower. these problems needs to be fixed, but I suggest to split this issue into different ones. moving to verified |