I've made a simple test with one DIST cache, 10 nodes, no data at all. Initial state transfer completes in 3 seconds when numSegments == 500 and when numSegments == 5000 it takes more than 50 seconds. Item #2 seems the have most impact. By switching to a simple HashSet in InboundTransferTask the time drops significatly, about 5 times. Unfortiunatelly the trivial solution of just replacing CopyOnWriteArraySet with HashSet breaks some concurrency related concerns so it's not immediatelly applicable. Instead I will try to see if using syncronization can get us better performance than usign concurrent collections. Item #1 does not seem to give us much impromvement, but it is indeed a potential optimization. Should be solved by extracting the wCh.getSegmentsForOwner(..) call outside the loop. I'm still working one #2, not sure if I can have a quick solution today.
PR here: https://github.com/infinispan/jdg/pull/123
I believe that the functional perspective of this PR can be verified within the usual elasticity/resilience tests (in fact, the changes seem so simple that Infinispan testsuite should catch any bug as well). For performance check, I could set up the cluster with many segments and see whether the startup time has improved (from logs - I don't think any automatization would be beneficial).
Please run the usual elasticity/resilience tests. Thanks.
The fix has been verified. It does NOT introduce any regression. We did not measure the speed up, though.
After additional reviewing in community we found some small additional, rather cosmetic, fixes: https://github.com/infinispan/jdg/pull/146