similar symptoms as https://bugzilla.redhat.com/show_bug.cgi?id=809631 the load to nodes is not well balanced after killed node is restored again. can be seen nicely in this chart: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-resilience-dist-4-3/59/artifact/report/stats-throughput.png but happens when killing only one node of 4
Illustration of the issue from several hyperion test runs: Elasticity test 8-16-8: http://www.qa.jboss.com/~mlinhard/hyperion3/run0009/report/stats-throughput.png Elasticity test 16-32-16: http://www.qa.jboss.com/~mlinhard/hyperion3/run0010/report/stats-throughput.png Resilience test 8-7-8: http://www.qa.jboss.com/~mlinhard/hyperion3/run0013/report/stats-throughput.png
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2632 @Michal, is the issue still present once an Infinispan version containing a fix for ISPN-2550 is used? You wanna wait to test CR1 and then we see if the issue is still present?
Michal Linhard <mlinhard> made a comment on jira ISPN-2632 the run http://www.qa.jboss.com/~mlinhard/hyperion3/run0013/report/stats-throughput.png that contains this issue, was done with JDG 6.1.0.ER5 patched with dan's fixes for ISPN-2550, see my last comments in ISPN-2550.
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2632 @Michal, Ok. What do you have in terms of logs? If this can be replicated in small scale, client and server TRACE logs on org.infinispan would be the easiest. How are these caches configured? Replication or distribution? I suppose it's the latter since it's the area that's changed the most for 5.2. If you need to scala log down, I'd suggest org.infinispan.client on the client side, and org.infinispan.server and org.infinispan.distribution on the server side.
Michal Linhard <mlinhard> made a comment on jira ISPN-2632 I was on this since yesterday but ran into some problems with our test framework. I'll do the trace logs for JDG 6.1.0.ER8 (Infinispan 5.2.0.CR1) that we should have available tomorrow.
Michal Linhard <mlinhard> made a comment on jira ISPN-2632 btw would it help to run the TRACE logged test with the patched ER5 (the patch containing fixes for ISPN-2642, ISPN-2550) - I could do that before ER8
Michal Linhard <mlinhard> made a comment on jira ISPN-2632 btw would it help to run the TRACE logged test with the patched ER5 (the patch containing fixes for ISPN-2642, ISPN-2550) - I could do that before ER8 - that's the code that produced this http://www.qa.jboss.com/~mlinhard/hyperion3/run0013/report/stats-throughput.png)
Mircea Markus <mmarkus> made a comment on jira ISPN-2632 the sooner the better as this is a blocker. That would run 5.2.CR1, right?
Michal Linhard <mlinhard> made a comment on jira ISPN-2632 no that's the thing, we don't have yet JDG ER with 5.2.0.CR1 inside.
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2632 Getting TRACE with patched ER5 could be handy while we wait for ER6, which I assume should have CR1 in it.
Still present in ER8
Michal Linhard <mlinhard> made a comment on jira ISPN-2632 TRACE log: http://www.qa.jboss.com/~mlinhard/test_results/run0020.zip resilience test 8 nodes, crash 1, 50 clients, 5000 entries JDG 6.1.0.ER8 (Infinispan 5.2.0.CR1)
Galder Zamarreño <galder.zamarreno> updated the status of jira ISPN-2632 to Coding In Progress
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2632 Michal, thx for attaching the logs. It only contains TRACE on client but it's Ok. Taking one client as example (https://gist.github.com/4511311), the number of servers sent back looks right (missing 192.168.87.122), so it doesn't appear to be an issue with the way view is managed. IMO, it seems to be more of an issue with hashing itself. Looking at the cached positions on the client, I don't see anything wrong at 1st glance: https://gist.github.com/4511569. The positions taken by .122 appear to have gone. Doesn't seem to be a problem with node4 taking more positions on the hash map. I wonder if this issue has to do with ISPN-2670? I see org.infinispan.client.hotrod.impl.protocol.Codec12 in the logs, but ISPN-2670 seems to imply that Hot Rod client does not work with version 1.2 of the protocol. I'm gonna tackle ISPN-2670 first and then we can see if the issue is still there.
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2632 @Michal, in the mean time you could try running the clients with infinispan.client.hotrod.protocol_version=1.1 ? The default protocol is 1.2 (even if RemoteCacheManager says otherwise...)
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2632 Actually, ISPN-2670 is about the server module's test client, so not really relevant here. It's worth testing with other protocol versions, 1.1 and 1.0, to see if this problem is specific to protocol version 1.2. @Michal, in terms of uneven request balancing, do you know if the request unbalancing happens for all clients? Or if only some of those suffer?
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2632 @Michal, one final question: I see 5 driver threads in each client. I assume each of these threads is using its own RemoteCacheManager instance?
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2632 If so, I wonder if the issue can be replicated with just 1 driver thread per client, so 10 clients. This would make debugging through the client logs easier.
Michal Linhard <mlinhard> made a comment on jira ISPN-2632 for server logs see: report/serverlogs.zip
Michal Linhard <mlinhard> made a comment on jira ISPN-2632 in this test there were 50 driver threads distributed over 11 driver nodes, RCM per thread
Michal Linhard <mlinhard> made a comment on jira ISPN-2632 The behaviour is the same with protocol version 1.1 http://www.qa.jboss.com/~mlinhard/hyperion3/run0027/report/stats-throughput.png
Dan Berindei <dberinde> made a comment on jira ISPN-2632 Send the cache topology id to the client instead of the JGroups view id. After a rebalance, the cache topology id is incremented, so this ensures that the client's topology information is updated as well. To work properly with replicated caches, this change required always incrementing the cache topology id after a leave, even if without a rebalance (https://issues.jboss.org/browse/ISPN-2420).
Michal Linhard <mlinhard> updated the status of jira ISPN-2632 to Reopened
Michal Linhard <mlinhard> made a comment on jira ISPN-2632 This appeared again in JDG 6.1.0.ER9 tests (infinispan 5.2.0.CR2) http://www.qa.jboss.com/~mlinhard/hyperion3/run0034-resi-8-6-8-ER9/report/stats-throughput.png
Michal Linhard <mlinhard> made a comment on jira ISPN-2632 in resilience test 32-28-32 (numowners5) the unevenness is from the beginning of test (I don't think this needs a different JIRA, or does it ?) http://www.qa.jboss.com/~mlinhard/hyperion3/run0035-resi-32-28-32-ER9/report/stats-throughput.png
Michal Linhard <mlinhard> made a comment on jira ISPN-2632 results from another resilience test 32-31-32 (owners 2): http://www.qa.jboss.com/~mlinhard/hyperion3/run0036-resi-32-31-32-ER9/report/stats-throughput.png
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2632 @Michal, 5.2.0.CR2 is out, so please create a separate issue.
Michal Linhard <mlinhard> made a comment on jira ISPN-2632 Created ISPN-2750
Michal Linhard <mlinhard> updated the status of jira ISPN-2632 to Resolved
Michal Linhard <mlinhard> made a comment on jira ISPN-2632 marking back as resolved
Michal Linhard <mlinhard> made a comment on jira ISPN-2750 I've traced resilience test runs 8-7-8 and 16-15-16 with 10 clients, but the chart doesn't give me the same look as in 32-31-32 test, at least I can't be sure for such small absolute values (under 20 ops/sec) http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0039-resi-08-ER9-trace/report/stats-throughput.png http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0040-resi-16-ER9-trace/report/stats-throughput.png the original chart that showed the problem: http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0036-resi-32-31-32-ER9/report/stats-throughput.png had the higher throughput values (around 250 ops/sec per node) In all cases I can see topology info updated in each client: http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0039-resi-08-ER9-trace/report/loganalysis/client-topology-info/ http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0040-resi-16-ER9-trace/report/loganalysis/client-topology-info/ http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0036-resi-32-31-32-ER9/report/loganalysis/client-topology-info/ the category "INFO New topology received Full Before" and "INFO New topology received Full After" has an entry for each client thread. In the 32-31-32 run where the problem manifests, all threads received the same topology id=62 before crash and id=67 after rejoin. Hmm, just checked the entry distribution in the failing test: http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0036-resi-32-31-32-ER9/report/cache_entries.png vs the traced runs: http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0039-resi-08-ER9-trace/report/cache_entries.png http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0040-resi-16-ER9-trace/report/cache_entries.png so it seems like the hotrod servers really are following the cache topology distribution and its the cache topology itself that's weird. the trace logs for 8-7-8 and 16-15-16 runs can be found here: http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0039-resi-08-ER9-trace/report/clientlogs.zip http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0039-resi-08-ER9-trace/report/serverlogs.zip http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0040-resi-16-ER9-trace/report/clientlogs.zip http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0040-resi-16-ER9-trace/report/serverlogs.zip
Dan Berindei <dberinde> updated the status of jira ISPN-2750 to Resolved
Dan Berindei <dberinde> made a comment on jira ISPN-2750 Looks like a configuration problem again: numSegments is only 40, and there are 32 nodes, which means the segments are not evenly divided between the cache members. Here's an ASCII "graph" to that shows how many segments are owned by each node in a sample consistent hash ('=' means it's a primary owner, '+' means it's a backup owner): {noformat} + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + = + + = = = = + = + + + + = + + + + + + + + = + + + + + + + + + = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = {noformat} You can see from the graph that before the ISPN-2643 fix, when the HotRod client would contact a random owner, the load was balanced more evenly. After that fix however, the HotRod client only contacts the primary owner, and there is a clear difference in load between the nodes who primary-own 2 segments and the nodes who primary-own only 1 segment.
Michal Linhard <mlinhard> made a comment on jira ISPN-2750 Damn, right. I'll try to run the 32-31-32 test with numSegments=320
Michal Linhard <mlinhard> made a comment on jira ISPN-2750 http://www.qa.jboss.com/~mlinhard/hyperion3/run0041-resi-32-31-32-ER9/report/stats-throughput.png with numSegments=320 the results are much nicer. Sorry for false alarm.
I'll run all the DIST mode elasticity/resilience tests with proper numSegments and mark this as verfied if they're clean.
In other tests there is a little unevenness http://www.qa.jboss.com/~mlinhard/hyperion3/run0042-resi-32-28-32-ER9/report/stats-throughput.png http://www.qa.jboss.com/~mlinhard/hyperion3/run0043-resi-8-6-8-ER9/report/stats-throughput.png by eye it's around 8%, and it's only during the crash. I need an opinion whether this is tolerable. Both tests use numSegments=10*clusterSize. I'm going to count the exact numbers for both throughput and entry distribution (that's taken from stats).
I calculated how the values of entries in caches and throughput on nodes differ from each other during the test by calculating how each one differs from average (counting only active nodes) these are maximum differences in entry counts and throughput on individual nodes: 32-28-32 test: entries: 4.8 % (ignoring 6 extreme values) throughput 17.6 % (ignoring 6 extreme values) 32-31-32 test: entries: 3.6 % (ignoring one extreme value) throughput: 18.6 % (ignoring one extreme value) 8-6-8 test: entries: 7.9 % (ignoring 10 extreme values) throughput: 17.1 % (ignoring 3 extreme values) Other than this the tests are fine, no unusual client or server errors in logs. So this depends on what we're able to tolerate.
Michal, the throughput can vary a log during the test even if the distribution of the entries stays the same. E.g. between minutes 9 and 13 in http://www.qa.jboss.com/~mlinhard/hyperion3/run0043-resi-8-6-8-ER9/report/stats-throughput.png, the entry distribution doesn't change, but the throughput seems to vary by ~ 5%. So I think it would be more fair to compare the average throughput of each node for a longer period. By this measure, I would expect differences to be ~ 10% (as there are ~10 segments per node, and while there are 6/7 nodes alive some of the nodes will have 1 extra "primary" segment). You could also try to extract the current ConsistentHash from a client/server and log the primary owner of each segment (or rather the number of segments/hash space points primary-owned by each node). This is still a proxy measurement, but the hashing function is pretty uniform so it should be more accurate than the throughput number. On the other hand, it would be more complicated, and I'm not sure it would bring any new information compared to the throughput graph.
OK. Understood. Number of entries (that's stable during a phase with same number of nodes) don't differ more than 8% - so still less than the expected ~10%. The throughput number is much more volatile - it doesn't depend only on the number of entries and the CH, but also all the nondeterminism of network so it's only an illustrative indicator - the numbers around 17-19% are maximums over whole test so the average will be somewhere closer to the acceptable 10% + randomness of network. I'm marking this verified.