Bug 886549 - Uneven request balancing after node crash
Summary: Uneven request balancing after node crash
Keywords:
Status: VERIFIED
Alias: None
Product: JBoss Data Grid 6
Classification: JBoss
Component: Infinispan
Version: 6.1.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ER10
: 6.1.0
Assignee: Tristan Tarrant
QA Contact: Nobody
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-12-12 14:25 UTC by Michal Linhard
Modified: 2023-03-02 08:27 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker ISPN-2632 0 Blocker Resolved Uneven request balancing after node crash 2013-12-04 18:40:28 UTC
Red Hat Issue Tracker ISPN-2750 0 Major Resolved Uneven request balancing via hotrod 2013-12-04 18:40:28 UTC

Description Michal Linhard 2012-12-12 14:25:13 UTC
similar symptoms as
https://bugzilla.redhat.com/show_bug.cgi?id=809631

the load to nodes is not well balanced after killed node is restored again. can be seen nicely in this chart:
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-resilience-dist-4-3/59/artifact/report/stats-throughput.png

but happens when killing only one node of 4

Comment 1 Michal Linhard 2012-12-18 10:33:59 UTC
Illustration of the issue from several hyperion test runs:

Elasticity test 8-16-8:
http://www.qa.jboss.com/~mlinhard/hyperion3/run0009/report/stats-throughput.png
Elasticity test 16-32-16:
http://www.qa.jboss.com/~mlinhard/hyperion3/run0010/report/stats-throughput.png
Resilience test 8-7-8:
http://www.qa.jboss.com/~mlinhard/hyperion3/run0013/report/stats-throughput.png

Comment 2 JBoss JIRA Server 2013-01-08 11:46:52 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2632

@Michal, is the issue still present once an Infinispan version containing a fix for ISPN-2550 is used? You wanna wait to test CR1 and then we see if the issue is still present?

Comment 3 JBoss JIRA Server 2013-01-08 12:14:17 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-2632

the run
http://www.qa.jboss.com/~mlinhard/hyperion3/run0013/report/stats-throughput.png
that contains this issue, was done with JDG 6.1.0.ER5 patched with dan's fixes for ISPN-2550, see my last comments in ISPN-2550.

Comment 4 JBoss JIRA Server 2013-01-09 09:43:41 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2632

@Michal, Ok. What do you have in terms of logs? If this can be replicated in small scale, client and server TRACE logs on org.infinispan would be the easiest. 

How are these caches configured? Replication or distribution? I suppose it's the latter since it's the area that's changed the most for 5.2. If you need to scala log down, I'd suggest org.infinispan.client on the client side, and org.infinispan.server and org.infinispan.distribution on the server side.

Comment 5 JBoss JIRA Server 2013-01-10 10:22:48 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-2632

I was on this since yesterday but ran into some problems with our test framework. I'll do the trace logs for JDG 6.1.0.ER8 (Infinispan 5.2.0.CR1) that we should have available tomorrow.

Comment 6 JBoss JIRA Server 2013-01-10 11:07:51 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-2632

btw would it help to run the TRACE logged test with the patched ER5 (the patch containing fixes for ISPN-2642, ISPN-2550) - I could do that before ER8

Comment 7 JBoss JIRA Server 2013-01-10 11:08:41 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-2632

btw would it help to run the TRACE logged test with the patched ER5 (the patch containing fixes for ISPN-2642, ISPN-2550) - I could do that before ER8 - that's the code that produced this http://www.qa.jboss.com/~mlinhard/hyperion3/run0013/report/stats-throughput.png)

Comment 8 JBoss JIRA Server 2013-01-10 11:20:01 UTC
Mircea Markus <mmarkus> made a comment on jira ISPN-2632

the sooner the better as this is a blocker. That would run 5.2.CR1, right?

Comment 9 JBoss JIRA Server 2013-01-10 11:22:16 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-2632

no that's the thing, we don't have yet JDG ER with 5.2.0.CR1 inside.

Comment 10 JBoss JIRA Server 2013-01-10 13:01:00 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2632

Getting TRACE with patched ER5 could be handy while we wait for ER6, which I assume should have CR1 in it.

Comment 11 Michal Linhard 2013-01-10 18:12:08 UTC
Still present in ER8

Comment 12 JBoss JIRA Server 2013-01-10 18:13:57 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-2632

TRACE log: http://www.qa.jboss.com/~mlinhard/test_results/run0020.zip

resilience test 8 nodes, crash 1, 50 clients, 5000 entries
JDG 6.1.0.ER8 (Infinispan 5.2.0.CR1)

Comment 13 JBoss JIRA Server 2013-01-11 11:47:07 UTC
Galder Zamarreño <galder.zamarreno> updated the status of jira ISPN-2632 to Coding In Progress

Comment 14 JBoss JIRA Server 2013-01-11 15:40:49 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2632

Michal, thx for attaching the logs. It only contains TRACE on client but it's Ok.

Taking one client as example (https://gist.github.com/4511311), the number of servers sent back looks right (missing 192.168.87.122), so it doesn't appear to be an issue with the way view is managed. IMO, it seems to be more of an issue with hashing itself.

Looking at the cached positions on the client, I don't see anything wrong at 1st glance: https://gist.github.com/4511569. The positions taken by .122 appear to have gone. Doesn't seem to be a problem with node4 taking more positions on the hash map.

I wonder if this issue has to do with ISPN-2670? I see org.infinispan.client.hotrod.impl.protocol.Codec12 in the logs, but ISPN-2670 seems to imply that Hot Rod client does not work with version 1.2 of the protocol.

I'm gonna tackle ISPN-2670 first and then we can see if the issue is still there.

Comment 15 JBoss JIRA Server 2013-01-11 15:43:27 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2632

@Michal, in the mean time you could try running the clients with infinispan.client.hotrod.protocol_version=1.1 ? The default protocol is 1.2 (even if RemoteCacheManager says otherwise...)

Comment 16 JBoss JIRA Server 2013-01-11 15:49:22 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2632

Actually, ISPN-2670 is about the server module's test client, so not really relevant here. It's worth testing with other protocol versions, 1.1 and 1.0, to see if this problem is specific to protocol version 1.2.

@Michal, in terms of uneven request balancing, do you know if the request unbalancing happens for all clients? Or if only some of those suffer?

Comment 17 JBoss JIRA Server 2013-01-11 15:52:33 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2632

@Michal, one final question: I see 5 driver threads in each client. I assume each of these threads is using its own RemoteCacheManager instance?

Comment 18 JBoss JIRA Server 2013-01-11 15:53:22 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2632

If so, I wonder if the issue can be replicated with just 1 driver thread per client, so 10 clients. This would make debugging through the client logs easier.

Comment 19 JBoss JIRA Server 2013-01-11 16:42:33 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-2632

for server logs see: report/serverlogs.zip

Comment 20 JBoss JIRA Server 2013-01-11 16:45:58 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-2632

in this test there were 50 driver threads distributed over 11 driver nodes, RCM per thread

Comment 21 JBoss JIRA Server 2013-01-11 17:53:49 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-2632

The behaviour is the same with protocol version 1.1
http://www.qa.jboss.com/~mlinhard/hyperion3/run0027/report/stats-throughput.png

Comment 22 JBoss JIRA Server 2013-01-18 14:36:02 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-2632

Send the cache topology id to the client instead of the JGroups view id.
After a rebalance, the cache topology id is incremented, so this ensures
that the client's topology information is updated as well.

To work properly with replicated caches, this change required always incrementing the cache topology id after a leave, even if without a rebalance (https://issues.jboss.org/browse/ISPN-2420).

Comment 23 JBoss JIRA Server 2013-01-23 09:18:36 UTC
Michal Linhard <mlinhard> updated the status of jira ISPN-2632 to Reopened

Comment 24 JBoss JIRA Server 2013-01-23 09:18:36 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-2632

This appeared again in JDG 6.1.0.ER9 tests (infinispan 5.2.0.CR2)
http://www.qa.jboss.com/~mlinhard/hyperion3/run0034-resi-8-6-8-ER9/report/stats-throughput.png

Comment 25 JBoss JIRA Server 2013-01-23 10:17:49 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-2632

in resilience test 32-28-32 (numowners5) the unevenness is from the beginning of test
(I don't think this needs a different JIRA, or does it ?)
http://www.qa.jboss.com/~mlinhard/hyperion3/run0035-resi-32-28-32-ER9/report/stats-throughput.png

Comment 26 JBoss JIRA Server 2013-01-23 11:36:15 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-2632

results from another resilience test 32-31-32 (owners 2):
http://www.qa.jboss.com/~mlinhard/hyperion3/run0036-resi-32-31-32-ER9/report/stats-throughput.png

Comment 27 JBoss JIRA Server 2013-01-23 15:31:50 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2632

@Michal, 5.2.0.CR2 is out, so please create a separate issue.

Comment 28 JBoss JIRA Server 2013-01-23 15:40:49 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-2632

Created ISPN-2750

Comment 29 JBoss JIRA Server 2013-01-23 15:42:23 UTC
Michal Linhard <mlinhard> updated the status of jira ISPN-2632 to Resolved

Comment 30 JBoss JIRA Server 2013-01-23 15:42:23 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-2632

marking back as resolved

Comment 31 JBoss JIRA Server 2013-01-24 09:52:47 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-2750

I've traced resilience test runs 8-7-8 and 16-15-16 with 10 clients, but the chart doesn't give me the same look as in 32-31-32 test,
at least I can't be sure for such small absolute values (under 20 ops/sec)
http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0039-resi-08-ER9-trace/report/stats-throughput.png
http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0040-resi-16-ER9-trace/report/stats-throughput.png
the original chart that showed the problem:
http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0036-resi-32-31-32-ER9/report/stats-throughput.png
had the higher throughput values (around 250 ops/sec per node)

In all cases I can see topology info updated in each client:
http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0039-resi-08-ER9-trace/report/loganalysis/client-topology-info/
http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0040-resi-16-ER9-trace/report/loganalysis/client-topology-info/
http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0036-resi-32-31-32-ER9/report/loganalysis/client-topology-info/
the category "INFO New topology received Full Before" and "INFO New topology received Full After" has an entry for each client thread.

In the 32-31-32 run where the problem manifests, all threads received the same topology id=62 before crash and id=67 after rejoin.

Hmm, just checked the entry distribution in the failing test:
http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0036-resi-32-31-32-ER9/report/cache_entries.png

vs the traced runs:
http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0039-resi-08-ER9-trace/report/cache_entries.png
http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0040-resi-16-ER9-trace/report/cache_entries.png

so it seems like the hotrod servers really are following the cache topology distribution and its the cache topology itself that's weird.

the trace logs for 8-7-8 and 16-15-16 runs can be found here:
http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0039-resi-08-ER9-trace/report/clientlogs.zip
http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0039-resi-08-ER9-trace/report/serverlogs.zip
http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0040-resi-16-ER9-trace/report/clientlogs.zip
http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0040-resi-16-ER9-trace/report/serverlogs.zip

Comment 32 JBoss JIRA Server 2013-01-24 09:56:52 UTC
Dan Berindei <dberinde> updated the status of jira ISPN-2750 to Resolved

Comment 33 JBoss JIRA Server 2013-01-24 09:56:52 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-2750

Looks like a configuration problem again: numSegments is only 40, and there are 32 nodes, which means the segments are not evenly divided between the cache members.

Here's an ASCII "graph" to that shows how many segments are owned by each node in a sample consistent hash ('=' means it's a primary owner, '+' means it's a backup owner):

{noformat}
+ +       +     + + +     +             +                       
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
= + + = = = = + = + + + + = + + + + + + + + = + + + + + + + + + 
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 
{noformat}

You can see from the graph that before the ISPN-2643 fix, when the HotRod client would contact a random owner, the load was balanced more evenly. After that fix however, the HotRod client only contacts the primary owner, and there is a clear difference in load between the nodes who primary-own 2 segments and the nodes who primary-own only 1 segment.

Comment 34 JBoss JIRA Server 2013-01-24 10:02:17 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-2750

Damn, right. I'll try to run the 32-31-32 test with numSegments=320

Comment 35 JBoss JIRA Server 2013-01-24 10:45:32 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-2750

http://www.qa.jboss.com/~mlinhard/hyperion3/run0041-resi-32-31-32-ER9/report/stats-throughput.png
with numSegments=320 the results are much nicer. Sorry for false alarm.

Comment 36 Michal Linhard 2013-01-24 10:48:49 UTC
I'll run all the DIST mode elasticity/resilience tests with proper numSegments and mark this as verfied if they're clean.

Comment 37 Michal Linhard 2013-01-24 12:40:12 UTC
In other tests there is a little unevenness
http://www.qa.jboss.com/~mlinhard/hyperion3/run0042-resi-32-28-32-ER9/report/stats-throughput.png
http://www.qa.jboss.com/~mlinhard/hyperion3/run0043-resi-8-6-8-ER9/report/stats-throughput.png

by eye it's around 8%, and it's only during the crash. I need an opinion whether this is tolerable. Both tests use numSegments=10*clusterSize.

I'm going to count the exact numbers for both throughput and entry distribution (that's taken from stats).

Comment 38 Michal Linhard 2013-01-24 17:17:35 UTC
I calculated how the values of entries in caches and throughput on nodes differ from each other during the test by calculating how each one differs from average (counting only active nodes) these are maximum differences in entry counts and throughput on individual nodes:

32-28-32 test:
entries:     4.8 % (ignoring 6 extreme values)
throughput  17.6 % (ignoring 6 extreme values)

32-31-32 test:
entries:     3.6 % (ignoring one extreme value)
throughput: 18.6 % (ignoring one extreme value)

8-6-8 test:
entries:     7.9 % (ignoring 10 extreme values)
throughput: 17.1 % (ignoring 3 extreme values)

Other than this the tests are fine, no unusual client or server errors in logs.
So this depends on what we're able to tolerate.

Comment 39 Dan Berindei 2013-01-28 16:00:18 UTC
Michal, the throughput can vary a log during the test even if the distribution of the entries stays the same. E.g. between minutes 9 and 13 in http://www.qa.jboss.com/~mlinhard/hyperion3/run0043-resi-8-6-8-ER9/report/stats-throughput.png, the entry distribution doesn't change, but the throughput seems to vary by ~ 5%.

So I think it would be more fair to compare the average throughput of each node for a longer period. By this measure, I would expect differences to be ~ 10% (as there are ~10 segments per node, and while there are 6/7 nodes alive some of the nodes will have 1 extra "primary" segment).

You could also try to extract the current ConsistentHash from a client/server and log the primary owner of each segment (or rather the number of segments/hash space points primary-owned by each node). This is still a proxy measurement, but the hashing function is pretty uniform so it should be more accurate than the throughput number. On the other hand, it would be more complicated, and I'm not sure it would bring any new information compared to the throughput graph.

Comment 40 Michal Linhard 2013-01-28 17:21:46 UTC
OK. Understood. Number of entries (that's stable during a phase with same number of nodes) don't differ more than 8% - so still less than the expected ~10%. The throughput number is much more volatile - it doesn't depend only on the number of entries and the CH, but also all the nondeterminism of network so it's only an illustrative indicator - the numbers around 17-19% are maximums over whole test so the average will be somewhere closer to the acceptable 10% + randomness of network.

I'm marking this verified.


Note You need to log in before you can comment on or make changes to this bug.