Bug 786202 - State transfer taking too long on node join
Summary: State transfer taking too long on node join
Keywords:
Status: VERIFIED
Alias: None
Product: JBoss Data Grid 6
Classification: JBoss
Component: Infinispan
Version: 6.0.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 6.0.0
Assignee: Tristan Tarrant
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-01-31 17:31 UTC by Michal Linhard
Modified: 2023-03-02 08:27 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
With JBoss Data Grid cluster of size 4 nodes or higher, if a node crashes and then is brought up again, the subsequent state transfer takes a very long time to conclude. This occurs despite a mild data load (5% of heap and client load). The data load concludes within a minute while the state transfer unexpectedly requires more time than this. </para><para> This problem has only occurred once and has not been reproduced in tests since. It is a low risk performance problem.
Clone Of:
Environment:
Last Closed: 2012-03-08 12:55:06 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker ISPN-1838 0 Major Resolved State transfer takes more than 10 minutes with only 10K entries. 2019-01-17 02:04:00 UTC

Description Michal Linhard 2012-01-31 17:31:56 UTC
EDG build: http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-build-edg-from-source/101/

Problematic run: http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-build-edg-from-source/101/

This is a 4 node failover test where we crash node2 after 5 minutes and then bring it back up again after another 5 minutes. after restart the resulting statetransfer takes forever - even though we have only mild dataload 5%of heap and client load 500. It took under 1minute to load all the data and it should take less to do the state transfer...

Comment 2 Michal Linhard 2012-02-02 17:44:43 UTC
on node03(coordinator) we can see that preparation of testCache cache view 8 fails after 10minute timeout.

Comment 3 JBoss JIRA Server 2012-02-03 09:16:00 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-1838

I forgot, tracelogs from client side (edg-perf06, edg-perf07 console outputs) were too big (> 2.5GB) so I zipped them and they're available here:

for http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-failover-dist-basic/23
http://www.qa.jboss.com/~mlinhard/b23c1.zip
http://www.qa.jboss.com/~mlinhard/b23c2.zip
for http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-failover-dist-basic/24
http://www.qa.jboss.com/~mlinhard/b24c1.zip
http://www.qa.jboss.com/~mlinhard/b24c2.zip

should you be interrested in client side trace logs.

Comment 4 JBoss JIRA Server 2012-02-05 20:26:12 UTC
Dan Berindei <dan.berindei> made a comment on jira ISPN-1838

The logs are quite weird, node-01 sends an APPLY_STATE command to node-02 for view 8 which never seems to reach its destination.

Each of the surviving nodes (node-01, node-03, node-04) sent an APPLY_STATE command to node-02. For the node-03 command it took 1 second to receive and unmarshal it, for the node-04 command it took 1 minute, and the node-01 command never started executing (either because it never arrived or because unmarshalling took too long).

{noformat}
2012-02-02 05:18:17,479 TRACE [RpcManagerImpl] (transport-thread-4) edg-perf03-36944 broadcasting call StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf03-36944, viewId=8, state=1806} to recipient list [edg-perf02-21799]
2012-02-02 05:18:18,479 TRACE [CommandAwareRpcDispatcher] (OOB-100,edg-perf02-21799) Attempting to execute command: StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf03-36944, viewId=8, state=1806} [sender=edg-perf03-36944]
{noformat}

{noformat}
2012-02-02 05:18:17,468 TRACE [CommandAwareRpcDispatcher] (transport-thread-5) Replication task sending StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf04-45788, viewId=8, state=1443} to addresses [edg-perf02-21799]
2012-02-02 05:19:16,630 TRACE [CommandAwareRpcDispatcher] (OOB-98,edg-perf02-21799) Attempting to execute command: StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf04-45788, viewId=8, state=1443} [sender=edg-perf04-45788]
{noformat}

{noformat}
2012-02-02 05:18:17,468 TRACE [CommandAwareRpcDispatcher] (transport-thread-5) Replication task sending StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf01-47003, viewId=8, state=1903} to addresses [edg-perf02-21799]
???
{noformat}

Michal, can you run the test again with TRACE enabled for org.jgroups and for org.infinispan.marshall as well?

Comment 5 JBoss JIRA Server 2012-02-06 12:00:56 UTC
Dan Berindei <dan.berindei> made a comment on jira ISPN-1838

For comparison, here are the logs for the next view's APPLY_STATE commands. They all take < 100ms from sending to execution on the target:

{noformat}
2012-02-02 05:28:17,689 TRACE [CommandAwareRpcDispatcher] (transport-thread-6) Replication task sending StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf04-45788, viewId=10, state=1443} to addresses [edg-perf02-21799]
2012-02-02 05:28:17,751 TRACE [CommandAwareRpcDispatcher] (OOB-99,edg-perf02-21799) Attempting to execute command: StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf04-45788, viewId=10, state=1443} [sender=edg-perf04-45788]
{noformat}

{noformat}
2012-02-02 05:28:17,724 TRACE [CommandAwareRpcDispatcher] (transport-thread-24) Replication task sending StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf03-36944, viewId=10, state=1806} to addresses [edg-perf02-21799]
2012-02-02 05:28:17,799 TRACE [CommandAwareRpcDispatcher] (OOB-100,edg-perf02-21799) Attempting to execute command: StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf03-36944, viewId=10, state=1806} [sender=edg-perf03-36944]
{noformat}

{noformat}
2012-02-02 05:28:17,792 TRACE [CommandAwareRpcDispatcher] (transport-thread-6) Replication task sending StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf01-47003, viewId=10, state=1903} to addresses [edg-perf02-21799]
2012-02-02 05:28:17,876 TRACE [CommandAwareRpcDispatcher] (OOB-98,edg-perf02-21799) Attempting to execute command: StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf01-47003, viewId=10, state=1903} [sender=edg-perf01-47003]
{noformat}

One difference that might be significant is that in view 8 all three nodes send their commands at exactly the same time (within only 1ms of each other). In view 10 the commands are more spread out, there's about 100ms between the first and the last send.

Comment 6 JBoss JIRA Server 2012-02-13 16:25:38 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-1838

Hi Dan, back from PTO... the TRACE log request is still actual I guesss ? Is it OK to leave out client side trace logs (or is DEBUG level enough) ? Cause they're quite big and  Vojta (hudson admin) needs to manually delete them after each such run...

Comment 8 JBoss JIRA Server 2012-02-14 15:23:56 UTC
Dan Berindei <dan.berindei> made a comment on jira ISPN-1838

Filtered JGroups logs from this test run: http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-experiments-mlinhard/200/

Comment 9 JBoss JIRA Server 2012-02-15 08:17:29 UTC
Dan Berindei <dan.berindei> made a comment on jira ISPN-1838

This looks like a bad interaction between our state transfer behaviour on join and UNICAST2.

I wrote a test called UUPerf, modeled after the existing UPerf, but modeling our state transfer behaviour: all the nodes send very few, very large messages to a single node (the last to join the cluster). The code is here: https://github.com/danberindei/JGroups/commit/8bdbce0d0fe3e5a65caf5d28e435853f29a20ff3

I have run the test with Infinispan's jgroups-udp.xml and jgroups-tcp.xml configurations on cluster01..04, and I have attached the results as uuperf-udp.txt and uuperf-tcp.txt

As I increased the number of nodes in the cluster, with TCP the throughput drops relatively quickly because of (I assume) collisions on the network between the switch and the joiner. With UDP however, the effect is much more pronounced, I even got a timeout after 400 seconds (thus reproducing Michal's original report).

The problem seems to be that UNICAST2 (just as NAKACK(2)) can't detect missing messages when they are the last messages sent by a node. So if there are lots of collisions on the wire, there is a high chance that the missing messages will only be detected when the STABLE timer expires (every 50 seconds in our configurations). However, since the joiner sends retransmission requests to all the nodes at the same time, there is again a high chance that the retransmission responses will collide and repeat the cycle.

Comment 10 JBoss JIRA Server 2012-02-15 08:35:46 UTC
Dan Berindei <dan.berindei> made a comment on jira ISPN-1838

I have discussed this with Bela yesterday and we have thought of several workarounds:
1. Stagger APPLY_STATE commands - add a random 0-500ms delay before sending the command
2. Make the chunk size smaller to decrease the chance of collisions.
3. Use RSVP for APPLY_STATE commands - would require JGroups 3.1
4. Use UNICAST instead of UNICAST2
5. Decrease the STABLE timeout - can't lower it too much, as it is quite expensive
6. Decrease FRAG2.frag_size (currently 60000 bytes). When there is a colision, an entire fragment has to be discarded and sent again, even if the Ethernet frame that collided was only 1500 bytes.


I have tried options #2 (500KB messages), #5 (15000 bytes) and #6 (10 seconds), and but I still got a timeout after 400 seconds with 4 senders and 1 receiver.

I have also tried option #4, the results are in the attached uuperf-unicast1.txt. It still scales worse than UNICAST2 over TCP, but it's much better than UNICAST2 over UDP.

I haven't tried yet options #1 and #3, but I suspect #1 will only make the problem harder to reproduce (because once STABLE kicks in, the retransmission responses are not staggered).

Comment 11 JBoss JIRA Server 2012-02-17 10:06:37 UTC
Bela Ban <bela> made a comment on jira ISPN-1838

I ran UUPerf on my local cluster and got good results (between 30 and 40 reqs/sec/node for a 4 node cluster).

However, the same test (same config and same JGroups version) had abysmal perf on cluster01-04 ! I think that this is due to a switch that's either not optimally configured for UDP datagrams and/or has small buffers for datagram packets.
Note that I've had this suspicion before, as MPerf numbers are also bad...

The TCP transport scales its send window down, to reduce packet loss and subsequent retransmission. However, JGroups' UFC (Unicast Flow Control) protocol has a fixed window size (number of credits). (This will be changed in 3.2 or 3.3, see [1]).
Due to this, messages are sent too fast, causing packet loss at the switch (still my assumption !) and retransmission which is costly and generates more traffic.

There are 2 things that correct this:
#1 Reduce UFC.max_bytes to (say) 200K

#2 Reduce FRAG2.fra_size to 8000. Note that the MTU on cluster01-10 is 9000, so a frag_size of 8K won't lead to IP fragmentation. On other systems, such as perf01-10 where the MTU is 1500, we'd have to experiment with even smaller sizes.

The config file I used to run UUPerf on cluster01-04 is attached (dan.xml).

[1] https://issues.jboss.org/browse/JGRP-1053

Comment 12 Michal Linhard 2012-03-05 16:03:25 UTC
In JDG 6.0.0.ER2 The 10min duration of state transfer was no longer observed, see the performance test results: https://docspace.corp.redhat.com/docs/DOC-94439

There are still quite long state transfer periods - around 60sec with 5% data load and 500clients but this is no longer a blocker for testing ... These will be reported as findings of elasticity performance test ...

Comment 13 Michal Linhard 2012-03-16 09:38:36 UTC
Appeared on 16 node elasticity test on hyperion:
http://www.qa.jboss.com/~mlinhard/hyperion/run26-elasticity16
when 9th node tried to join 8node cluster it took more than 10min.

The 4 node version works fine in hyperion:
http://www.qa.jboss.com/~mlinhard/hyperion/run27-elasticity4

Comment 14 Michal Linhard 2012-03-16 09:45:27 UTC
Another failed 8->9node shift:
http://www.qa.jboss.com/~mlinhard/hyperion/run25-elasticity16/

Comment 16 JBoss JIRA Server 2012-03-20 17:42:19 UTC
Michal Linhard <mlinhard> updated the status of jira ISPN-1838 to Reopened

Comment 17 JBoss JIRA Server 2012-03-20 17:42:19 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-1838

Reappeared in JDG 6.0.0.ER4 tests

Comment 18 Dan Berindei 2012-03-20 20:46:09 UTC
Michal, it looks like your test is using standalone.xml from the master branch, which is outdated, and not the latest config in the prod-6.0.0 branch.

Comment 19 JBoss JIRA Server 2012-03-28 08:38:48 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-1838

Michal/Dan, what's up with this?

Comment 20 JBoss JIRA Server 2012-03-30 10:23:01 UTC
Dan Berindei <dberinde> updated the status of jira ISPN-1838 to Resolved

Comment 21 JBoss JIRA Server 2012-03-30 10:23:01 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-1838

I'm pretty sure this only appeared again because the test was using an old config, so I'm marking it Done.

Comment 22 mark yarborough 2012-04-04 13:08:44 UTC
Prabhat indicates still looking for improvement, so moving to CR1 and taking out of ON_QA...

Comment 23 mark yarborough 2012-04-04 13:08:44 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
CCFR - Prabhat

Comment 24 Michal Linhard 2012-04-04 13:19:31 UTC
We haven't seen this in ER6 tests.

Comment 25 Michal Linhard 2012-04-04 13:21:29 UTC
I agree with Dan's comment, this appeared due to old config used in the particular tests.

Comment 26 JBoss JIRA Server 2012-04-13 14:05:20 UTC
Sanne Grinovero <sanne.grinovero> made a comment on jira ISPN-1838

Would be better to mark such cases as "rejected"/"can't reproduce"/... ? Seeing this one if the changelog scared me, but reading the comments I see this wasn't actually a problem.

Comment 27 JBoss JIRA Server 2012-04-13 14:31:16 UTC
Manik Surtani <manik.surtani> updated the status of jira ISPN-1838 to Reopened

Comment 28 JBoss JIRA Server 2012-04-13 14:31:32 UTC
Manik Surtani <manik.surtani> updated the status of jira ISPN-1838 to Resolved

Comment 29 Misha H. Ali 2012-06-06 03:29:24 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-CCFR - Prabhat+<remark>CCFR - Prabhat</remark>

Comment 30 prabhat jha 2012-06-08 14:27:06 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-<remark>CCFR - Prabhat</remark>+This problem has been fixed and we have not observed this behavior either with small cluster or large cluster.

Comment 31 Michal Linhard 2012-06-08 14:47:31 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-This problem has been fixed and we have not observed this behavior either with small cluster or large cluster.+This problem happened only once and we've been unable to reproduce it since then. Even in case it occurs it's only a low risk performance problem.

Comment 32 Misha H. Ali 2012-06-12 15:05:00 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1,3 @@
-This problem happened only once and we've been unable to reproduce it since then. Even in case it occurs it's only a low risk performance problem.+When JBoss Data Grid operates with four nodes, if the second node crashes and then is brought up again, the subsequent state transfer takes a very long time to conclude. This occurs despite a mild data load (5% of heap and client load). The data load concludes within a minute while the state transfer unexpectedly requires more time than this.
+</para><para>
+This problem has only occurred once and has not been reproduced in tests since. It is a low risk performance problem, if it occurs within JBoss Data Grid.

Comment 33 prabhat jha 2012-06-12 15:11:27 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,3 +1,3 @@
-When JBoss Data Grid operates with four nodes, if the second node crashes and then is brought up again, the subsequent state transfer takes a very long time to conclude. This occurs despite a mild data load (5% of heap and client load). The data load concludes within a minute while the state transfer unexpectedly requires more time than this.
+With JBoss Data Grid cluster of size 4 nodes or higher, if a node crashes and then is brought up again, the subsequent state transfer takes a very long time to conclude. This occurs despite a mild data load (5% of heap and client load). The data load concludes within a minute while the state transfer unexpectedly requires more time than this.
 </para><para>
-This problem has only occurred once and has not been reproduced in tests since. It is a low risk performance problem, if it occurs within JBoss Data Grid.+This problem has only occurred once and has not been reproduced in tests since. It is a low risk performance problem.


Note You need to log in before you can comment on or make changes to this bug.