EDG build: http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-build-edg-from-source/101/ Problematic run: http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-build-edg-from-source/101/ This is a 4 node failover test where we crash node2 after 5 minutes and then bring it back up again after another 5 minutes. after restart the resulting statetransfer takes forever - even though we have only mild dataload 5%of heap and client load 500. It took under 1minute to load all the data and it should take less to do the state transfer...
I think here we can see this problem again: http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-failover-dist-basic/23/artifact/report/serverlogs.zip it's a tracelog and it's pure infinispan (build http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-build-infinispan-from-source/45/)
on node03(coordinator) we can see that preparation of testCache cache view 8 fails after 10minute timeout.
Michal Linhard <mlinhard> made a comment on jira ISPN-1838 I forgot, tracelogs from client side (edg-perf06, edg-perf07 console outputs) were too big (> 2.5GB) so I zipped them and they're available here: for http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-failover-dist-basic/23 http://www.qa.jboss.com/~mlinhard/b23c1.zip http://www.qa.jboss.com/~mlinhard/b23c2.zip for http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-failover-dist-basic/24 http://www.qa.jboss.com/~mlinhard/b24c1.zip http://www.qa.jboss.com/~mlinhard/b24c2.zip should you be interrested in client side trace logs.
Dan Berindei <dan.berindei> made a comment on jira ISPN-1838 The logs are quite weird, node-01 sends an APPLY_STATE command to node-02 for view 8 which never seems to reach its destination. Each of the surviving nodes (node-01, node-03, node-04) sent an APPLY_STATE command to node-02. For the node-03 command it took 1 second to receive and unmarshal it, for the node-04 command it took 1 minute, and the node-01 command never started executing (either because it never arrived or because unmarshalling took too long). {noformat} 2012-02-02 05:18:17,479 TRACE [RpcManagerImpl] (transport-thread-4) edg-perf03-36944 broadcasting call StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf03-36944, viewId=8, state=1806} to recipient list [edg-perf02-21799] 2012-02-02 05:18:18,479 TRACE [CommandAwareRpcDispatcher] (OOB-100,edg-perf02-21799) Attempting to execute command: StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf03-36944, viewId=8, state=1806} [sender=edg-perf03-36944] {noformat} {noformat} 2012-02-02 05:18:17,468 TRACE [CommandAwareRpcDispatcher] (transport-thread-5) Replication task sending StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf04-45788, viewId=8, state=1443} to addresses [edg-perf02-21799] 2012-02-02 05:19:16,630 TRACE [CommandAwareRpcDispatcher] (OOB-98,edg-perf02-21799) Attempting to execute command: StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf04-45788, viewId=8, state=1443} [sender=edg-perf04-45788] {noformat} {noformat} 2012-02-02 05:18:17,468 TRACE [CommandAwareRpcDispatcher] (transport-thread-5) Replication task sending StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf01-47003, viewId=8, state=1903} to addresses [edg-perf02-21799] ??? {noformat} Michal, can you run the test again with TRACE enabled for org.jgroups and for org.infinispan.marshall as well?
Dan Berindei <dan.berindei> made a comment on jira ISPN-1838 For comparison, here are the logs for the next view's APPLY_STATE commands. They all take < 100ms from sending to execution on the target: {noformat} 2012-02-02 05:28:17,689 TRACE [CommandAwareRpcDispatcher] (transport-thread-6) Replication task sending StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf04-45788, viewId=10, state=1443} to addresses [edg-perf02-21799] 2012-02-02 05:28:17,751 TRACE [CommandAwareRpcDispatcher] (OOB-99,edg-perf02-21799) Attempting to execute command: StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf04-45788, viewId=10, state=1443} [sender=edg-perf04-45788] {noformat} {noformat} 2012-02-02 05:28:17,724 TRACE [CommandAwareRpcDispatcher] (transport-thread-24) Replication task sending StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf03-36944, viewId=10, state=1806} to addresses [edg-perf02-21799] 2012-02-02 05:28:17,799 TRACE [CommandAwareRpcDispatcher] (OOB-100,edg-perf02-21799) Attempting to execute command: StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf03-36944, viewId=10, state=1806} [sender=edg-perf03-36944] {noformat} {noformat} 2012-02-02 05:28:17,792 TRACE [CommandAwareRpcDispatcher] (transport-thread-6) Replication task sending StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf01-47003, viewId=10, state=1903} to addresses [edg-perf02-21799] 2012-02-02 05:28:17,876 TRACE [CommandAwareRpcDispatcher] (OOB-98,edg-perf02-21799) Attempting to execute command: StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf01-47003, viewId=10, state=1903} [sender=edg-perf01-47003] {noformat} One difference that might be significant is that in view 8 all three nodes send their commands at exactly the same time (within only 1ms of each other). In view 10 the commands are more spread out, there's about 100ms between the first and the last send.
Michal Linhard <mlinhard> made a comment on jira ISPN-1838 Hi Dan, back from PTO... the TRACE log request is still actual I guesss ? Is it OK to leave out client side trace logs (or is DEBUG level enough) ? Cause they're quite big and Vojta (hudson admin) needs to manually delete them after each such run...
Reproduced for DR2: http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-experiments-mlinhard/198/
Dan Berindei <dan.berindei> made a comment on jira ISPN-1838 Filtered JGroups logs from this test run: http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-experiments-mlinhard/200/
Dan Berindei <dan.berindei> made a comment on jira ISPN-1838 This looks like a bad interaction between our state transfer behaviour on join and UNICAST2. I wrote a test called UUPerf, modeled after the existing UPerf, but modeling our state transfer behaviour: all the nodes send very few, very large messages to a single node (the last to join the cluster). The code is here: https://github.com/danberindei/JGroups/commit/8bdbce0d0fe3e5a65caf5d28e435853f29a20ff3 I have run the test with Infinispan's jgroups-udp.xml and jgroups-tcp.xml configurations on cluster01..04, and I have attached the results as uuperf-udp.txt and uuperf-tcp.txt As I increased the number of nodes in the cluster, with TCP the throughput drops relatively quickly because of (I assume) collisions on the network between the switch and the joiner. With UDP however, the effect is much more pronounced, I even got a timeout after 400 seconds (thus reproducing Michal's original report). The problem seems to be that UNICAST2 (just as NAKACK(2)) can't detect missing messages when they are the last messages sent by a node. So if there are lots of collisions on the wire, there is a high chance that the missing messages will only be detected when the STABLE timer expires (every 50 seconds in our configurations). However, since the joiner sends retransmission requests to all the nodes at the same time, there is again a high chance that the retransmission responses will collide and repeat the cycle.
Dan Berindei <dan.berindei> made a comment on jira ISPN-1838 I have discussed this with Bela yesterday and we have thought of several workarounds: 1. Stagger APPLY_STATE commands - add a random 0-500ms delay before sending the command 2. Make the chunk size smaller to decrease the chance of collisions. 3. Use RSVP for APPLY_STATE commands - would require JGroups 3.1 4. Use UNICAST instead of UNICAST2 5. Decrease the STABLE timeout - can't lower it too much, as it is quite expensive 6. Decrease FRAG2.frag_size (currently 60000 bytes). When there is a colision, an entire fragment has to be discarded and sent again, even if the Ethernet frame that collided was only 1500 bytes. I have tried options #2 (500KB messages), #5 (15000 bytes) and #6 (10 seconds), and but I still got a timeout after 400 seconds with 4 senders and 1 receiver. I have also tried option #4, the results are in the attached uuperf-unicast1.txt. It still scales worse than UNICAST2 over TCP, but it's much better than UNICAST2 over UDP. I haven't tried yet options #1 and #3, but I suspect #1 will only make the problem harder to reproduce (because once STABLE kicks in, the retransmission responses are not staggered).
Bela Ban <bela> made a comment on jira ISPN-1838 I ran UUPerf on my local cluster and got good results (between 30 and 40 reqs/sec/node for a 4 node cluster). However, the same test (same config and same JGroups version) had abysmal perf on cluster01-04 ! I think that this is due to a switch that's either not optimally configured for UDP datagrams and/or has small buffers for datagram packets. Note that I've had this suspicion before, as MPerf numbers are also bad... The TCP transport scales its send window down, to reduce packet loss and subsequent retransmission. However, JGroups' UFC (Unicast Flow Control) protocol has a fixed window size (number of credits). (This will be changed in 3.2 or 3.3, see [1]). Due to this, messages are sent too fast, causing packet loss at the switch (still my assumption !) and retransmission which is costly and generates more traffic. There are 2 things that correct this: #1 Reduce UFC.max_bytes to (say) 200K #2 Reduce FRAG2.fra_size to 8000. Note that the MTU on cluster01-10 is 9000, so a frag_size of 8K won't lead to IP fragmentation. On other systems, such as perf01-10 where the MTU is 1500, we'd have to experiment with even smaller sizes. The config file I used to run UUPerf on cluster01-04 is attached (dan.xml). [1] https://issues.jboss.org/browse/JGRP-1053
In JDG 6.0.0.ER2 The 10min duration of state transfer was no longer observed, see the performance test results: https://docspace.corp.redhat.com/docs/DOC-94439 There are still quite long state transfer periods - around 60sec with 5% data load and 500clients but this is no longer a blocker for testing ... These will be reported as findings of elasticity performance test ...
Appeared on 16 node elasticity test on hyperion: http://www.qa.jboss.com/~mlinhard/hyperion/run26-elasticity16 when 9th node tried to join 8node cluster it took more than 10min. The 4 node version works fine in hyperion: http://www.qa.jboss.com/~mlinhard/hyperion/run27-elasticity4
Another failed 8->9node shift: http://www.qa.jboss.com/~mlinhard/hyperion/run25-elasticity16/
TRACE log: http://www.qa.jboss.com/~mlinhard/hyperion/run29-trace/report/serverlogs.zip
Michal Linhard <mlinhard> updated the status of jira ISPN-1838 to Reopened
Michal Linhard <mlinhard> made a comment on jira ISPN-1838 Reappeared in JDG 6.0.0.ER4 tests
Michal, it looks like your test is using standalone.xml from the master branch, which is outdated, and not the latest config in the prod-6.0.0 branch.
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-1838 Michal/Dan, what's up with this?
Dan Berindei <dberinde> updated the status of jira ISPN-1838 to Resolved
Dan Berindei <dberinde> made a comment on jira ISPN-1838 I'm pretty sure this only appeared again because the test was using an old config, so I'm marking it Done.
Prabhat indicates still looking for improvement, so moving to CR1 and taking out of ON_QA...
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: CCFR - Prabhat
We haven't seen this in ER6 tests.
I agree with Dan's comment, this appeared due to old config used in the particular tests.
Sanne Grinovero <sanne.grinovero> made a comment on jira ISPN-1838 Would be better to mark such cases as "rejected"/"can't reproduce"/... ? Seeing this one if the changelog scared me, but reading the comments I see this wasn't actually a problem.
Manik Surtani <manik.surtani> updated the status of jira ISPN-1838 to Reopened
Manik Surtani <manik.surtani> updated the status of jira ISPN-1838 to Resolved
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -CCFR - Prabhat+<remark>CCFR - Prabhat</remark>
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -<remark>CCFR - Prabhat</remark>+This problem has been fixed and we have not observed this behavior either with small cluster or large cluster.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -This problem has been fixed and we have not observed this behavior either with small cluster or large cluster.+This problem happened only once and we've been unable to reproduce it since then. Even in case it occurs it's only a low risk performance problem.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1,3 @@ -This problem happened only once and we've been unable to reproduce it since then. Even in case it occurs it's only a low risk performance problem.+When JBoss Data Grid operates with four nodes, if the second node crashes and then is brought up again, the subsequent state transfer takes a very long time to conclude. This occurs despite a mild data load (5% of heap and client load). The data load concludes within a minute while the state transfer unexpectedly requires more time than this. +</para><para> +This problem has only occurred once and has not been reproduced in tests since. It is a low risk performance problem, if it occurs within JBoss Data Grid.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,3 +1,3 @@ -When JBoss Data Grid operates with four nodes, if the second node crashes and then is brought up again, the subsequent state transfer takes a very long time to conclude. This occurs despite a mild data load (5% of heap and client load). The data load concludes within a minute while the state transfer unexpectedly requires more time than this. +With JBoss Data Grid cluster of size 4 nodes or higher, if a node crashes and then is brought up again, the subsequent state transfer takes a very long time to conclude. This occurs despite a mild data load (5% of heap and client load). The data load concludes within a minute while the state transfer unexpectedly requires more time than this. </para><para> -This problem has only occurred once and has not been reproduced in tests since. It is a low risk performance problem, if it occurs within JBoss Data Grid.+This problem has only occurred once and has not been reproduced in tests since. It is a low risk performance problem.