Bug 1104045 - Failing elasticity tests with 32 nodes
Summary: Failing elasticity tests with 32 nodes
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Data Grid 6
Classification: JBoss
Component: Infinispan, JGroups
Version: 6.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: CR3
: 6.3.0
Assignee: William Burns
QA Contact: Martin Gencur
URL:
Whiteboard:
Depends On: 1116963 1116965 1116967 1116969 1117948
Blocks: 1104639
TreeView+ depends on / blocked
 
Reported: 2014-06-03 07:23 UTC by Radim Vansa
Modified: 2015-01-26 14:05 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-01-26 14:05:17 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
RadarGun 2.0 benchmark (2.28 KB, application/xml)
2014-06-18 14:19 UTC, Radim Vansa
no flags Details
Infinispan configuration (1.98 KB, text/xml)
2014-06-18 14:20 UTC, Radim Vansa
no flags Details
JGroups configuration (4.07 KB, text/xml)
2014-06-18 14:20 UTC, Radim Vansa
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker ISPN-4154 0 Critical Resolved Cancelled segment transfer causes future entry transfer to be ignored 2015-09-11 14:04:28 UTC
Red Hat Issue Tracker ISPN-4480 0 Major Resolved Messages sent to leavers can clog the JGroups bundler thread 2015-09-11 14:04:29 UTC
Red Hat Issue Tracker ISPN-4484 0 Critical Resolved Outbound transfers can be cancelled by old CANCEL_STATE_TRANSFER command 2015-09-11 14:04:29 UTC
Red Hat Issue Tracker ISPN-4490 0 Major Resolved Members can miss the rebalance cancellation on coordinator change 2015-09-11 14:04:29 UTC

Description Radim Vansa 2014-06-03 07:23:57 UTC
For past 2 weeks I've been trying to set up an elasticity test with ISPN in distributed mode, 2 owners, with transactions - both UDP and TCP JGroups configuration. The test looks like this:

1) start 32 nodes
2) insert 10,000,000 entries into the cluster (1024 byte values)
3) stop 1 node
4) wait for a period when no @TopologyChanged or @DataRehashed event happens for 10 seconds (more precisely, when .isPre=true occurs, waiting for .isPre=false and then the 10 seconds).
5) check that we still have 10,000,000 entries + the backups
6) if we have more than 16 nodes, go to 3)
7) start 1 node
8) wait for a period of no events (same as 4)
9) check that we still have 10,000,000 entries + the backups
10) if we have less than 32 nodes, go to 7)

There are two problems I face and haven't been able to discover the cause:
a) (the bigger one) The JGroups cluster sometimes breaks - some nodes get suspected etc... FD_ALL (15s timeout, 3s heartbeat period) + FD_SOCK + VERIFY_SUSPECT (5s timeout) were used.
b) Although the cluster is under continuous load (10 threads from each node issuing put/get requests), sometimes I can see that there are no requests completed for several seconds.

I have done threaddumps in these periods, JGroups dumps (such as number of messages not acked, detected as missing etc) but nothing gave me any clue.

We regularly run elasticity tests on clusters of 8 nodes but this is not happening there - seems that the probability grows with increased cluster size and on smaller clusters, this was not detected.
In the past, we've ran similar tests on clusters of 32 nodes but we haven't encountered these problems.

Therefore, I think it would be great if developers try to run similar test for themselves and see if they can get a scaling cluster working.

Comment 2 Radim Vansa 2014-06-18 14:19:38 UTC
Created attachment 910034 [details]
RadarGun 2.0 benchmark

Comment 3 Radim Vansa 2014-06-18 14:20:18 UTC
Created attachment 910035 [details]
Infinispan configuration

Comment 4 Radim Vansa 2014-06-18 14:20:50 UTC
Created attachment 910036 [details]
JGroups configuration

Comment 5 Dan Berindei 2014-07-02 17:10:25 UTC
Alan ran the test a few more times in the Hyperion environment and provided logs. Initially it seemed like he got a different error related to https://issues.jboss.org/browse/ISPN-4154, but it turns out that all the failures were caused by JGroups spuriously suspecting a node (with or without a merge afterwards).

I'm not sure exactly why this happens, but I've noticed that the TransferQueueBundler thread seems to spend a lot of time waiting for physical addresses:

06:40:10,316 WARN  [org.radargun.utils.Utils] (pool-5-thread-1) Stack for thread TransferQueueBundler,default,apex953-14666:
java.lang.Thread.sleep(Native Method)
org.jgroups.util.Util.sleep(Util.java:1504)
org.jgroups.util.Util.sleepRandom(Util.java:1574)
org.jgroups.protocols.TP.sendToSingleMember(TP.java:1685)
org.jgroups.protocols.TP.doSend(TP.java:1670)
org.jgroups.protocols.TP$TransferQueueBundler.sendBundledMessages(TP.java:2476)
org.jgroups.protocols.TP$TransferQueueBundler.sendMessages(TP.java:2392)
org.jgroups.protocols.TP$TransferQueueBundler.run(TP.java:2383)
java.lang.Thread.run(Thread.java:744)

There are also quite a few "no physical address for x" warnings in the logs, with x not being in the latest JGroups view. I'm wondering if our nodes may gather a backlog of message to leavers that could prevent them from sending legitimate messages (like FD_ALL heartbeats).

Comment 6 Dan Berindei 2014-07-03 11:31:52 UTC
Bela has agreed that waiting for physical addresses of members that left could be a problem. There are 2 bugs related to this already fixed in JGroups 3.5.0.Beta2+:

https://issues.jboss.org/browse/JGRP-1814
https://issues.jboss.org/browse/JGRP-1815

There is also a special case where the physical address could be removed from the cache too soon, exacerbating the effect of ISPN-1815:

https://issues.jboss.org/browse/JGRP-1858


There are some JGroups configuration options that should mitigate the problem:

* TP.logical_addr_cache_expiration=86400000
* TP.physical_addr_max_fetch_attempts=1
* UNICAST3_conn_close_timeout=10000

Hopefully the test failures will go away with these settings.

Comment 7 William Burns 2014-07-03 18:07:29 UTC
I have ran the tests a few more times with the first 2 points that Dan mentioned in the previous comment.

* TP.logical_addr_cache_expiration=86400000
* TP.physical_addr_max_fetch_attempts=1

The first attempt (with tracing) completed successfully.  And the same for the next 2 runs (without tracing)

However I then ran it a 4th time and ran into an issue.  Of the 16 nodes that were supposed to come back up the 15th node didn't join the cluster properly (formed it's own) for some reason.  Unfortunately tracing was not enabled so I can't tell what happened.

09:23:52,972 INFO  [org.radargun.Slave] (main) Starting stage ServiceStart {clusterFormationTimeout=3 mins 0 secs, delayAfterFirstSlaveStarts=5.000 secs, delayBetweenStartingSlaves=0.500 secs, expectNumSlaves=31, groups=null, mayFailOn=null, reachable=null, roles=null, slaves=[ 14 ], staggerSlaveStartup=true, validateCluster=true }
09:23:52,972 INFO  [org.radargun.stages.lifecycle.ServiceStartStage] (main) Startup staggering, this is the slave with index 0, not sleeping
09:23:52,972 INFO  [org.radargun.stages.lifecycle.ServiceStartStage] (main) Ack master's StartCluster stage. Local address is: /192.168.118.83. This slave's index is: 14
09:23:52,972 INFO  [org.radargun.service.Infinispan52Lifecycle] (main) Infinispan version: Infinispan 'Infinium' 6.1.0-SNAPSHOT
09:23:52,972 DEBUG [org.radargun.service.Infinispan52Lifecycle] (main) Loading JGroups from: file:/home/field11/jenkins/home/jobs/hyperion-jenkins-job-resilience/workspace/radargun/node15/plugins/jdg63/lib/jgroups-3.4.4.Final-redhat-5.jar
09:23:52,972 INFO  [org.radargun.service.Infinispan52Lifecycle] (main) JGroups version: JGroups 3.4.4.Final-redhat-5
09:23:53,029 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (main) ISPN000078: Starting JGroups Channel
09:23:53,116 INFO  [org.jgroups.protocols.UFC] (main) UFC is not needed (and can be removed) as we're running on a TCP transport
09:23:53,123 WARN  [org.jgroups.protocols.TCP] (main) JGRP000028: bundler_type transfer-queue not known; using default bundler (new3)
09:23:56,126 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (main) ISPN000094: Received new cluster view: [apex946-35697|0] (1) [apex946-35697]
09:23:56,215 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (main) ISPN000079: Cache local address is apex946-35697, physical addresses are [192.168.86.83:7801]

I will try to run it some more with full tracing to see if I can get it occur again.

Comment 8 William Burns 2014-07-03 21:49:14 UTC
I ran it 3 more times as well. 2 success and 1 failure.

The logs can be found at:

https://spideroak.com/storage/OJSWI2DBORPXK43FOJPWC5LUN5PTCNRXHE2Q/shared/651705-1-1001/stdout-31.zip?90f16489bc37e206a2a5d49ab0172e47

It appears that apex958 went down unexpectedly and at the same time as routine shutdown (apex948).  I haven't been able to look into further detail about it.

node18.log:12:16:41,285 DEBUG [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-20,default,apex949-20948) Joined: [], Left: [apex948-56418, apex958-62705]

Comment 9 William Burns 2014-07-04 01:57:07 UTC
The previous link was actually for the run when tracing was not enabled.

The link to the log files when tracing was enabled and the node went down is actually Jenkins Run 33

https://spideroak.com/storage/OJSWI2DBORPXK43FOJPWC5LUN5PTCNRXHE2Q/shared/651705-1-1002/stdout-33.zip?4d5c350dfc9c2235773250bbf57d31c6

Comment 10 Dan Berindei 2014-07-04 12:31:30 UTC
The problem in build 31 seems to be that the initial discovery from apex946 didn't work. We don't have any logs for that run, but the GC logs for build 33 show some Full GC pauses of almost 3 seconds, and GMS.join_timeout was set to only 3 seconds, so increasing GMS.join_timeout might avoid this problem.

The problem in build 33 seems to be more serious. apex948/node18 didn't really go down, the message posted by Will is after the master started shutting down the cluster. But it did have problems receiving all the state for its added segments:

11:56:40,247 TRACE [org.infinispan.statetransfer.StateConsumerImpl] (remote-thread-22) Segments not received yet for cache testCache: {apex959-28734=[InboundTransferTask{segments=[461, 48], finishedSegments=[48], 
unfinishedSegments=[461], source=apex959-28734, isCancelled=false, isStartedSuccessfully=true, isCompletedSuccessfully=false, topologyId=157, timeout=600000, cacheName=testCache}], apex1081-29902=[InboundTransferTask{segments=[297], finishedSegments=[], 
unfinishedSegments=[297], source=apex1081-29902, isCancelled=false, isStartedSuccessfully=false, isCompletedSuccessfully=false, topologyId=157, timeout=600000, cacheName=testCache}], apex961-24039=[InboundTransferTask{segments=[164], finishedSegments=[], 
unfinishedSegments=[164], source=apex961-24039, isCancelled=false, isStartedSuccessfully=false, isCompletedSuccessfully=false, topologyId=157, timeout=600000, cacheName=testCache}]}

The sending thread on apex959 seems to have been interrupted, but I can't tell why:

11:56:00,602 ERROR [org.infinispan.remoting.rpc.RpcManagerImpl] (transport-thread-20) ISPN000073: Unexpected error while replicating
java.lang.InterruptedException
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2188)
	at org.jgroups.blocks.Request.responsesComplete(Request.java:197)
	at org.jgroups.blocks.Request.execute(Request.java:89)
	at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:399)
	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:349)
	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:167)
	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:523)
	at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:281)
	at org.infinispan.statetransfer.OutboundTransferTask.sendEntries(OutboundTransferTask.java:233)
	at org.infinispan.statetransfer.OutboundTransferTask.sendEntry(OutboundTransferTask.java:189)
	at org.infinispan.statetransfer.OutboundTransferTask.run(OutboundTransferTask.java:145)

Comment 11 Dan Berindei 2014-07-04 13:15:09 UTC
I made a confusion in comment 10, node apex948 didn't have any problem. It was apex949 that didn't receive all its segments. This happened after the test stopped node apex947 (the last node that was supposed to be stopped).

Just as apex947 left, it started a rebalance, which apex948 dutifully cancelled as it became the new coordinator. apex949 had already requested segments from apex959, so it sent a StateRequestCommand(CANCEL_STATE_TRANSFER) asynchronously to apex959. Then apex948 started a new rebalance, and apex949 asked apex959 for the same segments. When apex959 finally received the cancel request, it didn't check the topology id and it incorrectly cancelled the outbound transfer to apex949.

The solution would be to verify the topology id in the CANCEL_STATE_TRANSFER command before cancelling the transfer. I also think we can avoid sending the cancel command completely in this case, and only send it as we are about to stop.

Comment 12 William Burns 2014-07-07 18:28:43 UTC
Okay I ran an extra 3 sets using https://github.com/infinispan/jdg/tree/BZ1104045/ISPN-4154/failing_elasticity_test as my baseline.

I had 2 successes then this failure:

https://spideroak.com/storage/OJSWI2DBORPXK43FOJPWC5LUN5PTCNRXHE2Q/shared/651705-1-1003/stdout-37.zip?108be3edf5d58aa7370817c4ce650009

It appears the first node to die had an issue right away and didn't properly replicate the data to the nodes as it should have.

Comment 13 William Burns 2014-07-07 18:45:43 UTC
This was caused by a join issue again.

10:43:04,798 INFO  [org.jgroups.protocols.UFC] (main) UFC is not needed (and can be removed) as we're running on a TCP transport
10:43:04,821 WARN  [org.jgroups.protocols.TCP] (main) JGRP000028: bundler_type transfer-queue not known; using default bundler (new3)
10:43:04,824 DEBUG [org.jgroups.protocols.pbcast.GMS] (main) address=apex884-47061, cluster=default, physical address=192.168.86.21:7801
10:43:04,825 TRACE [org.jgroups.protocols.FD_SOCK] (FD_SOCK acceptor,default,apex884-47061) waiting for client connections on /192.168.86.21:34353
10:43:07,828 TRACE [org.jgroups.protocols.pbcast.GMS] (main) apex884-47061: initial_mbrs are apex885-63743 apex886-36329 apex887-43171 apex888-29567 apex889-5222 
10:43:07,829 TRACE [org.jgroups.protocols.pbcast.GMS] (main) apex884-47061: could not determine coordinator from responses [apex885-63743, view_id=, is_server=false, is_coord=false, logical_name=apex885-63743, physical_addrs=192.168.86.22:7801, apex886-36329, view_id=, is_server=false, is_coord=false, logical_name=apex886-36329, physical_addrs=192.168.86.23:7801, apex887-43171, view_id=, is_server=false, is_coord=false, logical_name=apex887-43171, physical_addrs=192.168.86.24:7801, apex888-29567, view_id=, is_server=false, is_coord=false, logical_name=apex888-29567, physical_addrs=192.168.86.25:7801, apex889-5222, view_id=, is_server=false, is_coord=false, logical_name=apex889-5222, physical_addrs=192.168.86.26:7801]
10:43:07,829 TRACE [org.jgroups.protocols.pbcast.GMS] (main) apex884-47061: clients to choose new coord from are: [apex884-47061, apex888-29567, apex886-36329, apex885-63743, apex889-5222, apex887-43171]
10:43:07,829 TRACE [org.jgroups.protocols.pbcast.GMS] (main) apex884-47061: I (apex884-47061) am the first of the clients, will become coordinator

Comment 14 Martin Gencur 2014-07-08 06:45:48 UTC
All linked JIRA issues have their fixes integrated in the product branch and CR2 was built with them. Setting this to ON_QA.

Comment 15 William Burns 2014-07-08 18:58:10 UTC
To prevent the latter split issue I had changed num_initial_members=10 and timeout=6000 on the MPING protocol.  It appears the issue was that 3 nodes came up at the same time and one of them created it's own cluster since they weren't present in one.

Comment 16 Mircea Markus 2014-07-09 13:56:01 UTC
The current issue under investigation is ISPN-4490. ETA today.

Comment 17 Alan Field 2014-07-15 10:29:21 UTC
Executed the elasticity test in Hyperion 3 times without a failure, and the resilience test 5 times without a failure with JDG 6.3.0 CR4. VERIFIED

Comment 18 JBoss JIRA Server 2014-07-29 09:11:04 UTC
Dan Berindei <dberinde> updated the status of jira ISPN-4480 to Resolved


Note You need to log in before you can comment on or make changes to this bug.