For past 2 weeks I've been trying to set up an elasticity test with ISPN in distributed mode, 2 owners, with transactions - both UDP and TCP JGroups configuration. The test looks like this: 1) start 32 nodes 2) insert 10,000,000 entries into the cluster (1024 byte values) 3) stop 1 node 4) wait for a period when no @TopologyChanged or @DataRehashed event happens for 10 seconds (more precisely, when .isPre=true occurs, waiting for .isPre=false and then the 10 seconds). 5) check that we still have 10,000,000 entries + the backups 6) if we have more than 16 nodes, go to 3) 7) start 1 node 8) wait for a period of no events (same as 4) 9) check that we still have 10,000,000 entries + the backups 10) if we have less than 32 nodes, go to 7) There are two problems I face and haven't been able to discover the cause: a) (the bigger one) The JGroups cluster sometimes breaks - some nodes get suspected etc... FD_ALL (15s timeout, 3s heartbeat period) + FD_SOCK + VERIFY_SUSPECT (5s timeout) were used. b) Although the cluster is under continuous load (10 threads from each node issuing put/get requests), sometimes I can see that there are no requests completed for several seconds. I have done threaddumps in these periods, JGroups dumps (such as number of messages not acked, detected as missing etc) but nothing gave me any clue. We regularly run elasticity tests on clusters of 8 nodes but this is not happening there - seems that the probability grows with increased cluster size and on smaller clusters, this was not detected. In the past, we've ran similar tests on clusters of 32 nodes but we haven't encountered these problems. Therefore, I think it would be great if developers try to run similar test for themselves and see if they can get a scaling cluster working.
Created attachment 910034 [details] RadarGun 2.0 benchmark
Created attachment 910035 [details] Infinispan configuration
Created attachment 910036 [details] JGroups configuration
Alan ran the test a few more times in the Hyperion environment and provided logs. Initially it seemed like he got a different error related to https://issues.jboss.org/browse/ISPN-4154, but it turns out that all the failures were caused by JGroups spuriously suspecting a node (with or without a merge afterwards). I'm not sure exactly why this happens, but I've noticed that the TransferQueueBundler thread seems to spend a lot of time waiting for physical addresses: 06:40:10,316 WARN [org.radargun.utils.Utils] (pool-5-thread-1) Stack for thread TransferQueueBundler,default,apex953-14666: java.lang.Thread.sleep(Native Method) org.jgroups.util.Util.sleep(Util.java:1504) org.jgroups.util.Util.sleepRandom(Util.java:1574) org.jgroups.protocols.TP.sendToSingleMember(TP.java:1685) org.jgroups.protocols.TP.doSend(TP.java:1670) org.jgroups.protocols.TP$TransferQueueBundler.sendBundledMessages(TP.java:2476) org.jgroups.protocols.TP$TransferQueueBundler.sendMessages(TP.java:2392) org.jgroups.protocols.TP$TransferQueueBundler.run(TP.java:2383) java.lang.Thread.run(Thread.java:744) There are also quite a few "no physical address for x" warnings in the logs, with x not being in the latest JGroups view. I'm wondering if our nodes may gather a backlog of message to leavers that could prevent them from sending legitimate messages (like FD_ALL heartbeats).
Bela has agreed that waiting for physical addresses of members that left could be a problem. There are 2 bugs related to this already fixed in JGroups 3.5.0.Beta2+: https://issues.jboss.org/browse/JGRP-1814 https://issues.jboss.org/browse/JGRP-1815 There is also a special case where the physical address could be removed from the cache too soon, exacerbating the effect of ISPN-1815: https://issues.jboss.org/browse/JGRP-1858 There are some JGroups configuration options that should mitigate the problem: * TP.logical_addr_cache_expiration=86400000 * TP.physical_addr_max_fetch_attempts=1 * UNICAST3_conn_close_timeout=10000 Hopefully the test failures will go away with these settings.
I have ran the tests a few more times with the first 2 points that Dan mentioned in the previous comment. * TP.logical_addr_cache_expiration=86400000 * TP.physical_addr_max_fetch_attempts=1 The first attempt (with tracing) completed successfully. And the same for the next 2 runs (without tracing) However I then ran it a 4th time and ran into an issue. Of the 16 nodes that were supposed to come back up the 15th node didn't join the cluster properly (formed it's own) for some reason. Unfortunately tracing was not enabled so I can't tell what happened. 09:23:52,972 INFO [org.radargun.Slave] (main) Starting stage ServiceStart {clusterFormationTimeout=3 mins 0 secs, delayAfterFirstSlaveStarts=5.000 secs, delayBetweenStartingSlaves=0.500 secs, expectNumSlaves=31, groups=null, mayFailOn=null, reachable=null, roles=null, slaves=[ 14 ], staggerSlaveStartup=true, validateCluster=true } 09:23:52,972 INFO [org.radargun.stages.lifecycle.ServiceStartStage] (main) Startup staggering, this is the slave with index 0, not sleeping 09:23:52,972 INFO [org.radargun.stages.lifecycle.ServiceStartStage] (main) Ack master's StartCluster stage. Local address is: /192.168.118.83. This slave's index is: 14 09:23:52,972 INFO [org.radargun.service.Infinispan52Lifecycle] (main) Infinispan version: Infinispan 'Infinium' 6.1.0-SNAPSHOT 09:23:52,972 DEBUG [org.radargun.service.Infinispan52Lifecycle] (main) Loading JGroups from: file:/home/field11/jenkins/home/jobs/hyperion-jenkins-job-resilience/workspace/radargun/node15/plugins/jdg63/lib/jgroups-3.4.4.Final-redhat-5.jar 09:23:52,972 INFO [org.radargun.service.Infinispan52Lifecycle] (main) JGroups version: JGroups 3.4.4.Final-redhat-5 09:23:53,029 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (main) ISPN000078: Starting JGroups Channel 09:23:53,116 INFO [org.jgroups.protocols.UFC] (main) UFC is not needed (and can be removed) as we're running on a TCP transport 09:23:53,123 WARN [org.jgroups.protocols.TCP] (main) JGRP000028: bundler_type transfer-queue not known; using default bundler (new3) 09:23:56,126 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (main) ISPN000094: Received new cluster view: [apex946-35697|0] (1) [apex946-35697] 09:23:56,215 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (main) ISPN000079: Cache local address is apex946-35697, physical addresses are [192.168.86.83:7801] I will try to run it some more with full tracing to see if I can get it occur again.
I ran it 3 more times as well. 2 success and 1 failure. The logs can be found at: https://spideroak.com/storage/OJSWI2DBORPXK43FOJPWC5LUN5PTCNRXHE2Q/shared/651705-1-1001/stdout-31.zip?90f16489bc37e206a2a5d49ab0172e47 It appears that apex958 went down unexpectedly and at the same time as routine shutdown (apex948). I haven't been able to look into further detail about it. node18.log:12:16:41,285 DEBUG [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-20,default,apex949-20948) Joined: [], Left: [apex948-56418, apex958-62705]
The previous link was actually for the run when tracing was not enabled. The link to the log files when tracing was enabled and the node went down is actually Jenkins Run 33 https://spideroak.com/storage/OJSWI2DBORPXK43FOJPWC5LUN5PTCNRXHE2Q/shared/651705-1-1002/stdout-33.zip?4d5c350dfc9c2235773250bbf57d31c6
The problem in build 31 seems to be that the initial discovery from apex946 didn't work. We don't have any logs for that run, but the GC logs for build 33 show some Full GC pauses of almost 3 seconds, and GMS.join_timeout was set to only 3 seconds, so increasing GMS.join_timeout might avoid this problem. The problem in build 33 seems to be more serious. apex948/node18 didn't really go down, the message posted by Will is after the master started shutting down the cluster. But it did have problems receiving all the state for its added segments: 11:56:40,247 TRACE [org.infinispan.statetransfer.StateConsumerImpl] (remote-thread-22) Segments not received yet for cache testCache: {apex959-28734=[InboundTransferTask{segments=[461, 48], finishedSegments=[48], unfinishedSegments=[461], source=apex959-28734, isCancelled=false, isStartedSuccessfully=true, isCompletedSuccessfully=false, topologyId=157, timeout=600000, cacheName=testCache}], apex1081-29902=[InboundTransferTask{segments=[297], finishedSegments=[], unfinishedSegments=[297], source=apex1081-29902, isCancelled=false, isStartedSuccessfully=false, isCompletedSuccessfully=false, topologyId=157, timeout=600000, cacheName=testCache}], apex961-24039=[InboundTransferTask{segments=[164], finishedSegments=[], unfinishedSegments=[164], source=apex961-24039, isCancelled=false, isStartedSuccessfully=false, isCompletedSuccessfully=false, topologyId=157, timeout=600000, cacheName=testCache}]} The sending thread on apex959 seems to have been interrupted, but I can't tell why: 11:56:00,602 ERROR [org.infinispan.remoting.rpc.RpcManagerImpl] (transport-thread-20) ISPN000073: Unexpected error while replicating java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2188) at org.jgroups.blocks.Request.responsesComplete(Request.java:197) at org.jgroups.blocks.Request.execute(Request.java:89) at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:399) at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:349) at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:167) at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:523) at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:281) at org.infinispan.statetransfer.OutboundTransferTask.sendEntries(OutboundTransferTask.java:233) at org.infinispan.statetransfer.OutboundTransferTask.sendEntry(OutboundTransferTask.java:189) at org.infinispan.statetransfer.OutboundTransferTask.run(OutboundTransferTask.java:145)
I made a confusion in comment 10, node apex948 didn't have any problem. It was apex949 that didn't receive all its segments. This happened after the test stopped node apex947 (the last node that was supposed to be stopped). Just as apex947 left, it started a rebalance, which apex948 dutifully cancelled as it became the new coordinator. apex949 had already requested segments from apex959, so it sent a StateRequestCommand(CANCEL_STATE_TRANSFER) asynchronously to apex959. Then apex948 started a new rebalance, and apex949 asked apex959 for the same segments. When apex959 finally received the cancel request, it didn't check the topology id and it incorrectly cancelled the outbound transfer to apex949. The solution would be to verify the topology id in the CANCEL_STATE_TRANSFER command before cancelling the transfer. I also think we can avoid sending the cancel command completely in this case, and only send it as we are about to stop.
Okay I ran an extra 3 sets using https://github.com/infinispan/jdg/tree/BZ1104045/ISPN-4154/failing_elasticity_test as my baseline. I had 2 successes then this failure: https://spideroak.com/storage/OJSWI2DBORPXK43FOJPWC5LUN5PTCNRXHE2Q/shared/651705-1-1003/stdout-37.zip?108be3edf5d58aa7370817c4ce650009 It appears the first node to die had an issue right away and didn't properly replicate the data to the nodes as it should have.
This was caused by a join issue again. 10:43:04,798 INFO [org.jgroups.protocols.UFC] (main) UFC is not needed (and can be removed) as we're running on a TCP transport 10:43:04,821 WARN [org.jgroups.protocols.TCP] (main) JGRP000028: bundler_type transfer-queue not known; using default bundler (new3) 10:43:04,824 DEBUG [org.jgroups.protocols.pbcast.GMS] (main) address=apex884-47061, cluster=default, physical address=192.168.86.21:7801 10:43:04,825 TRACE [org.jgroups.protocols.FD_SOCK] (FD_SOCK acceptor,default,apex884-47061) waiting for client connections on /192.168.86.21:34353 10:43:07,828 TRACE [org.jgroups.protocols.pbcast.GMS] (main) apex884-47061: initial_mbrs are apex885-63743 apex886-36329 apex887-43171 apex888-29567 apex889-5222 10:43:07,829 TRACE [org.jgroups.protocols.pbcast.GMS] (main) apex884-47061: could not determine coordinator from responses [apex885-63743, view_id=, is_server=false, is_coord=false, logical_name=apex885-63743, physical_addrs=192.168.86.22:7801, apex886-36329, view_id=, is_server=false, is_coord=false, logical_name=apex886-36329, physical_addrs=192.168.86.23:7801, apex887-43171, view_id=, is_server=false, is_coord=false, logical_name=apex887-43171, physical_addrs=192.168.86.24:7801, apex888-29567, view_id=, is_server=false, is_coord=false, logical_name=apex888-29567, physical_addrs=192.168.86.25:7801, apex889-5222, view_id=, is_server=false, is_coord=false, logical_name=apex889-5222, physical_addrs=192.168.86.26:7801] 10:43:07,829 TRACE [org.jgroups.protocols.pbcast.GMS] (main) apex884-47061: clients to choose new coord from are: [apex884-47061, apex888-29567, apex886-36329, apex885-63743, apex889-5222, apex887-43171] 10:43:07,829 TRACE [org.jgroups.protocols.pbcast.GMS] (main) apex884-47061: I (apex884-47061) am the first of the clients, will become coordinator
All linked JIRA issues have their fixes integrated in the product branch and CR2 was built with them. Setting this to ON_QA.
To prevent the latter split issue I had changed num_initial_members=10 and timeout=6000 on the MPING protocol. It appears the issue was that 3 nodes came up at the same time and one of them created it's own cluster since they weren't present in one.
The current issue under investigation is ISPN-4490. ETA today.
Executed the elasticity test in Hyperion 3 times without a failure, and the resilience test 5 times without a failure with JDG 6.3.0 CR4. VERIFIED
Dan Berindei <dberinde> updated the status of jira ISPN-4480 to Resolved