745913 – (EDG-8) hotrod server startup problem due to lock acquisition timeout

Bug 745913 (EDG-8) - hotrod server startup problem due to lock acquisition timeout

Summary: hotrod server startup problem due to lock acquisition timeout

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	EDG-8
Product:	JBoss Data Grid 6
Classification:	JBoss
Component:	unspecified
Sub Component:
Version:	6.0.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	6.0.0
Assignee:	Galder Zamarreño
QA Contact:
Docs Contact:
URL:	http://jira.jboss.org/jira/browse/EDG-8
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-07-29 13:41 UTC by Michal Linhard
Modified:	2014-03-17 04:02 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2011-09-08 09:07:34 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
traced_run.zip (657.83 KB, application/zip) 2011-08-05 13:57 UTC, Michal Linhard	no flags	Details
traced_run2.zip (479.52 KB, application/zip) 2011-08-09 11:18 UTC, Michal Linhard	no flags	Details
deadlock_stacks.zip (15.78 KB, application/zip) 2011-08-09 16:09 UTC, Michal Linhard	no flags	Details
deadlock_situation.txt (24.18 KB, text/plain) 2011-08-09 19:52 UTC, Michal Linhard	no flags	Details
deadlock-situation02.zip (31.74 KB, application/zip) 2011-08-10 11:07 UTC, Michal Linhard	no flags	Details
deadlock-situation02-logs.zip (13.28 KB, application/zip) 2011-08-10 11:15 UTC, Michal Linhard	no flags	Details
traced_run3.zip (44.58 KB, application/zip) 2011-08-11 08:54 UTC, Michal Linhard	no flags	Details
traced_run4.zip (404.30 KB, application/zip) 2011-08-11 10:08 UTC, Michal Linhard	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	745923	0	high	CLOSED	NPE in CacheLoaderInterceptor	2025-02-10 03:14:18 UTC
Red Hat Issue Tracker	EDG-8	0	Major	Closed	hotrod server startup problem due to lock acquisition timeout	2017-09-07 08:31:52 UTC

Internal Links: 745923

Description Michal Linhard 2011-07-29 13:41:38 UTC

project_key: EDG

Hot rod server has trouble starting up - probably because it's under load, due to a lock acquisition timeout

{code}
[JBoss] 06:05:03,475 ERROR [org.jboss.msc.service.fail] (MSC service thread 1-4) MSC00001: Failed to start service jboss.datagrid.endpoint: org.jboss.msc.service.StartException in service jboss.datagrid.endpoint: failed to start service
[JBoss] 	at org.jboss.datagrid.endpoint.EndpointService.start(EndpointService.java:101)
[JBoss] 	at org.jboss.msc.service.ServiceControllerImpl$StartTask.run(ServiceControllerImpl.java:1765)
[JBoss] 	at org.jboss.msc.service.ServiceControllerImpl$ClearTCCLTask.run(ServiceControllerImpl.java:2291)
[JBoss] 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [:1.6.0_24]
[JBoss] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [:1.6.0_24]
[JBoss] 	at java.lang.Thread.run(Thread.java:662) [:1.6.0_24]
[JBoss] Caused by: org.infinispan.CacheException: Unable to acquire lock on cache with name default
[JBoss] 	at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:456)
[JBoss] 	at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:478)
[JBoss] 	at org.jboss.as.clustering.infinispan.DefaultEmbeddedCacheManager.getCache(DefaultEmbeddedCacheManager.java:83)
[JBoss] 	at org.jboss.as.clustering.infinispan.DefaultEmbeddedCacheManager.getCache(DefaultEmbeddedCacheManager.java:72)
[JBoss] 	at org.jboss.as.clustering.infinispan.DefaultEmbeddedCacheManager.getCache(DefaultEmbeddedCacheManager.java:63)
[JBoss] 	at org.infinispan.server.core.AbstractProtocolServer.startDefaultCache(AbstractProtocolServer.scala:157)
[JBoss] 	at org.infinispan.server.core.AbstractProtocolServer.start(AbstractProtocolServer.scala:97)
[JBoss] 	at org.infinispan.server.hotrod.HotRodServer.start(HotRodServer.scala:79)
[JBoss] 	at org.jboss.datagrid.endpoint.EndpointService.startProtocolServer(EndpointService.java:148)
[JBoss] 	at org.jboss.datagrid.endpoint.EndpointService.start(EndpointService.java:94)
[JBoss] 	... 5 more

{code}

the issue here is: Why does a thread hold the lock for so long only to create cache ?
(from (DefaultCacheManager.java:456)
{code}
      boolean acquired = false;
      try {
         if (cacheCreateLock.tryLock(defaultConfiguration.getLockAcquisitionTimeout(), MILLISECONDS)) {
            acquired = true;
            return createCache(cacheName);
         } else {
            throw new CacheException("Unable to acquire lock on cache with name " + cacheName);
         }
      } catch (InterruptedException e) {
         Thread.currentThread().interrupt();
         throw new CacheException("Interrupted while trying to get lock on cache with cache name " + cacheName, e);
      } finally {
         if (acquired)
            cacheCreateLock.unlock();
      }
{code}

Comment 1 Michal Linhard 2011-07-29 13:43:07 UTC

btw this is the causing run: http://hudson.qa.jboss.com/hudson/view/EDG6/job/edg-60-resilience-size4/4/
and it's happenning on the perf19 that is forced to fail during the 4 node resilience(failover) test.

Comment 2 Galder Zamarreño 2011-08-02 08:12:22 UTC

Michal, that lock is there to prevent concurrent cache creation problems on startup. IOW, avoid issues with first time getCache() calls. So, it'd appear that apart from the HotRodServer, at least another piece of code is trying to create a cache and for some reason, this startup is not going as expected.  

Judging from http://hudson.mw.lab.eng.bos.redhat.com/hudson/view/EDG6/job/edg-60-resilience-size4/4/console-perf19/ , the two threads competing are: "MSC service thread 1-4" and "MSC service thread 1-5".

It seems like "MSC service thread 1-5" eventually manages to start up but quite a bit later:

{code}06:15:24,761 INFO  [org.jboss.as.clustering.infinispan.subsystem] (MSC service thread 1-5) Started default cache from default container{code}

I'm not sure what's making "MSC service thread 1-5" behave so slowly but it'd imagine it's rehashing or state transfer. I doubt there're two threads trying to start the Hot Rod server, so maybe it's a competition for the global distributed sync lock as a result of multiple caches belonging to different subsystems starting up.

A thread dump could be handy here to see where threads are stuck exactly.

Comment 3 Michal Linhard 2011-08-05 13:57:00 UTC

attached trace logs I'm investigating right now.

In this run I'm starting 4 instances in my laptop. They're just starting in parallel, no requests are made to them.

Comment 4 Michal Linhard 2011-08-05 13:57:00 UTC

Attachment: Added: traced_run.zip

Comment 5 Michal Linhard 2011-08-05 15:40:21 UTC

clarification of the situation:
there is 4 nodes using 512 virtual nodes - it's a lot of topology info being carried around.
seems that ___hotRodTopologyCache has quite a hard time during the startup
but still it's quite strange that 10s timeout is not enough...

Comment 6 Galder Zamarreño 2011-08-08 10:20:49 UTC

Link: Added: This issue depends ISPN-1316

Comment 7 Galder Zamarreño 2011-08-08 12:57:52 UTC

Link: Added: This issue depends ISPN-1317

Comment 8 Galder Zamarreño 2011-08-08 13:57:04 UTC

Michal, I've created a couple of JIRAs that explain some of the behaviour seen in the logs. 

Apart from that, I also see a few NamedCacheNotFoundException in the michal-linhard-12702's log. These are due to starting up servers very close to each other and one node sending a view in between the other node starting up the channel and creating the Hot Rod cache configuration. A solution for ISPN-658 or the talked possibility of being able to start up caches from InboundInvocationHandler could help here, but I'm not even sure of InboundInvocationHandler being able to start the hotRodTopologyCache on demand cos this is programmatically defined in the HotRod server startup.

One way to avoid this issue in EDG is to define the Hot Rod topology cache in the standalone.xml. That way the config is available from the start, even before the JGroups transport has been started in a node. This should, in theory, work just fine. You can do some testing replicating the configuration in the standalone.xml file. The config used is HotRodServer.createTopologyCacheConfig() method.

The rest of issues I see in the logs, at first glance, appear to be side effects of the state transfer issue, or the configuration not being there. I'm gonna focus now on those two ISPN jiras and try fix them today. If you modify the XML to have this config, we might just be able to get further testing in rather quickly.

Comment 9 Michal Linhard 2011-08-08 15:12:31 UTC

ok I'm on it - replicating the topology cache config in standalone.xml

Comment 10 Galder Zamarreño 2011-08-09 07:56:42 UTC

Btw Michal, ISPN-1317 is a different issue, more related to concurrent state transfer requests. I'll have a chat with Manik later on and will discuss it with him.

Comment 11 Michal Linhard 2011-08-09 11:18:37 UTC

added new traced run with new build:
EDG 6.0.0.Alpha1-SNAPSHOT rev 76
Infinispan 5.1.0-SNAPSHOT last commit ae5c4eb04882bc603b78ce0c5ece700b1ed889c1, Author: Manik Surtani <manik>, Date:   Mon Aug 8 22:28:21 2011 +0100
AS 7.1.0.Alpha1-SNAPSHOT last commit a1f7bbbe3ced08eb26484b120116ae01a8c5167a, Author: Paul Ferraro <paul.ferraro>, Date:   Sat Aug 6 22:05:18 2011 -0400

traced_run2.zip
still having some timeouts

Comment 12 Michal Linhard 2011-08-09 11:18:37 UTC

Attachment: Added: traced_run2.zip

Comment 13 Michal Linhard 2011-08-09 16:09:43 UTC

How about this kind of deadlock (on "view" key in topology cache)

1. server1 LOCK(view) (local PUT)
2. server1 REPLICATE(view, server2)
3. server2 LOCK(view)
4. server2 REPLICATE(view, server1)
5. server1 LOCK(view) (replicated PUT from server2) - infinite wait
6. server2 LOCK(view) (replicated PUT from server1) - infinite wait

Isn't this quite likely to happen when REPLICATE takes quite long time ? (Which it does, when TopologyView is quite a big item)

Attaching deadlock_stacks.zip with jstack dumps of the deadlock state that I've artificially simulated.

Comment 14 Michal Linhard 2011-08-09 16:09:43 UTC

Attachment: Added: deadlock_stacks.zip

Comment 15 Michal Linhard 2011-08-09 16:29:31 UTC

Just realized that HotRodServer deals with this by retrying after random time period.

Comment 16 Michal Linhard 2011-08-09 16:31:17 UTC

And timeout exceptions from this deadlock aren't logged as ERRORs

Comment 17 Michal Linhard 2011-08-09 19:52:15 UTC

attaching deadlock_situation.txt

resulting from run where all timeouts were set to 10min. started four nodes. only two started, other two are stuck. jstack dumps in attachement

Comment 18 Michal Linhard 2011-08-09 19:52:15 UTC

Attachment: Added: deadlock_situation.txt

Comment 19 Galder Zamarreño 2011-08-10 08:38:48 UTC

Michal, the view update failures eventually lead to errors if the update does not happen within certain time. I have a JIRA to improve this with some ideas.

I'll check the latest files you attached.

Comment 20 Galder Zamarreño 2011-08-10 09:10:05 UTC

Michal, thanks for the info. So, I've looked at deadlock_situation.txt and here's what I see:

test1 is applying state for Hot Rod server view and that's blocking the Rehasher for another cache. Both cache startups in that node are waiting on that. On test4, the Hot Rod server is waiting for the distributed cache to finish starting up but that can't finish because of the Rehasher hold up on test1.

Let's try to move state transfer out of the equation (not sure why it takes so long to transfer the view, maybe cos it's quite big with all the hash ids? Btw, I wanna reduce the amount of data stored in the view significantly but that's likely to require a new protocol version, there's a discussion that in ISPN-1090). So, disable state transfer for Hot Rod topology cache and force it to retrieve it lazily. Based on what you said yesterday, you know how to enable it?

Comment 21 Galder Zamarreño 2011-08-10 09:16:22 UTC

Btw, it sounds a bit silly that rehash for a cache-A needs to be hold up by state transfer on cache-B. They have disjoint data and should not be affecting each other. I'll make sure this gets noted for the state transfer refactoring into rehashing that's gonna happen.

Comment 22 Michal Linhard 2011-08-10 10:41:08 UTC

1. Sure, I'm aware of ISPN-1090, I'm keeping it in mind should I have some comments on current Hot Rod protocol, so far all my concerns are addressed by proposals for version 2.
2. OK, let's move topology state transfer out of the equation for this JIRA, I'll try to configure the lazy retrieval
3. I definitely want to explore the issues with state transfer of concurrently starting replicated caches later, but that can be part of other JIRA, here my priority is to move forward, to be able to continue with performance tests.

Comment 23 Michal Linhard 2011-08-10 11:07:02 UTC

deadlock-situation02.zip - run with lazy state retrieval in topology cache

Comment 24 Michal Linhard 2011-08-10 11:07:02 UTC

Attachment: Added: deadlock-situation02.zip

Comment 25 Michal Linhard 2011-08-10 11:15:19 UTC

deadlock-situation02-logs.zip - attached (INFO level) logs to see which threads timeouted

Comment 26 Michal Linhard 2011-08-10 11:15:19 UTC

Attachment: Added: deadlock-situation02-logs.zip

Comment 27 Galder Zamarreño 2011-08-10 14:34:05 UTC

Ok, so we're down to the deadlock issue of multiple Hot Rod instances trying to update the view concurrently and failing to do so. There're several JIRAs that should help ease this out. First ISPN-1137 which should make sure that a key is only locked on the data owner. I suppose in case of replicated caches, it will choose one node in the cluster, and that way we avoid multiple nodes acquiring locks concurrently and deadlocking each other. If ISPN-1137 does not solve this issue, ISPN-448 will have to be implemented but it's not as straightforward.

So, what can be done for the moment? Clearly, the 10 minute lock and replication timeout in the test were exaggerated but it helps see the issue better. So, right now EDG should be configured with this in mind, and to that extend, we should play with update-timeout, replication-timeout and lock-timeout elements in topology-state-transfer in order to allow retries. Some sensible defaults could be: lock-timeout=10000, replication-timeout=15000, update-timeout=60000 - That'll give nodes about 4-6 chances of updating the topology depending how much they sleep when they back off after a lock or replication timeout.

Comment 28 Michal Linhard 2011-08-11 08:54:21 UTC

I ran with the proposed timeouts in topology cache (keeping 10min timeouts in other places)

1. problem is that TimeoutException is thrown as ERROR on the remote side even though on the HotRodServer side it's swallowed as a DEBUG message
2. I can still see "Unable to update topology view, so aborting startup" message I'll check in debug whether the update-timeout is too low or somethings wrong there...

attaching logs in traced_run3.zip

Comment 29 Michal Linhard 2011-08-11 08:54:21 UTC

Attachment: Added: traced_run3.zip

Comment 30 Michal Linhard 2011-08-11 10:08:14 UTC

there is still something dependent on that 10 min timeout it seems.

I'm analyzing traced_run4.zip atm

Comment 31 Michal Linhard 2011-08-11 10:08:16 UTC

Attachment: Added: traced_run4.zip

Comment 32 Michal Linhard 2011-08-11 10:09:07 UTC

btw version I use:
EDG 6.0.0.Alpha1-SNAPSHOT rev78 with 
Infinispan 5.1.0-SNAPSHOT last commit ae5c4eb04882bc603b78ce0c5ece700b1ed889c1, Author: Manik Surtani <manik>, Date:   Mon Aug 8 22:28:21 2011 +0100
AS 7.1.0.Alpha1-SNAPSHOT last commit a1f7bbbe3ced08eb26484b120116ae01a8c5167a, Author: Paul Ferraro <paul.ferraro>, Date:   Sat Aug 6 22:05:18 2011 -0400

Comment 33 Michal Linhard 2011-08-11 10:13:41 UTC

btw situation might get different when all AS7 instances are run for the first time or not. Different race conditions seem to apply in each class of runs...

Comment 34 Michal Linhard 2011-08-11 10:49:13 UTC

galder, check out
https://github.com/infinispan/infinispan/blob/master/server/hotrod/src/main/scala/org/infinispan/server/hotrod/HotRodServer.scala
line 265
{code}
   protected def createTopologyCacheConfig(typedProps: TypedProperties, distSyncTimeout: Long): Configuration = {
      val lockTimeout = typedProps.getLongProperty(PROP_KEY_TOPOLOGY_LOCK_TIMEOUT, TOPO_LOCK_TIMEOUT_DEFAULT, true)
      val replTimeout = typedProps.getLongProperty(PROP_KEY_TOPOLOGY_REPL_TIMEOUT, TOPO_REPL_TIMEOUT_DEFAULT, true)
      val doStateTransfer = typedProps.getBooleanProperty(PROP_KEY_TOPOLOGY_STATE_TRANSFER, TOPO_STATE_TRANSFER_DEFAULT, true)
      topologyUpdateTimeout = MILLISECONDS.toSeconds(
         typedProps.getLongProperty(PROP_KEY_TOPOLOGY_UPDATE_TIMEOUT, TOPO_UPDATE_TIMEOUT_DEFAULT, true))
{code}

aren't we treating the topologyUpdateTimeout as milliseconds later in the code (updateTopologyView) ?
this could be causing timeouting too soon before we can try some number of retries ...

Comment 35 Galder Zamarreño 2011-08-11 13:20:51 UTC

The time granularity is indeed wrong, fixed it in ISPN-1327

Comment 36 Galder Zamarreño 2011-08-12 14:15:29 UTC

Link: Added: This issue depends ISPN-1327

Comment 37 Galder Zamarreño 2011-08-12 15:49:35 UTC

Something else worth trying for the topology cache is eager locking, where locks are acquired cluster wide. This would solve the deadlock because you have to get the locks cluster wide before you make the modification. At the moment, each node acquires the lock and updates, and then cluster wide lock request can cross each other on the network leading to a deadlock. This is well worth the try. Here'd we'd use implicit eager locking enabling it via configuration (see https://docs.jboss.org/author/display/ISPN/Locking+and+Concurrency#LockingandConcurrency-Explicitandimplicitdistributedeagerlocking)

Comment 38 Martin Gencur 2011-08-18 12:50:02 UTC

I was able to reproduce the error for which Michal attached log files, not in our resillience tests, though.
For resillience tests I got (verified several times) the following exception: 
https://hudson.qa.jboss.com/hudson/view/EDG6/job/edg-60-resilience-size4/8/console-perf19/consoleText.
It seems to be the same as ISPN-1106 (my log is very similar to that in the attached file http://dl.dropbox.com/u/10929737/5.0.0.CR2/server_node3.log)
This exception was thrown after a node tried to re-join a 3-node cluster (4-node including itself) a minute after the node was killed.
So this basically blocks our resilience testing of EDG6.

Comment 39 Martin Gencur 2011-08-18 12:53:46 UTC

And I also tried such a dummy resilience test on my localhost (with 4-node cluster, no requests/stores to any cache), killing one of the nodes and then starting it again, and I saw no exception. So there needs to be some traffic.

Comment 40 Galder Zamarreño 2011-08-29 14:32:18 UTC

Martin, at first glance, the latest resilience issue you're reporting here is different to what Michal and I have been using this jira for, which is for figuring out ways to solve the lock timeout issues updating the topology information. I do see however a 10 minute gap in the log you attached (between 08:26:31,149 and 08:26:31,221):

https://hudson.mw.lab.eng.bos.redhat.com/hudson/view/EDG6/job/edg-60-resilience-size4/8/console-perf19/consoleText

So, I'm wondering what that gap is about, and whether it's related to cluster topology update discussed earlier. The reason I ask is cos Michal was playing around with a 10 minute option to highlight the deadlock issue, but I'd imagine he would have reverted that option, since this timeout should be a lot lower.

Based on logs in other nodes (i.e. https://hudson.mw.lab.eng.bos.redhat.com/hudson/view/EDG6/job/edg-60-resilience-size4/8/console-perf17/consoleText), it seems as if the that gap might be more related to on going rehashing, which means we should probably have a different JIRA to look into this. The logs for perf19 seem to indicate that rehashing was halted while it was on-going, which is what leads to the interrupted exception. What's unclear to me is what caused the node to be killed/stopped in that log:

{code}[JBossERR] /tmp/hudson/jboss-datagrid-6.0.0.Alpha1-SNAPSHOT/bin/standalone.sh: line 135: 8095 Killed ...{code}

Is this the result of some configuration setting to have AS started in that time? If there's a particular value within which EDG needs to start, the rehashing timeout needs to be adjusted accordingly. What the log shows is that rehashing is still on going when the server is halted.

If it's rehashing issue here, I'm not sure what's exactly the problem. Dan might have a better idea. I'll ping him and ask to check the logs and let us know what he thinks. In the meantime Martin, start a new JIRA to address this rehashing issue.

Comment 41 Michal Linhard 2011-09-08 09:07:06 UTC

Let's make this JIRA be about deadlocking in ___hotRodTopologyCache during hotrod server start up.

I'm continuing with resilience tests and in new EDG build:
EDG 6.0.0-SNAPSHOT last commit 6d655c86ba05fe2f6a9aba5d77835dfbcb0aae4b Author: Galder Zamarreño <galderz> Date:   Wed Aug 31 16:36:23 2011 +0000
Infinispan 5.1.0-SNAPSHOT last commit fc94449eec33469057614a13cbfe59e4f4e855fd Author: Michal Linhard <michal> Date:   Wed Sep 7 13:42:32 2011 +0200
AS 7.0.1.Final

I'm not able to reproduce it or the deadlock solves itself as expected with the random wait mechanism.

I'll mark this as resolved until the deadlock in this specific scenario appears again.

Comment 42 Michal Linhard 2011-09-09 17:47:57 UTC

Link: Added: This issue is related to JBPAPP-7173

Comment 43 Galder Zamarreño 2011-09-12 12:39:52 UTC

Glad to hear that deadlock does not appear any longer. I'll still be pushing for some reforms in the protocol to avoid such big topology information. I think the topology cache should limit itself to keeping the endpoint information of all nodes in the cluster and nothing or little else.

Comment 44 Michal Linhard 2011-09-12 12:46:45 UTC

Actually, I still see some startup problems, but so far I can't say whether it's related to this special case. 
First we have to eliminate JBPAPP-7173 to know for sure.

Comment 45 Anne-Louise Tangring 2011-09-26 19:39:14 UTC

Docs QE Status: Removed: NEW

Note You need to log in before you can comment on or make changes to this bug.