project_key: EDG Hot rod server has trouble starting up - probably because it's under load, due to a lock acquisition timeout {code} [JBoss] 06:05:03,475 ERROR [org.jboss.msc.service.fail] (MSC service thread 1-4) MSC00001: Failed to start service jboss.datagrid.endpoint: org.jboss.msc.service.StartException in service jboss.datagrid.endpoint: failed to start service [JBoss] at org.jboss.datagrid.endpoint.EndpointService.start(EndpointService.java:101) [JBoss] at org.jboss.msc.service.ServiceControllerImpl$StartTask.run(ServiceControllerImpl.java:1765) [JBoss] at org.jboss.msc.service.ServiceControllerImpl$ClearTCCLTask.run(ServiceControllerImpl.java:2291) [JBoss] at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [:1.6.0_24] [JBoss] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [:1.6.0_24] [JBoss] at java.lang.Thread.run(Thread.java:662) [:1.6.0_24] [JBoss] Caused by: org.infinispan.CacheException: Unable to acquire lock on cache with name default [JBoss] at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:456) [JBoss] at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:478) [JBoss] at org.jboss.as.clustering.infinispan.DefaultEmbeddedCacheManager.getCache(DefaultEmbeddedCacheManager.java:83) [JBoss] at org.jboss.as.clustering.infinispan.DefaultEmbeddedCacheManager.getCache(DefaultEmbeddedCacheManager.java:72) [JBoss] at org.jboss.as.clustering.infinispan.DefaultEmbeddedCacheManager.getCache(DefaultEmbeddedCacheManager.java:63) [JBoss] at org.infinispan.server.core.AbstractProtocolServer.startDefaultCache(AbstractProtocolServer.scala:157) [JBoss] at org.infinispan.server.core.AbstractProtocolServer.start(AbstractProtocolServer.scala:97) [JBoss] at org.infinispan.server.hotrod.HotRodServer.start(HotRodServer.scala:79) [JBoss] at org.jboss.datagrid.endpoint.EndpointService.startProtocolServer(EndpointService.java:148) [JBoss] at org.jboss.datagrid.endpoint.EndpointService.start(EndpointService.java:94) [JBoss] ... 5 more {code} the issue here is: Why does a thread hold the lock for so long only to create cache ? (from (DefaultCacheManager.java:456) {code} boolean acquired = false; try { if (cacheCreateLock.tryLock(defaultConfiguration.getLockAcquisitionTimeout(), MILLISECONDS)) { acquired = true; return createCache(cacheName); } else { throw new CacheException("Unable to acquire lock on cache with name " + cacheName); } } catch (InterruptedException e) { Thread.currentThread().interrupt(); throw new CacheException("Interrupted while trying to get lock on cache with cache name " + cacheName, e); } finally { if (acquired) cacheCreateLock.unlock(); } {code}
btw this is the causing run: http://hudson.qa.jboss.com/hudson/view/EDG6/job/edg-60-resilience-size4/4/ and it's happenning on the perf19 that is forced to fail during the 4 node resilience(failover) test.
Michal, that lock is there to prevent concurrent cache creation problems on startup. IOW, avoid issues with first time getCache() calls. So, it'd appear that apart from the HotRodServer, at least another piece of code is trying to create a cache and for some reason, this startup is not going as expected. Judging from http://hudson.mw.lab.eng.bos.redhat.com/hudson/view/EDG6/job/edg-60-resilience-size4/4/console-perf19/ , the two threads competing are: "MSC service thread 1-4" and "MSC service thread 1-5". It seems like "MSC service thread 1-5" eventually manages to start up but quite a bit later: {code}06:15:24,761 INFO [org.jboss.as.clustering.infinispan.subsystem] (MSC service thread 1-5) Started default cache from default container{code} I'm not sure what's making "MSC service thread 1-5" behave so slowly but it'd imagine it's rehashing or state transfer. I doubt there're two threads trying to start the Hot Rod server, so maybe it's a competition for the global distributed sync lock as a result of multiple caches belonging to different subsystems starting up. A thread dump could be handy here to see where threads are stuck exactly.
attached trace logs I'm investigating right now. In this run I'm starting 4 instances in my laptop. They're just starting in parallel, no requests are made to them.
Attachment: Added: traced_run.zip
clarification of the situation: there is 4 nodes using 512 virtual nodes - it's a lot of topology info being carried around. seems that ___hotRodTopologyCache has quite a hard time during the startup but still it's quite strange that 10s timeout is not enough...
Link: Added: This issue depends ISPN-1316
Link: Added: This issue depends ISPN-1317
Michal, I've created a couple of JIRAs that explain some of the behaviour seen in the logs. Apart from that, I also see a few NamedCacheNotFoundException in the michal-linhard-12702's log. These are due to starting up servers very close to each other and one node sending a view in between the other node starting up the channel and creating the Hot Rod cache configuration. A solution for ISPN-658 or the talked possibility of being able to start up caches from InboundInvocationHandler could help here, but I'm not even sure of InboundInvocationHandler being able to start the hotRodTopologyCache on demand cos this is programmatically defined in the HotRod server startup. One way to avoid this issue in EDG is to define the Hot Rod topology cache in the standalone.xml. That way the config is available from the start, even before the JGroups transport has been started in a node. This should, in theory, work just fine. You can do some testing replicating the configuration in the standalone.xml file. The config used is HotRodServer.createTopologyCacheConfig() method. The rest of issues I see in the logs, at first glance, appear to be side effects of the state transfer issue, or the configuration not being there. I'm gonna focus now on those two ISPN jiras and try fix them today. If you modify the XML to have this config, we might just be able to get further testing in rather quickly.
ok I'm on it - replicating the topology cache config in standalone.xml
Btw Michal, ISPN-1317 is a different issue, more related to concurrent state transfer requests. I'll have a chat with Manik later on and will discuss it with him.
added new traced run with new build: EDG 6.0.0.Alpha1-SNAPSHOT rev 76 Infinispan 5.1.0-SNAPSHOT last commit ae5c4eb04882bc603b78ce0c5ece700b1ed889c1, Author: Manik Surtani <manik>, Date: Mon Aug 8 22:28:21 2011 +0100 AS 7.1.0.Alpha1-SNAPSHOT last commit a1f7bbbe3ced08eb26484b120116ae01a8c5167a, Author: Paul Ferraro <paul.ferraro>, Date: Sat Aug 6 22:05:18 2011 -0400 traced_run2.zip still having some timeouts
Attachment: Added: traced_run2.zip
How about this kind of deadlock (on "view" key in topology cache) 1. server1 LOCK(view) (local PUT) 2. server1 REPLICATE(view, server2) 3. server2 LOCK(view) 4. server2 REPLICATE(view, server1) 5. server1 LOCK(view) (replicated PUT from server2) - infinite wait 6. server2 LOCK(view) (replicated PUT from server1) - infinite wait Isn't this quite likely to happen when REPLICATE takes quite long time ? (Which it does, when TopologyView is quite a big item) Attaching deadlock_stacks.zip with jstack dumps of the deadlock state that I've artificially simulated.
Attachment: Added: deadlock_stacks.zip
Just realized that HotRodServer deals with this by retrying after random time period.
And timeout exceptions from this deadlock aren't logged as ERRORs
attaching deadlock_situation.txt resulting from run where all timeouts were set to 10min. started four nodes. only two started, other two are stuck. jstack dumps in attachement
Attachment: Added: deadlock_situation.txt
Michal, the view update failures eventually lead to errors if the update does not happen within certain time. I have a JIRA to improve this with some ideas. I'll check the latest files you attached.
Michal, thanks for the info. So, I've looked at deadlock_situation.txt and here's what I see: test1 is applying state for Hot Rod server view and that's blocking the Rehasher for another cache. Both cache startups in that node are waiting on that. On test4, the Hot Rod server is waiting for the distributed cache to finish starting up but that can't finish because of the Rehasher hold up on test1. Let's try to move state transfer out of the equation (not sure why it takes so long to transfer the view, maybe cos it's quite big with all the hash ids? Btw, I wanna reduce the amount of data stored in the view significantly but that's likely to require a new protocol version, there's a discussion that in ISPN-1090). So, disable state transfer for Hot Rod topology cache and force it to retrieve it lazily. Based on what you said yesterday, you know how to enable it?
Btw, it sounds a bit silly that rehash for a cache-A needs to be hold up by state transfer on cache-B. They have disjoint data and should not be affecting each other. I'll make sure this gets noted for the state transfer refactoring into rehashing that's gonna happen.
1. Sure, I'm aware of ISPN-1090, I'm keeping it in mind should I have some comments on current Hot Rod protocol, so far all my concerns are addressed by proposals for version 2. 2. OK, let's move topology state transfer out of the equation for this JIRA, I'll try to configure the lazy retrieval 3. I definitely want to explore the issues with state transfer of concurrently starting replicated caches later, but that can be part of other JIRA, here my priority is to move forward, to be able to continue with performance tests.
deadlock-situation02.zip - run with lazy state retrieval in topology cache
Attachment: Added: deadlock-situation02.zip
deadlock-situation02-logs.zip - attached (INFO level) logs to see which threads timeouted
Attachment: Added: deadlock-situation02-logs.zip
Ok, so we're down to the deadlock issue of multiple Hot Rod instances trying to update the view concurrently and failing to do so. There're several JIRAs that should help ease this out. First ISPN-1137 which should make sure that a key is only locked on the data owner. I suppose in case of replicated caches, it will choose one node in the cluster, and that way we avoid multiple nodes acquiring locks concurrently and deadlocking each other. If ISPN-1137 does not solve this issue, ISPN-448 will have to be implemented but it's not as straightforward. So, what can be done for the moment? Clearly, the 10 minute lock and replication timeout in the test were exaggerated but it helps see the issue better. So, right now EDG should be configured with this in mind, and to that extend, we should play with update-timeout, replication-timeout and lock-timeout elements in topology-state-transfer in order to allow retries. Some sensible defaults could be: lock-timeout=10000, replication-timeout=15000, update-timeout=60000 - That'll give nodes about 4-6 chances of updating the topology depending how much they sleep when they back off after a lock or replication timeout.
I ran with the proposed timeouts in topology cache (keeping 10min timeouts in other places) 1. problem is that TimeoutException is thrown as ERROR on the remote side even though on the HotRodServer side it's swallowed as a DEBUG message 2. I can still see "Unable to update topology view, so aborting startup" message I'll check in debug whether the update-timeout is too low or somethings wrong there... attaching logs in traced_run3.zip
Attachment: Added: traced_run3.zip
there is still something dependent on that 10 min timeout it seems. I'm analyzing traced_run4.zip atm
Attachment: Added: traced_run4.zip
btw version I use: EDG 6.0.0.Alpha1-SNAPSHOT rev78 with Infinispan 5.1.0-SNAPSHOT last commit ae5c4eb04882bc603b78ce0c5ece700b1ed889c1, Author: Manik Surtani <manik>, Date: Mon Aug 8 22:28:21 2011 +0100 AS 7.1.0.Alpha1-SNAPSHOT last commit a1f7bbbe3ced08eb26484b120116ae01a8c5167a, Author: Paul Ferraro <paul.ferraro>, Date: Sat Aug 6 22:05:18 2011 -0400
btw situation might get different when all AS7 instances are run for the first time or not. Different race conditions seem to apply in each class of runs...
galder, check out https://github.com/infinispan/infinispan/blob/master/server/hotrod/src/main/scala/org/infinispan/server/hotrod/HotRodServer.scala line 265 {code} protected def createTopologyCacheConfig(typedProps: TypedProperties, distSyncTimeout: Long): Configuration = { val lockTimeout = typedProps.getLongProperty(PROP_KEY_TOPOLOGY_LOCK_TIMEOUT, TOPO_LOCK_TIMEOUT_DEFAULT, true) val replTimeout = typedProps.getLongProperty(PROP_KEY_TOPOLOGY_REPL_TIMEOUT, TOPO_REPL_TIMEOUT_DEFAULT, true) val doStateTransfer = typedProps.getBooleanProperty(PROP_KEY_TOPOLOGY_STATE_TRANSFER, TOPO_STATE_TRANSFER_DEFAULT, true) topologyUpdateTimeout = MILLISECONDS.toSeconds( typedProps.getLongProperty(PROP_KEY_TOPOLOGY_UPDATE_TIMEOUT, TOPO_UPDATE_TIMEOUT_DEFAULT, true)) {code} aren't we treating the topologyUpdateTimeout as milliseconds later in the code (updateTopologyView) ? this could be causing timeouting too soon before we can try some number of retries ...
The time granularity is indeed wrong, fixed it in ISPN-1327
Link: Added: This issue depends ISPN-1327
Something else worth trying for the topology cache is eager locking, where locks are acquired cluster wide. This would solve the deadlock because you have to get the locks cluster wide before you make the modification. At the moment, each node acquires the lock and updates, and then cluster wide lock request can cross each other on the network leading to a deadlock. This is well worth the try. Here'd we'd use implicit eager locking enabling it via configuration (see https://docs.jboss.org/author/display/ISPN/Locking+and+Concurrency#LockingandConcurrency-Explicitandimplicitdistributedeagerlocking)
I was able to reproduce the error for which Michal attached log files, not in our resillience tests, though. For resillience tests I got (verified several times) the following exception: https://hudson.qa.jboss.com/hudson/view/EDG6/job/edg-60-resilience-size4/8/console-perf19/consoleText. It seems to be the same as ISPN-1106 (my log is very similar to that in the attached file http://dl.dropbox.com/u/10929737/5.0.0.CR2/server_node3.log) This exception was thrown after a node tried to re-join a 3-node cluster (4-node including itself) a minute after the node was killed. So this basically blocks our resilience testing of EDG6.
And I also tried such a dummy resilience test on my localhost (with 4-node cluster, no requests/stores to any cache), killing one of the nodes and then starting it again, and I saw no exception. So there needs to be some traffic.
Martin, at first glance, the latest resilience issue you're reporting here is different to what Michal and I have been using this jira for, which is for figuring out ways to solve the lock timeout issues updating the topology information. I do see however a 10 minute gap in the log you attached (between 08:26:31,149 and 08:26:31,221): https://hudson.mw.lab.eng.bos.redhat.com/hudson/view/EDG6/job/edg-60-resilience-size4/8/console-perf19/consoleText So, I'm wondering what that gap is about, and whether it's related to cluster topology update discussed earlier. The reason I ask is cos Michal was playing around with a 10 minute option to highlight the deadlock issue, but I'd imagine he would have reverted that option, since this timeout should be a lot lower. Based on logs in other nodes (i.e. https://hudson.mw.lab.eng.bos.redhat.com/hudson/view/EDG6/job/edg-60-resilience-size4/8/console-perf17/consoleText), it seems as if the that gap might be more related to on going rehashing, which means we should probably have a different JIRA to look into this. The logs for perf19 seem to indicate that rehashing was halted while it was on-going, which is what leads to the interrupted exception. What's unclear to me is what caused the node to be killed/stopped in that log: {code}[JBossERR] /tmp/hudson/jboss-datagrid-6.0.0.Alpha1-SNAPSHOT/bin/standalone.sh: line 135: 8095 Killed ...{code} Is this the result of some configuration setting to have AS started in that time? If there's a particular value within which EDG needs to start, the rehashing timeout needs to be adjusted accordingly. What the log shows is that rehashing is still on going when the server is halted. If it's rehashing issue here, I'm not sure what's exactly the problem. Dan might have a better idea. I'll ping him and ask to check the logs and let us know what he thinks. In the meantime Martin, start a new JIRA to address this rehashing issue.
Let's make this JIRA be about deadlocking in ___hotRodTopologyCache during hotrod server start up. I'm continuing with resilience tests and in new EDG build: EDG 6.0.0-SNAPSHOT last commit 6d655c86ba05fe2f6a9aba5d77835dfbcb0aae4b Author: Galder Zamarreño <galderz> Date: Wed Aug 31 16:36:23 2011 +0000 Infinispan 5.1.0-SNAPSHOT last commit fc94449eec33469057614a13cbfe59e4f4e855fd Author: Michal Linhard <michal> Date: Wed Sep 7 13:42:32 2011 +0200 AS 7.0.1.Final I'm not able to reproduce it or the deadlock solves itself as expected with the random wait mechanism. I'll mark this as resolved until the deadlock in this specific scenario appears again.
Link: Added: This issue is related to JBPAPP-7173
Glad to hear that deadlock does not appear any longer. I'll still be pushing for some reforms in the protocol to avoid such big topology information. I think the topology cache should limit itself to keeping the endpoint information of all nodes in the cluster and nothing or little else.
Actually, I still see some startup problems, but so far I can't say whether it's related to this special case. First we have to eliminate JBPAPP-7173 to know for sure.
Docs QE Status: Removed: NEW