| Summary: | StateTransferException: java.io.EOFException: Read past end of file | ||
|---|---|---|---|
| Product: | [JBoss] JBoss Data Grid 5 | Reporter: | Radoslav Husar <rhusar> |
| Component: | Infinispan | Assignee: | Default User <jbpapp-maint> |
| Status: | CLOSED NEXTRELEASE | QA Contact: | |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | unspecified | CC: | galder.zamarreno, rhusar |
| Target Milestone: | --- | ||
| Target Release: | EAP 5.1.0 EDG TP | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| URL: | http://jira.jboss.org/jira/browse/EDG-44 | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-03-15 07:42:13 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Radoslav Husar
2011-03-08 14:44:02 UTC
Link: Added: This issue is incorporated by JBPAPP-6054 As explained in http://community.jboss.org/docs/DOC-13439, most likely something went wrong in the state producer which halted the state transfer generation. The receiver end throws this kind of EOFExceptions when the stream finished unexpectedly. So, check whether any problems appear in the receiver. This and JBPAPP-6053 could well be related cos they both link to the same problem. State transfer, or rehashing not working because the node that should reply is not doing so. And base on what log says, that's either perf03 or perf04: [JBoss] 08:50:05,211 INFO [RpcManagerImpl] Trying to fetch state from perf04-63783 ... [JBoss] 08:50:15,297 ERROR [JGroupsTransport] Caught while requesting or applying state [JBoss] org.infinispan.statetransfer.StateTransferException: java.io.EOFException: Read past end of file ... [JBoss] 08:50:15,297 INFO [RpcManagerImpl] Trying to fetch state from perf03-36672 ... [JBoss] 08:51:00,458 ERROR [JGroupsTransport] Caught while requesting or applying state [JBoss] org.infinispan.statetransfer.StateTransferException: java.io.EOFException: Read past end of file There's no thread information in the logs, so cannot 100% guarantee this. Right, here's the root cause why perf04 fails with state transfer:
{code}[JBoss] 08:50:15,234 ERROR [JGroupsTransport] Caught while responding to state transfer request
[JBoss] org.infinispan.statetransfer.StateTransferException: java.util.concurrent.TimeoutException: STREAMING_STATE_TRANSFER-sender-1,DatagridPartition-Infinispan,perf04-63783 could not obtain exclusive processing lock after 10 seconds. Locks in question are java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock@1042e091[Read locks = 8] and java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@16a557ee[Unlocked]
[JBoss] at org.infinispan.statetransfer.StateTransferManagerImpl.generateState(StateTransferManagerImpl.java:176)
[JBoss] at org.infinispan.remoting.InboundInvocationHandlerImpl.generateState(InboundInvocationHandlerImpl.java:217)
[JBoss] at org.infinispan.remoting.transport.jgroups.JGroupsTransport.getState(JGroupsTransport.java:572)
[JBoss] at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:691)
[JBoss] at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:772)
[JBoss] at org.jgroups.JChannel.up(JChannel.java:1465)
[JBoss] at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:954)
[JBoss] at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:478)
[JBoss] at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER$StateProviderHandler.process(STREAMING_STATE_TRANSFER.java:653)
[JBoss] at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER$StateProviderThreadSpawner$1.run(STREAMING_STATE_TRANSFER.java:582)
[JBoss] at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
[JBoss] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
[JBoss] at java.lang.Thread.run(Thread.java:619)
[JBoss] Caused by: java.util.concurrent.TimeoutException: STREAMING_STATE_TRANSFER-sender-1,DatagridPartition-Infinispan,perf04-63783 could not obtain exclusive processing lock after 10 seconds. Locks in question are java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock@1042e091[Read locks = 8] and java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@16a557ee[Unlocked]
[JBoss] at org.infinispan.remoting.transport.jgroups.JGroupsDistSync.acquireProcessingLock(JGroupsDistSync.java:80)
[JBoss] at org.infinispan.statetransfer.StateTransferManagerImpl.generateTransactionLog(StateTransferManagerImpl.java:203)
[JBoss] at org.infinispan.statetransfer.StateTransferManagerImpl.generateState(StateTransferManagerImpl.java:166)
[JBoss] ... 12 more{code}
The same thing happening for perf03. I'll look further into this tomorrow.
Galder, what kind of thread information would you be looking for? Getting a jstack every minute or so? Rado, so the problem here appears to be that the state transfer generator thread cannot get an exclusive lock when trying to generate the state. Dunno why this is, maybe rehashing is going on at the same time for another cache and this is blocking it? This could well be the case but why should they share the same lock if the target cache is different (need to talk to Manik about this)? Anyway, a key thing in the logs is that exclusive lock appears to be unlocked by the time the exception is logged: java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@16a557ee[Unlocked] So, bearing in mind that the processing lock timeout is 10 seconds, you should repeat the tests increasing this to either 30 or 60 seconds. In the mean time, I'll talk to Manik about the dist sync lock. Btw, you modify this timeout in the transport element, i.e.
{code}<transport distributedSyncTimeout="60000"...{code}
Actually Rado, forget what I about the distributedSyncTimeout. State transfer is using the distributed sync lock, but it's not that timeout that it uses, but instead state retrieval timeout. So, it would appear that state transfer timeout is set to 10 seconds? In fact, this is a hardcoded value on the Hot Rod server!
{code} protected def defineTopologyCacheConfig(cacheManager: EmbeddedCacheManager) {
val topologyCacheConfig = new Configuration
topologyCacheConfig.setCacheMode(CacheMode.REPL_SYNC)
topologyCacheConfig.setSyncReplTimeout(10000) // Milliseconds
topologyCacheConfig.setFetchInMemoryState(true) // State transfer required
topologyCacheConfig.setEvictionStrategy(EvictionStrategy.NONE); // No eviction
topologyCacheConfig.setExpirationLifespan(-1); // No maximum lifespan
topologyCacheConfig.setExpirationMaxIdle(-1); // No maximum idle time
cacheManager.defineConfiguration(TopologyCacheName, topologyCacheConfig)
}{code}
So, I need to make that configurable (of couse!). I'll implement this asap and you can then try with a different value.
Great analysis, thanks. Luckily I didn't get around to applying those changes yet. So I shall wait for a fix. Link: Added: This issue depends ISPN-975 Rado, ISPN-975 is fixed now and I've uploaded a snapshot of 4.2.x to nexus. The documentation of the new options is in http://community.jboss.org/docs/DOC-15601 But essentially, there're two settings you should try for failover testing (it means that you'd need to modify https://svn.jboss.org/repos/jbossas/branches/JBPAPP_5_1_datagrid/cluster/src/installers/datagrid/datagrid-endpoint-hotrod.properties accordingly): 1. Keeping state transfer enabled, increase the state transfer timeout, i.e. {code}infinispan.server.topology.repl_timeout=60000{code} 2. Disable state transfer so that topology information is retrieved lazily: {code}infinispan.server.topology.state_transfer=false{code} Btw, you probably wanna add new properties to that .properties file with default values in them and with the options commented. I'm closing the jira for the time being. |