Bug 745858 (EDG-44) - StateTransferException: java.io.EOFException: Read past end of file
Summary: StateTransferException: java.io.EOFException: Read past end of file
Keywords:
Status: CLOSED NEXTRELEASE
Alias: EDG-44
Product: JBoss Data Grid 5
Classification: JBoss
Component: Infinispan
Version: unspecified
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: EAP 5.1.0 EDG TP
Assignee: Default User
QA Contact:
URL: http://jira.jboss.org/jira/browse/EDG-44
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-03-08 14:44 UTC by Radoslav Husar
Modified: 2011-10-11 17:05 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-03-15 07:42:13 UTC
Type: Bug


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker EDG-44 0 Major Closed StateTransferException: java.io.EOFException: Read past end of file 2018-05-28 11:16:59 UTC

Description Radoslav Husar 2011-03-08 14:44:02 UTC
project_key: EDG

Again, needs more investigation. Happened 4 minutes after one node died and then started again 

http://hudson.qa.jboss.com/hudson/job/edg-51x-resilience-client-size4-hotrod/13/console-perf01/

{code}
2011/03/08 08:47:56:196 EST [INFO ][Thread-2] HOST perf10.mw.lab.eng.bos.redhat.com:rootProcess:E - Node 0 is down.

... starts again and tries to get the state

[JBoss] 08:50:15,297 ERROR [JGroupsTransport] Caught while requesting or applying state
[JBoss] org.infinispan.statetransfer.StateTransferException: java.io.EOFException: Read past end of file
[JBoss] 	at org.infinispan.statetransfer.StateTransferManagerImpl.applyState(StateTransferManagerImpl.java:332)
[JBoss] 	at org.infinispan.remoting.InboundInvocationHandlerImpl.applyState(InboundInvocationHandlerImpl.java:199)
[JBoss] 	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.setState(JGroupsTransport.java:589)
[JBoss] 	at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:712)
[JBoss] 	at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:772)
[JBoss] 	at org.jgroups.JChannel.up(JChannel.java:1422)
[JBoss] 	at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:954)
[JBoss] 	at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:478)
[JBoss] 	at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.connectToStateProvider(STREAMING_STATE_TRANSFER.java:525)
[JBoss] 	at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.handleStateRsp(STREAMING_STATE_TRANSFER.java:464)
[JBoss] 	at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.up(STREAMING_STATE_TRANSFER.java:225)
[JBoss] 	at org.jgroups.protocols.FRAG2.up(FRAG2.java:190)
[JBoss] 	at org.jgroups.protocols.FC.up(FC.java:483)
[JBoss] 	at org.jgroups.protocols.pbcast.GMS.up(GMS.java:888)
[JBoss] 	at org.jgroups.protocols.VIEW_SYNC.up(VIEW_SYNC.java:171)
[JBoss] 	at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)
[JBoss] 	at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:615)
[JBoss] 	at org.jgroups.protocols.UNICAST.up(UNICAST.java:295)
[JBoss] 	at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:707)
[JBoss] 	at org.jgroups.protocols.BARRIER.up(BARRIER.java:120)
[JBoss] 	at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:132)
[JBoss] 	at org.jgroups.protocols.FD.up(FD.java:266)
[JBoss] 	at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:269)
[JBoss] 	at org.jgroups.protocols.MERGE2.up(MERGE2.java:210)
[JBoss] 	at org.jgroups.protocols.Discovery.up(Discovery.java:292)
[JBoss] 	at org.jgroups.protocols.PING.up(PING.java:67)
[JBoss] 	at org.jgroups.stack.Protocol.up(Protocol.java:406)
[JBoss] 	at org.jgroups.protocols.TP.passMessageUp(TP.java:1093)
[JBoss] 	at org.jgroups.protocols.TP.access$100(TP.java:56)
[JBoss] 	at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1633)
[JBoss] 	at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1615)
[JBoss] 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
[JBoss] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
[JBoss] 	at java.lang.Thread.run(Thread.java:619)
[JBoss] Caused by: java.io.EOFException: Read past end of file
[JBoss] 	at org.jboss.marshalling.AbstractUnmarshaller.eofOnRead(AbstractUnmarshaller.java:184)
[JBoss] 	at org.jboss.marshalling.AbstractUnmarshaller.readUnsignedByteDirect(AbstractUnmarshaller.java:319)
[JBoss] 	at org.jboss.marshalling.AbstractUnmarshaller.readUnsignedByte(AbstractUnmarshaller.java:280)
[JBoss] 	at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:207)
[JBoss] 	at org.jboss.marshalling.AbstractUnmarshaller.readObject(AbstractUnmarshaller.java:85)
[JBoss] 	at org.infinispan.marshall.jboss.GenericJBossMarshaller.objectFromObjectStream(GenericJBossMarshaller.java:182)
[JBoss] 	at org.infinispan.marshall.VersionAwareMarshaller.objectFromObjectStream(VersionAwareMarshaller.java:184)
[JBoss] 	at org.infinispan.statetransfer.StateTransferManagerImpl.processCommitLog(StateTransferManagerImpl.java:229)
[JBoss] 	at org.infinispan.statetransfer.StateTransferManagerImpl.applyTransactionLog(StateTransferManagerImpl.java:251)
[JBoss] 	at org.infinispan.statetransfer.StateTransferManagerImpl.applyState(StateTransferManagerImpl.java:321)
[JBoss] 	... 33 more
[JBoss] 08:50:15,297 INFO  [RpcManagerImpl] Trying to fetch state from perf03-36672
[JBoss] 08:51:00,458 ERROR [JGroupsTransport] Caught while requesting or applying state
[JBoss] org.infinispan.statetransfer.StateTransferException: java.io.EOFException: Read past end of file

...same thing again...

[JBoss] 08:51:00,458 WARN  [RpcManagerImpl] Could not find available peer for state, backing off and retrying
[JBoss] 08:51:01,461 INFO  [RpcManagerImpl] Trying to fetch state from perf04-63783
[JBoss] 08:51:01,466 INFO  [RpcManagerImpl] Successfully retrieved and applied state from perf04-63783
{code}

Comment 1 Radoslav Husar 2011-03-08 14:49:07 UTC
Link: Added: This issue is incorporated by JBPAPP-6054


Comment 2 Galder Zamarreño 2011-03-08 15:31:09 UTC
As explained in http://community.jboss.org/docs/DOC-13439, most likely something went wrong in the state producer which halted the state transfer generation. The receiver end throws this kind of EOFExceptions when the stream finished unexpectedly. So, check whether any problems appear in the receiver.

Comment 3 Galder Zamarreño 2011-03-09 17:25:21 UTC
This and JBPAPP-6053 could well be related cos they both link to the same problem. State transfer, or rehashing not working because the node that should reply is not doing so.

Comment 4 Galder Zamarreño 2011-03-09 17:28:59 UTC
And base on what log says, that's either perf03 or perf04:

[JBoss] 08:50:05,211 INFO  [RpcManagerImpl] Trying to fetch state from perf04-63783
...
[JBoss] 08:50:15,297 ERROR [JGroupsTransport] Caught while requesting or applying state
[JBoss] org.infinispan.statetransfer.StateTransferException: java.io.EOFException: Read past end of file
...
[JBoss] 08:50:15,297 INFO  [RpcManagerImpl] Trying to fetch state from perf03-36672
...
[JBoss] 08:51:00,458 ERROR [JGroupsTransport] Caught while requesting or applying state
[JBoss] org.infinispan.statetransfer.StateTransferException: java.io.EOFException: Read past end of file

There's no thread information in the logs, so cannot 100% guarantee this.



Comment 5 Galder Zamarreño 2011-03-09 17:35:14 UTC
Right, here's the root cause why perf04 fails with state transfer:

{code}[JBoss] 08:50:15,234 ERROR [JGroupsTransport] Caught while responding to state transfer request
[JBoss] org.infinispan.statetransfer.StateTransferException: java.util.concurrent.TimeoutException: STREAMING_STATE_TRANSFER-sender-1,DatagridPartition-Infinispan,perf04-63783 could not obtain exclusive processing lock after 10 seconds.  Locks in question are java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock@1042e091[Read locks = 8] and java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@16a557ee[Unlocked]
[JBoss] 	at org.infinispan.statetransfer.StateTransferManagerImpl.generateState(StateTransferManagerImpl.java:176)
[JBoss] 	at org.infinispan.remoting.InboundInvocationHandlerImpl.generateState(InboundInvocationHandlerImpl.java:217)
[JBoss] 	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.getState(JGroupsTransport.java:572)
[JBoss] 	at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:691)
[JBoss] 	at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:772)
[JBoss] 	at org.jgroups.JChannel.up(JChannel.java:1465)
[JBoss] 	at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:954)
[JBoss] 	at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:478)
[JBoss] 	at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER$StateProviderHandler.process(STREAMING_STATE_TRANSFER.java:653)
[JBoss] 	at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER$StateProviderThreadSpawner$1.run(STREAMING_STATE_TRANSFER.java:582)
[JBoss] 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
[JBoss] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
[JBoss] 	at java.lang.Thread.run(Thread.java:619)
[JBoss] Caused by: java.util.concurrent.TimeoutException: STREAMING_STATE_TRANSFER-sender-1,DatagridPartition-Infinispan,perf04-63783 could not obtain exclusive processing lock after 10 seconds.  Locks in question are java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock@1042e091[Read locks = 8] and java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@16a557ee[Unlocked]
[JBoss] 	at org.infinispan.remoting.transport.jgroups.JGroupsDistSync.acquireProcessingLock(JGroupsDistSync.java:80)
[JBoss] 	at org.infinispan.statetransfer.StateTransferManagerImpl.generateTransactionLog(StateTransferManagerImpl.java:203)
[JBoss] 	at org.infinispan.statetransfer.StateTransferManagerImpl.generateState(StateTransferManagerImpl.java:166)
[JBoss] 	... 12 more{code}

The same thing happening for perf03. I'll look further into this tomorrow.

Comment 6 Radoslav Husar 2011-03-09 17:54:16 UTC
Galder, what kind of thread information would you be looking for? Getting a jstack every minute or so?

Comment 7 Galder Zamarreño 2011-03-10 10:16:21 UTC
Rado, so the problem here appears to be that the state transfer generator thread cannot get an exclusive lock when trying to generate the state. Dunno why this is, maybe rehashing is going on at the same time for another cache and this is blocking it? This could well be the case but why should they share the same lock if the target cache is different (need to talk to Manik about this)?

Anyway, a key thing in the logs is that exclusive lock appears to be unlocked by the time the exception is logged:
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@16a557ee[Unlocked]

So, bearing in mind that the processing lock timeout is 10 seconds, you should repeat the tests increasing this to either 30 or 60 seconds.

In the mean time, I'll talk to Manik about the dist sync lock.

Comment 8 Galder Zamarreño 2011-03-10 10:18:26 UTC
Btw, you modify this timeout in the transport element, i.e.

{code}<transport distributedSyncTimeout="60000"...{code}

Comment 9 Galder Zamarreño 2011-03-10 10:31:12 UTC
Actually Rado, forget what I about the distributedSyncTimeout. State transfer is using the distributed sync lock, but it's not that timeout that it uses, but instead state retrieval timeout. So, it would appear that state transfer timeout is set to 10 seconds? In fact, this is a hardcoded value on the Hot Rod server!

{code}   protected def defineTopologyCacheConfig(cacheManager: EmbeddedCacheManager) {
      val topologyCacheConfig = new Configuration
      topologyCacheConfig.setCacheMode(CacheMode.REPL_SYNC)
      topologyCacheConfig.setSyncReplTimeout(10000) // Milliseconds
      topologyCacheConfig.setFetchInMemoryState(true) // State transfer required
      topologyCacheConfig.setEvictionStrategy(EvictionStrategy.NONE); // No eviction
      topologyCacheConfig.setExpirationLifespan(-1); // No maximum lifespan
      topologyCacheConfig.setExpirationMaxIdle(-1); // No maximum idle time
      cacheManager.defineConfiguration(TopologyCacheName, topologyCacheConfig)
   }{code}


So, I need to make that configurable (of couse!). I'll implement this asap and you can then try with a different value.

Comment 10 Radoslav Husar 2011-03-10 10:33:37 UTC
Great analysis, thanks. Luckily I didn't get around to applying those changes yet. So I shall wait for a fix.

Comment 11 Radoslav Husar 2011-03-10 12:42:09 UTC
Link: Added: This issue depends ISPN-975


Comment 12 Galder Zamarreño 2011-03-15 07:41:48 UTC
Rado, ISPN-975 is fixed now and I've uploaded a snapshot of 4.2.x to nexus.

The documentation of the new options is in http://community.jboss.org/docs/DOC-15601

But essentially, there're two settings you should try for failover testing (it means that you'd need to modify https://svn.jboss.org/repos/jbossas/branches/JBPAPP_5_1_datagrid/cluster/src/installers/datagrid/datagrid-endpoint-hotrod.properties accordingly):

1. Keeping state transfer enabled, increase the state transfer timeout, i.e.
{code}infinispan.server.topology.repl_timeout=60000{code}

2. Disable state transfer so that topology information is retrieved lazily:
{code}infinispan.server.topology.state_transfer=false{code}

Btw, you probably wanna add new properties to that .properties file with default values in them and with the options commented.

I'm closing the jira for the time being.


Note You need to log in before you can comment on or make changes to this bug.