Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 745858 (EDG-44)

Summary:	StateTransferException: java.io.EOFException: Read past end of file
Product:	[JBoss] JBoss Data Grid 5	Reporter:	Radoslav Husar <rhusar>
Component:	Infinispan	Assignee:	Default User <jbpapp-maint>
Status:	CLOSED NEXTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	unspecified	CC:	galder.zamarreno, rhusar
Target Milestone:	---
Target Release:	EAP 5.1.0 EDG TP
Hardware:	Unspecified
OS:	Unspecified
URL:	http://jira.jboss.org/jira/browse/EDG-44
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-03-15 07:42:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Radoslav Husar 2011-03-08 14:44:02 UTC

project_key: EDG

Again, needs more investigation. Happened 4 minutes after one node died and then started again 

http://hudson.qa.jboss.com/hudson/job/edg-51x-resilience-client-size4-hotrod/13/console-perf01/

{code}
2011/03/08 08:47:56:196 EST [INFO ][Thread-2] HOST perf10.mw.lab.eng.bos.redhat.com:rootProcess:E - Node 0 is down.

... starts again and tries to get the state

[JBoss] 08:50:15,297 ERROR [JGroupsTransport] Caught while requesting or applying state
[JBoss] org.infinispan.statetransfer.StateTransferException: java.io.EOFException: Read past end of file
[JBoss] 	at org.infinispan.statetransfer.StateTransferManagerImpl.applyState(StateTransferManagerImpl.java:332)
[JBoss] 	at org.infinispan.remoting.InboundInvocationHandlerImpl.applyState(InboundInvocationHandlerImpl.java:199)
[JBoss] 	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.setState(JGroupsTransport.java:589)
[JBoss] 	at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:712)
[JBoss] 	at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:772)
[JBoss] 	at org.jgroups.JChannel.up(JChannel.java:1422)
[JBoss] 	at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:954)
[JBoss] 	at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:478)
[JBoss] 	at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.connectToStateProvider(STREAMING_STATE_TRANSFER.java:525)
[JBoss] 	at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.handleStateRsp(STREAMING_STATE_TRANSFER.java:464)
[JBoss] 	at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.up(STREAMING_STATE_TRANSFER.java:225)
[JBoss] 	at org.jgroups.protocols.FRAG2.up(FRAG2.java:190)
[JBoss] 	at org.jgroups.protocols.FC.up(FC.java:483)
[JBoss] 	at org.jgroups.protocols.pbcast.GMS.up(GMS.java:888)
[JBoss] 	at org.jgroups.protocols.VIEW_SYNC.up(VIEW_SYNC.java:171)
[JBoss] 	at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)
[JBoss] 	at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:615)
[JBoss] 	at org.jgroups.protocols.UNICAST.up(UNICAST.java:295)
[JBoss] 	at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:707)
[JBoss] 	at org.jgroups.protocols.BARRIER.up(BARRIER.java:120)
[JBoss] 	at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:132)
[JBoss] 	at org.jgroups.protocols.FD.up(FD.java:266)
[JBoss] 	at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:269)
[JBoss] 	at org.jgroups.protocols.MERGE2.up(MERGE2.java:210)
[JBoss] 	at org.jgroups.protocols.Discovery.up(Discovery.java:292)
[JBoss] 	at org.jgroups.protocols.PING.up(PING.java:67)
[JBoss] 	at org.jgroups.stack.Protocol.up(Protocol.java:406)
[JBoss] 	at org.jgroups.protocols.TP.passMessageUp(TP.java:1093)
[JBoss] 	at org.jgroups.protocols.TP.access$100(TP.java:56)
[JBoss] 	at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1633)
[JBoss] 	at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1615)
[JBoss] 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
[JBoss] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
[JBoss] 	at java.lang.Thread.run(Thread.java:619)
[JBoss] Caused by: java.io.EOFException: Read past end of file
[JBoss] 	at org.jboss.marshalling.AbstractUnmarshaller.eofOnRead(AbstractUnmarshaller.java:184)
[JBoss] 	at org.jboss.marshalling.AbstractUnmarshaller.readUnsignedByteDirect(AbstractUnmarshaller.java:319)
[JBoss] 	at org.jboss.marshalling.AbstractUnmarshaller.readUnsignedByte(AbstractUnmarshaller.java:280)
[JBoss] 	at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:207)
[JBoss] 	at org.jboss.marshalling.AbstractUnmarshaller.readObject(AbstractUnmarshaller.java:85)
[JBoss] 	at org.infinispan.marshall.jboss.GenericJBossMarshaller.objectFromObjectStream(GenericJBossMarshaller.java:182)
[JBoss] 	at org.infinispan.marshall.VersionAwareMarshaller.objectFromObjectStream(VersionAwareMarshaller.java:184)
[JBoss] 	at org.infinispan.statetransfer.StateTransferManagerImpl.processCommitLog(StateTransferManagerImpl.java:229)
[JBoss] 	at org.infinispan.statetransfer.StateTransferManagerImpl.applyTransactionLog(StateTransferManagerImpl.java:251)
[JBoss] 	at org.infinispan.statetransfer.StateTransferManagerImpl.applyState(StateTransferManagerImpl.java:321)
[JBoss] 	... 33 more
[JBoss] 08:50:15,297 INFO  [RpcManagerImpl] Trying to fetch state from perf03-36672
[JBoss] 08:51:00,458 ERROR [JGroupsTransport] Caught while requesting or applying state
[JBoss] org.infinispan.statetransfer.StateTransferException: java.io.EOFException: Read past end of file

...same thing again...

[JBoss] 08:51:00,458 WARN  [RpcManagerImpl] Could not find available peer for state, backing off and retrying
[JBoss] 08:51:01,461 INFO  [RpcManagerImpl] Trying to fetch state from perf04-63783
[JBoss] 08:51:01,466 INFO  [RpcManagerImpl] Successfully retrieved and applied state from perf04-63783
{code}

Comment 1 Radoslav Husar 2011-03-08 14:49:07 UTC

Link: Added: This issue is incorporated by JBPAPP-6054

Comment 2 Galder Zamarreño 2011-03-08 15:31:09 UTC

As explained in http://community.jboss.org/docs/DOC-13439, most likely something went wrong in the state producer which halted the state transfer generation. The receiver end throws this kind of EOFExceptions when the stream finished unexpectedly. So, check whether any problems appear in the receiver.

Comment 3 Galder Zamarreño 2011-03-09 17:25:21 UTC

This and JBPAPP-6053 could well be related cos they both link to the same problem. State transfer, or rehashing not working because the node that should reply is not doing so.

Comment 4 Galder Zamarreño 2011-03-09 17:28:59 UTC

And base on what log says, that's either perf03 or perf04:

[JBoss] 08:50:05,211 INFO  [RpcManagerImpl] Trying to fetch state from perf04-63783
...
[JBoss] 08:50:15,297 ERROR [JGroupsTransport] Caught while requesting or applying state
[JBoss] org.infinispan.statetransfer.StateTransferException: java.io.EOFException: Read past end of file
...
[JBoss] 08:50:15,297 INFO  [RpcManagerImpl] Trying to fetch state from perf03-36672
...
[JBoss] 08:51:00,458 ERROR [JGroupsTransport] Caught while requesting or applying state
[JBoss] org.infinispan.statetransfer.StateTransferException: java.io.EOFException: Read past end of file

There's no thread information in the logs, so cannot 100% guarantee this.

Comment 5 Galder Zamarreño 2011-03-09 17:35:14 UTC

Right, here's the root cause why perf04 fails with state transfer:

{code}[JBoss] 08:50:15,234 ERROR [JGroupsTransport] Caught while responding to state transfer request
[JBoss] org.infinispan.statetransfer.StateTransferException: java.util.concurrent.TimeoutException: STREAMING_STATE_TRANSFER-sender-1,DatagridPartition-Infinispan,perf04-63783 could not obtain exclusive processing lock after 10 seconds.  Locks in question are java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock@1042e091[Read locks = 8] and java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@16a557ee[Unlocked]
[JBoss] 	at org.infinispan.statetransfer.StateTransferManagerImpl.generateState(StateTransferManagerImpl.java:176)
[JBoss] 	at org.infinispan.remoting.InboundInvocationHandlerImpl.generateState(InboundInvocationHandlerImpl.java:217)
[JBoss] 	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.getState(JGroupsTransport.java:572)
[JBoss] 	at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:691)
[JBoss] 	at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:772)
[JBoss] 	at org.jgroups.JChannel.up(JChannel.java:1465)
[JBoss] 	at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:954)
[JBoss] 	at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:478)
[JBoss] 	at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER$StateProviderHandler.process(STREAMING_STATE_TRANSFER.java:653)
[JBoss] 	at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER$StateProviderThreadSpawner$1.run(STREAMING_STATE_TRANSFER.java:582)
[JBoss] 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
[JBoss] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
[JBoss] 	at java.lang.Thread.run(Thread.java:619)
[JBoss] Caused by: java.util.concurrent.TimeoutException: STREAMING_STATE_TRANSFER-sender-1,DatagridPartition-Infinispan,perf04-63783 could not obtain exclusive processing lock after 10 seconds.  Locks in question are java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock@1042e091[Read locks = 8] and java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@16a557ee[Unlocked]
[JBoss] 	at org.infinispan.remoting.transport.jgroups.JGroupsDistSync.acquireProcessingLock(JGroupsDistSync.java:80)
[JBoss] 	at org.infinispan.statetransfer.StateTransferManagerImpl.generateTransactionLog(StateTransferManagerImpl.java:203)
[JBoss] 	at org.infinispan.statetransfer.StateTransferManagerImpl.generateState(StateTransferManagerImpl.java:166)
[JBoss] 	... 12 more{code}

The same thing happening for perf03. I'll look further into this tomorrow.

Comment 6 Radoslav Husar 2011-03-09 17:54:16 UTC

Galder, what kind of thread information would you be looking for? Getting a jstack every minute or so?

Comment 7 Galder Zamarreño 2011-03-10 10:16:21 UTC

Rado, so the problem here appears to be that the state transfer generator thread cannot get an exclusive lock when trying to generate the state. Dunno why this is, maybe rehashing is going on at the same time for another cache and this is blocking it? This could well be the case but why should they share the same lock if the target cache is different (need to talk to Manik about this)?

Anyway, a key thing in the logs is that exclusive lock appears to be unlocked by the time the exception is logged:
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@16a557ee[Unlocked]

So, bearing in mind that the processing lock timeout is 10 seconds, you should repeat the tests increasing this to either 30 or 60 seconds.

In the mean time, I'll talk to Manik about the dist sync lock.

Comment 8 Galder Zamarreño 2011-03-10 10:18:26 UTC

Btw, you modify this timeout in the transport element, i.e.

{code}<transport distributedSyncTimeout="60000"...{code}

Comment 9 Galder Zamarreño 2011-03-10 10:31:12 UTC

Actually Rado, forget what I about the distributedSyncTimeout. State transfer is using the distributed sync lock, but it's not that timeout that it uses, but instead state retrieval timeout. So, it would appear that state transfer timeout is set to 10 seconds? In fact, this is a hardcoded value on the Hot Rod server!

{code}   protected def defineTopologyCacheConfig(cacheManager: EmbeddedCacheManager) {
      val topologyCacheConfig = new Configuration
      topologyCacheConfig.setCacheMode(CacheMode.REPL_SYNC)
      topologyCacheConfig.setSyncReplTimeout(10000) // Milliseconds
      topologyCacheConfig.setFetchInMemoryState(true) // State transfer required
      topologyCacheConfig.setEvictionStrategy(EvictionStrategy.NONE); // No eviction
      topologyCacheConfig.setExpirationLifespan(-1); // No maximum lifespan
      topologyCacheConfig.setExpirationMaxIdle(-1); // No maximum idle time
      cacheManager.defineConfiguration(TopologyCacheName, topologyCacheConfig)
   }{code}


So, I need to make that configurable (of couse!). I'll implement this asap and you can then try with a different value.

Comment 10 Radoslav Husar 2011-03-10 10:33:37 UTC

Great analysis, thanks. Luckily I didn't get around to applying those changes yet. So I shall wait for a fix.

Comment 11 Radoslav Husar 2011-03-10 12:42:09 UTC

Link: Added: This issue depends ISPN-975

Comment 12 Galder Zamarreño 2011-03-15 07:41:48 UTC

Rado, ISPN-975 is fixed now and I've uploaded a snapshot of 4.2.x to nexus.

The documentation of the new options is in http://community.jboss.org/docs/DOC-15601

But essentially, there're two settings you should try for failover testing (it means that you'd need to modify https://svn.jboss.org/repos/jbossas/branches/JBPAPP_5_1_datagrid/cluster/src/installers/datagrid/datagrid-endpoint-hotrod.properties accordingly):

1. Keeping state transfer enabled, increase the state transfer timeout, i.e.
{code}infinispan.server.topology.repl_timeout=60000{code}

2. Disable state transfer so that topology information is retrieved lazily:
{code}infinispan.server.topology.state_transfer=false{code}

Btw, you probably wanna add new properties to that .properties file with default values in them and with the options commented.

I'm closing the jira for the time being.