Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 745845 (EDG-115)

Summary:	HotRod client throwing exception when accessing data after host shutdown/failure
Product:	[JBoss] JBoss Data Grid 5	Reporter:	Richard Achmatowicz <rachmato>
Component:	Infinispan	Assignee:	Default User <jbpapp-maint>
Status:	CLOSED NOTABUG	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	unspecified	CC:	galder.zamarreno, rachmato
Target Milestone:	---
Target Release:	EAP 5.1.0 EDG TP
Hardware:	Unspecified
OS:	Unspecified
URL:	http://jira.jboss.org/jira/browse/EDG-115
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-04-13 12:04:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Richard Achmatowicz 2011-01-27 22:29:10 UTC

project_key: EDG

The test case org.jboss.test.cluster.datagrid.test.HotRodClientFailoverTestcase is failing with the following exception:

  <testcase classname="org.jboss.test.cluster.datagrid.test.HotRodClientFailoverTestCase" name="testFailoverWithCleanShutdown" time="192.895">
    <error message="Problems writing data to stream" type="org.infinispan.client.hotrod.exceptions.TransportException">org.infinispan.client.hotrod.exceptions.TransportException: Problems writing data to stream
	at org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport.writeByte(TcpTransport.java:101)
	at org.infinispan.client.hotrod.impl.operations.HotRodOperation.writeHeader(HotRodOperation.java:52)
	at org.infinispan.client.hotrod.impl.operations.AbstractKeyOperation.sendKeyOperation(AbstractKeyOperation.java:43)
	at org.infinispan.client.hotrod.impl.operations.GetOperation.executeOperation(GetOperation.java:26)
	at org.infinispan.client.hotrod.impl.operations.RetryOnFailureOperation.execute(RetryOnFailureOperation.java:38)
	at org.infinispan.client.hotrod.impl.RemoteCacheImpl.get(RemoteCacheImpl.java:283)
	at org.jboss.test.cluster.datagrid.test.HotRodClientFailoverTestCase.testFailoverWithCleanShutdown(HotRodClientFailoverTestCase.java:200)
Caused by: java.io.IOException: Broken pipe
	at sun.nio.ch.FileDispatcher.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100)
	at sun.nio.ch.IOUtil.write(IOUtil.java:71)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
	at java.nio.channels.Channels.writeFullyImpl(Channels.java:59)
	at java.nio.channels.Channels.writeFully(Channels.java:81)
	at java.nio.channels.Channels.access$000(Channels.java:47)
	at java.nio.channels.Channels$1.write(Channels.java:155)
	at java.io.OutputStream.write(OutputStream.java:58)
	at java.nio.channels.Channels$1.write(Channels.java:136)
	at org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport.writeByte(TcpTransport.java:96)
</error>
  </testcase>
  <system-out><![CDATA[ShellScriptExecutor: executing shell command -> bash -c ./run.sh  -c datagrid-failover-1 -b 192.168.0.101 
Server started successfully
Putting keys and values
Getting keys and values before server shutdown
shutdown successfully invoked on MBean jboss.system:type=Server; waiting for shutdown
waitForServerStop: waiting 90 seconds
Server killed successfully
Getting keys and values after server shutdown
]]></system-out>
  <system-err><![CDATA[]]></system-err>




The test does the following:
- starts up two instances of EDG using the profile names datagrid-failover
- starts a RemoteCacheManager with server list pointing to the two instances and gets the default cache
- peforms puts using keys key1, key2, key3
- performs gets of key1, key2, key3
- shuts down one server instance (datagrid-failover-1) 
- performs gets of key1, key2, key3

The exception is being raised after shutdown of the server, during the second round of gets

Comment 1 Richard Achmatowicz 2011-01-27 22:30:08 UTC

A Hudson job showing the failure can be found at:

https://hudson.qa.jboss.com/hudson/view/EDG/job/JBoss-EAP-5.1.x-datagrid/65/

Comment 2 Galder Zamarreño 2011-02-07 10:29:14 UTC

Sorry Richard, I had missed this issue. So, it'd appear that failover in the Hot Rod client is not working as expected? I'll check the test in the EDG source code and try to replicate it in Infinispan.

Comment 3 Galder Zamarreño 2011-02-07 10:34:41 UTC

Attachment: Added: Configuration_changes_for_JBoss_mbean_server_lookup.patch

Comment 4 Galder Zamarreño 2011-02-07 10:50:53 UTC

Attachment: Removed: Configuration_changes_for_JBoss_mbean_server_lookup.patch

Comment 5 Galder Zamarreño 2011-02-07 18:07:05 UTC

Richard, I had a look to this and I think your tests need improving to verify that a cluster actually forms between the two nodes that you start. I've replicated your issue but the issue it fails is cos the cluster does not form, and hence the client ends up knowing about only one of the servers. So, if you close the only server it knows, it's not gonna be able to failover:

datagrid-failover-0:
GMS: address=dnb-2-52617, cluster=DatagridPartition-Infinispan, physical address=127.0.0.2:55200
...
2011-02-07 18:37:08,467 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (main) Received new cluster view: [dnb-2-52617|0] [dnb-2-52617]
2011-02-07 18:37:08,817 DEBUG [org.infinispan.server.hotrod.HotRodServer] (main) First member to start, topology view is {0}
2011-02-07 18:40:40,664 DEBUG [org.infinispan.server.hotrod.HotRodServer] (JBoss Shutdown Hook) Removed {0} from topology view, new view is {1}

datagrid-failover-1:
GMS: address=dnb-2-44413, cluster=DatagridPartition-Infinispan, physical address=127.0.0.3:55200
...
2011-02-07 18:37:35,491 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (main) Received new cluster view: [dnb-2-44413|0] [dnb-2-44413]
...
GMS: address=dnb-2-3447, cluster=DatagridPartition-Infinispan, physical address=127.0.0.3:55200
...
2011-02-07 18:40:13,172 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (main) Received new cluster view: [dnb-2-3447|0] [dnb-2-3447]
2011-02-07 18:40:18,080 WARN  [org.jgroups.protocols.pbcast.NAKACK] (OOB-9,null) dnb-2-3447: dropped message from dnb-2-52617 (not in table [dnb-2-3447]), view=[dnb-2-3447|0] [dnb-2-3447]
2011-02-07 18:40:18,505 WARN  [org.jgroups.protocols.pbcast.NAKACK] (OOB-10,null) dnb-2-3447: dropped message from dnb-2-52617 (not in table [dnb-2-3447]), view=[dnb-2-3447|0] [dnb-2-3447]
2011-02-07 18:40:31,566 WARN  [org.jgroups.protocols.pbcast.NAKACK] (OOB-19,null) dnb-2-3447: dropped message from dnb-2-52617 (not in table [dnb-2-3447]), view=[dnb-2-3447|0] [dnb-2-3447]
2011-02-07 18:40:31,628 WARN  [org.jgroups.protocols.pbcast.NAKACK] (OOB-19,null) dnb-2-3447: dropped message from dnb-2-52617 (not in table [dnb-2-3447]), view=[dnb-2-3447|0] [dnb-2-3447]

I think you should base your tests in the clustering ones that verify that the cluster forms in the first place.

In the client log, I only see:
test.log:60:2011-02-07 18:37:47,761 INFO  [org.infinispan.client.hotrod.impl.operations.HotRodOperation] New topology: {/127.0.0.3:11222=7359}
test.log:409:2011-02-07 18:40:22,068 INFO  [org.infinispan.client.hotrod.impl.operations.HotRodOperation] New topology: {/127.0.0.3:11222=6146}

So, since 127.0.0.2 and 127.0.0.3 did not cluster, the client never knows about them.

Also, there's another bug in your test which is highlighted by the fact that the test thinks that server list is: 127.0.0.3 and 127.0.0.1:

2011-02-07 18:37:45,928 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.RoundRobinBalancingStrategy] New server list is: [/127.0.0.3:11222, localhost/127.0.0.1:11222]. Resetting index to 0

That should clearly not be the case cos after I uncommented the following in local.properties:
node0=${env.MYTESTIP_1}
node1=${env.MYTESTIP_2}

I ran:
./build.sh -Denv.MYTESTIP_1=127.0.0.2 -Denv.MYTESTIP_2=127.0.0.3 tests-clustering-datagrid-failover-tests

I think your test should have worked even if the cluster did not form but the server list had been the correct one: 127.0.0.2 and 127.0.0.3. It's up to you to decide what to test but make sure that the test is clear about it.

Comment 6 Richard Achmatowicz 2011-02-07 18:38:09 UTC

Should have done some more investigation before logging this bug - when I start up the failing host, I neglected to add in the -u option to specify the multicast address used by the testsuite, so the hosts will not cluster as you rightly pointed out. 

Appologies - i'll update the test case.

Comment 7 Richard Achmatowicz 2011-04-13 12:02:10 UTC

Test case has been updated and the issue does not appear.

Comment 8 Richard Achmatowicz 2011-04-13 12:04:59 UTC

This issue was due to a test framework problem, and so was not a bug.