Bug 745845 (EDG-115) - HotRod client throwing exception when accessing data after host shutdown/failure
Summary: HotRod client throwing exception when accessing data after host shutdown/failure
Keywords:
Status: CLOSED NOTABUG
Alias: EDG-115
Product: JBoss Data Grid 5
Classification: JBoss
Component: Infinispan
Version: unspecified
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: EAP 5.1.0 EDG TP
Assignee: Default User
QA Contact:
URL: http://jira.jboss.org/jira/browse/EDG...
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-01-27 22:29 UTC by Richard Achmatowicz
Modified: 2011-10-11 17:09 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-04-13 12:04:59 UTC
Type: Bug


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker EDG-115 0 Major Closed HotRod client throwing exception when accessing data after host shutdown/failure 2017-03-22 18:31:40 UTC

Description Richard Achmatowicz 2011-01-27 22:29:10 UTC
project_key: EDG

The test case org.jboss.test.cluster.datagrid.test.HotRodClientFailoverTestcase is failing with the following exception:

  <testcase classname="org.jboss.test.cluster.datagrid.test.HotRodClientFailoverTestCase" name="testFailoverWithCleanShutdown" time="192.895">
    <error message="Problems writing data to stream" type="org.infinispan.client.hotrod.exceptions.TransportException">org.infinispan.client.hotrod.exceptions.TransportException: Problems writing data to stream
	at org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport.writeByte(TcpTransport.java:101)
	at org.infinispan.client.hotrod.impl.operations.HotRodOperation.writeHeader(HotRodOperation.java:52)
	at org.infinispan.client.hotrod.impl.operations.AbstractKeyOperation.sendKeyOperation(AbstractKeyOperation.java:43)
	at org.infinispan.client.hotrod.impl.operations.GetOperation.executeOperation(GetOperation.java:26)
	at org.infinispan.client.hotrod.impl.operations.RetryOnFailureOperation.execute(RetryOnFailureOperation.java:38)
	at org.infinispan.client.hotrod.impl.RemoteCacheImpl.get(RemoteCacheImpl.java:283)
	at org.jboss.test.cluster.datagrid.test.HotRodClientFailoverTestCase.testFailoverWithCleanShutdown(HotRodClientFailoverTestCase.java:200)
Caused by: java.io.IOException: Broken pipe
	at sun.nio.ch.FileDispatcher.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100)
	at sun.nio.ch.IOUtil.write(IOUtil.java:71)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
	at java.nio.channels.Channels.writeFullyImpl(Channels.java:59)
	at java.nio.channels.Channels.writeFully(Channels.java:81)
	at java.nio.channels.Channels.access$000(Channels.java:47)
	at java.nio.channels.Channels$1.write(Channels.java:155)
	at java.io.OutputStream.write(OutputStream.java:58)
	at java.nio.channels.Channels$1.write(Channels.java:136)
	at org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport.writeByte(TcpTransport.java:96)
</error>
  </testcase>
  <system-out><![CDATA[ShellScriptExecutor: executing shell command -> bash -c ./run.sh  -c datagrid-failover-1 -b 192.168.0.101 
Server started successfully
Putting keys and values
Getting keys and values before server shutdown
shutdown successfully invoked on MBean jboss.system:type=Server; waiting for shutdown
waitForServerStop: waiting 90 seconds
Server killed successfully
Getting keys and values after server shutdown
]]></system-out>
  <system-err><![CDATA[]]></system-err>




The test does the following:
- starts up two instances of EDG using the profile names datagrid-failover
- starts a RemoteCacheManager with server list pointing to the two instances and gets the default cache
- peforms puts using keys key1, key2, key3
- performs gets of key1, key2, key3
- shuts down one server instance (datagrid-failover-1) 
- performs gets of key1, key2, key3

The exception is being raised after shutdown of the server, during the second round of gets

Comment 1 Richard Achmatowicz 2011-01-27 22:30:08 UTC
A Hudson job showing the failure can be found at:

https://hudson.qa.jboss.com/hudson/view/EDG/job/JBoss-EAP-5.1.x-datagrid/65/

Comment 2 Galder Zamarreño 2011-02-07 10:29:14 UTC
Sorry Richard, I had missed this issue. So, it'd appear that failover in the Hot Rod client is not working as expected? I'll check the test in the EDG source code and try to replicate it in Infinispan.

Comment 3 Galder Zamarreño 2011-02-07 10:34:41 UTC
Attachment: Added: Configuration_changes_for_JBoss_mbean_server_lookup.patch


Comment 4 Galder Zamarreño 2011-02-07 10:50:53 UTC
Attachment: Removed: Configuration_changes_for_JBoss_mbean_server_lookup.patch 


Comment 5 Galder Zamarreño 2011-02-07 18:07:05 UTC
Richard, I had a look to this and I think your tests need improving to verify that a cluster actually forms between the two nodes that you start. I've replicated your issue but the issue it fails is cos the cluster does not form, and hence the client ends up knowing about only one of the servers. So, if you close the only server it knows, it's not gonna be able to failover:

datagrid-failover-0:
GMS: address=dnb-2-52617, cluster=DatagridPartition-Infinispan, physical address=127.0.0.2:55200
...
2011-02-07 18:37:08,467 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (main) Received new cluster view: [dnb-2-52617|0] [dnb-2-52617]
2011-02-07 18:37:08,817 DEBUG [org.infinispan.server.hotrod.HotRodServer] (main) First member to start, topology view is {0}
2011-02-07 18:40:40,664 DEBUG [org.infinispan.server.hotrod.HotRodServer] (JBoss Shutdown Hook) Removed {0} from topology view, new view is {1}

datagrid-failover-1:
GMS: address=dnb-2-44413, cluster=DatagridPartition-Infinispan, physical address=127.0.0.3:55200
...
2011-02-07 18:37:35,491 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (main) Received new cluster view: [dnb-2-44413|0] [dnb-2-44413]
...
GMS: address=dnb-2-3447, cluster=DatagridPartition-Infinispan, physical address=127.0.0.3:55200
...
2011-02-07 18:40:13,172 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (main) Received new cluster view: [dnb-2-3447|0] [dnb-2-3447]
2011-02-07 18:40:18,080 WARN  [org.jgroups.protocols.pbcast.NAKACK] (OOB-9,null) dnb-2-3447: dropped message from dnb-2-52617 (not in table [dnb-2-3447]), view=[dnb-2-3447|0] [dnb-2-3447]
2011-02-07 18:40:18,505 WARN  [org.jgroups.protocols.pbcast.NAKACK] (OOB-10,null) dnb-2-3447: dropped message from dnb-2-52617 (not in table [dnb-2-3447]), view=[dnb-2-3447|0] [dnb-2-3447]
2011-02-07 18:40:31,566 WARN  [org.jgroups.protocols.pbcast.NAKACK] (OOB-19,null) dnb-2-3447: dropped message from dnb-2-52617 (not in table [dnb-2-3447]), view=[dnb-2-3447|0] [dnb-2-3447]
2011-02-07 18:40:31,628 WARN  [org.jgroups.protocols.pbcast.NAKACK] (OOB-19,null) dnb-2-3447: dropped message from dnb-2-52617 (not in table [dnb-2-3447]), view=[dnb-2-3447|0] [dnb-2-3447]

I think you should base your tests in the clustering ones that verify that the cluster forms in the first place.

In the client log, I only see:
test.log:60:2011-02-07 18:37:47,761 INFO  [org.infinispan.client.hotrod.impl.operations.HotRodOperation] New topology: {/127.0.0.3:11222=7359}
test.log:409:2011-02-07 18:40:22,068 INFO  [org.infinispan.client.hotrod.impl.operations.HotRodOperation] New topology: {/127.0.0.3:11222=6146}

So, since 127.0.0.2 and 127.0.0.3 did not cluster, the client never knows about them.

Also, there's another bug in your test which is highlighted by the fact that the test thinks that server list is: 127.0.0.3 and 127.0.0.1:

2011-02-07 18:37:45,928 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.RoundRobinBalancingStrategy] New server list is: [/127.0.0.3:11222, localhost/127.0.0.1:11222]. Resetting index to 0

That should clearly not be the case cos after I uncommented the following in local.properties:
node0=${env.MYTESTIP_1}
node1=${env.MYTESTIP_2}

I ran:
./build.sh -Denv.MYTESTIP_1=127.0.0.2 -Denv.MYTESTIP_2=127.0.0.3 tests-clustering-datagrid-failover-tests

I think your test should have worked even if the cluster did not form but the server list had been the correct one: 127.0.0.2 and 127.0.0.3. It's up to you to decide what to test but make sure that the test is clear about it.

Comment 6 Richard Achmatowicz 2011-02-07 18:38:09 UTC
Should have done some more investigation before logging this bug - when I start up the failing host, I neglected to add in the -u option to specify the multicast address used by the testsuite, so the hosts will not cluster as you rightly pointed out. 

Appologies - i'll update the test case.


Comment 7 Richard Achmatowicz 2011-04-13 12:02:10 UTC
Test case has been updated and the issue does not appear.

Comment 8 Richard Achmatowicz 2011-04-13 12:04:59 UTC
This issue was due to a test framework problem, and so was not a bug.


Note You need to log in before you can comment on or make changes to this bug.