| Summary: | HotRod client throwing exception when accessing data after host shutdown/failure | ||
|---|---|---|---|
| Product: | [JBoss] JBoss Data Grid 5 | Reporter: | Richard Achmatowicz <rachmato> |
| Component: | Infinispan | Assignee: | Default User <jbpapp-maint> |
| Status: | CLOSED NOTABUG | QA Contact: | |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | unspecified | CC: | galder.zamarreno, rachmato |
| Target Milestone: | --- | ||
| Target Release: | EAP 5.1.0 EDG TP | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| URL: | http://jira.jboss.org/jira/browse/EDG-115 | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-04-13 12:04:59 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
A Hudson job showing the failure can be found at: https://hudson.qa.jboss.com/hudson/view/EDG/job/JBoss-EAP-5.1.x-datagrid/65/ Sorry Richard, I had missed this issue. So, it'd appear that failover in the Hot Rod client is not working as expected? I'll check the test in the EDG source code and try to replicate it in Infinispan. Attachment: Added: Configuration_changes_for_JBoss_mbean_server_lookup.patch Attachment: Removed: Configuration_changes_for_JBoss_mbean_server_lookup.patch Richard, I had a look to this and I think your tests need improving to verify that a cluster actually forms between the two nodes that you start. I've replicated your issue but the issue it fails is cos the cluster does not form, and hence the client ends up knowing about only one of the servers. So, if you close the only server it knows, it's not gonna be able to failover:
datagrid-failover-0:
GMS: address=dnb-2-52617, cluster=DatagridPartition-Infinispan, physical address=127.0.0.2:55200
...
2011-02-07 18:37:08,467 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (main) Received new cluster view: [dnb-2-52617|0] [dnb-2-52617]
2011-02-07 18:37:08,817 DEBUG [org.infinispan.server.hotrod.HotRodServer] (main) First member to start, topology view is {0}
2011-02-07 18:40:40,664 DEBUG [org.infinispan.server.hotrod.HotRodServer] (JBoss Shutdown Hook) Removed {0} from topology view, new view is {1}
datagrid-failover-1:
GMS: address=dnb-2-44413, cluster=DatagridPartition-Infinispan, physical address=127.0.0.3:55200
...
2011-02-07 18:37:35,491 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (main) Received new cluster view: [dnb-2-44413|0] [dnb-2-44413]
...
GMS: address=dnb-2-3447, cluster=DatagridPartition-Infinispan, physical address=127.0.0.3:55200
...
2011-02-07 18:40:13,172 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (main) Received new cluster view: [dnb-2-3447|0] [dnb-2-3447]
2011-02-07 18:40:18,080 WARN [org.jgroups.protocols.pbcast.NAKACK] (OOB-9,null) dnb-2-3447: dropped message from dnb-2-52617 (not in table [dnb-2-3447]), view=[dnb-2-3447|0] [dnb-2-3447]
2011-02-07 18:40:18,505 WARN [org.jgroups.protocols.pbcast.NAKACK] (OOB-10,null) dnb-2-3447: dropped message from dnb-2-52617 (not in table [dnb-2-3447]), view=[dnb-2-3447|0] [dnb-2-3447]
2011-02-07 18:40:31,566 WARN [org.jgroups.protocols.pbcast.NAKACK] (OOB-19,null) dnb-2-3447: dropped message from dnb-2-52617 (not in table [dnb-2-3447]), view=[dnb-2-3447|0] [dnb-2-3447]
2011-02-07 18:40:31,628 WARN [org.jgroups.protocols.pbcast.NAKACK] (OOB-19,null) dnb-2-3447: dropped message from dnb-2-52617 (not in table [dnb-2-3447]), view=[dnb-2-3447|0] [dnb-2-3447]
I think you should base your tests in the clustering ones that verify that the cluster forms in the first place.
In the client log, I only see:
test.log:60:2011-02-07 18:37:47,761 INFO [org.infinispan.client.hotrod.impl.operations.HotRodOperation] New topology: {/127.0.0.3:11222=7359}
test.log:409:2011-02-07 18:40:22,068 INFO [org.infinispan.client.hotrod.impl.operations.HotRodOperation] New topology: {/127.0.0.3:11222=6146}
So, since 127.0.0.2 and 127.0.0.3 did not cluster, the client never knows about them.
Also, there's another bug in your test which is highlighted by the fact that the test thinks that server list is: 127.0.0.3 and 127.0.0.1:
2011-02-07 18:37:45,928 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.RoundRobinBalancingStrategy] New server list is: [/127.0.0.3:11222, localhost/127.0.0.1:11222]. Resetting index to 0
That should clearly not be the case cos after I uncommented the following in local.properties:
node0=${env.MYTESTIP_1}
node1=${env.MYTESTIP_2}
I ran:
./build.sh -Denv.MYTESTIP_1=127.0.0.2 -Denv.MYTESTIP_2=127.0.0.3 tests-clustering-datagrid-failover-tests
I think your test should have worked even if the cluster did not form but the server list had been the correct one: 127.0.0.2 and 127.0.0.3. It's up to you to decide what to test but make sure that the test is clear about it.
Should have done some more investigation before logging this bug - when I start up the failing host, I neglected to add in the -u option to specify the multicast address used by the testsuite, so the hosts will not cluster as you rightly pointed out. Appologies - i'll update the test case. Test case has been updated and the issue does not appear. This issue was due to a test framework problem, and so was not a bug. |
project_key: EDG The test case org.jboss.test.cluster.datagrid.test.HotRodClientFailoverTestcase is failing with the following exception: <testcase classname="org.jboss.test.cluster.datagrid.test.HotRodClientFailoverTestCase" name="testFailoverWithCleanShutdown" time="192.895"> <error message="Problems writing data to stream" type="org.infinispan.client.hotrod.exceptions.TransportException">org.infinispan.client.hotrod.exceptions.TransportException: Problems writing data to stream at org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport.writeByte(TcpTransport.java:101) at org.infinispan.client.hotrod.impl.operations.HotRodOperation.writeHeader(HotRodOperation.java:52) at org.infinispan.client.hotrod.impl.operations.AbstractKeyOperation.sendKeyOperation(AbstractKeyOperation.java:43) at org.infinispan.client.hotrod.impl.operations.GetOperation.executeOperation(GetOperation.java:26) at org.infinispan.client.hotrod.impl.operations.RetryOnFailureOperation.execute(RetryOnFailureOperation.java:38) at org.infinispan.client.hotrod.impl.RemoteCacheImpl.get(RemoteCacheImpl.java:283) at org.jboss.test.cluster.datagrid.test.HotRodClientFailoverTestCase.testFailoverWithCleanShutdown(HotRodClientFailoverTestCase.java:200) Caused by: java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcher.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100) at sun.nio.ch.IOUtil.write(IOUtil.java:71) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) at java.nio.channels.Channels.writeFullyImpl(Channels.java:59) at java.nio.channels.Channels.writeFully(Channels.java:81) at java.nio.channels.Channels.access$000(Channels.java:47) at java.nio.channels.Channels$1.write(Channels.java:155) at java.io.OutputStream.write(OutputStream.java:58) at java.nio.channels.Channels$1.write(Channels.java:136) at org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport.writeByte(TcpTransport.java:96) </error> </testcase> <system-out><![CDATA[ShellScriptExecutor: executing shell command -> bash -c ./run.sh -c datagrid-failover-1 -b 192.168.0.101 Server started successfully Putting keys and values Getting keys and values before server shutdown shutdown successfully invoked on MBean jboss.system:type=Server; waiting for shutdown waitForServerStop: waiting 90 seconds Server killed successfully Getting keys and values after server shutdown ]]></system-out> <system-err><![CDATA[]]></system-err> The test does the following: - starts up two instances of EDG using the profile names datagrid-failover - starts a RemoteCacheManager with server list pointing to the two instances and gets the default cache - peforms puts using keys key1, key2, key3 - performs gets of key1, key2, key3 - shuts down one server instance (datagrid-failover-1) - performs gets of key1, key2, key3 The exception is being raised after shutdown of the server, during the second round of gets