Bug 807741 - Reopened: Invalid magic number
Reopened: Invalid magic number
Status: VERIFIED
Product: JBoss Data Grid 6
Classification: JBoss
Component: Infinispan (Show other bugs)
6.0.0
Unspecified Unspecified
high Severity high
: ER5
: 6.1.0
Assigned To: Tristan Tarrant
nobody nobody
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-03-28 11:19 EDT by Michal Linhard
Modified: 2014-03-17 00:03 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Known Issue
Doc Text:
The issue is a race condition on the Hot Rod server, which can lead to topologies being sent erroneously as a result of addition of a new node to the cluster. When the issue appears, clients view "Invalid magic number" error messages due to unexpected data within the stream. </para> <para> In such a situation, the recommended approach is to restart the client. If the client is not restarted, the client may recover after the unexpected data is consumed but this is not guaranteed. If the client recovers with a restart, the view topology displayed does not display one of the nodes added, resulting in uneven request distribution.
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
ISPN-1948 patch (163.70 KB, patch)
2012-06-07 04:45 EDT, Michal Linhard
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
JBoss Issue Tracker ISPN-2209 Major Resolved HotRod client occasionally reports invalid magic number in elasticity tests 2017-06-27 04:37 EDT

  None (edit)
Description Michal Linhard 2012-03-28 11:19:43 EDT
see
https://issues.jboss.org/browse/ISPN-1948
Comment 1 JBoss JIRA Server 2012-03-29 06:08:27 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> updated the status of jira ISPN-1948 to Coding In Progress
Comment 2 JBoss JIRA Server 2012-04-02 13:55:45 EDT
Michal Linhard <mlinhard@redhat.com> made a comment on jira ISPN-1948

Reappeared in ER6 tests:
http://www.qa.jboss.com/~mlinhard/hyperion/run48-elas-dist-size16/hyperion1135.log
Comment 3 JBoss JIRA Server 2012-04-03 03:24:49 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

First of all, thx Michal for getting trace logs for this. I must say that this is quite an odd one. I've posted the relevant bits of the logs to https://gist.github.com/2290090. Basically, the client sends the request, server handles it correctly and encodes a response without any issues, yet the client gets messed up. My gut feeling is that there's either a mix up in sockets/buffers/channels on the server, or in the client of some sort. I'm gonna add some extra logging information to both the client and the server to find out more about these situations. With this info as well, we should be able to get more targeted logging for next round of testing.
Comment 4 JBoss JIRA Server 2012-04-03 03:47:47 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

@Michal, what do you know about the circumstances when this happens? It happens when a node is added or removed, right?
Comment 5 JBoss JIRA Server 2012-04-03 05:38:23 EDT
Michal Linhard <mlinhard@redhat.com> made a comment on jira ISPN-1948

first occurence:
edg-perf05.log	occurences: 1113 first: 15:18:01,998 last: 15:19:00,510 first time: 203.233 last time: 261.745
this started shortly after node03 join (15:17:56,382) and shortly after cluster formed (15:18:54,651)
Comment 6 JBoss JIRA Server 2012-04-03 05:45:15 EDT
Michal Linhard <mlinhard@redhat.com> made a comment on jira ISPN-1948

occurence on size16 test in hyperion:
hyperion1135.log occurences: 9242 first: 08:50:05,420 last: 08:58:01,491 first time: 378.689 last time: 854.760

begin: shortly after node 12 started (08:49:55,463)
end: 08:58:01,259 shortly after node 3 was killed

It definitely happens when the view changes, but I can't connect it with a specific event.
Comment 7 JBoss JIRA Server 2012-04-03 12:09:27 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

@Michal, I've added a pull req to get more logging.

Once that's in, next tests please get the following logging:
- TRACE on org.infinispan.client.hotrod on the client
- TRACE on org.infinispan.server for the server.
Comment 8 mark yarborough 2012-04-04 09:18:26 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
CCFR - Manik
Comment 9 JBoss JIRA Server 2012-04-04 11:00:08 EDT
Mircea Markus <mmarkus@redhat.com> made a comment on jira ISPN-1948

@Galderz - your pull request with extra logging is now integrated.
Comment 10 JBoss JIRA Server 2012-04-04 11:00:09 EDT
Mircea Markus <mmarkus@redhat.com> made a comment on jira ISPN-1948

@Galderz - your pull request with extra logging is now integrated.
Comment 11 Manik Surtani 2012-04-05 09:58:43 EDT
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1,5 @@
-CCFR - Manik+CCFR
+
+This is a suspected issue. We haven't completely proven or disproven its existence as yet and this is still work in progress. W have created additional logging in the upstream version for QE to test against and help gather more information on the potential bug. 
+
+Until we have more details I cannot really comment too much on the issue except that it appears to happen under load from Hot Rod clients when connected to certain servers which are then taken offline part way during an operation.
Comment 12 Misha H. Ali 2012-04-05 10:27:00 EDT
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,5 +1,4 @@
-CCFR
+This is a suspected issue. We haven't completely proven or disproven its existence as yet and this is still work in progress. We have created additional logging in the upstream version for QE to test against and help gather more information on the potential bug. 
-
+</para>
-This is a suspected issue. We haven't completely proven or disproven its existence as yet and this is still work in progress. W have created additional logging in the upstream version for QE to test against and help gather more information on the potential bug. 
+<para>
-
+This problem occurs under load from Hot Rod clients when connected to certain servers which are then taken offline part way during an operation.-Until we have more details I cannot really comment too much on the issue except that it appears to happen under load from Hot Rod clients when connected to certain servers which are then taken offline part way during an operation.
Comment 13 Misha H. Ali 2012-04-05 10:27:19 EDT
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,4 +1,4 @@
 This is a suspected issue. We haven't completely proven or disproven its existence as yet and this is still work in progress. We have created additional logging in the upstream version for QE to test against and help gather more information on the potential bug. 
 </para>
 <para>
-This problem occurs under load from Hot Rod clients when connected to certain servers which are then taken offline part way during an operation.+This problem occurs under load from Hot Rod clients when connected to certain servers which are then taken offline during an operation.
Comment 14 Tristan Tarrant 2012-04-13 07:58:44 EDT
Please download 
http://www.dataforte.net/data/jboss-datagrid-server-bz807741.tar.bz2

and unpack it inside an ER6 installation and rerun the tests. This should give us extra logs (trace level obviously).
Comment 15 JBoss JIRA Server 2012-04-13 11:32:39 EDT
Michal Linhard <mlinhard@redhat.com> made a comment on jira ISPN-1948

The new tracelog
http://www.qa.jboss.com/~mlinhard/hyperion/run61
Comment 16 JBoss JIRA Server 2012-04-16 10:24:59 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

Michal, thx for the link but do you know if there was any error related to "Invalid..." message in the clients?

I've looked at all the hyperion* files in the link and they show no error message. And what's the difference between those files and the nodeX log files in http://www.qa.jboss.com/~mlinhard/hyperion/run61/report/ ?
Comment 17 JBoss JIRA Server 2012-04-16 10:37:19 EDT
Michal Linhard <mlinhard@redhat.com> made a comment on jira ISPN-1948

in http://www.qa.jboss.com/~mlinhard/hyperion/run61 :

hyperion1132.log.gz - hyperion1150.log.gz - Client trace logs

in http://www.qa.jboss.com/~mlinhard/hyperion/run61/report :

node0001.log.gz - node0016.log.gz - Server trace logs

hyperion1096.log - hyperion1111.log - sfDaemon logs on server side
hyperion1112.log - hyperion1131.log - sfDaemons logs on unused nodes
hyperion1151.log - controller sfDaemon log
Comment 18 JBoss JIRA Server 2012-04-16 10:38:56 EDT
Michal Linhard <mlinhard@redhat.com> made a comment on jira ISPN-1948

If you haven't found anything in hyperion1132 - 1150 than it's probably not there. I'd have to try again. I haven't checked them myself yet.
Comment 19 JBoss JIRA Server 2012-04-17 04:06:38 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

Indeed, 32 to 51 didn't show anything:

{code}
[g@dhcp-144-239:~/Go/jira/1948i/1304]% grep Invalid hyperion1132.log    
[g@dhcp-144-239:~/Go/jira/1948i/1304]% grep Invalid hyperion1133.log
[g@dhcp-144-239:~/Go/jira/1948i/1304]% grep Invalid hyperion1134.log
[g@dhcp-144-239:~/Go/jira/1948i/1304]% grep Invalid hyperion1135.log
[g@dhcp-144-239:~/Go/jira/1948i/1304]% grep Invalid hyperion1136.log
[g@dhcp-144-239:~/Go/jira/1948i/1304]% grep Invalid hyperion1137.log
[g@dhcp-144-239:~/Go/jira/1948i/1304]% grep Invalid hyperion1138.log
[g@dhcp-144-239:~/Go/jira/1948i/1304]% grep Invalid hyperion1139.log
[g@dhcp-144-239:~/Go/jira/1948i/1304]% grep Invalid hyperion1140.log
[g@dhcp-144-239:~/Go/jira/1948i/1304]% grep Invalid hyperion1141.log
[g@dhcp-144-239:~/Go/jira/1948i/1304]% grep Invalid hyperion1142.log
[g@dhcp-144-239:~/Go/jira/1948i/1304]% grep Invalid hyperion1143.log
[g@dhcp-144-239:~/Go/jira/1948i/1304]% grep Invalid hyperion1144.log
[g@dhcp-144-239:~/Go/jira/1948i/1304]% grep Invalid hyperion1145.log
[g@dhcp-144-239:~/Go/jira/1948i/1304]% grep Invalid hyperion1146.log
[g@dhcp-144-239:~/Go/jira/1948i/1304]% grep Invalid hyperion1147.log
[g@dhcp-144-239:~/Go/jira/1948i/1304]% grep Invalid hyperion1148.log
[g@dhcp-144-239:~/Go/jira/1948i/1304]% grep Invalid hyperion1149.log
[g@dhcp-144-239:~/Go/jira/1948i/1304]% grep Invalid hyperion1150.log
[g@dhcp-144-239:~/Go/jira/1948i/1304]% grep Invalid hyperion1151.log{code}
Comment 20 JBoss JIRA Server 2012-04-17 04:06:57 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

Btw, these are huge files, so I unzipped them individually and checked.
Comment 21 Michal Linhard 2012-05-04 05:41:21 EDT
So far I haven't reproduced this in elasticity/resilience tests with ER7 up to cluster size 16. I have couple of 32 node tests in front of me, I'll mark VERIFIED once I've done those.
Comment 22 Michal Linhard 2012-05-08 12:38:58 EDT
Unfortunately the 32 node elasticity tests showed this again:

http://www.qa.jboss.com/~mlinhard/hyperion/run94-elas-dist-32/report/loganalysis/client/
Comment 23 JBoss JIRA Server 2012-05-08 12:41:33 EDT
Michal Linhard <mlinhard@redhat.com> made a comment on jira ISPN-1948

In hyperion1132.log there's also a special kind of GET exception:

{code}
08:32:25,567 ERROR [org.jboss.smartfrog.jdg.loaddriver.DriverNode] (Client-18) Error doing GET(key454518) to node node0025 (lastOpTime 50 ms)
org.infinispan.client.hotrod.exceptions.TransportException:: Unable to unmarshall byte stream
	at org.infinispan.client.hotrod.impl.RemoteCacheImpl.bytes2obj(RemoteCacheImpl.java:450)
	at org.infinispan.client.hotrod.impl.RemoteCacheImpl.get(RemoteCacheImpl.java:341)
	at org.jboss.qa.jdg.adapter.HotRodAdapter$HotRodRemoteCacheAdapter.get(HotRodAdapter.java:244)
	at org.jboss.smartfrog.jdg.loaddriver.DriverNodeImpl$ClientThread.makeRequest(DriverNodeImpl.java:233)
	at org.jboss.smartfrog.jdg.loaddriver.DriverNodeImpl$ClientThread.run(DriverNodeImpl.java:386)
Caused by: java.io.IOException: Unsupported protocol version 49
	at org.jboss.marshalling.river.RiverUnmarshaller.start(RiverUnmarshaller.java:1184)
	at org.infinispan.marshall.jboss.AbstractJBossMarshaller.startObjectInput(AbstractJBossMarshaller.java:148)
	at org.infinispan.marshall.jboss.AbstractJBossMarshaller.objectFromByteBuffer(AbstractJBossMarshaller.java:129)
	at org.infinispan.marshall.AbstractMarshaller.objectFromByteBuffer(AbstractMarshaller.java:90)
	at org.infinispan.client.hotrod.impl.RemoteCacheImpl.bytes2obj(RemoteCacheImpl.java:448)
	... 4 more

{code}

I'm not creating a special JIRA for this cause I think this is related to erroneously shifted data in some of the streams.
Comment 24 mark yarborough 2012-05-09 09:23:24 EDT
Per Tristan triage, moved from ER8 to ER9 target release.
Comment 25 JBoss JIRA Server 2012-05-11 05:22:08 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

Don't know what's causing this yet. hyperion1132.log shows some connection refused errors on the start which I think should be solved with ISPN-1995. Then, the first thing that appears is the "IOException: Unsupported protocol version 49" error. node0025 shows no errors/warns/exceptions but it appears to be in the middle of rehashing ___hotRodTopologyCache, memcachedCache and testCache.

I'd suggest repeating the test with an ER version that has ISPN-1995 fixed (in case this is some sort of side effect of that jira but I doubt it), and add some further logging:

Server: TRACE on org.infinispan.server.hotrod
Client: DEBUG on org.infinispan.client.hotrod
Comment 26 JBoss JIRA Server 2012-05-11 05:38:05 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

There're more oddities here:

{code}[g@:~/Go/jira/1948i]% grep -r "InvalidResponseException:: Invalid message id" . | grep 201326591
Invalid message id. Expected 743598 and received 201326591
Invalid message id. Expected 761419 and received 201326591
Invalid message id. Expected 783396 and received 201326591{code}

Seems like client receives 3 messages with the same 201326591 message id (0xBFFFFFF)? It looks more like a different part of the response is treated as the msg id, rather than the actual message id.
Comment 27 JBoss JIRA Server 2012-05-11 05:38:24 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

Don't know what's causing this yet. hyperion1132.log shows some connection refused errors on the start which I think should be solved with ISPN-1995. Then, the first thing that appears is the "IOException: Unsupported protocol version 49" error. node0025 shows no errors/warns/exceptions but it appears to be in the middle of rehashing ___hotRodTopologyCache, memcachedCache and testCache.

I'd suggest repeating the test with an ER version that has ISPN-1995 fixed (in case this is some sort of side effect of that jira but I doubt it), and add some further logging:

{code}Server: TRACE on org.infinispan.server.hotrod
Client: DEBUG on org.infinispan.client.hotrod{code}
Comment 28 Michal Linhard 2012-05-16 05:53:47 EDT
I did these tests with ER8.1 with fix for Netty and BZ 809631 (uneven requests):

Resilience 16 nodes (3owners, crash2)
http://www.qa.jboss.com/~mlinhard/hyperion/run105-resi-dist-16-ER8.1-nettyfix/logs/analysis/client/

Resilience 32 nodes (5owners, crash4)
http://www.qa.jboss.com/~mlinhard/hyperion/run104-resi-dist-32-ER8.1-nettyfix/logs/analysis/client/

Elasticity 8->16->8
http://www.qa.jboss.com/~mlinhard/hyperion/run106-elas-dist-16-ER8.1-nettyfix/logs/analysis/client/

Elasticity 16->32->16
http://www.qa.jboss.com/~mlinhard/hyperion/run107-elas-dist-32-ER8.1-nettyfix/logs/analysis/client/

And couldn't see the "Invalid magic number issue" in any of them. Is it possible that the netty fix solved it ?
Comment 29 JBoss JIRA Server 2012-05-17 08:52:44 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

Maybe... :)
Comment 30 Michal Linhard 2012-05-18 05:01:24 EDT
I haven't seen this in these tests with ER9:

http://www.qa.jboss.com/~mlinhard/hyperion/run110-resi-dist-32-ER9-own5/
http://www.qa.jboss.com/~mlinhard/hyperion/run111-resi-dist-16-ER9-own3/
http://www.qa.jboss.com/~mlinhard/hyperion/run112-elas-dist-16-ER9/

but I'd rather wait before marking this verified after one full round of testing after https://bugzilla.redhat.com/show_bug.cgi?id=809631 is fixed.
Comment 31 JBoss JIRA Server 2012-05-18 05:08:31 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

FYI: ISPN-2052 - I don't think you guys send a remove with force return previous value flag, but the exception is similar.
Comment 32 Michal Linhard 2012-05-22 14:35:23 EDT
Appeared again in ER10 32 node resilience tests:
http://www.qa.jboss.com/~mlinhard/hyperion/run124-resi-dist-32-own2-ER10/logs/analysis/client/
Comment 33 mark yarborough 2012-05-23 10:19:19 EDT
Decision: Not a blocker for JDG 6. Will list as known issue in Release Notes.
Comment 34 JBoss JIRA Server 2012-05-23 11:40:25 EDT
Michal Linhard <mlinhard@redhat.com> made a comment on jira ISPN-1948

recent trace logs: http://www.qa.jboss.com/~mlinhard/test_results/ispn1948tracelog2.zip
Comment 35 JBoss JIRA Server 2012-05-24 06:03:57 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

Michal, I don't have permissions to download that.
Comment 36 JBoss JIRA Server 2012-05-24 07:40:40 EDT
Michal Linhard <mlinhard@redhat.com> made a comment on jira ISPN-1948

Fixed, please try again.
Comment 37 JBoss JIRA Server 2012-05-25 09:57:20 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

A summary of what I'm seeing in the latest logs. Effectively, what seems to be happening is that the (buffered)input stream in the client is being reset/rewinded somehow. It seems to go back in the buffer to re-read host information when it should be reading the value of the get() response. Why this happens? No idea...
Comment 38 JBoss JIRA Server 2012-05-25 10:05:17 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

Well, either there's a reset, or the client is reading from the wrong socket input stream. IOW, if it's reading from socket stream belonging to a different thread.
Comment 39 JBoss JIRA Server 2012-05-25 10:28:37 EDT
Manik Surtani <manik.surtani@jboss.com> made a comment on jira ISPN-1948

Can't you detect if the client is reading from the wrong socket stream by logging what it is reading from each time?
Comment 40 JBoss JIRA Server 2012-05-25 10:31:01 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

No sign of thread safety issue. Same transport instance borrowed from the pool and returned for the client thread showing the problem, and no other thread using the same transport instance (buffer input stream is a final instance variable of transport). Attached updated summary.
Comment 41 JBoss JIRA Server 2012-05-28 06:20:47 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

@Manik, I'm pretty sure that's not the case here because it uses the same Transport instance throughout sending the request and reading from it, and all socket instance variables in Transport are final, and we're logging what transport is reading each time. See my updated summary.
Comment 42 JBoss JIRA Server 2012-05-28 07:59:09 EDT
Dan Berindei <dberinde@redhat.com> made a comment on jira ISPN-1948

Galder, I think it's a concurrency issue after all. The header logged by the client looks like this:

A1E1BC030400 
01 - topology changed
04 - view id
00 02 - num owners
02 - hash function
FFFFFFFF07 - hash space (MAX_INT)
1F - members size (31)
80 04 - numVirtualNodes (512)

So the number of servers written by the server is 31, but there are actually 32 server addresses afterwards. I think this is because the topology address cache modified after the server wrote the size but before it wrote the actual addresses.
Comment 43 JBoss JIRA Server 2012-05-28 08:02:33 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

The confusion came from lack of logging on the server side and on the client side of how many nodes came in the view. So, I reverted to doing counts rather than decoding the message. I'm adding some clearer messages for the future and will check what's wrong with the concurrency.
Comment 44 JBoss JIRA Server 2012-05-28 09:30:48 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

Great... we've got the most reliable set of notifications... not... the following test (https://gist.github.com/2819113) fails with:

java.lang.AssertionError: size is: 0
	at org.infinispan.notifications.cachelistener.SizeAndListenerTest.test000(SizeAndListenerTest.java:78)

How do we expect people to use these correctly if we provide no basic guarantees like cache size? Jeez...

It doesn't work with CacheEntryModified either cos the container is updated *after* all notifications have been fired. Wonderful...
Comment 45 JBoss JIRA Server 2012-05-28 09:31:08 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

Great... we've got the most reliable set of notifications... not... the following test (https://gist.github.com/2819113) fails with:

{code}java.lang.AssertionError: size is: 0
	at org.infinispan.notifications.cachelistener.SizeAndListenerTest.test000(SizeAndListenerTest.java:78)
{code}

How do we expect people to use these correctly if we provide no basic guarantees like cache size? Jeez...

It doesn't work with CacheEntryModified either cos the container is updated *after* all notifications have been fired. Wonderful...
Comment 46 JBoss JIRA Server 2012-05-28 09:49:46 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

There's no easy fix for this in the current architecture based on cache listener. The idea with ViewIdUpdater listener was to avoid the need to cache the view id itself after updating the address cache, since this would two operations to the cache and would require batching or transactions to make it atomic...etc. There is however a method that will provide the guarantees that I want, and that is interceptors, cos they can be injected anywhere we want, and they can be placed after we know the container has been updated, i.e. EntryWrappingInterceptor.

I thin there's another JIRA on cache size guarantees with listeners, but this really needs to be revisited when JSR-107 cache listeners are implemented.
Comment 47 JBoss JIRA Server 2012-05-29 05:58:39 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> updated the status of jira ISPN-1948 to Reopened
Comment 48 JBoss JIRA Server 2012-05-29 05:58:39 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

The issue is not yet fixed. Just added some further logging yesterday.
Comment 49 JBoss JIRA Server 2012-05-30 06:04:22 EDT
Dan Berindei <dberinde@redhat.com> made a comment on jira ISPN-1948

Galder, I think saving the members list in the listener itself should work and it should be easier to implement than using an interceptor.
Comment 50 JBoss JIRA Server 2012-05-30 12:27:04 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

Dan, I'm not so sure but I'll investigate it. I'm not even sure that there're any guarantees that cache.get() will contents with the entry added or not from CacheEntryCreated. Thanks for the input.
Comment 51 JBoss JIRA Server 2012-05-31 13:06:14 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

Dan, your suggestion won't work because of https://docs.jboss.org/author/pages/viewpage.action?pageId=5832860, and to prove it, here's a pretty comprehensive test I've built: https://gist.github.com/2844743 - All tests fail.

The only guarantees you have is wrt event.getKey/event.getValue. So, for the listener solution to still work, I'd need to catch all the events in the cache and maintain a copy in the listener itself that can be guaranteed to be the expected one. Might as well use a notification bus then... ;)

I think an interceptor, located in the right place, can provide me with the right guarantees. I'll create a test to verify it.
Comment 52 JBoss JIRA Server 2012-06-04 11:08:27 EDT
Dan Berindei <dberinde@redhat.com> made a comment on jira ISPN-1948

Hmmm, you're right, I guess I was relying on the invocation context being thread-local... it may still work with transactional caches, because the listener should reuse the thread's transaction, but it wouldn't be useful for your use case anyway.
Comment 53 Michal Linhard 2012-06-05 12:01:15 EDT
Appeared in ER11 elasticity tests sizes 8->16->8 and 16->32->16
Comment 54 Misha H. Ali 2012-06-05 23:26:17 EDT
This bug is nominated for JDG 6 GA Release Notes. Assigning NEEDINFO to assignee to ensure that the technical note text still accurately reflects the problem.
Comment 55 JBoss JIRA Server 2012-06-06 06:13:49 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

Michal, does ER11 has the last proposed fix in it?
Comment 56 JBoss JIRA Server 2012-06-06 06:25:40 EDT
Michal Linhard <mlinhard@redhat.com> made a comment on jira ISPN-1948

Just in the process of running the tests with your patch. I'll run at least 5 different tests - similar to those that demonstrated it lately.

First one didn't show the magic number problem.
Comment 57 JBoss JIRA Server 2012-06-06 06:26:23 EDT
Michal Linhard <mlinhard@redhat.com> made a comment on jira ISPN-1948

ER11 doesn't contain the fix, it's based on 5.1.5.Final
Comment 58 mark yarborough 2012-06-06 09:20:26 EDT
Triaged to post-6.0 release: Rationale - potential fix too hard to verify, and test risk too high this late in the release cycle.
Comment 60 Michal Linhard 2012-06-07 04:45:25 EDT
Created attachment 590128 [details]
ISPN-1948 patch

attaching the patch infinispan-server-hotrod-5.1.5.FINAL-redhat-1.jar
md5: 0970619a4dfeb20440004be5d3fb02ff
Comment 61 JBoss JIRA Server 2012-06-07 06:32:42 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

So, the patch looking good so far? Do you plan any further testing in the near future?
Comment 62 JBoss JIRA Server 2012-06-07 08:56:52 EDT
Michal Linhard <mlinhard@redhat.com> made a comment on jira ISPN-1948

Yes. The patch looks good. The magic number problem doesn't appear, nor it causes any other problems in elasticity tests, do you think we should run more tests or different types of tests ?
Comment 63 JBoss JIRA Server 2012-06-07 09:52:49 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-1948

Your elasticity tests where the best tests we had, and we'd easily see the failure if it was there. I'm confident this now works as expected since I was able to reproduce similar scenarios in smaller scale, and with an interceptor, we can avoid this type of issues. I'll make a note in the pull req to get this integrated. Thx for all your help Michal!!!
Comment 64 Galder Zamarreño 2012-06-08 04:53:05 EDT
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,4 +1,3 @@
-This is a suspected issue. We haven't completely proven or disproven its existence as yet and this is still work in progress. We have created additional logging in the upstream version for QE to test against and help gather more information on the potential bug. 
+The issue is a race condition on the Hot Rod server, which can lead to sending topologies erroneously as a result of addition of a new node to the cluster. When the issue appears, clients will start seeing "Invalid magic number" error messages as a result of the stream containing unexpected data.
 </para>
-<para>
+When the issue appears, the safest thing to do is to restart the client, but the client might recover itself once all the unexpected data has been consumed. If the client recovers, the view topology it has will be lacking one of the added nodes, so although it would work relatively Ok, it would lead to some uneven request distribution.-This problem occurs under load from Hot Rod clients when connected to certain servers which are then taken offline during an operation.
Comment 65 JBoss JIRA Server 2012-06-08 08:43:10 EDT
Dan Berindei <dberinde@redhat.com> updated the status of jira ISPN-1948 to Resolved
Comment 66 Michal Linhard 2012-06-08 10:51:33 EDT
Occured in CR1 elasticity tests:
http://www.qa.jboss.com/~mlinhard/hyperion/run177-elas-dist-32-CR1/logs/analysis/client/
Comment 67 Misha H. Ali 2012-06-12 01:06:07 EDT
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,3 +1,4 @@
-The issue is a race condition on the Hot Rod server, which can lead to sending topologies erroneously as a result of addition of a new node to the cluster. When the issue appears, clients will start seeing "Invalid magic number" error messages as a result of the stream containing unexpected data.
+The issue is a race condition on the Hot Rod server, which can lead to topologies being sent erroneously as a result of addition of a new node to the cluster. When the issue appears, clients will start seeing "Invalid magic number" error messages as a result of unexpected data within the stream.
 </para>
-When the issue appears, the safest thing to do is to restart the client, but the client might recover itself once all the unexpected data has been consumed. If the client recovers, the view topology it has will be lacking one of the added nodes, so although it would work relatively Ok, it would lead to some uneven request distribution.+<para>
+When this problem is encountered, the recommended approach is to restart the client. If the client is not restarted, on some occasions, the client may recover after the unexpected data is consumed but this is not guaranteed. If the client recovers with a restart, the view topology displayed does not display one of the nodes added, resulting in uneven request distribution.
Comment 68 mark yarborough 2012-08-16 09:06:27 EDT
Need Galder to look at this again. Was release noted for 6.0.0. We'd like to fix it for 6.0.1 if valid, but is not a regression so can be deferred if necessary.
Comment 69 Tristan Tarrant 2012-08-20 05:21:18 EDT
Relinked to a new Jira to avoid pollution of old issue
Comment 70 JBoss JIRA Server 2012-08-20 06:48:06 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-2209

@Tristan, the fix for this is in 5.1.6.FINAL. Can you confirm the issue is present with that version? You've marked 5.1.5 as affects...
Comment 71 JBoss JIRA Server 2012-08-20 07:53:34 EDT
Tristan Tarrant <ttarrant@redhat.com> made a comment on jira ISPN-2209

5.1.5.FINAL was the version that was tested by [~mlinhard] (JDG 6.0.0.CR1 which was then renamed to JDG 6.0.0.GA). I need QE to get some runs on the 6.0.1.ER2 builds (which use 5.1.6.FINAL).
Comment 72 mark yarborough 2012-08-22 09:56:35 EDT
Not a blocker for 6.0.1.
Comment 73 JBoss JIRA Server 2012-08-28 04:30:33 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> updated the status of jira ISPN-2209 to Resolved
Comment 74 JBoss JIRA Server 2012-08-28 04:30:33 EDT
Galder Zamarreño <galder.zamarreno@redhat.com> made a comment on jira ISPN-2209

I have no reason to believe this is an issue any more with the latest versions. If you think otherwise after testing 5.1.6 or later, then re-open :)
Comment 76 mark yarborough 2012-11-14 09:42:41 EST
ttarrant will add jira links as appropriate.
Comment 77 mark yarborough 2012-12-05 09:26:34 EST
ttarrant requests re-create attempt using upcoming jdg61-ER5 build.
Comment 78 Michal Linhard 2012-12-17 04:36:32 EST
I've run 2 tests with JDG 6.1.0.ER5 that previously demonstrated this issue and haven't seen it. 

ER5 build is burdened with other errors that might affect manifestation of this issue, namely bugzillas 886565, 886549, 887323

This heisenbug wasn't easy to spot even in builds where it was confirmed, however for this build, I think these two tests are enough, if the issue reappears, I'll reopen.

The tests I did were elasticity tests 8-16-8 and 16-32-16 in hyperion:
http://www.qa.jboss.com/~mlinhard/hyperion3/run0009
http://www.qa.jboss.com/~mlinhard/hyperion3/run0010

Note You need to log in before you can comment on or make changes to this bug.