Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 745867 (EDG-120)

Summary:	HotRod: InvalidResponseException:: Invalid message id. Expected cfec and received 84b6>
Product:	[JBoss] JBoss Data Grid 5	Reporter:	Radoslav Husar <rhusar>
Component:	Infinispan	Assignee:	Default User <jbpapp-maint>
Status:	CLOSED NEXTRELEASE	QA Contact:
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	EAP 5.1.0 EDG TP	CC:	galder.zamarreno, mlinhard, rachmato, rhusar, trustin
Target Milestone:	---
Target Release:	EAP 5.1.0 EDG TP
Hardware:	Unspecified
OS:	Unspecified
URL:	http://jira.jboss.org/jira/browse/EDG-120
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-08-03 08:34:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Radoslav Husar 2011-03-16 14:43:19 UTC

project_key: EDG

Occured when bringing up a failed node.


{code}
&lt;org.infinispan.client.hotrod.exceptions.InvalidResponseException:: Invalid message id. Expected cfec and received 84b6>
        org.infinispan.client.hotrod.exceptions.InvalidResponseException:: Invalid message id. Expected cfec and received 84b6
	at org.infinispan.client.hotrod.impl.operations.HotRodOperation.readHeaderAndValidate(HotRodOperation.java:90)
	at org.infinispan.client.hotrod.impl.operations.AbstractKeyValueOperation.sendPutOperation(AbstractKeyValueOperation.java:48)
	at org.infinispan.client.hotrod.impl.operations.PutOperation.executeOperation(PutOperation.java:27)
	at org.infinispan.client.hotrod.impl.operations.RetryOnFailureOperation.execute(RetryOnFailureOperation.java:38)
	at org.infinispan.client.hotrod.impl.RemoteCacheImpl.put(RemoteCacheImpl.java:178)
	at org.infinispan.CacheSupport.put(CacheSupport.java:28)
	at org.jboss.smartfrog.edg.adapter.HotRodAdapter$HotRodRemoteCacheAdapter.put(HotRodAdapter.java:150)
	at org.jboss.smartfrog.edg.loaddriver.CacheRequestProcessorFactory$InfinispanRequestProcessor.processRequest(CacheRequestProcessorFactory.java:163)
	at org.jboss.smartfrog.loaddriver.CompoundRequestProcessorFactoryImpl$CompoundRequestProcessor.processRequest(CompoundRequestProcessorFactoryImpl.java:51)
	at org.jboss.smartfrog.loaddriver.Runner.run(Runner.java:87)
	at java.lang.Thread.run(Thread.java:662)
{code}

Comment 1 Radoslav Husar 2011-03-16 14:48:01 UTC

Link: Added: This issue is incorporated by JBPAPP-6054

Comment 2 Galder Zamarreño 2011-03-16 14:50:04 UTC

I'd suggest doing the same thing as for JBPAPP-6051 - get some TRACE logs on client and server side to figure this out.

Comment 3 Michal Linhard 2011-03-29 11:58:24 UTC

some more occurences:
http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-stress-client-size1-hotrod/31/console-perf09/consoleText
http://hudson.qa.jboss.com/hudson/job/edg-51x-stress-client-size8-hotrod/4/console-perf02/consoleText

Comment 4 Galder Zamarreño 2011-03-30 08:45:19 UTC

Michal, it's very interesting that you see it even when cluster size is 1. That should make it easier to debug. Let me look at those files.

Comment 6 Radoslav Husar 2011-03-30 15:06:02 UTC

@Galder Got it! http://reports.qa.atl.jboss.com/~rhusar/logs/edg-51x-resilience-client-size4-hotrod-44.zip - 40 occurences.

Comment 7 Galder Zamarreño 2011-03-31 09:17:43 UTC

Right, I think the issue is happening as a result of a failure to update the hash view on the server side. For example:

{code}2011/03/30 09:39:46:213 EDT [WARN ][Runner - 0] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Error sampling data:  <org.infinispan.client.hotrod.exceptions.InvalidResponseException:: Invalid message id. Expected ce0 and received 0>
        org.infinispan.client.hotrod.exceptions.InvalidResponseException:: Invalid message id. Expected ce0 and received 0{code}

It's expecting messageId=3296 but got something else and on the server side you see:

{code}[JBoss] 09:39:36,192 TRACE [HotRodEncoder$] Encode msg GetResponse{messageId=3296, operation=GetResponse, status=Success, data=ByteArray{size=505, hashCode=18e5b891, array=[2, 67, 1, -12, 33, 0, 0, 0, 0, -1, ..]}}
...
[JBoss] 09:39:46,193 ERROR [InvocationContextInterceptor] Execution error: 
[JBoss] org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock after [10 seconds] on key [view] for requestor [Thread[OOB-15,null,5,Thread Pools]]! Lock held by [(another thread)]{code}

Rado, you guys might wanna add thread log information to the logs to make things clearer but it's fairly clear that the server could not replace the view and that caused the issues here. I'll create a JIRA on ISPN to look into this.

Comment 8 Galder Zamarreño 2011-03-31 09:56:20 UTC

Corresponding ISPN issue to track this. This needs some thinking.

Comment 9 Galder Zamarreño 2011-03-31 09:56:20 UTC

Link: Added: This issue depends ISPN-1016

Comment 10 Radoslav Husar 2011-03-31 10:37:47 UTC

We can include jstack by JBQA-4481.

Comment 11 Galder Zamarreño 2011-04-04 08:40:12 UTC

With the snapshot uploaded last Thursday that includes a fix fot the linked ISPN jira, this should be fixed.

Comment 12 Galder Zamarreño 2011-04-04 14:08:35 UTC

Link: Added: This issue relates to JBPAPP-6051

Comment 13 Radoslav Husar 2011-04-08 17:30:02 UTC

Run http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-resilience-client-size4-hotrod/46/ shows this error again - but this time there is only one error on this client. I ll check with Michal what version it ran with. Please hold on fow now.

2011-04-04 17:20:35,555 153501 ERROR [org.infinispan.client.hotrod.impl.operations.HotRodOperation] (Runner - 2:) Invalid message id. Expected 8523 and received 6820
2011/04/04 17:20:35:556 EDT [WARN ][Runner - 2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Error sampling data:  &lt;org.infinispan.client.hotrod.exceptions.InvalidResponseException:: Invalid message id. Expected 8523 and received 6820>

Comment 14 Radoslav Husar 2011-04-29 14:53:53 UTC

Hi Galder, confirmed -  still present, I see

{code}2011/04/29 00:32:02:733 EDT [WARN ][Runner - 0] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Error sampling data:  &lt;org.infinispan.client.hotrod.exceptions.InvalidResponseException:: Invalid message id. Expected 2476 and received 2421>
{code}

at http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-resilience-client-size4-hotrod/49/ (perf03 client log).

Update: Galder, here are the compacted logs 4 u: http://reports.qa.atl.jboss.com/~rhusar/logs/edg-51x-resilience-client-size4-hotrod-49.zip

Comment 15 Michal Linhard 2011-04-29 15:41:43 UTC

I gave this a brief look
this is the sequence of the events in the resilience test (for your convenience :-) )

{code}
startup:

[JBoss] 00:18:17,649 INFO  [ServerImpl] JBoss (Microcontainer) [5.1.0 (build: SVNTag=JBPAPP_5_1_0 date=201009150028)] Started in 21s:236ms (perf18)
[JBoss] 00:18:37,663 INFO  [ServerImpl] JBoss (Microcontainer) [5.1.0 (build: SVNTag=JBPAPP_5_1_0 date=201009150028)] Started in 19s:83ms (perf19)
[JBoss] 00:18:56,694 INFO  [ServerImpl] JBoss (Microcontainer) [5.1.0 (build: SVNTag=JBPAPP_5_1_0 date=201009150028)] Started in 18s:105ms (perf20)
[JBoss] 00:19:15,693 INFO  [ServerImpl] JBoss (Microcontainer) [5.1.0 (build: SVNTag=JBPAPP_5_1_0 date=201009150028)] Started in 18s:189ms (perf17)


2011/04/29 00:22:51:508 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Failing node 0
2011/04/29 00:23:51:673 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Bringing back node 0
[JBoss] 00:24:12,150 INFO  [ServerImpl] JBoss (Microcontainer) [5.1.0 (build: SVNTag=JBPAPP_5_1_0 date=201009150028)] Started in 19s:708ms

2011/04/29 00:25:12:161 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Failing node 1
2011/04/29 00:26:12:266 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Bringing back node 1
[JBoss] 00:26:33,434 INFO  [ServerImpl] JBoss (Microcontainer) [5.1.0 (build: SVNTag=JBPAPP_5_1_0 date=201009150028)] Started in 20s:395ms

2011/04/29 00:27:33:445 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Failing node 2
2011/04/29 00:28:33:549 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Bringing back node 2
[JBoss] 00:28:55,110 INFO  [ServerImpl] JBoss (Microcontainer) [5.1.0 (build: SVNTag=JBPAPP_5_1_0 date=201009150028)] Started in 20s:796ms

2011/04/29 00:29:55:120 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Failing node 3
2011/04/29 00:30:55:219 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Bringing back node 3
[JBoss] 00:31:17,794 INFO  [ServerImpl] JBoss (Microcontainer) [5.1.0 (build: SVNTag=JBPAPP_5_1_0 date=201009150028)] Started in 21s:797ms
{code}

Comment 16 Michal Linhard 2011-04-29 15:44:58 UTC

These are the occurences of messageId=2476

{code}
perf20:
[JBoss] 00:25:42,643 TRACE [HotRodDecoder$] Decoded header HotRodHeader{op=GetRequest, messageId=2476, cacheName=, flag=NoFlag, clientIntelligence=3, topologyId=7}
perf19:
[JBoss] 00:25:45,772 TRACE [HotRodDecoder$] Decoded header HotRodHeader{op=PutRequest, messageId=2476, cacheName=, flag=NoFlag, clientIntelligence=3, topologyId=7}
perf18:
[JBoss] 00:27:43,160 TRACE [HotRodDecoder$] Decoded header HotRodHeader{op=GetRequest, messageId=2476, cacheName=, flag=NoFlag, clientIntelligence=3, topologyId=9}
[JBoss] 00:29:42,741 TRACE [HotRodDecoder$] Decoded header HotRodHeader{op=PutRequest, messageId=2476, cacheName=, flag=NoFlag, clientIntelligence=3, topologyId=10}
perf17:
[JBoss] 00:19:30,338 TRACE [HotRodDecoder$] Decoded header HotRodHeader{op=PutRequest, messageId=2476, cacheName=, flag=NoFlag, clientIntelligence=3, topologyId=4}
[JBoss] 00:25:42,559 TRACE [HotRodDecoder$] Decoded header HotRodHeader{op=PutRequest, messageId=2476, cacheName=, flag=NoFlag, clientIntelligence=3, topologyId=7}
[JBoss] 00:25:43,288 TRACE [HotRodDecoder$] Decoded header HotRodHeader{op=PutRequest, messageId=2476, cacheName=, flag=NoFlag, clientIntelligence=3, topologyId=7}
{code}

BTW: Is it safe to have the same messageId used in parallel ? I'm gonna check the code to understand this better...

Comment 17 Michal Linhard 2011-05-05 07:17:50 UTC

another occurence (no TRACE log available) in data stress test with hotrod client on 10 node cluster
http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-stress-data-size10-hotrod/1/console-perf03/consoleText
http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-stress-data-size10-hotrod/1/console-perf08/consoleText
http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-stress-data-size10-hotrod/1/console-perf09/consoleText

note: the cluster is supposed to be stable in this test, no view updates happen during the test (only at the end)

Comment 18 Michal Linhard 2011-05-06 09:30:13 UTC

just a note for myself, you probably know this already:

in all three cases I checked the message id causing this was connected to a PUT operation that didn't complete due to org.infinispan.remoting.transport.jgroups.SuspectException

Comment 19 Richard Achmatowicz 2011-05-06 23:29:31 UTC

I have a feeling that these problems are related more to the way the TCP channels are being managed than the stuff that is being put into them. The code on either end is pretty simple (on the client side, writing magic bytes, headers, keys, values, waiting, then reading what was written by the server; on the server side, decoding magic bytes, message id's headers, params, performing an operation on the cache, then writing the results back).

On the client side, there is a fairly simple TCP transport which takes a connection from a pool, uses it, and then returns it to the pool; on the server side, Netty manages channels and makes calls to AbstractProtocolDecoder.

I'm guessing that, somehow, some garbage is getting written into the channel on the server side when a server gets into trouble (either slow or killed) and that garbage ends up at the client waiting for its response. Could try to show this by logging the bytes of all responses on the server side - once a node went down, compare the bytes logged with the bytes in the error message on the client side. There's probably a better way.

Comment 20 Michal Linhard 2011-05-10 12:03:18 UTC

another interresting occurence:

https://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-elasticity-client-size4-hotrod/11/
in this scenario I setup 4node cluster in replication mode and after a while I kill 3 nodes.

Comment 21 Radoslav Husar 2011-05-11 13:28:09 UTC

@Trustin, Hi, any update on this? This issue is critical for us.

Comment 22 Radoslav Husar 2011-05-11 13:29:47 UTC

Updating to blocker (as the parent Jira).

Comment 23 Michal Linhard 2011-05-11 14:47:48 UTC

This is how this bug manifests itself in resilience performance tests:
https://docspace.corp.redhat.com/docs/DOC-66047

Comment 24 Galder Zamarreño 2011-05-12 12:05:29 UTC

I'm assigning back to me since I'm more familiar with the code than Trustin at this point. I'll get back to you asap.

Comment 25 Galder Zamarreño 2011-05-12 12:23:14 UTC

I looked at the logs from 29th April that Rado posted and I've found an issue - ISPN-1099

I don't know however what the root cause, the invalid magic ID issue comes from, but the message ID failure is due to that.

I'll use this opportunity to add some logging to get a better idea of what went wrong with invalid message ID.

Comment 26 Galder Zamarreño 2011-05-12 12:23:38 UTC

Link: Added: This issue depends ISPN-1099

Comment 27 Michal Linhard 2011-05-16 15:23:31 UTC

run with 4.2.2-SNAPSHOT-20110516.115014:
http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-resilience-client-size4-hotrod/60

we still see the "magic" problems:
http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-resilience-client-size4-hotrod/60/console-perf02/consoleText

Comment 28 Trustin Lee 2011-05-19 13:35:36 UTC

I read the Hot Rod decoder on the server side tonight and it looks OK to me.

So, I moved my focus to the Hot Rod client implementation.

I see the following code in PingOperation.java:

{code}
long messageId = writeHeader(transport, HotRodConstants.PING_REQUEST);
short respStatus = readHeaderAndValidate(transport, messageId, ...);
{code}

it assumes that the response read immediately after writing a ping request is always a ping response with the same messageId.

IIUC, TcpTransport is borrowed from a pool before pinging and it is returned to the pool immediately after reading a response.  So, I guess there's no chance of thread safety issue as long as all operations borrow and return the transport correctly.

The only cause I can think of is that something did not read the Socket InputStream completely before returning the TcpTransport to the pool, and PingOperation is reading the previous message not read by previous operation for some reason.

To confirm this theory, you could log the whole content of the Socket InputStream in HotRodOperation.readHeaderAndValidate() before raising InvalidResponseException.  To do this, I'd add dumpStream() method to Transport:

{code}
  interface Transport {
    ...
    /** Reads the whole content of the current Socket receive buffer for
troubleshooting. */
    byte[] dumpStream();
    ...
  }

  class TcpTransport {
    // don't copy and paste. review it first :-)
    byte[] dumpStream() {
      ByteArrayOutputStream buf = ...;
      // Read until the connection is idle for 5 sec.
      socket.setSoTimeout(5000);
      try {
        // Read up to 32KiB
        for (int count = 0; count < 32768; count ++) {
          int b = socketInputStream.read();
          if (b < 0) {
            break;
          }
          buf.write(b);
        }
      } catch (IOException e) {
        // not interested at the moment
      }
      return buf.toByteArray();
    }
  }
{code}

and then use it in HotRodOperation.readHeaderAndValidate():

{code}
  ...
  if (receivedMessageId != messageId) {
    ...
    log.invalidMessageId(...);
    ...
    if (log.isTraceEnabled()) {
      // We don't log what's read already but we can get it from
      // other log messages, right?
      log.tracef("Socket dump: %s", makeHexDump(transport.dumpStream());
    }
    ...
    throw new InvalidResponseException(...);
  }
  ...
{code}

Then let's see what's logged in the next round.  If my guess was right, the dump will contain both the partial previous response (because we read a part of it already during validation) and the response for the ping request.

Comment 29 Galder Zamarreño 2011-05-19 17:10:24 UTC

Ignore this comment cos it's not right.

Comment 30 Galder Zamarreño 2011-05-19 17:15:04 UTC

Link: Added: This issue relates to ISPN-1122

Comment 31 Galder Zamarreño 2011-05-20 09:20:29 UTC

Guys, forget about my last comment, I was looking at the wrong message ids (runner - 0 and runner - 1 in logs above point to diff message ids), but this is more meaningful. However, I don't think I'm far off. So, if you take the log and extract it from last time the the id=3 transport was fetched until it was returned and you grep by "Runner - 0", you see:

{code}2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TransportObjectFactory] (Runner - 0:) Fetching from pool:TcpTransport{socket=Socket[addr=perf18/10.16.90.54,port=11222,localport=46181], serverAddress=perf18/10.16.90.54:11222, id =3} 
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransportFactory] (Runner - 0:) For server perf18/10.16.90.54:11222: active = 1; idle = 1
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Runner - 0:) Wrote byte 160
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Runner - 0:) Wrote byte 10
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Runner - 0:) Wrote byte 1
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Runner - 0:) Wrote 0 bytes
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Runner - 0:) Wrote byte 3
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Runner - 0:) Wrote byte 0
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.operations.HotRodOperation] (Runner - 0:) wrote header for message 6899. Operation code: 0x01. Flags: 0x0
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Runner - 0:) Wrote 8 bytes
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Runner - 0:) Wrote 505 bytes
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Runner - 0:) Flushed socket: Socket[addr=perf18/10.16.90.54,port=11222,localport=46181]
2011-05-17 11:23:31,119 211988 TRACE [org.infinispan.client.hotrod.impl.operations.RetryOnFailureOperation] (Runner - 0:) Exception encountered. Retry 4 out of 40
2011-05-17 11:23:31,119 211988 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TransportObjectFactory] (Runner - 0:) Returning to pool:TcpTransport{socket=Socket[addr=perf18/10.16.90.54,port=11222,localport=46181], serverAddress=perf18/10.16.90.54:11222, id =3} {code}

Now, where on earth is the response for 6899 message Id being read? Nowhere hence no wonder that as Trustin suggested, the next time a thread takes that tcp transport, the old message id response is still there. What on earth happeneed to Runner - 0 then?? Hmmm, digging continues...

Comment 32 Galder Zamarreño 2011-05-20 09:38:56 UTC

Btw, something is not right with the way RetryOnFailureOperation logs there, what exception was encountered? I've opened ISPN-1126 to fix this. Once this is in and with Trustin's suggestion also in, I'll build a snapshot and we try again to see if we get more clues.

Comment 33 Galder Zamarreño 2011-08-03 07:09:51 UTC

Michal/Rado, any news on this?

Comment 34 Michal Linhard 2011-08-03 08:34:09 UTC

I cannot reproduce this with latest EDG6 alpha build. If this reappears it will be dealt with in JBPAPP-6850.