| Summary: | HotRod: InvalidResponseException:: Invalid message id. Expected cfec and received 84b6> | ||
|---|---|---|---|
| Product: | [JBoss] JBoss Data Grid 5 | Reporter: | Radoslav Husar <rhusar> |
| Component: | Infinispan | Assignee: | Default User <jbpapp-maint> |
| Status: | CLOSED NEXTRELEASE | QA Contact: | |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | EAP 5.1.0 EDG TP | CC: | galder.zamarreno, mlinhard, rachmato, rhusar, trustin |
| Target Milestone: | --- | ||
| Target Release: | EAP 5.1.0 EDG TP | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| URL: | http://jira.jboss.org/jira/browse/EDG-120 | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-08-03 08:34:09 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
Link: Added: This issue is incorporated by JBPAPP-6054 I'd suggest doing the same thing as for JBPAPP-6051 - get some TRACE logs on client and server side to figure this out. some more occurences: http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-stress-client-size1-hotrod/31/console-perf09/consoleText http://hudson.qa.jboss.com/hudson/job/edg-51x-stress-client-size8-hotrod/4/console-perf02/consoleText Michal, it's very interesting that you see it even when cluster size is 1. That should make it easier to debug. Let me look at those files. @Galder Got it! http://reports.qa.atl.jboss.com/~rhusar/logs/edg-51x-resilience-client-size4-hotrod-44.zip - 40 occurences. Right, I think the issue is happening as a result of a failure to update the hash view on the server side. For example:
{code}2011/03/30 09:39:46:213 EDT [WARN ][Runner - 0] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Error sampling data: <org.infinispan.client.hotrod.exceptions.InvalidResponseException:: Invalid message id. Expected ce0 and received 0>
org.infinispan.client.hotrod.exceptions.InvalidResponseException:: Invalid message id. Expected ce0 and received 0{code}
It's expecting messageId=3296 but got something else and on the server side you see:
{code}[JBoss] 09:39:36,192 TRACE [HotRodEncoder$] Encode msg GetResponse{messageId=3296, operation=GetResponse, status=Success, data=ByteArray{size=505, hashCode=18e5b891, array=[2, 67, 1, -12, 33, 0, 0, 0, 0, -1, ..]}}
...
[JBoss] 09:39:46,193 ERROR [InvocationContextInterceptor] Execution error:
[JBoss] org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock after [10 seconds] on key [view] for requestor [Thread[OOB-15,null,5,Thread Pools]]! Lock held by [(another thread)]{code}
Rado, you guys might wanna add thread log information to the logs to make things clearer but it's fairly clear that the server could not replace the view and that caused the issues here. I'll create a JIRA on ISPN to look into this.
Corresponding ISPN issue to track this. This needs some thinking. Link: Added: This issue depends ISPN-1016 We can include jstack by JBQA-4481. With the snapshot uploaded last Thursday that includes a fix fot the linked ISPN jira, this should be fixed. Link: Added: This issue relates to JBPAPP-6051 Run http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-resilience-client-size4-hotrod/46/ shows this error again - but this time there is only one error on this client. I ll check with Michal what version it ran with. Please hold on fow now. 2011-04-04 17:20:35,555 153501 ERROR [org.infinispan.client.hotrod.impl.operations.HotRodOperation] (Runner - 2:) Invalid message id. Expected 8523 and received 6820 2011/04/04 17:20:35:556 EDT [WARN ][Runner - 2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Error sampling data: <org.infinispan.client.hotrod.exceptions.InvalidResponseException:: Invalid message id. Expected 8523 and received 6820> Hi Galder, confirmed - still present, I see
{code}2011/04/29 00:32:02:733 EDT [WARN ][Runner - 0] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Error sampling data: <org.infinispan.client.hotrod.exceptions.InvalidResponseException:: Invalid message id. Expected 2476 and received 2421>
{code}
at http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-resilience-client-size4-hotrod/49/ (perf03 client log).
Update: Galder, here are the compacted logs 4 u: http://reports.qa.atl.jboss.com/~rhusar/logs/edg-51x-resilience-client-size4-hotrod-49.zip
I gave this a brief look
this is the sequence of the events in the resilience test (for your convenience :-) )
{code}
startup:
[JBoss] 00:18:17,649 INFO [ServerImpl] JBoss (Microcontainer) [5.1.0 (build: SVNTag=JBPAPP_5_1_0 date=201009150028)] Started in 21s:236ms (perf18)
[JBoss] 00:18:37,663 INFO [ServerImpl] JBoss (Microcontainer) [5.1.0 (build: SVNTag=JBPAPP_5_1_0 date=201009150028)] Started in 19s:83ms (perf19)
[JBoss] 00:18:56,694 INFO [ServerImpl] JBoss (Microcontainer) [5.1.0 (build: SVNTag=JBPAPP_5_1_0 date=201009150028)] Started in 18s:105ms (perf20)
[JBoss] 00:19:15,693 INFO [ServerImpl] JBoss (Microcontainer) [5.1.0 (build: SVNTag=JBPAPP_5_1_0 date=201009150028)] Started in 18s:189ms (perf17)
2011/04/29 00:22:51:508 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Failing node 0
2011/04/29 00:23:51:673 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Bringing back node 0
[JBoss] 00:24:12,150 INFO [ServerImpl] JBoss (Microcontainer) [5.1.0 (build: SVNTag=JBPAPP_5_1_0 date=201009150028)] Started in 19s:708ms
2011/04/29 00:25:12:161 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Failing node 1
2011/04/29 00:26:12:266 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Bringing back node 1
[JBoss] 00:26:33,434 INFO [ServerImpl] JBoss (Microcontainer) [5.1.0 (build: SVNTag=JBPAPP_5_1_0 date=201009150028)] Started in 20s:395ms
2011/04/29 00:27:33:445 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Failing node 2
2011/04/29 00:28:33:549 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Bringing back node 2
[JBoss] 00:28:55,110 INFO [ServerImpl] JBoss (Microcontainer) [5.1.0 (build: SVNTag=JBPAPP_5_1_0 date=201009150028)] Started in 20s:796ms
2011/04/29 00:29:55:120 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Failing node 3
2011/04/29 00:30:55:219 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Bringing back node 3
[JBoss] 00:31:17,794 INFO [ServerImpl] JBoss (Microcontainer) [5.1.0 (build: SVNTag=JBPAPP_5_1_0 date=201009150028)] Started in 21s:797ms
{code}
These are the occurences of messageId=2476
{code}
perf20:
[JBoss] 00:25:42,643 TRACE [HotRodDecoder$] Decoded header HotRodHeader{op=GetRequest, messageId=2476, cacheName=, flag=NoFlag, clientIntelligence=3, topologyId=7}
perf19:
[JBoss] 00:25:45,772 TRACE [HotRodDecoder$] Decoded header HotRodHeader{op=PutRequest, messageId=2476, cacheName=, flag=NoFlag, clientIntelligence=3, topologyId=7}
perf18:
[JBoss] 00:27:43,160 TRACE [HotRodDecoder$] Decoded header HotRodHeader{op=GetRequest, messageId=2476, cacheName=, flag=NoFlag, clientIntelligence=3, topologyId=9}
[JBoss] 00:29:42,741 TRACE [HotRodDecoder$] Decoded header HotRodHeader{op=PutRequest, messageId=2476, cacheName=, flag=NoFlag, clientIntelligence=3, topologyId=10}
perf17:
[JBoss] 00:19:30,338 TRACE [HotRodDecoder$] Decoded header HotRodHeader{op=PutRequest, messageId=2476, cacheName=, flag=NoFlag, clientIntelligence=3, topologyId=4}
[JBoss] 00:25:42,559 TRACE [HotRodDecoder$] Decoded header HotRodHeader{op=PutRequest, messageId=2476, cacheName=, flag=NoFlag, clientIntelligence=3, topologyId=7}
[JBoss] 00:25:43,288 TRACE [HotRodDecoder$] Decoded header HotRodHeader{op=PutRequest, messageId=2476, cacheName=, flag=NoFlag, clientIntelligence=3, topologyId=7}
{code}
BTW: Is it safe to have the same messageId used in parallel ? I'm gonna check the code to understand this better...
another occurence (no TRACE log available) in data stress test with hotrod client on 10 node cluster http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-stress-data-size10-hotrod/1/console-perf03/consoleText http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-stress-data-size10-hotrod/1/console-perf08/consoleText http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-stress-data-size10-hotrod/1/console-perf09/consoleText note: the cluster is supposed to be stable in this test, no view updates happen during the test (only at the end) just a note for myself, you probably know this already: in all three cases I checked the message id causing this was connected to a PUT operation that didn't complete due to org.infinispan.remoting.transport.jgroups.SuspectException I have a feeling that these problems are related more to the way the TCP channels are being managed than the stuff that is being put into them. The code on either end is pretty simple (on the client side, writing magic bytes, headers, keys, values, waiting, then reading what was written by the server; on the server side, decoding magic bytes, message id's headers, params, performing an operation on the cache, then writing the results back). On the client side, there is a fairly simple TCP transport which takes a connection from a pool, uses it, and then returns it to the pool; on the server side, Netty manages channels and makes calls to AbstractProtocolDecoder. I'm guessing that, somehow, some garbage is getting written into the channel on the server side when a server gets into trouble (either slow or killed) and that garbage ends up at the client waiting for its response. Could try to show this by logging the bytes of all responses on the server side - once a node went down, compare the bytes logged with the bytes in the error message on the client side. There's probably a better way. another interresting occurence: https://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-elasticity-client-size4-hotrod/11/ in this scenario I setup 4node cluster in replication mode and after a while I kill 3 nodes. @Trustin, Hi, any update on this? This issue is critical for us. Updating to blocker (as the parent Jira). This is how this bug manifests itself in resilience performance tests: https://docspace.corp.redhat.com/docs/DOC-66047 I'm assigning back to me since I'm more familiar with the code than Trustin at this point. I'll get back to you asap. I looked at the logs from 29th April that Rado posted and I've found an issue - ISPN-1099 I don't know however what the root cause, the invalid magic ID issue comes from, but the message ID failure is due to that. I'll use this opportunity to add some logging to get a better idea of what went wrong with invalid message ID. Link: Added: This issue depends ISPN-1099 run with 4.2.2-SNAPSHOT-20110516.115014: http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-resilience-client-size4-hotrod/60 we still see the "magic" problems: http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-resilience-client-size4-hotrod/60/console-perf02/consoleText I read the Hot Rod decoder on the server side tonight and it looks OK to me.
So, I moved my focus to the Hot Rod client implementation.
I see the following code in PingOperation.java:
{code}
long messageId = writeHeader(transport, HotRodConstants.PING_REQUEST);
short respStatus = readHeaderAndValidate(transport, messageId, ...);
{code}
it assumes that the response read immediately after writing a ping request is always a ping response with the same messageId.
IIUC, TcpTransport is borrowed from a pool before pinging and it is returned to the pool immediately after reading a response. So, I guess there's no chance of thread safety issue as long as all operations borrow and return the transport correctly.
The only cause I can think of is that something did not read the Socket InputStream completely before returning the TcpTransport to the pool, and PingOperation is reading the previous message not read by previous operation for some reason.
To confirm this theory, you could log the whole content of the Socket InputStream in HotRodOperation.readHeaderAndValidate() before raising InvalidResponseException. To do this, I'd add dumpStream() method to Transport:
{code}
interface Transport {
...
/** Reads the whole content of the current Socket receive buffer for
troubleshooting. */
byte[] dumpStream();
...
}
class TcpTransport {
// don't copy and paste. review it first :-)
byte[] dumpStream() {
ByteArrayOutputStream buf = ...;
// Read until the connection is idle for 5 sec.
socket.setSoTimeout(5000);
try {
// Read up to 32KiB
for (int count = 0; count < 32768; count ++) {
int b = socketInputStream.read();
if (b < 0) {
break;
}
buf.write(b);
}
} catch (IOException e) {
// not interested at the moment
}
return buf.toByteArray();
}
}
{code}
and then use it in HotRodOperation.readHeaderAndValidate():
{code}
...
if (receivedMessageId != messageId) {
...
log.invalidMessageId(...);
...
if (log.isTraceEnabled()) {
// We don't log what's read already but we can get it from
// other log messages, right?
log.tracef("Socket dump: %s", makeHexDump(transport.dumpStream());
}
...
throw new InvalidResponseException(...);
}
...
{code}
Then let's see what's logged in the next round. If my guess was right, the dump will contain both the partial previous response (because we read a part of it already during validation) and the response for the ping request.
Ignore this comment cos it's not right. Link: Added: This issue relates to ISPN-1122 Guys, forget about my last comment, I was looking at the wrong message ids (runner - 0 and runner - 1 in logs above point to diff message ids), but this is more meaningful. However, I don't think I'm far off. So, if you take the log and extract it from last time the the id=3 transport was fetched until it was returned and you grep by "Runner - 0", you see:
{code}2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TransportObjectFactory] (Runner - 0:) Fetching from pool:TcpTransport{socket=Socket[addr=perf18/10.16.90.54,port=11222,localport=46181], serverAddress=perf18/10.16.90.54:11222, id =3}
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransportFactory] (Runner - 0:) For server perf18/10.16.90.54:11222: active = 1; idle = 1
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Runner - 0:) Wrote byte 160
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Runner - 0:) Wrote byte 10
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Runner - 0:) Wrote byte 1
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Runner - 0:) Wrote 0 bytes
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Runner - 0:) Wrote byte 3
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Runner - 0:) Wrote byte 0
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.operations.HotRodOperation] (Runner - 0:) wrote header for message 6899. Operation code: 0x01. Flags: 0x0
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Runner - 0:) Wrote 8 bytes
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Runner - 0:) Wrote 505 bytes
2011-05-17 11:22:31,118 151987 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Runner - 0:) Flushed socket: Socket[addr=perf18/10.16.90.54,port=11222,localport=46181]
2011-05-17 11:23:31,119 211988 TRACE [org.infinispan.client.hotrod.impl.operations.RetryOnFailureOperation] (Runner - 0:) Exception encountered. Retry 4 out of 40
2011-05-17 11:23:31,119 211988 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TransportObjectFactory] (Runner - 0:) Returning to pool:TcpTransport{socket=Socket[addr=perf18/10.16.90.54,port=11222,localport=46181], serverAddress=perf18/10.16.90.54:11222, id =3} {code}
Now, where on earth is the response for 6899 message Id being read? Nowhere hence no wonder that as Trustin suggested, the next time a thread takes that tcp transport, the old message id response is still there. What on earth happeneed to Runner - 0 then?? Hmmm, digging continues...
Btw, something is not right with the way RetryOnFailureOperation logs there, what exception was encountered? I've opened ISPN-1126 to fix this. Once this is in and with Trustin's suggestion also in, I'll build a snapshot and we try again to see if we get more clues. Michal/Rado, any news on this? I cannot reproduce this with latest EDG6 alpha build. If this reappears it will be dealt with in JBPAPP-6850. |
project_key: EDG Occured when bringing up a failed node. {code} <org.infinispan.client.hotrod.exceptions.InvalidResponseException:: Invalid message id. Expected cfec and received 84b6> org.infinispan.client.hotrod.exceptions.InvalidResponseException:: Invalid message id. Expected cfec and received 84b6 at org.infinispan.client.hotrod.impl.operations.HotRodOperation.readHeaderAndValidate(HotRodOperation.java:90) at org.infinispan.client.hotrod.impl.operations.AbstractKeyValueOperation.sendPutOperation(AbstractKeyValueOperation.java:48) at org.infinispan.client.hotrod.impl.operations.PutOperation.executeOperation(PutOperation.java:27) at org.infinispan.client.hotrod.impl.operations.RetryOnFailureOperation.execute(RetryOnFailureOperation.java:38) at org.infinispan.client.hotrod.impl.RemoteCacheImpl.put(RemoteCacheImpl.java:178) at org.infinispan.CacheSupport.put(CacheSupport.java:28) at org.jboss.smartfrog.edg.adapter.HotRodAdapter$HotRodRemoteCacheAdapter.put(HotRodAdapter.java:150) at org.jboss.smartfrog.edg.loaddriver.CacheRequestProcessorFactory$InfinispanRequestProcessor.processRequest(CacheRequestProcessorFactory.java:163) at org.jboss.smartfrog.loaddriver.CompoundRequestProcessorFactoryImpl$CompoundRequestProcessor.processRequest(CompoundRequestProcessorFactoryImpl.java:51) at org.jboss.smartfrog.loaddriver.Runner.run(Runner.java:87) at java.lang.Thread.run(Thread.java:662) {code}