Bug 851574 - Java clients exceptions, qpidd "No protocol received closing"
Java clients exceptions, qpidd "No protocol received closing"
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-java (Show other bugs)
2.2
x86_64 Unspecified
unspecified Severity high
: 2.2
: ---
Assigned To: Rajith Attapattu
Leonid Zhaldybin
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-08-24 09:48 EDT by Leonid Zhaldybin
Modified: 2014-11-09 17:38 EST (History)
6 users (show)

See Also:
Fixed In Version: qpid-java-0.18-2
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-12-07 12:41:30 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
reproducer script (2.41 KB, application/x-shellscript)
2012-08-24 09:48 EDT, Leonid Zhaldybin
no flags Details
qpidd log file with trace+ log level (292.26 KB, application/x-bzip)
2012-08-24 09:59 EDT, Leonid Zhaldybin
no flags Details
Protocol trace and client log, i386 (185.65 KB, application/x-bzip)
2012-08-28 04:27 EDT, Leonid Zhaldybin
no flags Details
Protocol trace and client log, x86_64 (389.39 KB, application/x-bzip)
2012-08-28 04:28 EDT, Leonid Zhaldybin
no flags Details
Strace output. (7.43 KB, application/x-bzip)
2012-08-28 12:02 EDT, Leonid Zhaldybin
no flags Details
Complete client's strace. (126.34 KB, application/x-bzip)
2012-08-29 02:54 EDT, Leonid Zhaldybin
no flags Details

  None (edit)
Description Leonid Zhaldybin 2012-08-24 09:48:36 EDT
Created attachment 606864 [details]
reproducer script

Description of problem:
While running our automated tests, I came across the following problem. Running a java client (standard Drain or Spout from qpid-java-example package), an exception is thrown sometimes. The most common one I see looks like this:

IoSender - localhost/127.0.0.1:5672 2012-08-24 07:51:20,314 ERROR [transport.network.io.IoSender] error in write thread
java.net.SocketException: Broken pipe
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
        at org.apache.qpid.transport.network.io.IoSender.run(IoSender.java:313)
        at java.lang.Thread.run(Thread.java:662)
Exception in thread "main" org.apache.qpid.AMQConnectionFailureException: connect() aborted [error code 200: reply success]
        at org.apache.qpid.client.AMQConnection.<init>(AMQConnection.java:438)
        at org.apache.qpid.client.AMQConnection.<init>(AMQConnection.java:211)
        at org.apache.qpid.example.OptionParser.createConnection(OptionParser.java:281)
        at org.apache.qpid.example.Spout.<init>(Spout.java:87)
        at org.apache.qpid.example.Spout.main(Spout.java:146)
Caused by: org.apache.qpid.AMQException: Cannot connect to broker: connect() aborted [error code 200: reply success]
        at org.apache.qpid.client.AMQConnectionDelegate_0_10.makeBrokerConnection(AMQConnectionDelegate_0_10.java:241)
        at org.apache.qpid.client.AMQConnection.makeBrokerConnection(AMQConnection.java:581)
        at org.apache.qpid.client.AMQConnection.<init>(AMQConnection.java:361)
        ... 4 more
Caused by: org.apache.qpid.transport.ConnectionException: connect() aborted
        at org.apache.qpid.transport.Connection.connect(Connection.java:296)
        at org.apache.qpid.client.AMQConnectionDelegate_0_10.makeBrokerConnection(AMQConnectionDelegate_0_10.java:222)
        ... 6 more

but I also saw this one:

Exception in thread "main" org.apache.qpid.AMQConnectionFailureException: connect() aborted [error code 200: reply success]
        at org.apache.qpid.client.AMQConnection.<init>(AMQConnection.java:444)
        at org.apache.qpid.client.AMQConnection.<init>(AMQConnection.java:215)
        at org.apache.qpid.example.OptionParser.createConnection(OptionParser.java:283)
        at org.apache.qpid.example.Drain.<init>(Drain.java:64)
        at org.apache.qpid.example.Drain.main(Drain.java:101)
Caused by: org.apache.qpid.AMQException: Cannot connect to broker: connect() aborted [error code 200: reply success]
        at org.apache.qpid.client.AMQConnectionDelegate_0_10.makeBrokerConnection(AMQConnectionDelegate_0_10.java:239)
        at org.apache.qpid.client.AMQConnection.makeBrokerConnection(AMQConnection.java:588)
        at org.apache.qpid.client.AMQConnection.<init>(AMQConnection.java:367)
        ... 4 more
Caused by: org.apache.qpid.transport.ConnectionException: connect() aborted
        at org.apache.qpid.transport.Connection.connect(Connection.java:282)
        at org.apache.qpid.client.AMQConnectionDelegate_0_10.makeBrokerConnection(AMQConnectionDelegate_0_10.java:219)
        ... 6 more

There is a corresponding error message in qpidd log:
"error Connection 127.0.0.1:5672-127.0.0.1:51804 No protocol received closing"

This happens with java clients only, as far as I can tell.
Comment 3 Leonid Zhaldybin 2012-08-24 09:59:27 EDT
Created attachment 606866 [details]
qpidd log file with trace+ log level

qpidd broker log file attached. The log level was set to the highest possible level: trace+.
Comment 4 Leonid Zhaldybin 2012-08-24 10:39:19 EDT
I've tried to set "worker-threads=1" in qpidd.conf file. This doesn't seem to have any impact on the issue. The same exception (the Broken pipe one) was raised in less than a minute.
Comment 5 Leonid Zhaldybin 2012-08-26 04:12:24 EDT
The exact same reproducer script, but using python clients, is running on two machines (RHEL6.3 i386 and x86_64) for over 40 hours now without triggering this issue even once. So, the problem occurs if, and only if, *java* clients are used.
Comment 6 Justin Ross 2012-08-27 14:31:20 EDT
Some info that would help: a wire-level trace of the interaction (say, from wireshark).  Less good but still helpful, a trace of the client.

Andrew looked at the broker trace, and it doesn't appear that the broker (a) is particularly busy or (b) doing anything obviously wrong.

Finally, we recommend testing with --max-negotiate-time turned up to 60, as you mentioned above.
Comment 7 Rajith Attapattu 2012-08-27 15:22:17 EDT
Leonid,

I don't know how practical it is, but it would be good if you could get a protocol trace to see what exactly is happening.

It seems you can reproduce this issue fairly often with the test suite. I wonder if you could try to get a protocol trace at the point of capture.

Regards,

Rajith
Comment 8 Leonid Zhaldybin 2012-08-28 04:27:57 EDT
Created attachment 607439 [details]
Protocol trace and client log, i386
Comment 9 Leonid Zhaldybin 2012-08-28 04:28:27 EDT
Created attachment 607440 [details]
Protocol trace and client log, x86_64
Comment 10 Leonid Zhaldybin 2012-08-28 04:41:21 EDT
Justin, Rajith,

I've attached protocol traces from two machines, together with clients' logs, see attachments above.

Also, setting 'max-negotiate-time=60000' in qpidd.conf file seems to be a very reliable workaround. After adding this to the broker's configuration file, my reproducer script was running for about 24 hours without triggering the issue.
Comment 11 Leonid Zhaldybin 2012-08-28 11:58:20 EDT
Java version I'm using for my tests:

RHEL6.3 i386
java-1.6.0-sun-1.6.0.33-1jpp.1.el6_3.i686
java-1.6.0-sun-devel-1.6.0.33-1jpp.1.el6_3.i686

RHEL6.3 x86_64
java-1.6.0-sun-1.6.0.33-1jpp.1.el6_3.x86_64
java-1.6.0-sun-devel-1.6.0.33-1jpp.1.el6_3.x86_64

I've also tried using java-1.6.0-openjdk, it did not make any difference.
Comment 12 Leonid Zhaldybin 2012-08-28 12:02:31 EDT
Created attachment 607644 [details]
Strace output.

I modified the reproducer slightly to run strace each time the client starts. In the attachment, there are logs of these test runs.
Comment 13 Andrew Stitcher 2012-08-28 14:34:25 EDT
Examining the first attached protocol trace (i386):

Initially there are a number of complete successful amqp interchanges (client ports 56398, 564400, ... 56582).

One odd feature of these successful interchanges is that there is a small gap between the server sending connection.start and the client responding with connection.startOk.

This gap is about 0.2s

Another very odd feature is that the client sends back the connection.startOk as 2 separate packets one 3 bytes, one 13 bytes - I guess this means that there are 2 individual writes here which are not coalesced for some reason.

Then there is the protocol interchange that fails due to the broker timing out the initial protocol exchange (client port 56584)

What happens here is:

Client sends initial Protocol Header:
Server sends initial Protocol Header:
Server sends connection.start control:
2s gap
Server closes connection
3.2s gap
Client sends 3 bytes
Server sends reset as the connection has already been closed.

Ie the client does send the initial Protocol Header, but then fails to completely respond to the connect.start control

So it seems that the client is maybe not responding quickly enough to the incoming connection.start taking in the failing case actually 5.2s to respond, and moreover it hasn't noticed the previous fin from the server either, which is pretty strange.

In addition it's pretty strange that connection.startOk is being split up into 2 writes in all of the captured traces (although there is nothing actually wrong there it's just a possible pointer to an issue).
Comment 14 Leonid Zhaldybin 2012-08-29 02:53:15 EDT
(In reply to comment #13)
> So it seems that the client is maybe not responding quickly enough to the
> incoming connection.start taking in the failing case actually 5.2s to
> respond, and moreover it hasn't noticed the previous fin from the server
> either, which is pretty strange.

So this does look like a problem on client's side. Good to know, Andrew, thank you for your input.

I'm attaching the complete strace (with -f option set to watch all child processes) for a client failure.
Comment 15 Leonid Zhaldybin 2012-08-29 02:54:23 EDT
Created attachment 607786 [details]
Complete client's strace.
Comment 16 Rajith Attapattu 2012-08-30 22:54:23 EDT
Leonid,

Could you please provide the jdk version that you used for this test ?
(I assume you are using open jdk ?)

Also could you please run this on the latest sun jdk to see if you can reproduce this?

Rajith
Comment 17 Leonid Zhaldybin 2012-08-31 03:39:53 EDT
(In reply to comment #16)
> Leonid,
> 
> Could you please provide the jdk version that you used for this test ?
> (I assume you are using open jdk ?)
> 
> Also could you please run this on the latest sun jdk to see if you can
> reproduce this?
> 
> Rajith

Rajith,

The issue was first discovered on the machines with open jdk installed. Later, I did the massive test runs using all four combinations (RHEL5+openjdk, RHEL5+sunjdk, RHEL6+openjdk, RHEL6+sunjdk). The issue was fairly reproducible on RHEL6 with any jdk (open or sun, it does not matter). I actually mentioned it in comment 11 but probably did not make it entirely clear, sorry.

I currently use sun jdk for my testing. Here is the output of 'java -version':
java version "1.6.0_33"
Java(TM) SE Runtime Environment (build 1.6.0_33-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03, mixed mode)

I'll install the latest available java version and give it a try.

Leonid.
Comment 18 Leonid Zhaldybin 2012-09-04 12:09:39 EDT
Rajith,

I gathered some statistics during the weekend. I was using a simple java client which establishes the connections to the broker in an infinite loop. Running the tests with this client for a couple of days shows that:

1. The problem is an absolutely random one. The connection attempt from the client can fail after just a few successful connections, or a client might be able to connect to a broker for thousands of times (but after that, the next connection attempt fails anyway).
2. The average number of successful connections before the failure is around 3000. But I've seen the cases when, for example, the 10th connection attempt fails.

These two points concern RHEL6 only. On RHEL5, the same java client is able to connect to the broker tens of millions of times. So, this is definitely a unique RHEL6 bug.

I can share the exact numbers from my test runs showing the number of successful connections before the error is triggered. If you need them, let me know.

Leonid.
Comment 20 Rajith Attapattu 2012-09-04 20:37:23 EDT
After additional testing I was able to get a bit more insight into the issue.
It appears that two things makes it more likely to happen.
Using a Leonids test of creating connections in a loop.

1. getting the pid() -- this seems to cause the issue to happen anywhere btw 100 to 3000 connections.

2. getting the system info (this can be improved). -- This seems to cause the issue anywhere btw 3000 - 15000 connection attempts.

3. If I remove the above two cases, then the failure happens around 30k to 40k connections.

Following are my recommendations.

1. We could make getting the pid optional (via a system prop).

2. improve getPlatformInfo() method to construct the info once and instead doing it for each connection.
Comment 21 Andrew Stitcher 2012-09-05 09:41:31 EDT
From comment #20:

I'd say the evidence here is that the cause is irregular kernel scheduling of the java client.

getting the pid of the process will call the system call getpid(), and presumably getting platform info also enters the kernel a number of times to gather its information.

Once you've entered the kernel there is some doubt how long you will be there before the kernel schedules you again on return.

I'd suggest doing the getpid() and getPlatformInfo() before initiating the connection - that way the connection attempt can't be interrupted by them.

Even better do them once at process start as they don't change at all once the process is going.
Comment 22 Andrew Stitcher 2012-09-05 09:49:21 EDT
A second thought - is it possible that the jre is just doing a long garbage collection at a bad time?
Comment 23 Rajith Attapattu 2012-09-06 10:49:34 EDT
Andrew,

The delay around 30k to 40k connections could very well be due to a garbage collection pause.

I now fetch the pid and system info in a static block and then return the cached value.

Rajith
Comment 24 Rajith Attapattu 2012-09-06 10:51:54 EDT
The suggested patches are now in upstream.
http://svn.apache.org/viewvc?rev=1381625&view=rev
Comment 25 Leonid Zhaldybin 2012-09-10 02:21:46 EDT
After running the extensive tests on the qpid-java-0.18-2 build, I can report with confidence that this problem has been fixed. My tests did not trigger it even once while running for a few hours and opening millions of connections.
NOTE: max-negotiate-time option was set to 2s. in qpidd configuration file to imitate the qpid-cpp-0.14.21 behaviour and to prevent the delay on java client side from being masked.

Packages used for testing:

qpid-cpp-client-0.14-22.el6_3
qpid-cpp-client-devel-0.14-22.el6_3
qpid-cpp-client-devel-docs-0.14-22.el6_3
qpid-cpp-client-rdma-0.14-22.el6_3
qpid-cpp-client-ssl-0.14-22.el6_3
qpid-cpp-debuginfo-0.14-22.el6_3
qpid-cpp-server-0.14-22.el6_3
qpid-cpp-server-cluster-0.14-22.el6_3
qpid-cpp-server-devel-0.14-22.el6_3
qpid-cpp-server-rdma-0.14-22.el6_3
qpid-cpp-server-ssl-0.14-22.el6_3
qpid-cpp-server-store-0.14-22.el6_3
qpid-cpp-server-xml-0.14-22.el6_3
qpid-java-client-0.18-2.el6
qpid-java-common-0.18-2.el6
qpid-java-example-0.18-2.el6
qpid-qmf-0.14-14.el6_3
qpid-qmf-debuginfo-0.14-14.el6_3
qpid-qmf-devel-0.14-14.el6_3
qpid-tools-0.14-6.el6_3

-> VERIFIED

Note You need to log in before you can comment on or make changes to this bug.