Bug 1255213 - Retry failed saying "Failed to connect" under high load
Retry failed saying "Failed to connect" under high load
Product: JBoss Data Grid 6
Classification: JBoss
Component: CPP Client (Show other bugs)
Unspecified Unspecified
unspecified Severity unspecified
: CR1
: 6.5.1
Assigned To: Alan Field
Alan Field
Depends On:
Blocks: 1258047 1259639
  Show dependency treegraph
Reported: 2015-08-19 22:02 EDT by Osamu Nagano
Modified: 2018-01-29 20:45 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
The HotRod C++ client's failover was failing sporadically under high load. This was caused by the <classname>Transport</classname> object throwing a <classname>TransportException</classname> during an operation, resulting in a <literal>ConnectionPool::invalidateObject</literal> which removed the <classname>Transport</classname> from the busy queue, destroyed the <classname>Transport</classname>, and then added a new <classname>Transport</classname> to the idle queue. When the new <classname>Transport</classname> is created, it tries to connect to the server socket which fails. This prevents retry attempts and client failover from happening. This issue is resolved as of Red Hat JBoss Data Grid 6.5.1.
Story Points: ---
Clone Of:
: 1258047 1259639 (view as bug list)
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
hotrodwrapper.cpp (4.08 KB, text/x-csrc)
2015-08-26 22:16 EDT, Osamu Nagano
no flags Details

External Trackers
Tracker ID Priority Status Summary Last Updated
JBoss Issue Tracker HRCPP-197 Major Resolved C++ client failover does not work consistently 2017-07-10 22:43 EDT

  None (edit)
Description Osamu Nagano 2015-08-19 22:02:10 EDT
Description of problem:
Against 2 nodes JDG cluster, a Hot Rod C++ client is keep operating get and put.  Then kill 1 node, the client sometimes fails to failover and leaves a error message like "Failed to connect (host: port: 11222) Operation now in progress".

How reproducible:
The customer is the same as Bug 1228026.  Their client is Nginx/LuaJIT and a reproducing environment is attached there as "hotrodwrapper.zip".  Say one Siege session is 500 requests in 100 concurrent users (50 req/s on my machine) and a node is killed during the session.  About once per 10 sessions, 500 is returned which means a retry failed.

Steps to Reproduce:
1. Prepare 2 nodes JDG cluster and env of "hotrodwrapper.zip" in Bug 1228026.
2. Run "make siege".  This runs one Siege session (siege -r5 -c100 -lsiege.log -i -furls.txt).
3. Kill one node during the session.
4. Check error log, ./nginx/logs/error.log, to find the error message.

Actual results:
Siege reports 500 returned and error.log contains the following line.

  2015/08/11 16:53:39 [error] 31974#0: *600 [lua] hotrod.lua:20: Failed to connect (host: port: 11222) Operation now in progress, client:, server: , request: "GET /hotrod/default/foo/foovalue HTTP/1.1", host: ""

Expected results:
No 500 requests and error messages.

Additional info:
The error message was generated when "connect" failed.  The customer said "send" also fails sometimes.
Comment 2 Osamu Nagano 2015-08-26 22:16:33 EDT
Created attachment 1067483 [details]

Replace hotrodwrapper.cpp in "hotrodwrapper.zip" of Bug 1228026 with the attachment.  And modify "run" target in the Makefile as follows.

  run: main

Then "make run" will run the standalone main function.  It seems this program always fails to failover.

  TRACE [Socket.cpp:107] Trying to connect to (
  DEBUG [Socket.cpp:134] Attempting connection to
  DEBUG [Socket.cpp:147] Failed to connect to
  terminate called after throwing an instance of 'infinispan::hotrod::TransportException'
    what():  Failed to connect (host: port: 11222) Operation now in progress
Comment 3 JBoss JIRA Server 2015-08-28 11:52:30 EDT
Alan Field <afield@redhat.com> updated the status of jira HRCPP-197 to Coding In Progress
Comment 5 Alan Field 2015-09-08 16:11:21 EDT
Waiting to hear if Osamu's customer is satisfied with the fix.
Comment 6 Osamu Nagano 2015-09-08 20:36:24 EDT
(In reply to Alan Field from comment #5)
> Waiting to hear if Osamu's customer is satisfied with the fix.

As mailed, the customer will test with 6.5.1.GA.  Since the reproducer works well now, this BZ can be closed.
Comment 7 Alan Field 2015-09-09 09:09:52 EDT
Verified with JDG 6.5.1 CR1

Note You need to log in before you can comment on or make changes to this bug.