Bug 1255213

Summary: Retry failed saying "Failed to connect" under high load
Product: [JBoss] JBoss Data Grid 6 Reporter: Osamu Nagano <onagano>
Component: CPP ClientAssignee: Alan Field <afield>
Status: CLOSED UPSTREAM QA Contact: Alan Field <afield>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.4.0CC: afield, chuffman, mgencur, onagano, wfink
Target Milestone: CR1   
Target Release: 6.5.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
The HotRod C++ client's failover was failing sporadically under high load. This was caused by the <classname>Transport</classname> object throwing a <classname>TransportException</classname> during an operation, resulting in a <literal>ConnectionPool::invalidateObject</literal> which removed the <classname>Transport</classname> from the busy queue, destroyed the <classname>Transport</classname>, and then added a new <classname>Transport</classname> to the idle queue. When the new <classname>Transport</classname> is created, it tries to connect to the server socket which fails. This prevents retry attempts and client failover from happening. This issue is resolved as of Red Hat JBoss Data Grid 6.5.1.
Story Points: ---
Clone Of:
: 1258047 1259639 (view as bug list) Environment:
Last Closed: 2025-02-10 03:48:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1258047, 1259639    
Attachments:
Description Flags
hotrodwrapper.cpp none

Description Osamu Nagano 2015-08-20 02:02:10 UTC
Description of problem:
Against 2 nodes JDG cluster, a Hot Rod C++ client is keep operating get and put.  Then kill 1 node, the client sometimes fails to failover and leaves a error message like "Failed to connect (host: 127.0.1.1 port: 11222) Operation now in progress".


How reproducible:
The customer is the same as Bug 1228026.  Their client is Nginx/LuaJIT and a reproducing environment is attached there as "hotrodwrapper.zip".  Say one Siege session is 500 requests in 100 concurrent users (50 req/s on my machine) and a node is killed during the session.  About once per 10 sessions, 500 is returned which means a retry failed.


Steps to Reproduce:
1. Prepare 2 nodes JDG cluster and env of "hotrodwrapper.zip" in Bug 1228026.
2. Run "make siege".  This runs one Siege session (siege -r5 -c100 -lsiege.log -i -furls.txt).
3. Kill one node during the session.
4. Check error log, ./nginx/logs/error.log, to find the error message.


Actual results:
Siege reports 500 returned and error.log contains the following line.

  2015/08/11 16:53:39 [error] 31974#0: *600 [lua] hotrod.lua:20: Failed to connect (host: 127.0.1.1 port: 11222) Operation now in progress, client: 127.0.0.1, server: , request: "GET /hotrod/default/foo/foovalue HTTP/1.1", host: "127.0.0.1:8000"


Expected results:
No 500 requests and error messages.


Additional info:
The error message was generated when "connect" failed.  The customer said "send" also fails sometimes.

Comment 2 Osamu Nagano 2015-08-27 02:16:33 UTC
Created attachment 1067483 [details]
hotrodwrapper.cpp

Replace hotrodwrapper.cpp in "hotrodwrapper.zip" of Bug 1228026 with the attachment.  And modify "run" target in the Makefile as follows.

  run: main
          LD_LIBRARY_PATH=$(LUAJIT_LIB):$(HOTROD_LIB) HOTROD_LOG_LEVEL="TRACE" ./$(PROGRAM) >run.log 2>&1

Then "make run" will run the standalone main function.  It seems this program always fails to failover.

  TRACE [Socket.cpp:107] Trying to connect to 127.0.0.1 (127.0.0.1).
  DEBUG [Socket.cpp:134] Attempting connection to 127.0.0.1:11222
  DEBUG [Socket.cpp:147] Failed to connect to 127.0.0.1:11222
  terminate called after throwing an instance of 'infinispan::hotrod::TransportException'
    what():  Failed to connect (host: 127.0.0.1 port: 11222) Operation now in progress

Comment 3 JBoss JIRA Server 2015-08-28 15:52:30 UTC
Alan Field <afield> updated the status of jira HRCPP-197 to Coding In Progress

Comment 5 Alan Field 2015-09-08 20:11:21 UTC
Waiting to hear if Osamu's customer is satisfied with the fix.

Comment 6 Osamu Nagano 2015-09-09 00:36:24 UTC
(In reply to Alan Field from comment #5)
> Waiting to hear if Osamu's customer is satisfied with the fix.

As mailed, the customer will test with 6.5.1.GA.  Since the reproducer works well now, this BZ can be closed.

Comment 7 Alan Field 2015-09-09 13:09:52 UTC
Verified with JDG 6.5.1 CR1

Comment 10 Red Hat Bugzilla 2025-02-10 03:48:03 UTC
This product has been discontinued or is no longer tracked in Red Hat Bugzilla.