The HotRod C++ client's failover was failing sporadically under high load. This was caused by the <classname>Transport</classname> object throwing a <classname>TransportException</classname> during an operation, resulting in a <literal>ConnectionPool::invalidateObject</literal> which removed the <classname>Transport</classname> from the busy queue, destroyed the <classname>Transport</classname>, and then added a new <classname>Transport</classname> to the idle queue. When the new <classname>Transport</classname> is created, it tries to connect to the server socket which fails. This prevents retry attempts and client failover from happening.
This issue is resolved as of Red Hat JBoss Data Grid 6.5.1.
Description of problem:
Against 2 nodes JDG cluster, a Hot Rod C++ client is keep operating get and put. Then kill 1 node, the client sometimes fails to failover and leaves a error message like "Failed to connect (host: 127.0.1.1 port: 11222) Operation now in progress".
How reproducible:
The customer is the same as Bug 1228026. Their client is Nginx/LuaJIT and a reproducing environment is attached there as "hotrodwrapper.zip". Say one Siege session is 500 requests in 100 concurrent users (50 req/s on my machine) and a node is killed during the session. About once per 10 sessions, 500 is returned which means a retry failed.
Steps to Reproduce:
1. Prepare 2 nodes JDG cluster and env of "hotrodwrapper.zip" in Bug 1228026.
2. Run "make siege". This runs one Siege session (siege -r5 -c100 -lsiege.log -i -furls.txt).
3. Kill one node during the session.
4. Check error log, ./nginx/logs/error.log, to find the error message.
Actual results:
Siege reports 500 returned and error.log contains the following line.
2015/08/11 16:53:39 [error] 31974#0: *600 [lua] hotrod.lua:20: Failed to connect (host: 127.0.1.1 port: 11222) Operation now in progress, client: 127.0.0.1, server: , request: "GET /hotrod/default/foo/foovalue HTTP/1.1", host: "127.0.0.1:8000"
Expected results:
No 500 requests and error messages.
Additional info:
The error message was generated when "connect" failed. The customer said "send" also fails sometimes.
Created attachment 1067483[details]
hotrodwrapper.cpp
Replace hotrodwrapper.cpp in "hotrodwrapper.zip" of Bug 1228026 with the attachment. And modify "run" target in the Makefile as follows.
run: main
LD_LIBRARY_PATH=$(LUAJIT_LIB):$(HOTROD_LIB) HOTROD_LOG_LEVEL="TRACE" ./$(PROGRAM) >run.log 2>&1
Then "make run" will run the standalone main function. It seems this program always fails to failover.
TRACE [Socket.cpp:107] Trying to connect to 127.0.0.1 (127.0.0.1).
DEBUG [Socket.cpp:134] Attempting connection to 127.0.0.1:11222
DEBUG [Socket.cpp:147] Failed to connect to 127.0.0.1:11222
terminate called after throwing an instance of 'infinispan::hotrod::TransportException'
what(): Failed to connect (host: 127.0.0.1 port: 11222) Operation now in progress
Comment 3JBoss JIRA Server
2015-08-28 15:52:30 UTC
Alan Field <afield> updated the status of jira HRCPP-197 to Coding In Progress
(In reply to Alan Field from comment #5)
> Waiting to hear if Osamu's customer is satisfied with the fix.
As mailed, the customer will test with 6.5.1.GA. Since the reproducer works well now, this BZ can be closed.