Created attachment 1015379 [details]
Description of problem:
Given a default server, storage node, agent installation, the server should reconnect to storage node following a brief storage node offline.
0. Install JON server as usual
1. ./rhqctl stop --storage
2. wait 3 minutes
3. ./rhqctl start --storage
4. Observer server.log for "NoHostAvailableException"
See attached server and storage node logs. Storage node was shutdown at around 2015-04-16 18:03.
Version-Release number of selected component (if applicable):
- reproduced in 3.3.0 GA and 3.3.2 ER1
Created attachment 1015380 [details]
Another way to reproduce is dropping packets going to Cassandra port 9142
#iptables -A INPUT -p tcp --dport 9142 -m statistic --mode random --probability 0.50 -j
let it run for a few minutes until NoHostAvailableException appears in server.log then delete the rule
# iptables -D INPUT -p tcp --dport 9142 -m statistic --mode random --probability 0.50 -j
The server never seems to able to recover from the exception
This seems to happen to me also these days (I can't remember this happening in the past though - perhaps environmental?)
I do not think that this is an environment issue. I believe it has more to do with the Cassandra driver as demonstrated in this test code, https://gist.github.com/jsanda/95409e8f4956730d58a8. I perform the following steps with that test to produce the problem,
1) Start Cassandra
2) Run test (which loops indefinitely)
3) Stop Cassandra
4) Driver reports exceptions
5) Start Cassandra
The Host.StateListener never gets called. I think that this is a bug or a limitation, in the version of the driver being used because I ran the same test with version 2.1.5, and the driver does reconnect and notify the listener after the Storage Node is restarted. I am not necessarily pointing this out to suggest we need to upgrade the driver. I am pointing it out to show that my understanding of the driver's behavior in this regard was wrong.
We are going to have to shut down and recreate the session, particularly with single node deployments. I will have to do some additional testing to see if it is also an issue with multi-node deployments.
It is not as simple as recreate the session when we see a NoHostAvailablException for a couple reasons. First, the driver can report a NoHostAvailableException when the Storage Node is under heavy load. There could be a burst in requests that triggers the exception. After that burst subsides, we might not have any problem connecting. Secondly, when we store raw data, writes are done asynchronously in parallel. In effect they are pipelined. This means that if we see one NoHostAvailableException, there is pretty good chance we will see several of them.
I think we need to set up a scheduled job that takes action if there has been a NoHostAvailableException within the past N minutes or seconds. If we are not in maintenance mode, there presumably everything is fine, and there nothing else to do. If we are in maintenance mode though, we should try executing a simple query. If we still get a NoHostAvailableException, then we shutdown and recreate the Session object and try again.
Unfortunately we cannot simply check the Storage Node's availability to determine whether or not it is down. When the server is in maintenance mode, we refuse all agent requests. If the storage node was down and later restarted, the agent will report it as being UP but the server will reject the availability report which means we cannot rely on the last reported availability.
Is there any chance to get this in JON 3.3.3?
Merge: 91de3c4 bb867e3
Author: jsanda <firstname.lastname@example.org>
Date: Wed Jun 24 07:44:55 2015 -0400
Merge pull request #178 from burmanm/reconnect
[BZ 1212627] Recreate storage node sessions if connections are down
Author: Michael Burman <email@example.com>
Date: Wed Jun 24 13:31:21 2015 +0300
Set name for the AliveChecker for easier debugging and catch all the exceptions in the aliveChecker thread
Author: Michael Burman <firstname.lastname@example.org>
Date: Tue Jun 23 16:37:53 2015 +0300
[BZ 1212627] Check storage node connection aliveness every 4s and recreate session if check failed twice in a row.
Available for test with 3.3.3 ER01 build:
*Note: jon-server-patch-3.3.0.GA.zip maps to ER01 build of
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
*** Bug 1339586 has been marked as a duplicate of this bug. ***