Created attachment 1015379 [details] server.log Description of problem: Given a default server, storage node, agent installation, the server should reconnect to storage node following a brief storage node offline. To reproduce: 0. Install JON server as usual 1. ./rhqctl stop --storage 2. wait 3 minutes 3. ./rhqctl start --storage 4. Observer server.log for "NoHostAvailableException" See attached server and storage node logs. Storage node was shutdown at around 2015-04-16 18:03. Version-Release number of selected component (if applicable): - reproduced in 3.3.0 GA and 3.3.2 ER1 How reproducible: 100%
Created attachment 1015380 [details] rhq-storage.log
Another way to reproduce is dropping packets going to Cassandra port 9142 #iptables -A INPUT -p tcp --dport 9142 -m statistic --mode random --probability 0.50 -j DROP let it run for a few minutes until NoHostAvailableException appears in server.log then delete the rule # iptables -D INPUT -p tcp --dport 9142 -m statistic --mode random --probability 0.50 -j DROP The server never seems to able to recover from the exception
This seems to happen to me also these days (I can't remember this happening in the past though - perhaps environmental?)
I do not think that this is an environment issue. I believe it has more to do with the Cassandra driver as demonstrated in this test code, https://gist.github.com/jsanda/95409e8f4956730d58a8. I perform the following steps with that test to produce the problem, 1) Start Cassandra 2) Run test (which loops indefinitely) 3) Stop Cassandra 4) Driver reports exceptions 5) Start Cassandra The Host.StateListener never gets called. I think that this is a bug or a limitation, in the version of the driver being used because I ran the same test with version 2.1.5, and the driver does reconnect and notify the listener after the Storage Node is restarted. I am not necessarily pointing this out to suggest we need to upgrade the driver. I am pointing it out to show that my understanding of the driver's behavior in this regard was wrong. We are going to have to shut down and recreate the session, particularly with single node deployments. I will have to do some additional testing to see if it is also an issue with multi-node deployments. It is not as simple as recreate the session when we see a NoHostAvailablException for a couple reasons. First, the driver can report a NoHostAvailableException when the Storage Node is under heavy load. There could be a burst in requests that triggers the exception. After that burst subsides, we might not have any problem connecting. Secondly, when we store raw data, writes are done asynchronously in parallel. In effect they are pipelined. This means that if we see one NoHostAvailableException, there is pretty good chance we will see several of them. I think we need to set up a scheduled job that takes action if there has been a NoHostAvailableException within the past N minutes or seconds. If we are not in maintenance mode, there presumably everything is fine, and there nothing else to do. If we are in maintenance mode though, we should try executing a simple query. If we still get a NoHostAvailableException, then we shutdown and recreate the Session object and try again. Unfortunately we cannot simply check the Storage Node's availability to determine whether or not it is down. When the server is in maintenance mode, we refuse all agent requests. If the storage node was down and later restarted, the agent will report it as being UP but the server will reject the availability report which means we cannot rely on the last reported availability.
Is there any chance to get this in JON 3.3.3?
PR #178
In master: commit 67bbfa0026755659762b54e69e65c3354447f914 Merge: 91de3c4 bb867e3 Author: jsanda <jsanda> Date: Wed Jun 24 07:44:55 2015 -0400 Merge pull request #178 from burmanm/reconnect [BZ 1212627] Recreate storage node sessions if connections are down commit bb867e35169891e134b80e218eb636dfdeb35e90 Author: Michael Burman <miburman> Date: Wed Jun 24 13:31:21 2015 +0300 Set name for the AliveChecker for easier debugging and catch all the exceptions in the aliveChecker thread commit aa63682195ea1fc2a7ae06f11023b6cc05286c19 Author: Michael Burman <miburman> Date: Tue Jun 23 16:37:53 2015 +0300 [BZ 1212627] Check storage node connection aliveness every 4s and recreate session if check failed twice in a row.
Available for test with 3.3.3 ER01 build: https://brewweb.devel.redhat.com/buildinfo?buildID=446732 *Note: jon-server-patch-3.3.0.GA.zip maps to ER01 build of jon-server-3.3.0.GA-update-03.zip.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1525.html
*** Bug 1339586 has been marked as a duplicate of this bug. ***