Bug 1212627 - Server fails to reconnect to storage node
Summary: Server fails to reconnect to storage node
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: JBoss Operations Network
Classification: JBoss
Component: Storage Node
Version: JON 3.3.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ER01
: JON 3.3.3
Assignee: Michael Burman
QA Contact: Matt Mahoney
URL:
Whiteboard:
: 1339586 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-04-16 20:32 UTC by Viet Nguyen
Modified: 2019-07-11 08:58 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-07-30 16:41:55 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
server.log (3.91 MB, text/plain)
2015-04-16 20:32 UTC, Viet Nguyen
no flags Details
rhq-storage.log (188.85 KB, text/plain)
2015-04-16 20:32 UTC, Viet Nguyen
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1235974 0 high CLOSED JBoss ON Agent cannot recover the connection after it's storage node is restarted 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHSA-2015:1525 0 normal SHIPPED_LIVE Moderate: Red Hat JBoss Operations Network 3.3.3 update 2015-07-30 20:41:08 UTC

Internal Links: 1235974

Description Viet Nguyen 2015-04-16 20:32:12 UTC
Created attachment 1015379 [details]
server.log

Description of problem:

Given a default server, storage node, agent installation, the server should reconnect to storage node following a brief storage node offline.

To reproduce:

0. Install JON server as usual

1. ./rhqctl stop --storage

2. wait 3 minutes

3. ./rhqctl start --storage

4. Observer server.log for "NoHostAvailableException"


See attached server and storage node logs.  Storage node was shutdown at around 2015-04-16 18:03.


Version-Release number of selected component (if applicable):
- reproduced in 3.3.0 GA and 3.3.2 ER1

How reproducible:
100%

Comment 1 Viet Nguyen 2015-04-16 20:32:40 UTC
Created attachment 1015380 [details]
rhq-storage.log

Comment 2 Viet Nguyen 2015-04-17 17:36:05 UTC
Another way to reproduce is dropping packets going to Cassandra port 9142

#iptables -A INPUT -p tcp --dport 9142 -m statistic --mode random --probability 0.50 -j 
DROP

let it run for a few minutes until NoHostAvailableException appears in server.log then delete the rule

# iptables -D INPUT -p tcp --dport 9142 -m statistic --mode random --probability 0.50 -j 
DROP

The server never seems to able to recover from the exception

Comment 3 Michael Burman 2015-04-20 12:27:57 UTC
This seems to happen to me also these days (I can't remember this happening in the past though - perhaps environmental?)

Comment 4 John Sanda 2015-06-03 19:52:30 UTC
I do not think that this is an environment issue. I believe it has more to do with the Cassandra driver as demonstrated in this test code, https://gist.github.com/jsanda/95409e8f4956730d58a8. I perform the following steps with that test to produce the problem,

1) Start Cassandra
2) Run test (which loops indefinitely)
3) Stop Cassandra
4) Driver reports exceptions
5) Start Cassandra

The Host.StateListener never gets called. I think that this is a bug or a limitation, in the version of the driver being used because I ran the same test with version 2.1.5, and the driver does reconnect and notify the listener after  the Storage Node is restarted. I am not necessarily pointing this out to suggest we need to upgrade the driver. I am pointing it out to show that my understanding of the driver's behavior in this regard was wrong.

We are going to have to shut down and recreate the session, particularly with single node deployments. I will have to do some additional testing to see if it is also an issue with multi-node deployments. 

It is not as simple as recreate the session when we see a NoHostAvailablException for a couple reasons. First, the driver can report a NoHostAvailableException when the Storage Node is under heavy load. There could be a burst in requests that triggers the exception. After that burst subsides, we might not have any problem connecting. Secondly, when we store raw data, writes are done asynchronously in parallel. In effect they are pipelined. This means that if we see one NoHostAvailableException, there is pretty good chance we will see several of them.

I think we need to set up a scheduled job that takes action if there has been a NoHostAvailableException within the past N minutes or seconds. If we are not in maintenance mode, there presumably everything is fine, and there nothing else to do. If we are in maintenance mode though, we should try executing a simple query. If we still get a NoHostAvailableException, then we shutdown and recreate the Session object and try again.

Unfortunately we cannot simply check the Storage Node's availability to determine whether or not it is down. When the server is in maintenance mode, we refuse all agent requests. If the storage node was down and later restarted, the agent will report it as being UP but the server will reject the availability report which means we cannot rely on the last reported availability.

Comment 5 Filip Brychta 2015-06-18 15:23:54 UTC
Is there any chance to get this in JON 3.3.3?

Comment 6 Michael Burman 2015-06-23 13:43:57 UTC
PR #178

Comment 7 Michael Burman 2015-06-24 16:51:56 UTC
In master:

commit 67bbfa0026755659762b54e69e65c3354447f914
Merge: 91de3c4 bb867e3
Author: jsanda <jsanda>
Date:   Wed Jun 24 07:44:55 2015 -0400

    Merge pull request #178 from burmanm/reconnect
    
    [BZ 1212627] Recreate storage node sessions if connections are down

commit bb867e35169891e134b80e218eb636dfdeb35e90
Author: Michael Burman <miburman>
Date:   Wed Jun 24 13:31:21 2015 +0300

    Set name for the AliveChecker for easier debugging and catch all the exceptions in the aliveChecker thread

commit aa63682195ea1fc2a7ae06f11023b6cc05286c19
Author: Michael Burman <miburman>
Date:   Tue Jun 23 16:37:53 2015 +0300

    [BZ 1212627] Check storage node connection aliveness every 4s and recreate session if check failed twice in a row.

Comment 9 Simeon Pinder 2015-07-10 18:55:30 UTC
Available for test with 3.3.3 ER01 build: 
https://brewweb.devel.redhat.com/buildinfo?buildID=446732
 *Note: jon-server-patch-3.3.0.GA.zip maps to ER01 build of
 jon-server-3.3.0.GA-update-03.zip.

Comment 12 errata-xmlrpc 2015-07-30 16:41:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1525.html

Comment 13 Michael Burman 2016-05-25 12:50:02 UTC
*** Bug 1339586 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.