1212627 – Server fails to reconnect to storage node

Bug 1212627 - Server fails to reconnect to storage node

Summary: Server fails to reconnect to storage node

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	JBoss Operations Network
Classification:	JBoss
Component:	Storage Node
Sub Component:
Version:	JON 3.3.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	ER01
Target Release:	JON 3.3.3
Assignee:	Michael Burman
QA Contact:	Matt Mahoney
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1339586 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-04-16 20:32 UTC by Viet Nguyen
Modified:	2019-07-11 08:58 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-07-30 16:41:55 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
server.log (3.91 MB, text/plain) 2015-04-16 20:32 UTC, Viet Nguyen	no flags	Details
rhq-storage.log (188.85 KB, text/plain) 2015-04-16 20:32 UTC, Viet Nguyen	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1235974	0	high	CLOSED	JBoss ON Agent cannot recover the connection after it's storage node is restarted	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2015:1525	0	normal	SHIPPED_LIVE	Moderate: Red Hat JBoss Operations Network 3.3.3 update	2015-07-30 20:41:08 UTC

Internal Links: 1235974

Description Viet Nguyen 2015-04-16 20:32:12 UTC

Created attachment 1015379 [details]
server.log

Description of problem:

Given a default server, storage node, agent installation, the server should reconnect to storage node following a brief storage node offline.

To reproduce:

0. Install JON server as usual

1. ./rhqctl stop --storage

2. wait 3 minutes

3. ./rhqctl start --storage

4. Observer server.log for "NoHostAvailableException"


See attached server and storage node logs.  Storage node was shutdown at around 2015-04-16 18:03.


Version-Release number of selected component (if applicable):
- reproduced in 3.3.0 GA and 3.3.2 ER1

How reproducible:
100%

Comment 1 Viet Nguyen 2015-04-16 20:32:40 UTC

Created attachment 1015380 [details]
rhq-storage.log

Comment 2 Viet Nguyen 2015-04-17 17:36:05 UTC

Another way to reproduce is dropping packets going to Cassandra port 9142

#iptables -A INPUT -p tcp --dport 9142 -m statistic --mode random --probability 0.50 -j 
DROP

let it run for a few minutes until NoHostAvailableException appears in server.log then delete the rule

# iptables -D INPUT -p tcp --dport 9142 -m statistic --mode random --probability 0.50 -j 
DROP

The server never seems to able to recover from the exception

Comment 3 Michael Burman 2015-04-20 12:27:57 UTC

This seems to happen to me also these days (I can't remember this happening in the past though - perhaps environmental?)

Comment 4 John Sanda 2015-06-03 19:52:30 UTC

I do not think that this is an environment issue. I believe it has more to do with the Cassandra driver as demonstrated in this test code, https://gist.github.com/jsanda/95409e8f4956730d58a8. I perform the following steps with that test to produce the problem,

1) Start Cassandra
2) Run test (which loops indefinitely)
3) Stop Cassandra
4) Driver reports exceptions
5) Start Cassandra

The Host.StateListener never gets called. I think that this is a bug or a limitation, in the version of the driver being used because I ran the same test with version 2.1.5, and the driver does reconnect and notify the listener after the Storage Node is restarted. I am not necessarily pointing this out to suggest we need to upgrade the driver. I am pointing it out to show that my understanding of the driver's behavior in this regard was wrong.

We are going to have to shut down and recreate the session, particularly with single node deployments. I will have to do some additional testing to see if it is also an issue with multi-node deployments.

It is not as simple as recreate the session when we see a NoHostAvailablException for a couple reasons. First, the driver can report a NoHostAvailableException when the Storage Node is under heavy load. There could be a burst in requests that triggers the exception. After that burst subsides, we might not have any problem connecting. Secondly, when we store raw data, writes are done asynchronously in parallel. In effect they are pipelined. This means that if we see one NoHostAvailableException, there is pretty good chance we will see several of them.

I think we need to set up a scheduled job that takes action if there has been a NoHostAvailableException within the past N minutes or seconds. If we are not in maintenance mode, there presumably everything is fine, and there nothing else to do. If we are in maintenance mode though, we should try executing a simple query. If we still get a NoHostAvailableException, then we shutdown and recreate the Session object and try again.

Unfortunately we cannot simply check the Storage Node's availability to determine whether or not it is down. When the server is in maintenance mode, we refuse all agent requests. If the storage node was down and later restarted, the agent will report it as being UP but the server will reject the availability report which means we cannot rely on the last reported availability.

Comment 5 Filip Brychta 2015-06-18 15:23:54 UTC

Is there any chance to get this in JON 3.3.3?

Comment 6 Michael Burman 2015-06-23 13:43:57 UTC

PR #178

Comment 7 Michael Burman 2015-06-24 16:51:56 UTC

In master:

commit 67bbfa0026755659762b54e69e65c3354447f914
Merge: 91de3c4 bb867e3
Author: jsanda <jsanda>
Date:   Wed Jun 24 07:44:55 2015 -0400

    Merge pull request #178 from burmanm/reconnect
    
    [BZ 1212627] Recreate storage node sessions if connections are down

commit bb867e35169891e134b80e218eb636dfdeb35e90
Author: Michael Burman <miburman>
Date:   Wed Jun 24 13:31:21 2015 +0300

    Set name for the AliveChecker for easier debugging and catch all the exceptions in the aliveChecker thread

commit aa63682195ea1fc2a7ae06f11023b6cc05286c19
Author: Michael Burman <miburman>
Date:   Tue Jun 23 16:37:53 2015 +0300

    [BZ 1212627] Check storage node connection aliveness every 4s and recreate session if check failed twice in a row.

Comment 9 Simeon Pinder 2015-07-10 18:55:30 UTC

Available for test with 3.3.3 ER01 build: 
https://brewweb.devel.redhat.com/buildinfo?buildID=446732
 *Note: jon-server-patch-3.3.0.GA.zip maps to ER01 build of
 jon-server-3.3.0.GA-update-03.zip.

Comment 12 errata-xmlrpc 2015-07-30 16:41:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1525.html

Comment 13 Michael Burman 2016-05-25 12:50:02 UTC

*** Bug 1339586 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.