Bug 588383

Summary:	make sure server sleeps a given amount of time at startup to ensure agents know it was down
Product:	[Other] RHQ Project	Reporter:	John Mazzitelli <mazz>
Component:	Core Server	Assignee:	John Mazzitelli <mazz>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Corey Welton <cwelton>
Severity:	low	Docs Contact:
Priority:	low
Version:	1.3
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	2.4	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-08-12 16:57:01 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	584435

Description John Mazzitelli 2010-05-03 15:50:14 UTC

Description of problem:

If a server shuts down and immediately starts up back, and the startup was so fast that the agent never knew the server was down, the new server's caches are never refreshed because the agent won't know to re-send a connect message. This has bad side-effects - for example, alerts will get lost.

How reproducible:

Set the agent's "rhq.agent.client.server-polling-interval-msecs" setting to 5 minutes. Start server and agent into steady state. Shutdown and immediately restart the server and make sure the server starts up in under 5 minutes. Notice the server log never indicates the agent resent a connect message to it. This is bad - the agent needed to reconnect to the restarted server (i.e. you should have seen the server log indicate it received a connect message from the agent).

Additional info:

This is a very rare problem, so rare it probably will never occur in production. It is rare that a server will be able to restart so fast that the agents connected to it never notice. However, the chance is greater than 0%, thus we should do something to prevent this.

The solution is to make sure the server's startup time is slow enough for the agents to notice. Again, most times, this naturally happens (the server startup time is rarely less than 60 seconds). But we should have some code in StartupServlet that will check the amount of time since the server started and if it has been fast, then before we startup the comm layer, we should have the server sleep an additional amount of seconds to take it over the threshold of the agents' polling times. By default, we ensure the server startup time is 70 seconds or more. If the server has been started 50 seconds ago, the server will sleep an additional 20 seconds just to make sure the agents detected the downed server. If the server has been started 80 seconds ago, no additional sleep time is required and the server continues on as it would have normally.

Comment 1 John Mazzitelli 2010-05-03 15:52:30 UTC

gwt branch - commit 28f0e62acee77b6f439ab2b114cdf9ee2fc767ae - we need to cherry pick that over to master.

Note that the default is 70s, but its configurable by setting the system property "rhq.server.ensure-down-time-secs" - you can add this to rhq-server.properties.

Comment 2 John Mazzitelli 2010-05-03 16:00:34 UTC

we need to confirm that where this sleep happens is in the proper place.

for example, just because our comm layer isn't up, doesn't mean the tomcat connector isn't accepting connections. The tomcat connectors may be up and therefore the agents may still be able to connect, they just won't be able to successfully get their messages processed. If that is the case, this solution won't work (because IIRC the agent will only think the server is truly down if it receives a CannotConnect exception, and that would not be the exception it would have received since it did connect, its message just failed to be processed).

BTW: this has been cherry picked to master - commit 30f79902c4ff5f0ef03a16c5c0319befc697bcce

Comment 3 John Mazzitelli 2010-05-03 16:13:33 UTC

(In reply to comment #2)
> for example, just because our comm layer isn't up, doesn't mean the tomcat
> connector isn't accepting connections. The tomcat connectors may be up and
> therefore the agents may still be able to connect, they just won't be able to
> successfully get their messages processed. If that is the case, this solution
> won't work (because IIRC the agent will only think the server is truly down if
> it receives a CannotConnect exception, and that would not be the exception it
> would have received since it did connect, its message just failed to be
> processed).

We should be OK. I just ran and test and looked at the code and a
"failover-able" exception includes the exception you get if our comm layer
isn't started but Tomcat connectors are - that being a WebServerError
exception. See CommUtils:

    public static boolean isExceptionFailoverable(Throwable t) {
        return (t instanceof CannotConnectException || t instanceof
ConnectException
            || t instanceof NotProcessedException || t instanceof
WebServerError);
    }

Therefore, I think this solution will be OK. Of course, need more QA testing to
confirm.

Comment 4 Corey Welton 2010-05-21 19:09:29 UTC

QA verified.

Comment 5 Corey Welton 2010-08-12 16:57:01 UTC

Mass-closure of verified bugs against JON.