1152154 – have option to allow agent to wait forever for registration server to come online

Bug 1152154 - have option to allow agent to wait forever for registration server to come online

Summary: have option to allow agent to wait forever for registration server to come on...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Operations Network
Classification:	JBoss
Component:	Agent
Sub Component:
Version:	JON 3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	CR01
Target Release:	JON 3.3.0
Assignee:	John Mazzitelli
QA Contact:	Garik Khachikyan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-10-13 13:45 UTC by John Mazzitelli
Modified:	2017-04-03 11:56 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-12-11 14:00:13 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
agent log with log wait (31.54 KB, text/plain) 2014-10-16 03:05 UTC, Viet Nguyen	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1145327	0	unspecified	CLOSED	Agent died on start up due to unreachable Server public endpoint	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1420060	0	medium	CLOSED	Agent sometimes fails to register	2021-02-22 00:41:40 UTC

Internal Links: 1145327 1420060

Description John Mazzitelli 2014-10-13 13:45:30 UTC

See bug #1145327 for why this has been requested.

Provide an option to tell the agent to wait forever when performing its very first registration with its registration server.

This will help those times when you install the agent first, then the server. It is desireable for the agent to wait indefinitely (or at least some long amount of time) for the first server to come online.

There actually might already be an option like this, I'll research that first. If there is, I will add a comment here and mark as CLOSED/NOTABUG. But off the top of my head, I don't think there is such a feature, in which case something will need to be implemented.

Comment 1 John Mazzitelli 2014-10-13 14:41:42 UTC

well, this is interesting. We already do do this. If the agent has never registered before, the agent will wait indefinitely. Nothing you have to do.

I might have to close this as NOTABUG and ask more about the use case.

Comment 2 John Mazzitelli 2014-10-13 18:40:31 UTC

there is a setting rhq.agent.wait-for-server-at-startup-msecs that does what I think is needed. Just set it to a very large number and it will wait that long for the server's public IP to become available.

Comment 3 John Mazzitelli 2014-10-13 20:05:53 UTC

viet - please see comment #2 and try customizing that setting in your agent prior to starting the agent for the first time. I think this will actually do what you need.

Comment 4 Viet Nguyen 2014-10-16 03:05:58 UTC

Created attachment 947398 [details]
agent log with log wait

Comment 5 Viet Nguyen 2014-10-16 03:13:05 UTC

Longer wait didn't help.  The main complain seemed to be "Cause: java.net.UnknownHostException: rhqserver".  I could only reproduce with the storage node managing agent. 

1. Launch docker container without calling jon install script
2. ./rhqctl install --server --storage
3. rm -rf /root/.java
4. edit /opt/rhq-agent/conf/agent-configuration.xml.  Set rhq.agent.wait-for-server-at-startup-msecs to a 1 hour
5. /rhqctl start --server --storage --agent

Comment 6 John Mazzitelli 2014-10-16 13:12:32 UTC

this is really odd - after the agent started shutting down, the registration thread did actually end up getting registered!

2014-10-16 02:50:38,485 INFO  [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.agent-registration-results}Agent has successfully registered with the server. The results are: [AgentRegistrationResults: [agent-token=a8VtCuU5ugfJT/BVxQkMXaCElwQ2jCRZCVV23akvj+cPz4O6yjdb94WcK+N5Ws+DToI=]]

Comment 7 John Mazzitelli 2014-10-16 15:04:37 UTC

so, it looks like the server is accessible on that local IP (172.17.0.241 port 7080) , because I see this:

=====
2014-10-16 02:46:17,723 DEBUG [RHQ Server Polling Thread] (enterprise.communications.command.client.ServerPollingThread)- {ServerPollingThread.server-poll-failure}Failed to successfully poll the server. This is normally due to the server not being up yet. You can usually ignore this message since it will be tried again later, however, you should ensure this failure was not really caused by a misconfiguration. Cause: org.jboss.remoting.transport.http.WebServerError:<html><head><title>JBoss Web/7.4.8.Final-redhat-4 - JBWEB000064: Error report</title><style><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--></style> </head><body><h1>JBWEB000065: HTTP Status 404 - /jboss-remoting-servlet-invoker/ServerInvokerServlet</h1><HR size="1" noshade="noshade"><p><b>JBWEB000309: type</b> JBWEB000067: Status report</p><p><b>JBWEB000068: message</b> <u>/jboss-remoting-servlet-invoker/ServerInvokerServlet</u></p><p><b>JBWEB000069: description</b> <u>JBWEB000124: The requested resource is not available.</u></p><HR size="1" noshade="noshade"><h3>JBoss Web/7.4.8.Final-redhat-4</h3></body></html>
=====

That's the 404 HTML error of the JON Server it isn't ready yet (clearly, the agent connected to SOMETHING at 172.17.0.241:7080).

But then for some reason, the agent doesn't wait long enough for following registration requests to successfully complete before aborting.

I'm not sure why it aborted so fast, when we can clearly see soon after the registration request succeeded and the agent got a token and failover list.

Still investigating what this particular use-case is triggering inside the agent.

Comment 8 John Mazzitelli 2014-10-16 15:13:50 UTC

I see this code in AgentMain and this is the code that is spitting out the exception indicated in the attached log to this BZ - and the comment in this code seems to describe what I see the log said happened: the server was up - we know it is up because we got a response (it was a 404 from the comm servlet, but technically it was up) but yet we weren't registered.

I'm not sure why this is getting triggered in this BZ use-case but not seen in my other tests. But there might be something we have to do here to allow for the use-case that is failing.


                        } else if (mustRegister && !agentIsRegistered) {
                            // If we got here, we know a server is up and the agent needs to be registered, but it isn't registered.
                            // This usually means an unrecoverable registration error occurred, so abort.
                            throw new AgentRegistrationException(
                                MSG.getMsg(AgentI18NResourceKeys.AGENT_CANNOT_REGISTER));

Comment 9 John Mazzitelli 2014-10-16 15:26:37 UTC

I really think this broke as part of the "fix" to bug #987628. It used to be the agent would just always attempt to retry the registration - which would mean this problem in this BZ wouldn't happen because eventually the registration would have succeeded. But now there seems to be a race condition and the agent aborts before the registration thread has a chance to finish doing what its doing.

Will look to see if I can add back some type of retry loop but not have the agent wait an infinite amount of time before aborting.

Comment 10 John Mazzitelli 2014-10-16 20:06:52 UTC

I have not been able to replicate this. But I added some code to let the agent wait a few more times before aborting completely. The abort is very drastic (the agent VM will die) so we should only do that as a last resort. The new code will now, by default, wait 5 different times before aborting. The "5" is actually configurable by a backdoor system property "rhq.agent.startup-registration-waits" (you can set this in RHQ_AGENT_ADDITIONAL_JAVA_OPTS or add it to agent-configuration.xml - I didn't make this an actual agent preference (i.e. it is not documented in the .xml and AgentConfiguration doesn't know about it) but at least its here in case some odd use case crops up where its needed.

commit to master:

commit 6b97b5ce9c146d7fd14bcd0da50e165ccb85e9ea
Author: John Mazzitelli <mazz>
Date:   Thu Oct 16 16:06:09 2014 -0400

    BZ 1152154 - wait a few times during startup to ensure the agent has been given a fair chance to register

Comment 11 John Mazzitelli 2014-10-20 19:05:35 UTC

Cherry picking over so it makes it in the next QE build for testing.

commit 19ae6a0031ba949349d0ddfc8330d1db2604f790
Author: John Mazzitelli <mazz>
Date:   Thu Oct 16 16:06:09 2014 -0400

    BZ 1152154 - wait a few times during startup to ensure the agent has been given a fair chance to register
    (cherry picked from commit 6b97b5ce9c146d7fd14bcd0da50e165ccb85e9ea)

Comment 12 Simeon Pinder 2014-11-03 19:03:33 UTC

Moving to ON_QA as available to test with latest brew build:
https://brewweb.devel.redhat.com//buildinfo?buildID=396547

Comment 13 Garik Khachikyan 2014-11-04 14:16:54 UTC

taking QA contact with following scenario in my mind for verification:

===
Scenario1: exit/die agent after default attempts (leave rhq.agent.startup-registration-waits not set; rhq.agent.wait-for-server-at-startup-msecs=60000)
Description1: perform agent configuration to a non-existing server hostname/address and define properties as above. Agent should die in after 5 min.

Scenario2: exit/die agent after some attempts (rhq.agent.startup-registration-waits=2; rhq.agent.wait-for-server-at-startup-msecs=60000)
Description2: perform agent configuration to a non-existing server hostname/address and define properties as above. Agent should die in after 2 min.

Scenario3: succeed during default attempts (leave all as defaults)
Description3: perform agent configuration to a server to be registered afterwards (but really quick). Agent should get registered.
===

Comment 16 Garik Khachikyan 2014-11-04 14:52:45 UTC

Scenario3: succeed during default attempts (leave all as defaults)
Description3: perform agent configuration to a server to be registered afterwards (but really quick). Agent should get registered.

the other 2 scenarios are eliminated due to discussion with John.

Comment 17 Garik Khachikyan 2014-11-05 14:04:14 UTC

Scenario3 for linux succeed.

trying Win now.

Comment 18 Garik Khachikyan 2014-11-05 15:05:59 UTC

Windows case works fine too. verifying.

Note You need to log in before you can comment on or make changes to this bug.