1145327 – Agent died on start up due to unreachable Server public endpoint

Bug 1145327 - Agent died on start up due to unreachable Server public endpoint

Summary: Agent died on start up due to unreachable Server public endpoint

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	JBoss Operations Network
Classification:	JBoss
Component:	Agent
Sub Component:
Version:	JON 3.3.0
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	ER05
Target Release:	JON 3.3.0
Assignee:	RHQ Project Maintainer
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-09-22 20:25 UTC by Viet Nguyen
Modified:	2014-10-13 19:13 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-10-10 14:17:44 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
agent.log (4.34 KB, text/plain) 2014-09-22 20:25 UTC, Viet Nguyen	no flags	Details
dying agent with debugging on (34.42 KB, text/plain) 2014-10-13 19:13 UTC, Viet Nguyen	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1152154	0	high	CLOSED	have option to allow agent to wait forever for registration server to come online	2021-02-22 00:41:40 UTC

Internal Links: 1152154

Description Viet Nguyen 2014-09-22 20:25:42 UTC

Created attachment 940172 [details]
agent.log

Description of problem:

I misconfigured server public endpoing port and as a result the default storage-node agent died upon registration.  See attachment.

Version-Release number of selected component (if applicable):
JON 3.3.0 ER03

How reproducible:
50%

Steps to Reproduce:
1. Set rhq-server.property rhq.communications.connector.bind-port=a random port
2. ./rhqctl install --server --storage
3. ./rhqctl start

Comment 1 Viet Nguyen 2014-09-22 20:59:39 UTC

Correction to reproducing step 1.  Set rhq.communications.connector.bind-address to an invalid address.  Setting bind-port has no effect (BZ 1145338).

Comment 2 John Mazzitelli 2014-09-29 21:00:46 UTC

This is working as expected.

I'm not sure what the expected results would be, other than what it is doing.

The agent log says:

2014-09-22 18:41:20,463 FATAL [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-error}The agent encountered an error during startup and must abort org.rhq.core.clientapi.server.core.AgentRegistrationException: The agent cannot register with the server. Admin intervention needed! 

Which is true. You gave a bad server endpoint - the agent can't talk to the server. The agent didn't have a failover list (from a previous registration from that server) so the agent has no idea where any server is, and so it dead in the water. It can't do anything until its told where a server is.

That's why it is telling you "Admin intervention needed" - you have to tell the agent where a server is since your initial config was bad and it had no failover list to try.

Is there something I am misunderstanding? What would you think the expected behavior to be in this circumstance?

Comment 4 Viet Nguyen 2014-09-30 06:05:52 UTC

I think aborting the registration process is a bit heavy handed. For example could the agent go into wait-until-server-ready-with-timeout mode? Definitely makes the agent more resilient and improves user experience.  It's like seeing your email app die on start up because you haven't plugged in the network cable.  In a container-based setup it's possible to have the app up and running inside the container and then network wiring to follow as a separate step.

Comment 5 John Mazzitelli 2014-09-30 13:37:55 UTC

(In reply to Viet Nguyen from comment #4)
> I think aborting the registration process is a bit heavy handed. For example
> could the agent go into wait-until-server-ready-with-timeout mode?
> Definitely makes the agent more resilient and improves user experience. 
> It's like seeing your email app die on start up because you haven't plugged
> in the network cable.  In a container-based setup it's possible to have the
> app up and running inside the container and then network wiring to follow as
> a separate step.

But what would it be waiting for in this case? The IP address was bad in this case. It will never get a server connection - so the agent would wait indefinitely and essentially be dead in the water. It will need some admin intervention to tell it the real IP address. 

Now, in the case that you DID give it a good IP address and its just that the server is not up yet, we actually DO have a wait-until-server-ready-with-timeout. The agent is given 1 minute to retry:

From AgentMain:

    private class RegisterStateListener implements ClientCommandSenderStateListener {
        public boolean startedSending(ClientCommandSender sender) {
            // spawn the register thread - wait for it to complete because once we return, the other state listeners will
            // be called and if they want to talk to the server, we must first be registered in order to be able to do that
            registerWithServer(60000L, false);

In that registerWithServer, we do some simple heuristics to try to wait for the server. We'll retry at most 5 times, and adjust our wait intervals to help avoid log jams of multiple agents clobbering the server with requests:

                        if (!got_registered) {
                            retry_interval = (retry_interval < 60000L) ? (retry_interval * 2) : 60000L;
                            LOG.warn(t, AgentI18NResourceKeys.AGENT_REGISTRATION_FAILURE, retry, retry_interval,
                                ThrowableUtil.getAllMessages(t));

So, in short, we do attempt to retry with a timeout - but if the agent can't connect in 5 attempts over a span of tens of seconds, the agent will still abort.

The only alternative I could recommend is have some agent configuration setting that you can say, "Don't ever abort registration - keep trying until you success". If that was implemented, and you had that turned on, your agent would essentially be spinning its wheels for an infinite long time because you gave it a bad IP address and the agent would NEVER succeed. But if the IP is good and you just need to wait for the server to start, you could start the server any time in the future and the agent will connect since it would never abort.

Let me know if that is something you think you need.

Comment 6 John Mazzitelli 2014-09-30 13:45:20 UTC

Oh! And you know what - I just looked closer at the log. This is different than what I thought.

You gave the SERVER a bad public endpoint in its configuration (Admin>Servers). So the agent did look to register OK because you gave the agent a good registration IP, but when it got the failover list (which replaces the server endpoint used by the agent to do that initial registration/connect) it was bad! So the agent now doesn't know what's going on because now its failover list has all bad IP addresses in its list and it can't connect anymore.

This is slightly different that what I thought, though I think my earlier comment still holds true in its analysis.

Comment 7 Viet Nguyen 2014-10-01 17:29:11 UTC

Since the server public endpoint can be updated/corrected post-install on JON UI making  it recoverable error not a catastrophic one that warrants an agent shutdown. A configurable option would be great!

Comment 8 John Mazzitelli 2014-10-01 20:25:54 UTC

something else must have gone on - because I just tried the following and I'm not seeing the agent abort. It just keeps trying and trying to register:

1) install everything, but only start server and storage node
2) Go in Admin>Server GUI page and change server public IP to a bogus IP
3) Start agent and point it to the actual server IP (so it can register)
4) Notice the server gets the registration for the agent, sends down the failover list in the registration response, but the agent verifies the failover list and sees that its got the bad IP in it. That's what this message is that you see:

2014-09-22 18:42:03,280 WARN  [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-list-check-failed}!!! There are [1] servers that are potentially unreachable by this agent.
Please double check all public endpoints of your servers and ensure
they are all reachable by this agent. The failed server endpoints are:
[10.16.23.108:7080/7443]
See the Administration (Topology) > Servers in the server GUI
to change the public endpoint of a server.
THIS AGENT WILL WAIT UNTIL ONE OF ITS SERVERS BECOMES REACHABLE!

But in my tests, agent just keeps trying and trying and never aborts.

I'm wondering what else is different that is causing your agent to abort. 

Can you provide more specific and more details replication steps? Because those steps I tried above don't cause the problem being reported.

Comment 9 John Mazzitelli 2014-10-01 20:28:03 UTC

oh, and when I went into the GUI and changed the server public IP back to the real IP, the agent's next registration request fully succeeded. This is what the logs show:


... series of warnings about the bad server IP in the failover list...
2014-10-01 16:25:26,312 WARN  [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-list-unreachable-host}Failover list has an unreachable host [192.168.1.222] (tested ports [7080] and [7443]). Cause: java.net.NoRouteToHostException:No route to host
2014-10-01 16:26:02,408 WARN  [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-list-unreachable-host}Failover list has an unreachable host [192.168.1.222] (tested ports [7080] and [7443]). Cause: java.net.NoRouteToHostException:No route to host
2014-10-01 16:26:32,503 INFO  [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.agent-registration-results}Agent has successfully registered with the server. The results are: [AgentRegistrationResults: [agent-token=dgQwfPj61GioLf/CwmHSVAvKZB64lrQoOq8u25+OMHE7DrJsrxeB9bN3+wo7+2xNpHM=]]

Comment 10 John Mazzitelli 2014-10-03 19:36:22 UTC

After more investigation, I think I'm fairly confident what happened in this case is the agent was given a bad server IP at its initialization. So it was never able to connect to the server at startup and why it eventually aborted and the VM exited. I think this is the right thing to do - because in this case, the agent is just dead in the water - it has no good server IP address to try, so it will never actually talk to any server.

If you give the agent a vaild registration IP such that it can initially register with the server, but the SERVER has a bad public endpoint IP, the agent will spin and wait for a good failover list - repeatedly registerting until it gets that good failover list.

Comment 12 Heiko W. Rupp 2014-10-10 14:17:44 UTC

The current agent registration process works, as Mazz has pointed out and works like this for a long time.

Comment 14 John Mazzitelli 2014-10-13 14:43:50 UTC

viet - it turns out, the agent WILL wait indefinitely if it has never registered before. So something else is going on with your use case that I haven't been able to figure out.

Can you turn on agent debug so I can see DEBUG messages, and post the agent log either in this BZ or in the RFE BZ I created: bug #1152154

Comment 15 John Mazzitelli 2014-10-13 18:40:37 UTC

there is a setting rhq.agent.wait-for-server-at-startup-msecs that does what I think is needed. Just set it to a very large number and it will wait that long for the server's public IP to become available.

Comment 16 Viet Nguyen 2014-10-13 19:13:00 UTC

Created attachment 946554 [details]
dying agent with debugging on

Note You need to log in before you can comment on or make changes to this bug.