Created attachment 940172 [details] agent.log Description of problem: I misconfigured server public endpoing port and as a result the default storage-node agent died upon registration. See attachment. Version-Release number of selected component (if applicable): JON 3.3.0 ER03 How reproducible: 50% Steps to Reproduce: 1. Set rhq-server.property rhq.communications.connector.bind-port=a random port 2. ./rhqctl install --server --storage 3. ./rhqctl start
Correction to reproducing step 1. Set rhq.communications.connector.bind-address to an invalid address. Setting bind-port has no effect (BZ 1145338).
This is working as expected. I'm not sure what the expected results would be, other than what it is doing. The agent log says: 2014-09-22 18:41:20,463 FATAL [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-error}The agent encountered an error during startup and must abort org.rhq.core.clientapi.server.core.AgentRegistrationException: The agent cannot register with the server. Admin intervention needed! Which is true. You gave a bad server endpoint - the agent can't talk to the server. The agent didn't have a failover list (from a previous registration from that server) so the agent has no idea where any server is, and so it dead in the water. It can't do anything until its told where a server is. That's why it is telling you "Admin intervention needed" - you have to tell the agent where a server is since your initial config was bad and it had no failover list to try. Is there something I am misunderstanding? What would you think the expected behavior to be in this circumstance?
I think aborting the registration process is a bit heavy handed. For example could the agent go into wait-until-server-ready-with-timeout mode? Definitely makes the agent more resilient and improves user experience. It's like seeing your email app die on start up because you haven't plugged in the network cable. In a container-based setup it's possible to have the app up and running inside the container and then network wiring to follow as a separate step.
(In reply to Viet Nguyen from comment #4) > I think aborting the registration process is a bit heavy handed. For example > could the agent go into wait-until-server-ready-with-timeout mode? > Definitely makes the agent more resilient and improves user experience. > It's like seeing your email app die on start up because you haven't plugged > in the network cable. In a container-based setup it's possible to have the > app up and running inside the container and then network wiring to follow as > a separate step. But what would it be waiting for in this case? The IP address was bad in this case. It will never get a server connection - so the agent would wait indefinitely and essentially be dead in the water. It will need some admin intervention to tell it the real IP address. Now, in the case that you DID give it a good IP address and its just that the server is not up yet, we actually DO have a wait-until-server-ready-with-timeout. The agent is given 1 minute to retry: From AgentMain: private class RegisterStateListener implements ClientCommandSenderStateListener { public boolean startedSending(ClientCommandSender sender) { // spawn the register thread - wait for it to complete because once we return, the other state listeners will // be called and if they want to talk to the server, we must first be registered in order to be able to do that registerWithServer(60000L, false); In that registerWithServer, we do some simple heuristics to try to wait for the server. We'll retry at most 5 times, and adjust our wait intervals to help avoid log jams of multiple agents clobbering the server with requests: if (!got_registered) { retry_interval = (retry_interval < 60000L) ? (retry_interval * 2) : 60000L; LOG.warn(t, AgentI18NResourceKeys.AGENT_REGISTRATION_FAILURE, retry, retry_interval, ThrowableUtil.getAllMessages(t)); So, in short, we do attempt to retry with a timeout - but if the agent can't connect in 5 attempts over a span of tens of seconds, the agent will still abort. The only alternative I could recommend is have some agent configuration setting that you can say, "Don't ever abort registration - keep trying until you success". If that was implemented, and you had that turned on, your agent would essentially be spinning its wheels for an infinite long time because you gave it a bad IP address and the agent would NEVER succeed. But if the IP is good and you just need to wait for the server to start, you could start the server any time in the future and the agent will connect since it would never abort. Let me know if that is something you think you need.
Oh! And you know what - I just looked closer at the log. This is different than what I thought. You gave the SERVER a bad public endpoint in its configuration (Admin>Servers). So the agent did look to register OK because you gave the agent a good registration IP, but when it got the failover list (which replaces the server endpoint used by the agent to do that initial registration/connect) it was bad! So the agent now doesn't know what's going on because now its failover list has all bad IP addresses in its list and it can't connect anymore. This is slightly different that what I thought, though I think my earlier comment still holds true in its analysis.
Since the server public endpoint can be updated/corrected post-install on JON UI making it recoverable error not a catastrophic one that warrants an agent shutdown. A configurable option would be great!
something else must have gone on - because I just tried the following and I'm not seeing the agent abort. It just keeps trying and trying to register: 1) install everything, but only start server and storage node 2) Go in Admin>Server GUI page and change server public IP to a bogus IP 3) Start agent and point it to the actual server IP (so it can register) 4) Notice the server gets the registration for the agent, sends down the failover list in the registration response, but the agent verifies the failover list and sees that its got the bad IP in it. That's what this message is that you see: 2014-09-22 18:42:03,280 WARN [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-list-check-failed}!!! There are [1] servers that are potentially unreachable by this agent. Please double check all public endpoints of your servers and ensure they are all reachable by this agent. The failed server endpoints are: [10.16.23.108:7080/7443] See the Administration (Topology) > Servers in the server GUI to change the public endpoint of a server. THIS AGENT WILL WAIT UNTIL ONE OF ITS SERVERS BECOMES REACHABLE! But in my tests, agent just keeps trying and trying and never aborts. I'm wondering what else is different that is causing your agent to abort. Can you provide more specific and more details replication steps? Because those steps I tried above don't cause the problem being reported.
oh, and when I went into the GUI and changed the server public IP back to the real IP, the agent's next registration request fully succeeded. This is what the logs show: ... series of warnings about the bad server IP in the failover list... 2014-10-01 16:25:26,312 WARN [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-list-unreachable-host}Failover list has an unreachable host [192.168.1.222] (tested ports [7080] and [7443]). Cause: java.net.NoRouteToHostException:No route to host 2014-10-01 16:26:02,408 WARN [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-list-unreachable-host}Failover list has an unreachable host [192.168.1.222] (tested ports [7080] and [7443]). Cause: java.net.NoRouteToHostException:No route to host 2014-10-01 16:26:32,503 INFO [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.agent-registration-results}Agent has successfully registered with the server. The results are: [AgentRegistrationResults: [agent-token=dgQwfPj61GioLf/CwmHSVAvKZB64lrQoOq8u25+OMHE7DrJsrxeB9bN3+wo7+2xNpHM=]]
After more investigation, I think I'm fairly confident what happened in this case is the agent was given a bad server IP at its initialization. So it was never able to connect to the server at startup and why it eventually aborted and the VM exited. I think this is the right thing to do - because in this case, the agent is just dead in the water - it has no good server IP address to try, so it will never actually talk to any server. If you give the agent a vaild registration IP such that it can initially register with the server, but the SERVER has a bad public endpoint IP, the agent will spin and wait for a good failover list - repeatedly registerting until it gets that good failover list.
The current agent registration process works, as Mazz has pointed out and works like this for a long time.
viet - it turns out, the agent WILL wait indefinitely if it has never registered before. So something else is going on with your use case that I haven't been able to figure out. Can you turn on agent debug so I can see DEBUG messages, and post the agent log either in this BZ or in the RFE BZ I created: bug #1152154
there is a setting rhq.agent.wait-for-server-at-startup-msecs that does what I think is needed. Just set it to a very large number and it will wait that long for the server's public IP to become available.
Created attachment 946554 [details] dying agent with debugging on