Install everything. Give a bogus public endpoint to the server. Register a new agent (make sure the default setting of rhq.agent.test-failover-list-at-startup remains "true" - since this is the default, nothing to do here, just wanted to point that out). See that the server registered the agent successfully, assigned it a token and sent it to the agent. The agent, however, fails the server endpoint ping test and retries. However, by now, the token has been assigned, but the agent didn't persist it yet. So the retry fails. I haven't tested this, but this is the behavior reported to me and the code I see seems to tell me this is what will happen. We should persist the token as soon as we get it.
I think the code to fix is in here: org.rhq.enterprise.agent.AgentMain.registerWithServer(long, boolean)
I verified this is a problem. Here is replication procedure. 1) Install server and agent fresh 2) Change the server's public endpoint address to some value that is not valid (for example, go in Administration>Servers and edit your server, changing the endpoint address to some IP or hostname that isn't valid 3) Start the new agent clean (rhq-agent.sh -L), provide the good endpoint address to the server so it can connect successfully. Then watch as it says the public endpoint is invalid: !!! There are [1] servers that are potentially unreachable by this agent. Please double check all public endpoints of your servers and ensure they are all reachable by this agent. The failed server endpoints are: [192.168.0.100:7080/7443] See the Administration (Topology) > Servers in the server GUI to change the public endpoint of a server. THIS AGENT WILL WAIT UNTIL ONE OF ITS SERVERS BECOMES REACHABLE! 4) Shutdown the agent. 5) Now immediately restart the agent with NO command line arguments (specifically, not -L or -l). You will now see the agent being told it is missing a token when it should have one: ---------- The server has rejected the agent registration request. Cause: [org.rhq.core.clientapi.server.core.AgentRegistrationException:The agent [mazztower] is attempting to re-register without a security token. Please consult an administrator to obtain the agent's proper security token and restart the agent with the option "-Drhq.agent.security-token=<the valid security token>". An administrator can find the agent's security token by navigating to the GUI page "Administration (Topology) > Agents" and drilling down to this specific agent. You will see the long security token string there. For more information, read: https://docs.jboss.org/author/display/RHQ/Agent+Registration] Will retry the agent registration request soon... ---------- This is because the first time you started the agent, it registered OK, and it got a token. However, it failed to persist the token. What should have happened was the agent should still have failed to same way the second time as it failed the first time (with a note about the server endpoints being bad). Once the server endpoint was fixed, the agent should be able to reconnect fine - however, it won't be able to because it forgot to persist the token.
When testing the fix, you know it works when you see something like this in the server logs: 16:46:42,329 INFO [org.rhq.enterprise.server.core.CoreServerServiceImpl] (http-/0.0.0.0:7080-4) Got agent registration request for new agent: mazztower[192.168.1.2:16163][4.8.0-SNAPSHOT(b9ce1ee)] 16:47:00,335 INFO [org.rhq.enterprise.server.core.CoreServerServiceImpl] (http-/0.0.0.0:7080-4) Agent [mazztower][4.8.0-SNAPSHOT(b9ce1ee)] would like to connect to this server 16:47:00,503 INFO [org.rhq.enterprise.server.core.CoreServerServiceImpl] (http-/0.0.0.0:7080-4) Agent [mazztower] has connected to this server at Fri May 17 16:47:00 EDT 2013 16:47:01,631 INFO [org.rhq.enterprise.server.core.CoreServerServiceImpl] (http-/0.0.0.0:7080-4) Got agent registration request for existing agent: mazztower[192.168.1.2:16163][4.8.0-SNAPSHOT(b9ce1ee)] - Will not regenerate a new token Notice it says "Got agent registration request for existing agent" - that's what you want to see the second time you start the agent. If you leave the server endpoint with a bad IP/name, the agent will fail to start twice with the same error: !!! There are [1] servers that are potentially unreachable by this agent. Please double check all public endpoints of your servers and ensure they are all reachable by this agent. The failed server endpoints are: [bogusname:7080/7443] See the Administration (Topology) > Servers in the server GUI to change the public endpoint of a server. THIS AGENT WILL WAIT UNTIL ONE OF ITS SERVERS BECOMES REACHABLE! If you change the server's public endpoint to a correct IP/hostname, and restart the agent (withOUT any cmdline arguments, specifically without -L or -l) the agent should start successfully and being to run normally.
git commit to master: edaffb8