Bug 963982 - if server endpoint check test fails, agent loses token
Summary: if server endpoint check test fails, agent loses token
Keywords:
Status: ON_QA
Alias: None
Product: RHQ Project
Classification: Other
Component: Agent
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: John Mazzitelli
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 963983
TreeView+ depends on / blocked
 
Reported: 2013-05-16 23:40 UTC by John Mazzitelli
Modified: 2022-03-31 04:27 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 963983 (view as bug list)
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)

Description John Mazzitelli 2013-05-16 23:40:27 UTC
Install everything. Give a bogus public endpoint to the server.

Register a new agent (make sure the default setting of rhq.agent.test-failover-list-at-startup remains "true" - since this is the default, nothing to do here, just wanted to point that out).

See that the server registered the agent successfully, assigned it a token and sent it to the agent. The agent, however, fails the server endpoint ping test and retries. However, by now, the token has been assigned, but the agent didn't persist it yet. So the retry fails.

I haven't tested this, but this is the behavior reported to me and the code I see seems to tell me this is what will happen.

We should persist the token as soon as we get it.

Comment 1 John Mazzitelli 2013-05-16 23:43:07 UTC
I think the code to fix is in here:

org.rhq.enterprise.agent.AgentMain.registerWithServer(long, boolean)

Comment 2 John Mazzitelli 2013-05-17 20:38:29 UTC
I verified this is a problem. Here is replication procedure.

1) Install server and agent fresh
2) Change the server's public endpoint address to some value that is not valid (for example, go in Administration>Servers and edit your server, changing the endpoint address to some IP or hostname that isn't valid
3) Start the new agent clean (rhq-agent.sh -L), provide the good endpoint address to the server so it can connect successfully. Then watch as it says the public endpoint is invalid:

  !!! There are [1] servers that are potentially unreachable by this agent.
  Please double check all public endpoints of your servers and ensure
  they are all reachable by this agent. The failed server endpoints are:
  [192.168.0.100:7080/7443]
  See the Administration (Topology) > Servers in the server GUI
  to change the public endpoint of a server.
  THIS AGENT WILL WAIT UNTIL ONE OF ITS SERVERS BECOMES REACHABLE!

4) Shutdown the agent.
5) Now immediately restart the agent with NO command line arguments (specifically, not -L or -l).

You will now see the agent being told it is missing a token when it should have one:

----------
The server has rejected the agent registration request. Cause: [org.rhq.core.clientapi.server.core.AgentRegistrationException:The agent [mazztower] is attempting to re-register without a security token. Please consult an administrator to obtain the agent's proper security token and restart the agent with the option "-Drhq.agent.security-token=<the valid security token>". An administrator can find the agent's security token by navigating to the GUI page "Administration (Topology) > Agents" and drilling down to this specific agent. You will see the long security token string there. For more information, read: https://docs.jboss.org/author/display/RHQ/Agent+Registration]
Will retry the agent registration request soon...
----------

This is because the first time you started the agent, it registered OK, and it got a token. However, it failed to persist the token.

What should have happened was the agent should still have failed to same way the second time as it failed the first time (with a note about the server endpoints being bad). Once the server endpoint was fixed, the agent should be able to reconnect fine - however, it won't be able to because it forgot to persist the token.

Comment 3 John Mazzitelli 2013-05-17 20:49:34 UTC
When testing the fix, you know it works when you see something like this in the server logs:

16:46:42,329 INFO  [org.rhq.enterprise.server.core.CoreServerServiceImpl] (http-/0.0.0.0:7080-4) Got agent registration request for new agent: mazztower[192.168.1.2:16163][4.8.0-SNAPSHOT(b9ce1ee)]
16:47:00,335 INFO  [org.rhq.enterprise.server.core.CoreServerServiceImpl] (http-/0.0.0.0:7080-4) Agent [mazztower][4.8.0-SNAPSHOT(b9ce1ee)] would like to connect to this server
16:47:00,503 INFO  [org.rhq.enterprise.server.core.CoreServerServiceImpl] (http-/0.0.0.0:7080-4) Agent [mazztower] has connected to this server at Fri May 17 16:47:00 EDT 2013
16:47:01,631 INFO  [org.rhq.enterprise.server.core.CoreServerServiceImpl] (http-/0.0.0.0:7080-4) Got agent registration request for existing agent: mazztower[192.168.1.2:16163][4.8.0-SNAPSHOT(b9ce1ee)] - Will not regenerate a new token

Notice it says "Got agent registration request for existing agent" - that's what you want to see the second time you start the agent.

If you leave the server endpoint with a bad IP/name, the agent will fail to start twice with the same error:

!!! There are [1] servers that are potentially unreachable by this agent.
Please double check all public endpoints of your servers and ensure
they are all reachable by this agent. The failed server endpoints are:
[bogusname:7080/7443]
See the Administration (Topology) > Servers in the server GUI
to change the public endpoint of a server.
THIS AGENT WILL WAIT UNTIL ONE OF ITS SERVERS BECOMES REACHABLE!

If you change the server's public endpoint to a correct IP/hostname, and restart the agent (withOUT any cmdline arguments, specifically without -L or -l) the agent should start successfully and being to run normally.

Comment 4 John Mazzitelli 2013-05-17 21:10:55 UTC
git commit to master: edaffb8


Note You need to log in before you can comment on or make changes to this bug.