Bug 585334

Summary: support having agent change hostname without human intervention
Product: [Other] RHQ Project Reporter: Charles Crouch <ccrouch>
Component: No ComponentAssignee: John Sanda <jsanda>
Status: CLOSED WONTFIX QA Contact: Mike Foley <mfoley>
Severity: medium Docs Contact:
Priority: low    
Version: unspecifiedCC: bmozaffa, hbrock, jshaughn, mazz, sreichar, twilkins
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-05-29 15:41:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
agent-configuration.xml_ra-130-249
none
rhq-agent-wrapper.sh_ra-130-249 none

Description Charles Crouch 2010-04-23 18:20:20 UTC
From Tim Wilkinson:

We are currently trying to get satellite to "provision virtual machines with JBoss EAP and the rhq-agent pre configured 
(via post install scripting) so that when booted, JON would be able to 
discover each VM. Although we saw this work by hand, our attempts at 
automating it have hit a snag or two. These VMs can reboot at will and 
using a DHCP naming convention similar to that of RH (e.g., ra-129.230), 
the hostname can change and JON complains that this agent was previously 
registered under a different name and can't re-register under a new name. 
We had no luck at engaging the cleanconfig switch to get around this. 
That's when we turned to ask someone who knows JON if there was a way to 
get cleanconfig working for us (if it will indeed resolve the issue) or if 
there is another possible workaround."

Comment 2 Charles Crouch 2010-04-23 19:03:32 UTC
So to clarify how the agent/server registration should work:

Assume the agent has initially registered successfully with the server.
If the agent starts up with the same "agent name" and same "agent token" as when it first registered, but it has a different hostname, then the agent should be able to negotiate with the server to update the hostname and maintain correct communication.

The key part of this is *not* to run cleanconfig when the agent starts up with the new hostname. When the agent starts up the first time and registers with the server it will write its entire configuration into Java preferences, stored in the home directory of the user which started the agent. [Note: running agents as a user whose home directory is on a shared drive will therefore not work by default, since each agent will overwrite the configuration of the other]. Also the next time the agent starts it will not read any settings from the ./conf/agent-configuration.xml, instead it will get all its settings from these Java preferences.

If you run cleanconfig on the agent then these Java preferences, in particular the "agent token", will get blown away. Without the token the server will think this is an entirely new agent, not the old agent trying to reconnect.

It sounds like in this automated setup you have you should be running with a pre-configured agent-configuration.xml which has a unique agent name set (obviously not the hostname which will change) and set to point to where your JON server is located.

Can you attach the agent-configuration.xml you are using and the *exact* commands you give to the agent to start it initially (never been run on this machine) and when it is started subsequently with a different hostname (if different).

Comment 3 Tim Wilkinson 2010-04-23 19:52:15 UTC
Created attachment 408717 [details]
agent-configuration.xml_ra-130-249

Comment 4 Tim Wilkinson 2010-04-23 19:53:20 UTC
Created attachment 408718 [details]
rhq-agent-wrapper.sh_ra-130-249

Comment 5 Tim Wilkinson 2010-04-23 19:54:15 UTC
The agent-configuration.xml file and the calling script /etc/init.d/rhq-agent-wrapper.sh have been attached.

Comment 6 Tim Wilkinson 2010-04-23 20:05:29 UTC
There's a little bit more to this attempt that may still be playing a factor. The issue was initially noticed in our first attempts to automate the registration at startup. The agent name somehow defaulted to localhost.localdomain and registered with the JON server with that agent name rather that the hostname we wanted. Subsequent attempts to re-register the node have lead to this issue. We have yet to get the agent name to default to the hostname as expected.

Comment 7 John Mazzitelli 2010-04-23 21:35:48 UTC
this feature might be useful to you:

https://bugzilla.redhat.com/show_bug.cgi?id=535783

"by default, the agent's registration server (that is the server it registers
with at startup) used to be 127.0.0.1 unless you preconfigured the agent.

Now it will first perform a DNS lookup for a machine called "rhqserver" - if
there is one defined, THAT will be the server it will connect to (you still
have to define the port and transport - the defaults remain the same - servlet
and 7080).

If there is no "rhqserver", the default remains the localhost."

I can't remember if this made it into the latest JON release or not.

Comment 8 Tim Wilkinson 2010-04-26 15:59:38 UTC
It must have made it in because it worked. We had been trying to preconfigure the agent but that continued to fail until rhqserver was defined, thanks for mentioning that. I assumed specifying the server using rhq.agent.server.bind-address would have been the answer but it did not succeed.

So now a newly provisioned jboss system can be discovered by the JON server as the jboss system boots. We still have two hurdles to clear ...

1) What will occur if a jboss system is down for a while, its IP lease expires, another system acquires that IP, and the jboss server boots to obtain a different IP. 

2) Our next goal is to duplicate the newly registered jboss server via templates so I will have to 'sysprep', if you will, the system to be ready for cloning. Will the cleanconfig option do what I need in this instance?

Comment 9 Tim Wilkinson 2010-04-27 19:41:28 UTC
So far any VMs cloned from template do indeed fail to be discovered by JON. JBoss and the JON agent are running so I'm assuming that my attempts to prep the VM for templating (stop agent, run the agent as the same user who registered it only with --cleanconfig) are not succeeding. This is almost expected because my attempts at cleanconfig ...

  rhq-agent.sh --cleanconfig -c ../conf/agent-configuration.xml

... do not return to the command line prompt but instead sit with a prompt as if the command is incomplete ...

> [root@ra-131-247 ~]# rhq-agent/bin/rhq-agent.sh --cleanconfig -c /root/rhq-agent/conf/agent-configuration.xml
> RHQ 1.3.1.GA [5295] (Wed Feb 24 18:46:23 EST 2010)
>>

... until I eventually ^C out of it and the agent shuts down ...

> RHQ 1.3.1.GA [5295] (Wed Feb 24 18:46:23 EST 2010)
>>
>> Shutting down...
> The agent will wait for [0] threads to die
> Shutdown complete - agent will now exit.

It is assumed that --cleanconfig never actually occurs. What am I doing wrong regarding that option?

Comment 10 John Mazzitelli 2010-04-27 19:53:22 UTC
see the second yellow box at:

http://rhq-project.org/display/JOPR2/RHQ+Agent+Installation#RHQAgentInstallation-ConfiguretheRHQAgent

that starts with "If the agent fails to register with the server..."

If the agent seems to just "hang", look at the agent log file and it will probably tell you what its waiting for. I suspect its trying to register with a Server that it cannot communicate with. This is the Server specified in the configuration preferences rhq.agent.server.* - the agent log should tell you more.

Comment 11 Tim Wilkinson 2010-04-27 20:20:50 UTC
Actually, it looks like I've been put into a shell for rhq-agent ...

> help
    avail: Get availability of inventoried resources
   config: Manages the agent configuration
    debug: Provides features to help debug the agent.
discovery: Asks a plugin to run a server scan discovery
 download: Downloads a file from the RHQ Server
dumpspool: Shows the entries found in the command spool file
     exit: Shuts down the agent's communications services and kills the agent
 failover: Provides HA failover functionality
getconfig: Displays one, several or all agent configuration preferences
     help: Shows help for a given command
 identify: Asks to identify a remote server
inventory: Provides information about the current inventory of resources
      log: Configures some settings for the log messages
  metrics: Shows the agent metrics
   native: Obtains native system information
       pc: Starts and stops the plugin container and all deployed plugins
     ping: Pings the RHQ Server
     piql: Executes a PIQL query to search for running processes
  plugins: Updates the agent plugins with the latest versions from the server
     quit: Shuts down the agent's communications services and kills the agent
 register: Registers this agent with the RHQ Server
   sender: Controls the command sender to start or stop sending commands
setconfig: Sets an agent configuration preference
    setup: Sets up the agent configuration by asking a series of questions
 shutdown: Shuts down all communications services without killing the agent
    sleep: Puts the agent prompt to sleep for a given amount of seconds.
    start: Starts the agent comm services so it can accept remote requests
    timer: Times how long it takes to execute another prompt command
   update: Provides agent update functionality
  version: Shows information on agent version and agent environment


The agent log looks to have connected top the server successfully and updated
its plugins  ...

2010-04-27 16:09:04,213 INFO  [main]
(org.rhq.enterprise.communications.ServiceContainer)-
{ServiceContainer.started}Service container started - ready to accept incoming
commands
2010-04-27 16:09:05,361 INFO  [RHQ Agent Registration Thread]
(org.rhq.enterprise.agent.AgentMain)-
{AgentMain.agent-registration-results}Agent has successfully registered with
the server. The results are: [AgentRegistrationResults:
[agent-token=1272308665969-1833291284-4956338588785322629]]
2010-04-27 16:09:05,485 ERROR [ClientCommandSenderTask Timer Thread #0]
(org.rhq.enterprise.agent.AgentMain)- {AgentMain.time-not-synced}The server and
agent clocks are not in sync. Server=[1272413394595][April 27, 2010 8:09:54 PM
EDT], Agent=[1272398945484][April 27, 2010 4:09:05 PM EDT]
2010-04-27 16:09:05,492 INFO  [RHQ Server Polling Thread]
(org.rhq.enterprise.agent.PluginUpdate)-
{PluginUpdate.updating-complete}Completed updating the plugins to their latest
versions.
2010-04-27 16:09:05,493 INFO  [RHQ Server Polling Thread]
(enterprise.communications.command.client.ServerPollingThread)-
{ServerPollingThread.server-online}The server has come back online; client has
been told to start sending commands again
2010-04-27 16:09:07,523 INFO  [main] (org.rhq.core.pc.PluginContainer)-
Initializing Plugin Container v1.3.1.GA...
> 2010-04-27 16:09:09,821 INFO  [main] (rhq.core.pc.inventory.InventoryManager)- Initializing Inventory Manager...
2010-04-27 16:09:09,841 INFO  [main] (rhq.core.pc.inventory.InventoryManager)-
Detected new Platform [Resource[id=0, type=Linux, key=ra-130-249.ra.rh.com,
name=ra-130-249.ra.rh.com, parent=<null>, version=Linux 2.6.18-191.el5]] -
adding to local inventory...
2010-04-27 16:09:09,846 INFO  [main] (rhq.core.pc.inventory.InventoryManager)-
Inventory Manager initialized.
2010-04-27 16:09:09,850 INFO  [main]
(rhq.core.pc.inventory.ResourceFactoryManager)- Initializing
2010-04-27 16:09:09,850 INFO  [main] (rhq.core.pc.content.ContentManager)-
Initializing Content Manager...
2010-04-27 16:09:09,851 INFO  [main] (rhq.core.pc.content.ContentManager)-
Initializing scheduled content discovery...
2010-04-27 16:09:09,851 INFO  [main] (rhq.core.pc.content.ContentManager)-
Content Manager initialized...
2010-04-27 16:09:09,852 INFO  [main] (org.rhq.core.pc.PluginContainer)- Plugin
Container initialized.
2010-04-27 16:09:09,854 INFO  [RHQ Primary Server Switchover Thread]
(org.rhq.enterprise.agent.AgentMain)-
{PrimaryServerSwitchoverThread.started}The primary server switchover thread has
started.
2010-04-27 16:09:19,849 INFO  [InventoryManager.discovery-1]
(rhq.core.pc.inventory.AutoDiscoveryExecutor)- Executing server discovery
scan...
2010-04-27 16:09:19,931 INFO  [ResourceDiscoveryComponent.invoker.daemon-1]
(org.rhq.plugins.agent.AgentDiscoveryComponent)- Discovering RHQ Agent...
2010-04-27 16:09:19,940 INFO  [InventoryManager.discovery-1]
(rhq.core.pc.inventory.InventoryManager)- Detected new Server [Resource[id=0,
type=RHQ Agent, key=ra-130-249.ra.rh.com RHQ Agent, name=ra-130-249.ra.rh.com
RHQ Agent, parent=<null>, version=1.3.1.GA]] - adding to local inventory...
2010-04-27 16:09:19,982 INFO  [InventoryManager.discovery-1]
(rhq.core.pc.inventory.InventoryManager)- Sending [server] inventory report to
Server...
2010-04-27 16:09:20,052 INFO  [InventoryManager.discovery-1]
(rhq.core.pc.inventory.InventoryManager)- Syncing local inventory with Server
inventory...
2010-04-27 16:09:20,052 INFO  [InventoryManager.discovery-1]
(rhq.core.pc.inventory.InventoryManager)- Got unknown resource: 10281
2010-04-27 16:09:20,171 INFO  [ResourceContainer.invoker.daemon-1]
(org.rhq.plugins.platform.LinuxPlatformComponent)- Internal yum server is
disabled.
2010-04-27 16:09:20,687 INFO  [ResourceContainer.invoker.daemon-2]
(org.rhq.plugins.jmx.JMXServerComponent)- Starting connection to JMX Server
ra-130-249.ra.rh.com RHQ Agent
2010-04-27 16:09:20,706 INFO  [ResourceContainer.invoker.daemon-2]
(ems.impl.jmx.connection.DConnection)- Querying MBeanServer for all MBeans
2010-04-27 16:09:20,707 INFO  [ResourceContainer.invoker.daemon-2]
(ems.impl.jmx.connection.DConnection)- Found 28 MBeans, starting load
2010-04-27 16:09:20,722 INFO  [ResourceContainer.invoker.daemon-2]
(org.rhq.plugins.jmx.JMXServerComponent)- Starting connection to JMX Server
InternalVM
2010-04-27 16:09:20,723 INFO  [ResourceContainer.invoker.daemon-2]
(ems.impl.jmx.connection.DConnection)- Querying MBeanServer for all MBeans
2010-04-27 16:09:20,723 INFO  [ResourceContainer.invoker.daemon-2]
(ems.impl.jmx.connection.DConnection)- Found 28 MBeans, starting load
2010-04-27 16:09:20,814 INFO  [InventoryManager.discovery-1]
(rhq.core.pc.inventory.AutoDiscoveryExecutor)- Found 1 servers.
2010-04-27 16:09:20,832 INFO  [InventoryManager.availability-1]
(rhq.core.pc.inventory.InventoryManager)- Sending availability report to
Server...
2010-04-27 16:09:25,815 INFO  [InventoryManager.discovery-1]
(rhq.core.pc.inventory.RuntimeDiscoveryExecutor)- Running runtime discovery
scan rooted at [platform]
2010-04-27 16:09:25,850 INFO  [InventoryManager.discovery-1]
(rhq.core.pc.inventory.InventoryManager)- Version of [Resource[id=10298,
type=CPU, key=1, name=CPU 1, parent=ra-130-249.ra.rh.com, version=QEMU Virtual
CPU version 0.9.1]] changed from [] to [QEMU Virtual CPU version 0.9.1]
2010-04-27 16:09:25,857 INFO  [InventoryManager.discovery-1]
(rhq.core.pc.inventory.InventoryManager)- Version of [Resource[id=10297,
type=CPU, key=0, name=CPU 0, parent=ra-130-249.ra.rh.com, version=QEMU Virtual
CPU version 0.9.1]] changed from [] to [QEMU Virtual CPU version 0.9.1]
2010-04-27 16:09:25,869 INFO  [InventoryManager.discovery-1]
(rhq.core.pc.inventory.InventoryManager)- Version of [Resource[id=10301,
type=RHQ Agent JVM, key=InternalVM, name=RHQ Agent JVM,
parent=ra-130-249.ra.rh.com RHQ Agent, version=1.6.0]] changed from [] to
[1.6.0]
2010-04-27 16:09:25,903 INFO  [ResourceDiscoveryComponent.invoker.daemon-1]
(org.rhq.plugins.agent.AgentEnvironmentScriptDiscoveryComponent)- Discovering
RHQ Agent's environment setup script...
2010-04-27 16:09:25,917 INFO  [InventoryManager.discovery-1]
(rhq.core.pc.inventory.InventoryManager)- Version of [Resource[id=10304,
type=Environment Setup Script, key=environment-setup-script,
name=rhq-agent-env.sh, parent=ra-130-249.ra.rh.com RHQ Agent,
version=1.3.1.GA]] changed from [] to [1.3.1.GA]
2010-04-27 16:09:25,917 INFO  [ResourceDiscoveryComponent.invoker.daemon-1]
(org.rhq.plugins.agent.AgentLauncherScriptDiscoveryComponent)- Discovering RHQ
Agent's launcher script service...
2010-04-27 16:09:25,925 INFO  [InventoryManager.discovery-1]
(rhq.core.pc.inventory.InventoryManager)- Version of [Resource[id=10305,
type=Launcher Script, key=launcherscript, name=RHQ Agent Launcher Script,
parent=ra-130-249.ra.rh.com RHQ Agent, version=1.3.1.GA]] changed from [] to
[1.3.1.GA]
2010-04-27 16:09:25,926 INFO  [ResourceDiscoveryComponent.invoker.daemon-1]
(org.rhq.plugins.agent.AgentJavaServiceWrapperDiscoveryComponent)- Discovering
RHQ Agent's JSW service...
2010-04-27 16:09:25,929 INFO  [InventoryManager.discovery-1]
(rhq.core.pc.inventory.RuntimeDiscoveryExecutor)- Scanned [0] servers and found
[0] total descendant Resources.
2010-04-27 16:09:25,929 INFO  [InventoryManager.discovery-1]
(rhq.core.pc.inventory.InventoryManager)- Sending [runtime] inventory report to
Server...
2010-04-27 16:09:25,951 INFO  [InventoryManager.discovery-1]
(rhq.core.pc.inventory.InventoryManager)- Syncing local inventory with Server
inventory...
2010-04-27 16:09:29,849 INFO  [InventoryManager.discovery-1]
(rhq.core.pc.inventory.RuntimeDiscoveryExecutor)- Running runtime discovery
scan rooted at [platform]
2010-04-27 16:09:29,905 INFO  [ResourceDiscoveryComponent.invoker.daemon-1]
(org.rhq.plugins.agent.AgentEnvironmentScriptDiscoveryComponent)- Discovering
RHQ Agent's environment setup script...
2010-04-27 16:09:29,906 INFO  [ResourceDiscoveryComponent.invoker.daemon-1]
(org.rhq.plugins.agent.AgentLauncherScriptDiscoveryComponent)- Discovering RHQ
Agent's launcher script service...
2010-04-27 16:09:29,906 INFO  [ResourceDiscoveryComponent.invoker.daemon-1]
(org.rhq.plugins.agent.AgentJavaServiceWrapperDiscoveryComponent)- Discovering
RHQ Agent's JSW service...
2010-04-27 16:09:29,910 INFO  [InventoryManager.discovery-1]
(rhq.core.pc.inventory.RuntimeDiscoveryExecutor)- Scanned [0] servers and found
[0] total descendant Resources.
2010-04-27 16:09:29,910 INFO  [InventoryManager.discovery-1]
(rhq.core.pc.inventory.InventoryManager)- Sending [runtime] inventory report to
Server...
2010-04-27 16:09:29,934 INFO  [InventoryManager.discovery-1]
(rhq.core.pc.inventory.InventoryManager)- Syncing local inventory with Server
inventory...
2010-04-27 16:10:05,517 ERROR [RHQ Server Polling Thread]
(org.rhq.enterprise.agent.AgentMain)- {AgentMain.time-not-synced}The server and
agent clocks are not in sync. Server=[1272413454627][April 27, 2010 8:10:54 PM
EDT], Agent=[1272399005516][April 27, 2010 4:10:05 PM EDT]
2010-04-27 16:10:09,855 INFO  [MeasurementManager.sender-1]
(rhq.core.pc.measurement.MeasurementCollectorRunner)- Measurement collection
for [123] metrics took 551ms - sending report to Server...
  2010-04-27 16:10:39,855 INFO  [MeasurementManager.sender-2]
(rhq.core.pc.measurement.MeasurementCollectorRunner)- Measurement collection
for [5] metrics took 3ms - sending report to Server...


If the functionality of --cleanconfig is to restart the agent immediately after
having cleared its persistent preferences, then this will not fit the bill for
our cloning by template needs. The wiped config would need to leave the
rhq-agent stopped and force itself into setup mode (which is hopefully
preconfigured) when the cloned guest is started and not before. 

If this is so, we can change how we invoke cleanconfig to run it immediately
after the newly cloned VM has started for the first time but we have not found
a way out of the apparent shell we enter by executing the command as shown
above. 

We tried the following to see if we could exit the shell ...

[root@ra-130-249 ~]# /root/rhq-agent/bin/rhq-agent.sh --cleanconfig << eof
> quit
> eof
RHQ 1.3.1.GA [5295] (Wed Feb 24 18:46:23 EST 2010)
> Agent no longer accepting input at prompt.
Shutting down...
The agent will wait for [0] threads to die
Shutdown complete - agent will now exit.

... but this shuts down the agent when done and I'm not sure the cleanconfig
has occurred. I will test this further.

Still digging.

Comment 12 John Mazzitelli 2010-04-27 20:43:53 UTC
You are using rhq-agent.sh - it assumes the agent is to be run in the foreground as a console app.  Thus, the agent will provide a "shell" prompt (i.e. it will read stdin and you can enter keyboard input, such as you did there with the "help" option).

If you want to run the agent in the background, you typically will use the init.d script that we provide - rhq-agent-wrapper.sh. You can use that as-is (execute "rhq-agent-wrapper.sh" without args for syntax help) or you can install it as an init.d script for boottime launching. You can export the env var RHQ_AGENT_CMDLINE_OPTS if you want to pass cmd line args to the agent via rhq-agent-wrapper.sh. Read the comments at the top of rhq-agent-wrapper.sh (and rhq-agent.sh) for more info on all the different environment variables they accept).

Note that if you want to run rhq-agent.sh and put it into background via "&" (as opposed to rhq-agent-wrapper.sh), make sure you at least pass in --daemon as a cmdline argument (this tells the Java agent it will be in background and to not listen for stdin input). The rhq-agent-wrapper.sh does this for you. If you want to see help on the cmdline args, "rhq-agent.sh help" provides you some help. Or see this: http://rhq-project.org/display/JOPR2/RHQ+Agent+Command+Line+Options - specifically you'll see "--daemon" and its description there.

Have you read the docs on how to start and configure the agent such that it can run in the background the first time you start it? See these docs, these explain a lot of this:

http://rhq-project.org/display/JOPR2/Running+the+RHQ+Agent#RunningtheRHQAgent-RunningonUnix

http://rhq-project.org/display/JOPR2/RHQ+Agent+Installation

Comment 13 Tim Wilkinson 2010-04-28 14:24:18 UTC
> You are using rhq-agent.sh - it assumes the agent is to be run in the
> foreground as a console app. 

Thanks. I had tried using the wrapper first but did not see the RHQ_AGENT_CMDLINE_OPTS var to help with engaging cleanconfig.

Is it possible that cleanconfig is not doing what we require? Although --cleanconfig is included in RHQ_AGENT_CMDLINE_OPTS, I am not sure it is engaging unless this line in the log is that action ... 

   Agent has been asked to start up clean - cleaning out the data
   directory: data

I'm still trying to determine if the attempt to cleanconfig should occur on the original VM (before a template is produced) OR if the agent config can remain in place, be template cloned into a new VM and then have that new VM run cleanconfig on its first boot so it can register. So far I am still seeing remnants of the first incarnation of the guest [ra-131-247] in JON preventing its new cloned identity [ra-130-223] from registering.

2010-04-28 09:56:56,365 ERROR [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.agent-registration-rejected}The server has rejected the agent registration request. Cause: [org.rhq.core.clientapi.server.core.AgentRegistrationException:The agent asking for registration is trying to register the same address/port [172.20.130.223:16163] that is already registered under a different name [ra-131-247.ra.rh.com]; if this new agent is actually the same as the original, then re-register with the same name]


> 
> Have you read the docs on how to start and configure the agent such that it
> can run in the background the first time you start it?

Yes, at least by use of the rhq.agent.configuration-setup-flag so the activity is not interactive, if that's what you mean... but not via the same URLs you provided ...

http://www.redhat.com/docs/en-US/JBoss_ON/2.3/html/Installation_Guide/Installation_Guide-JON_Agent_Installation_Guide-Preconfiguring_the_JON_Agent.html

http://www.redhat.com/docs/en-US/JBoss_ON/html/JON_Agent_Guide/index.html