Bug 534197 (RHQ-1016) - have agent be more tolerant of IP changes
Summary: have agent be more tolerant of IP changes
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: RHQ-1016
Product: RHQ Project
Classification: Other
Component: Agent
Version: unspecified
Hardware: All
OS: All
high
medium
Target Milestone: ---
: ---
Assignee: John Mazzitelli
QA Contact:
URL: http://jira.rhq-project.org/browse/RH...
Whiteboard:
Depends On: RHQ-1223
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-10-22 19:10 UTC by John Mazzitelli
Modified: 2013-09-03 16:56 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-09-03 16:56:51 UTC
Embargoed:


Attachments (Terms of Use)

Description John Mazzitelli 2008-10-22 19:10:00 UTC
If the agent's IP changes underneath of it, we'd like a simple agent restart to help it recover. It doesn't happen that way today - you actually have to tell the agent explicitly to change its endpoint IP (via setup questions, for example).

Do this as an example of what we want to work but won't work today:

Log onto VPN, start agent with your VPN address as its endpoint. kill VPN. restart agent - you have to tell the agent to advertise its non-VPN IP now.  It won't know enough to say, "my VPN address went away, I need to register with my other IP"

The agent should, at some point during startup, check its connector's remote endpoint (the IP the server uses to communicator with the agent).  If it finds that its connector bind-address is no longer valid for this agent, the agent should fallback to its default value which is: InetAddress.getLocalHost().getHostAddress().

Comment 1 John Mazzitelli 2008-10-24 13:50:27 UTC
This can be really bad, especially in the situation I just had.

My wireless network blipped.  After it came back on, my machines got assigned different IPs when they performed their DHCP handshake.

In fact, two of my agents swapped IP addressed.  Originally, I had:

agent1 = 192.168.0.5
agent2 = 192.168.0.6

After my wireless router reboot, I had this:

agent1 = 192.168.0.6
agent2 = 192.168.0.5

How to recover?  You would think I could go to agent1, do a --cleanconfig, enter the setup answer for the agent IP as its new .6 address. Nope! --cleanconfig will purge the security token, thus, the server will disallow an "unknown" agent from stealing another agent's IP.  So, I went to agent2 and instead of --cleanconfig, I ran the agent with --nostart (so it  doesn't try to bind to its old, not invalid, IP) and I did:

setconfig rhq.communications.connector.bind-address=192.168.0.5

This set the agent's IP to its correct one. I then started the agent.  But this failed to register because in the database, there is ALREADY an agent on this IP (agent1 had it originally and I haven't moved it off that IP yet) - so we get a constraint violation as expected.  Its a chicken and the egg.  I can't set the IP on agent2 without freeing that IP from agent1 and I can't set the IP on agent1 without freeing that IP from agent2 (because agent2 had its IP before).

I fixed it by --cleanconfig and setting agent1's IP to its correct IP but to a DIFFERENT port than what agent2 was using.  Because the constraint is on both IP and port, this successfully saved to the database. Then went to the other agent, registered it with the correct IP and original port, and it worked.  Then went back to the first agent, reconfigured it to go back to the port it was originally on.  Phew.  Swapped the IPs.

This is ugly, we need to more gracefully handle changing of IPs of agent machines.

Comment 2 John Mazzitelli 2008-12-10 03:44:11 UTC
Lowering down to minor. The subtask makes it easier to tolerate IP changes. You have to explicitly set the IP to enter a use case where its harder to recover for that subtask fix to help.

For this particular issue, its still a problem but:

a) it will be a rare occurrance now that our default is to pick up the new IP when it changes
b) the only way the problem happens is if two registered agents swap IP addresses - another rare occurence

Comment 3 Red Hat Bugzilla 2009-11-10 20:21:55 UTC
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1016


Comment 4 wes hayutin 2010-02-16 17:08:44 UTC
mass add of key word FutureFeature to help track

Comment 5 John Sanda 2010-08-18 17:02:13 UTC
I initially registered an agent with my server. Then while the agent was running, I changed the IP address on the agent machine. When I restarted my agent without any arguments or flags, I got the following error,

FATAL [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.start-failure}Failed to start the agent
java.net.BindException: Cannot assign requested address
        at java.net.PlainSocketImpl.socketBind(Native Method)
        at java.net.PlainSocketImpl.bind(PlainSocketImpl.java:365)
        at java.net.ServerSocket.bind(ServerSocket.java:319)        at java.net.ServerSocket.<init>(ServerSocket.java:185)
        at javax.net.DefaultServerSocketFactory.createServerSocket(ServerSocketFactory.java:170)        at org.jboss.remoting.transport.socket.SocketServerInvoker.createServerSocket(SocketServerInvoker.java:264)
        at org.jboss.remoting.transport.socket.SocketServerInvoker.start(SocketServerInvoker.java:193)
        at org.jboss.remoting.transport.Connector.start(Connector.java:324)
        at org.rhq.enterprise.communications.ServiceContainer.setupServerConnector(ServiceContainer.java:122
6)        at org.rhq.enterprise.communications.ServiceContainer.start(ServiceContainer.java:550)        at org.rhq.enterprise.communications.ServiceContainer.start(ServiceContainer.java:468)
        at org.rhq.enterprise.agent.AgentMain.startCommServices(AgentMain.java:2172)        at org.rhq.enterprise.agent.AgentMain.start(AgentMain.java:638)
        at org.rhq.enterprise.agent.AgentMain.main(AgentMain.java:415)

Then I tried restarting my agent with --cleanconfig, and it seems that the agent is able to register with the server. The last thing I see in my agent log is,

INFO  [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.waiting-to-be-registered-begin}The agent will now wait until it has registered with the server...

Next I restarted my agent with --nostart and changed the registered IP address with the setconfig command as described in comment 1. Finally, I restarted my agent with --cleanconfig, and I was able to register with and connect to my server.

Going through and reproducing this helped me better understand the issue, although I am not clear how we want to handle the IP address change.

Comment 6 John Mazzitelli 2010-08-23 15:24:54 UTC
With bug #534427 implemented, we now support dynamic IP lookup at agent startup/registration. So it can tolerate IP changes by simply recycling the agent (you should be able to just do a "shutdown" then "start" (or the restart operation) without killing the agent VM and still get it to pick up the new IP).

There is a rare situation that will need more manual intervention:

If two registered agents want to swap IP addresses you'd have to do some manual steps described in earlier comments in here (or invoke some basic SQL manipulation, if you want to tweek the RHQ_AGENT table under the covers).

This can be closed since we are now more tolerant of IP changes given the 534427 fix (which is essentially leave the bind-address preference set to null, which tells the agent to detect its IP dynamically at startup rather than requiring it to be explicitly set).

Comment 7 Rajan Timaniya 2010-10-25 09:59:32 UTC
Tested on RHQ-Master build #500

Steps:
1) Installed RHQ server and agent
2) Import all resources and checked agent 'Agent Bind Address'
3) Changed agent IP address (connect through VPN)
5) 'shutdown' and 'start' agent

Observation:
Agent shutdown successfully but start operation gives BindException

> shutdown
Shutting down...
Shutdown complete.
> start
Starting the agent...
Failed to start the agent
java.net.BindException: Cannot assign requested address
	at java.net.PlainSocketImpl.socketBind(Native Method)
	at java.net.PlainSocketImpl.bind(PlainSocketImpl.java:365)
	at java.net.ServerSocket.bind(ServerSocket.java:319)
	at java.net.ServerSocket.<init>(ServerSocket.java:185)
	at javax.net.DefaultServerSocketFactory.createServerSocket(ServerSocketFactory.java:170)
	at org.jboss.remoting.transport.socket.SocketServerInvoker.createServerSocket(SocketServerInvoker.java:264)
	at org.jboss.remoting.transport.socket.SocketServerInvoker.start(SocketServerInvoker.java:193)
	at org.jboss.remoting.transport.Connector.start(Connector.java:324)
	at org.rhq.enterprise.communications.ServiceContainer.setupServerConnector(ServiceContainer.java:1226)
	at org.rhq.enterprise.communications.ServiceContainer.start(ServiceContainer.java:550)
	at org.rhq.enterprise.communications.ServiceContainer.start(ServiceContainer.java:468)
	at org.rhq.enterprise.agent.AgentMain.startCommServices(AgentMain.java:2178)
	at org.rhq.enterprise.agent.AgentMain.start(AgentMain.java:639)
	at org.rhq.enterprise.agent.promptcmd.StartPromptCommand.execute(StartPromptCommand.java:57)
	at org.rhq.enterprise.agent.AgentMain.executePromptCommand(AgentMain.java:2768)
	at org.rhq.enterprise.agent.AgentMain$4.run(AgentMain.java:2676)
	at java.lang.Thread.run(Thread.java:619)

Comment 8 Rajan Timaniya 2010-10-26 07:12:13 UTC
Retested with following steps:
1) Installed RHQ server and agent
2) Import all resources and checked agent 'Agent Bind Address'
3) Changed agent IP address (connect through VPN)
5) Shutdown agent
6) Start agent with --nostart
7) Check bind address (getconfig rhq.communications.connector.bind-address)

Observation:
> shutdown
Shutting down...
Shutdown complete.

> [root@dhcp1-1 bin]# ./rhq-agent.sh --nostart
RHQ 4.0.0-SNAPSHOT [e17fe82] (Mon Oct 25 08:16:41 IST 2010)

> getconfig rhq.communications.conector.bind-address
rhq.communications.conector.bind-address=<unknown>

Bind address shows unknown.

Comment 9 John Mazzitelli 2011-02-16 14:23:21 UTC
(In reply to comment #8)
> Retested with following steps:
> 1) Installed RHQ server and agent
> 2) Import all resources and checked agent 'Agent Bind Address'
> 3) Changed agent IP address (connect through VPN)
> 5) Shutdown agent
> 6) Start agent with --nostart
> 7) Check bind address (getconfig rhq.communications.connector.bind-address)
> 
> Observation:
> > shutdown
> Shutting down...
> Shutdown complete.
> 
> > [root@dhcp1-1 bin]# ./rhq-agent.sh --nostart
> RHQ 4.0.0-SNAPSHOT [e17fe82] (Mon Oct 25 08:16:41 IST 2010)
> 
> > getconfig rhq.communications.conector.bind-address
> rhq.communications.conector.bind-address=<unknown>
> 
> Bind address shows unknown.

That is correct. the agent's own bind address will not be explicitly set in the config but rather determined at run time. Did you start the agent? Can it connect and register with the server properly? Can you then ping the server?

In order to confirm this issue is fixed, you have to start the agent and see it communicating with the server (see bug #534427)

Comment 10 Mike Foley 2011-06-01 19:22:51 UTC
this is important for ec2/cloud

Comment 11 Mike Foley 2011-12-13 20:34:58 UTC
i also have a router which changes gives me new IP addresses when it disconnects ... 

and i am verifying that i can restart the agent with --cleanconfig and connect to my server

Comment 12 Viet Nguyen 2012-11-15 16:36:20 UTC
I had the opposite issue.  JON server IP lease expired over night.

Workaround
1.change "rhq.agent.server.bind-address" in  ~/.java/.userPrefs/rhq-agent/default/prefs.xml 

2. restart Agent.  Got this message but after awhile Agent was up again in JON UI

RHQ 4.4.0.JON311GA [35e0244] (Wed Sep 19 20:38:03 EDT 2012)
!!! There are [1] servers that are potentially unreachable by this agent.
Please double check all public endpoints of your servers and ensure
they are all reachable by this agent. The failed server endpoints are:
[10-16-120-134.dhcp.rhq.lab.eng.bos.redhat.com:80/7443]
See the Administration (Topology) > Servers in the server GUI
to change the public endpoint of a server.
THIS AGENT WILL WAIT UNTIL ONE OF ITS SERVERS BECOMES REACHABLE!

Comment 13 John Mazzitelli 2012-11-15 16:44:49 UTC
(In reply to comment #12)
> I had the opposite issue.  JON server IP lease expired over night.
> 
> Workaround
> 1.change "rhq.agent.server.bind-address" in 
> ~/.java/.userPrefs/rhq-agent/default/prefs.xml 
> 
> 2. restart Agent.  Got this message but after awhile Agent was up again in
> JON UI
> 
> RHQ 4.4.0.JON311GA [35e0244] (Wed Sep 19 20:38:03 EDT 2012)
> !!! There are [1] servers that are potentially unreachable by this agent.
> Please double check all public endpoints of your servers and ensure
> they are all reachable by this agent. The failed server endpoints are:
> [10-16-120-134.dhcp.rhq.lab.eng.bos.redhat.com:80/7443]
> See the Administration (Topology) > Servers in the server GUI
> to change the public endpoint of a server.
> THIS AGENT WILL WAIT UNTIL ONE OF ITS SERVERS BECOMES REACHABLE!

This is unrelated to this BZ and is a very different issue than what this BZ is describing. This BZ is related to the issue when the AGENT IP changes, not when the server IP changes.

Comment 14 Heiko W. Rupp 2013-09-03 16:56:51 UTC
Bulk closing of old issues that are in VERIFIED state.


Note You need to log in before you can comment on or make changes to this bug.