See bug #1145327 for why this has been requested. Provide an option to tell the agent to wait forever when performing its very first registration with its registration server. This will help those times when you install the agent first, then the server. It is desireable for the agent to wait indefinitely (or at least some long amount of time) for the first server to come online. There actually might already be an option like this, I'll research that first. If there is, I will add a comment here and mark as CLOSED/NOTABUG. But off the top of my head, I don't think there is such a feature, in which case something will need to be implemented.
well, this is interesting. We already do do this. If the agent has never registered before, the agent will wait indefinitely. Nothing you have to do. I might have to close this as NOTABUG and ask more about the use case.
there is a setting rhq.agent.wait-for-server-at-startup-msecs that does what I think is needed. Just set it to a very large number and it will wait that long for the server's public IP to become available.
viet - please see comment #2 and try customizing that setting in your agent prior to starting the agent for the first time. I think this will actually do what you need.
Created attachment 947398 [details] agent log with log wait
Longer wait didn't help. The main complain seemed to be "Cause: java.net.UnknownHostException: rhqserver". I could only reproduce with the storage node managing agent. 1. Launch docker container without calling jon install script 2. ./rhqctl install --server --storage 3. rm -rf /root/.java 4. edit /opt/rhq-agent/conf/agent-configuration.xml. Set rhq.agent.wait-for-server-at-startup-msecs to a 1 hour 5. /rhqctl start --server --storage --agent
this is really odd - after the agent started shutting down, the registration thread did actually end up getting registered! 2014-10-16 02:50:38,485 INFO [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.agent-registration-results}Agent has successfully registered with the server. The results are: [AgentRegistrationResults: [agent-token=a8VtCuU5ugfJT/BVxQkMXaCElwQ2jCRZCVV23akvj+cPz4O6yjdb94WcK+N5Ws+DToI=]]
so, it looks like the server is accessible on that local IP (172.17.0.241 port 7080) , because I see this: ===== 2014-10-16 02:46:17,723 DEBUG [RHQ Server Polling Thread] (enterprise.communications.command.client.ServerPollingThread)- {ServerPollingThread.server-poll-failure}Failed to successfully poll the server. This is normally due to the server not being up yet. You can usually ignore this message since it will be tried again later, however, you should ensure this failure was not really caused by a misconfiguration. Cause: org.jboss.remoting.transport.http.WebServerError:<html><head><title>JBoss Web/7.4.8.Final-redhat-4 - JBWEB000064: Error report</title><style><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--></style> </head><body><h1>JBWEB000065: HTTP Status 404 - /jboss-remoting-servlet-invoker/ServerInvokerServlet</h1><HR size="1" noshade="noshade"><p><b>JBWEB000309: type</b> JBWEB000067: Status report</p><p><b>JBWEB000068: message</b> <u>/jboss-remoting-servlet-invoker/ServerInvokerServlet</u></p><p><b>JBWEB000069: description</b> <u>JBWEB000124: The requested resource is not available.</u></p><HR size="1" noshade="noshade"><h3>JBoss Web/7.4.8.Final-redhat-4</h3></body></html> ===== That's the 404 HTML error of the JON Server it isn't ready yet (clearly, the agent connected to SOMETHING at 172.17.0.241:7080). But then for some reason, the agent doesn't wait long enough for following registration requests to successfully complete before aborting. I'm not sure why it aborted so fast, when we can clearly see soon after the registration request succeeded and the agent got a token and failover list. Still investigating what this particular use-case is triggering inside the agent.
I see this code in AgentMain and this is the code that is spitting out the exception indicated in the attached log to this BZ - and the comment in this code seems to describe what I see the log said happened: the server was up - we know it is up because we got a response (it was a 404 from the comm servlet, but technically it was up) but yet we weren't registered. I'm not sure why this is getting triggered in this BZ use-case but not seen in my other tests. But there might be something we have to do here to allow for the use-case that is failing. } else if (mustRegister && !agentIsRegistered) { // If we got here, we know a server is up and the agent needs to be registered, but it isn't registered. // This usually means an unrecoverable registration error occurred, so abort. throw new AgentRegistrationException( MSG.getMsg(AgentI18NResourceKeys.AGENT_CANNOT_REGISTER));
I really think this broke as part of the "fix" to bug #987628. It used to be the agent would just always attempt to retry the registration - which would mean this problem in this BZ wouldn't happen because eventually the registration would have succeeded. But now there seems to be a race condition and the agent aborts before the registration thread has a chance to finish doing what its doing. Will look to see if I can add back some type of retry loop but not have the agent wait an infinite amount of time before aborting.
I have not been able to replicate this. But I added some code to let the agent wait a few more times before aborting completely. The abort is very drastic (the agent VM will die) so we should only do that as a last resort. The new code will now, by default, wait 5 different times before aborting. The "5" is actually configurable by a backdoor system property "rhq.agent.startup-registration-waits" (you can set this in RHQ_AGENT_ADDITIONAL_JAVA_OPTS or add it to agent-configuration.xml - I didn't make this an actual agent preference (i.e. it is not documented in the .xml and AgentConfiguration doesn't know about it) but at least its here in case some odd use case crops up where its needed. commit to master: commit 6b97b5ce9c146d7fd14bcd0da50e165ccb85e9ea Author: John Mazzitelli <mazz> Date: Thu Oct 16 16:06:09 2014 -0400 BZ 1152154 - wait a few times during startup to ensure the agent has been given a fair chance to register
Cherry picking over so it makes it in the next QE build for testing. commit 19ae6a0031ba949349d0ddfc8330d1db2604f790 Author: John Mazzitelli <mazz> Date: Thu Oct 16 16:06:09 2014 -0400 BZ 1152154 - wait a few times during startup to ensure the agent has been given a fair chance to register (cherry picked from commit 6b97b5ce9c146d7fd14bcd0da50e165ccb85e9ea)
Moving to ON_QA as available to test with latest brew build: https://brewweb.devel.redhat.com//buildinfo?buildID=396547
taking QA contact with following scenario in my mind for verification: === Scenario1: exit/die agent after default attempts (leave rhq.agent.startup-registration-waits not set; rhq.agent.wait-for-server-at-startup-msecs=60000) Description1: perform agent configuration to a non-existing server hostname/address and define properties as above. Agent should die in after 5 min. Scenario2: exit/die agent after some attempts (rhq.agent.startup-registration-waits=2; rhq.agent.wait-for-server-at-startup-msecs=60000) Description2: perform agent configuration to a non-existing server hostname/address and define properties as above. Agent should die in after 2 min. Scenario3: succeed during default attempts (leave all as defaults) Description3: perform agent configuration to a server to be registered afterwards (but really quick). Agent should get registered. ===
Scenario3: succeed during default attempts (leave all as defaults) Description3: perform agent configuration to a server to be registered afterwards (but really quick). Agent should get registered. the other 2 scenarios are eliminated due to discussion with John.
Scenario3 for linux succeed. trying Win now.
Windows case works fine too. verifying.