Created attachment 575321 [details] excerpt of agent.log illustrating the issue Description of problem: This is a rare occurence but it has happened. The agent registration, due to some sort of environmental reasons (like slow network link between the agent and server) took more than a minute. This causes the plugins to fail to update at startup, even if the agent is configured to do so. The reasons for this behavior are somewhat convoluted: 1) agent starts comms 2) agent queues 3 listeners for when the comms are up, which are executed in this order in a separate thread from the main startup thread: a) register with the server, with the timeout of 60s b) update plugins c) setup the conditional restart of the plugin container in case of comms flipping down. 3) agent waits for the server up 4) agent waits for being registered with the server 5) agent starts up plugin container 6) normal operation The above workflow allows for the following "race condition": 0) comms are up 1) start of registration 2) 60 seconds later 3) attempt to update plugins fails 4) registration completes 5) startup continues IMHO, the attempt to update the plugins should be moved to after the wait for successful registration so that the agent doesn't miss the opportunity to do that due to it being "impatient". Version-Release number of selected component (if applicable): all How reproducible: rarely Steps to Reproduce: 1. make sure that a connection to at least 1 server in the HA cluster is extremely slow. 2. start up the agent Actual results: watch the log for failure to upgrade plugins followed by successful registration message Expected results: plugins are updated once the registration completes no matter how long it takes. Additional info: attached is an excerpt from the agent.log that illustrates this issue.
I have a feeling the problem is caused by the testing of the failover list at startup. When the agent starts up, it attempts to do a quick test connect to all the servers in teh failover list and logs warnings if one or more servers cannot be connected to. If one or more RHQ Servers are down or otherwise cannot be connected to by the agent, its possible those test connections will hang (this could explain the long 6 minute delay). I have a feeling if we implement bug #810302 this will help work around this problem. If you have a slow network and/or alot of RHQ Servers in the RHQ HA server cloud and/or you have one or more RHQ Servers consistently down or out of service, then you should turn off this new feature that bug #810302 will provide - the ability to turn off this startup testing of the connectivity to all servers in the failover list. Rather than changing the delicate internals of the AgentMain code, I would wait to see if this new feature will solve the problem.
here's a note one what is happening: when the agent starts up, it downloads its failover list. It will then loop through all entries of the failover list. If the agent can connect to at least one of them, it will log a warning but continue the agent startup. HOWEVER, if ALL of the entries in the failover list are down (that is, the agent cannot connect to any servers in its failover list), the agent will sleep for 30 seconds and then retry again. It will continue to do this until at least one server is ready and can be connected to. The more I look at this, the more I think this is what the problem may have been. 6 minutes is 12 times through this loop - so I suspect what may have happened is the network was out or the server(s) in the agents failover list were not accessible during that time. This is just a guess, but it makes sense.
bug #810302 has been implemented in master.
Although the option added by Bug 810302 may provide a faster start-up for the agent, it does not sound like it is an adequate solution for this problem. The agent can not control a network partition event and such an event could result in the check taking a long time to actually timeout. For example, on a Solaris machine, this could be 2 minutes for each server who is on the other side of the partition. Perhaps this could happen in a completed unrelated thread? Agent start-up should continue as normal and these timers should not start until the agent has made its connection to the server. So, in the end, a separate thread attempts to check each known server in the fail-over list (this should probably happen periodically anyway and I am assuming it is). Unrelated to checking the fail-over list, we start the agent and attempt to connect to one of the servers from the previously known fail-over list (or if not available, the server URL used in configuration). If the connect fails or times-out, we try another server from the "known" fail-over list. If we (eventually) connect, the agent update, plug-in update, etc tasks should be started. If connection fails to all know servers, we perform the normal loop of retrying after a delay.
bug 810302 added the feature to turn on or off the failover check. I just committed a change where the default for that feature is now "false" meaning we will not check the failover list by default. you have to enable it explicitly to do the checks.
assigning a target release of 3.1.1 for us to revisit this one.
As suspected, this bug is evident when using SSL while the JON server host machine is configured to silently block all traffic directed at the non-SSL port. Blocking non-secure or unneeded traffic is a common practice within an enterprise configuration. Here is how the issue can be reproduced: 1. Extract JON server installation 2. Configure JON server for SSL # Set TESTENV_DIR environment variable to the directory where the jon-server directory exists sed -i 's/^rhq\.communications\.connector\.transport=.*$/rhq.communications.connector.transport=sslservlet/' "${TESTENV_DIR}/jon-server/bin/rhq-server.properties" sed -i 's/^rhq\.server\.tomcat\.security\.client-auth-mode=.*$/rhq.server.tomcat.security.client-auth-mode=false/' "${TESTENV_DIR}/jon-server/bin/rhq-server.properties" sed -i 's/^rhq\.server\.client\.security\.server-auth-mode-enabled=.*$/rhq.server.client.security.server-auth-mode-enabled=false/' "${TESTENV_DIR}/jon-server/bin/rhq-server.properties" 3. Start and install JON server 4. Install JON agent but do not start it 5. Configure JON agent to use SSL # Set TESTENV_DIR environment variable to the directory where the rhq-agent directory exists sed -i 's/^.*<entry key="rhq\.communications\.connector\.transport" value=".*" \/>/ <entry key="rhq.communications.connector.transport" value="sslsocket" \/>/' "${TESTENV_DIR}/rhq-agent/conf/agent-configuration.xml" sed -i 's/^.*<entry key="rhq\.agent\.server\.transport" value=".*" \/>/ <entry key="rhq.agent.server.transport" value="sslservlet" \/>/' "${TESTENV_DIR}/rhq-agent/conf/agent-configuration.xml" sed -i 's/^.*<entry key="rhq\.agent\.server\.bind-port" value=".*" \/>/ <entry key="rhq.agent.server.bind-port" value="7443" \/>/' "${TESTENV_DIR}/rhq-agent/conf/agent-configuration.xml" 6. Start iptables if not already running and add the following rule to DROP packets with a destination port of 7080 sudo iptables -I INPUT 1 -m state --state NEW -m tcp -p tcp --dport 7080 -j DROP 7. Start JON agent using --cleanconfig argument * A minute or so after the agent was started, the following WARN log message will appear in agent.log: WARN [RHQ Server Polling Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.plugin-update-failure}Failed to update the plugins.. Cause: org.rhq.enterprise.communications.command.server.AuthenticationException: Command failed to be authenticated! This command will be ignored and not processed: Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=localhost.localdomain, rhq.externalizable-strategy=AGENT, rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[getLatestPlugins], targetInterfaceName=org.rhq.core.clientapi.server.core.CoreServerService}] * About three minutes later, the agent is finally coming online: INFO [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.agent-registration-results}Agent has successfully registered with the server. The results are: [AgentRegistrationResults: [agent-token=QLxZEJ+iU+hFd9LpUntvMrsdF7u27NAgIkJFgMnIWn13/qol6irUhymbJxiT8kF4fuo=]] This is because the fail-over test is first using the non-secure port and then the secure port. Additionally, the fail-over list sent to the agent from the server also includes both the non-secure and secure ports: localhost.localdomain:7080/7443 The result is the agent first attempting to connect to port 7080 and when that fails, to port 7443. Although Bug 810302 makes this issue less visible as the default will be not to test the fail-over list, my suggestion is that we also modify the Socket creation to utilize a short time-out for the fail-over and accurately report this at DEBUG level. For example, a 5 second time-out could be used upon this initial check: Connection to {host}:{port} failed due to timeout after {timeout} seconds This still does not fully resolve the issue in a larger HA environment. For example, if server1 was completely down and the network proxy did not return a socket error the result would be a 10 second delay to server1. And if server2 did the same thing, another 10 seconds. Now, we are at server3 and perhaps only 7080 makes us wait but 7443 is okay. Our final wait is 25 seconds. This still is short enough to allow for the other agent startup tasks to begin within their timeout but as you can see, the solution is not perfect.
(In reply to comment #13) > This is because the fail-over test is first using the non-secure port and > then the secure port. Correct. AgentMain.testFailoverList does indeed try to connect with the unsecured port and if that fails for some reason, it tries the secured port. What we are trying to do here is NOT try to confirm the port numbers; rather, we are trying to confirm the host/IP of the server - so we don't care which port accepts our simple socket connection - we just want it to connect. The entire reason for this failover list test was someone's prior RFE to ask that we (as early as we can in agent startup) yell at the user if they entered a bad public server endpoint - specifically, when they use "localhost" (IIRC, this request came from one of our developers - it is developers who are most likely to take the default "localhost" in the dev builds and when they try to connect a remote agent to it, it fails for no apparent reason - this failover list test was to help inform the user when this happens). Note that if all the servers are unreachable, we retry the registration again. We do this indefinitely. > Additionally, the fail-over list sent to the agent > from the server also includes both the non-secure and secure ports: > > localhost.localdomain:7080/7443 This is correct as well. The failover lists provide both a secure and non-secure port. It is the agent configuration that determines which port the agent talks to the server. > The result is the agent first attempting to connect to port 7080 and when > that fails, to port 7443. We could see if we can change the code to only test the port that the agent will be really using - e.g. if it sees it is to talk over "sslservlet", it should only test the secure port listed in the failover list. > Although Bug 810302 makes this issue less visible as the default will be not > to test the fail-over list, my suggestion is that we also modify the Socket > creation to utilize a short time-out for the fail-over and accurately report > this at DEBUG level. For example, a 5 second time-out could be used upon > this initial check: > > Connection to {host}:{port} failed due to timeout after {timeout} seconds That's another thing we can do. I'll see if we can set some kind of timeout so the call to socket connect fails fast. The actual code we do to test the connection is this: socket = new Socket(server.address, server.port); So the mere construction of the socket with that constructor does the test. We'll need to configure that with some timeout.
fwiw: I replicated this problem but ONLY if I explicitly switched from the default of false to true for the new agent setting : rhq.agent.test-failover-list-at-startup. So at least that setting's new default will help avoid this in the future.
(In reply to comment #15) > fwiw: I replicated this problem but ONLY if I explicitly switched from the > default of false to true for the new agent setting : > rhq.agent.test-failover-list-at-startup. So at least that setting's new > default will help avoid this in the future. Correct. Which is why I mentioned that Bug 810302 does take care of most of this. However, it does not make sense that we have a test option that can be enabled if the test isn't really valid within a production setting. Even as a diagnostic tool, we want to know right away that it is taking 5+ seconds for the expected host/port to respond. So perhaps the real solution here is to remove the functionality altogether? From a diagnostic perspective we already ask the user to verify connectivity to the expected host/port using netcat, ping, or telnet. The validation and logging of the agent's or server's listener socket being a local address should be part of the actual connection. If we still think this process, which seems to have no value, is necessary, perhaps we can move it to its own thread and only have it evaluated when DEBUG is enabled, removing the need for Bug 810302. I just don't want to throw a new configuration property at something that has no value or meaning from user or diagnostic perspective.
git commit to master: 5bd129d this sets the timeout to 5s per port test. I also changed the out-of-box setting of that agent configuration - it is true now (so out of box we will test the failover list again, since now we timeout pretty quickly in cases when the socket won't outright fail the connection quickly. The internal default is still false, but our agent-configuration.xml now explicitly sets it to true. We can decide what to do with all of this in the future. But I at least wanted to get in this 5s timeout thing now.
This fix is in master only right now so setting Target Release field correctly. I've cloned a new issue for getting the fix in JON311: https://bugzilla.redhat.com/show_bug.cgi?id=828791
cherry picked to release/jon3.1.x branch : aa5577b053712ed037afcaceddbb67b9c597fb20
Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.