Description of problem: If using the *Platform Utilization* portlet or viewing the *Platform Utilization* report page, the following error pop-up appears 30 seconds later: Server returned FAILURE with no error message The message center contains the following message: Failed to load platform utilization data. This occurred because the server is taking a long time to complete this request. Please be aware that the server may still be processing your request and it may complete shortly. You can check the server logs to see if any abnormal errors occurred. This only occurs if one or more agents are slow or unable to provide the requested metric data necessary. Version-Release number of selected component (if applicable): 4.4.0.JON310GA How reproducible: Always Steps to Reproduce: 1. Install ON 3.1 system 2. Start ON 3.1 system 3. Register agent and import platform 4. From ON UI, add Platform Utilization portlet to default dashboard 5. Confirm portlet is displaying platform's CPU load, memory usage, etc. 6. Logout of ON UI 7. Use firewall rule to silently ignore TCP connection requests to ON agent's port sudo iptables -I INPUT 1 -m state --state NEW -m tcp -p tcp --dport 16163 -j DROP You can confirm this worked by attempting to telnet to the agent using its listening address and port: telnet localhost 16163 Trying 127.0.0.1... This will eventually fail but will take a very long time (on Fedora 14, 3 minutes). The goal is that it should take more then 30 seconds. This is needed to simulate a slow agent or an agent that is shutdown while network is in stealth mode. 8. Connect to ON UI The default page should be the Dashboard page that contains the Platform Utilization portlet. Approximately 30 seconds after connecting, an error should apper in the UI. 9. Navigate to the Reports > Inventory / Platform Utilization page 10. Navigate to another page Approximately 30 seconds after first navigating to the Platform Utilization page the error will reoccur 11. Navigate back to the dashbaord and wait 5 to 10 seconds and click refresh 30 seconds after first navigating to the dashboard the error will reoccur and then again approximately 5 to 10 seconds later. 12. Now, refresh the dashboard again a few times with a couple seconds between each refresh 13. Navigate to the Inventory > Servers page (or any other page that does not contain the platform utilization data No matter what page you are on, the error will appear -- and seem random -- but is 30 seconds after each time you clicked the refresh when on the dashboard page. Actual results: UI errors appear and platform utilization portlet or report are blank/empty Expected results: No errors and platform utilization portlet or report should contain data for agents that have returned it. A visual indication should also appear for agents that have not returned such data. Additional info: When the server is attempting to get the platform metrics from the agent, it appears that two requests are sent at a the same time resulting in the following two socket connection attempts: tcp 0 1 127.0.0.1:42709 127.0.0.1:16163 SYN_SENT 22603/java tcp 0 1 127.0.0.1:42711 127.0.0.1:16163 SYN_SENT 22603/java 127.0.0.1:42709 and 127.0.0.1:42711 both belong to the server and 127.0.0.1:16163 is the agent. This connection will time out eventually due to our firewall setting which is delaying the server's ability to have the agent execute the request. When the server's attempt to execute the request fails (on Fedora 14 and probably most Linux distributions, 3 minutes) the following two error messages are logged corresponding to the above connection attempts: 2012-11-02 14:50:33,601 ERROR [org.rhq.enterprise.communications.command.client.ClientCommandSenderTask] {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[getRealTimeMeasurementValue], targetInterfaceName=org.rhq.core.clientapi.agent.measurement.MeasurementAgentService}]]. Cause: org.jboss.remoting.CannotConnectException:Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://127.0.0.1:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000] -> java.net.ConnectException:Connection timed out. Cause: org.jboss.remoting.CannotConnectException: Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://127.0.0.1:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000] 2012-11-02 14:50:33,854 ERROR [org.rhq.enterprise.communications.command.client.ClientCommandSenderTask] {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[getRealTimeMeasurementValue], targetInterfaceName=org.rhq.core.clientapi.agent.measurement.MeasurementAgentService}]]. Cause: org.jboss.remoting.CannotConnectException:Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://127.0.0.1:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000] -> java.net.ConnectException:Connection timed out. Cause: org.jboss.remoting.CannotConnectException: Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://127.0.0.1:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000] However, regardless of how long those requests take (assuming they take at least 30 seconds) the UI will timeout waiting for the data and the user will see a pop-up saying "Server returned FAILURE with no error message" along with two message center messages: Message : Failed to load platform utilization data. This occurred because the server is taking a long time to complete this request. Please be aware that the server may still be processing your request and it may complete shortly. You can check the server logs to see if any abnormal errors occurred. Severity : Warning Time : Friday, November 2, 2012 2:47:54 PM UTC-5 Detail : com.google.gwt.http.client.RequestTimeoutException:A request timeout has expired after 30000 ms --- STACK TRACE FOLLOWS --- A request timeout has expired after 30000 ms at Unknown.java_lang_Exception_Exception__Ljava_lang_String_2V(Unknown Source) at Unknown.com_google_gwt_http_client_RequestTimeoutException_RequestTimeoutException__Lcom_google_gwt_http_client_Request_2IV(Unknown Source) at Unknown.com_google_gwt_http_client_Request_$fireOnTimeout__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_RequestCallback_2V(Unknown Source) at Unknown.com_google_gwt_http_client_Request$3_run__V(Unknown Source) at Unknown.com_google_gwt_user_client_Timer_fire__V(Unknown Source) at Unknown.anonymous(Unknown Source) at Unknown.com_google_gwt_core_client_impl_Impl_apply__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source) at Unknown.com_google_gwt_core_client_impl_Impl_entry0__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source) at Unknown.anonymous(Unknown Source) at Unknown.anonymous(Unknown Source) Message : Failed to load platform utilization data. This occurred because the server is taking a long time to complete this request. Please be aware that the server may still be processing your request and it may complete shortly. You can check the server logs to see if any abnormal errors occurred. Severity : Warning Time : Friday, November 2, 2012 2:47:54 PM UTC-5 Detail : com.google.gwt.http.client.RequestTimeoutException:A request timeout has expired after 30000 ms --- STACK TRACE FOLLOWS --- A request timeout has expired after 30000 ms at Unknown.java_lang_Exception_Exception__Ljava_lang_String_2V(Unknown Source) at Unknown.com_google_gwt_http_client_RequestTimeoutException_RequestTimeoutException__Lcom_google_gwt_http_client_Request_2IV(Unknown Source) at Unknown.com_google_gwt_http_client_Request_$fireOnTimeout__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_RequestCallback_2V(Unknown Source) at Unknown.com_google_gwt_http_client_Request$3_run__V(Unknown Source) at Unknown.com_google_gwt_user_client_Timer_fire__V(Unknown Source) at Unknown.anonymous(Unknown Source) at Unknown.com_google_gwt_core_client_impl_Impl_apply__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source) at Unknown.com_google_gwt_core_client_impl_Impl_entry0__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source) at Unknown.anonymous(Unknown Source) at Unknown.anonymous(Unknown Source) These always occur in pairs and will occur 30 seconds after any page containing platform utilization data (dashboard portlet or platform utilization report) is viewed or refreshed. Multiple page views or refreshes will result in additional errors (2 each time). Another aspect of this issue to consider is the load this places on each agent. If each click on the report or refresh generates a new request even while the previous one is still running, one could bring agents down by repeatedly asking for the data.
Created attachment 637243 [details] Error Message Pop-up screenshot1.png - Shows pop-up error message that appears on any page of the UI 30 seconds after viewing the dashboard or the platform utilization report.
Created attachment 637244 [details] Message Center messages two at a time screenshot2.png - Shows *Message Center* with two error messages for each attempt of accessing dashboard or platform utilization report.
We need to get to the bottom of what is going on here and make sure agents being slow doesn't impact the server UI rendering.
Some initial thoughts without really looking. There are several things going here, I think. - I think we have a 30s timeout on server RPC calls from the GUI. That means any call that takes longer than 30s will get get the onFailure block called for the async call. The cause will be a timeout, the service may still succeed. - I don't know if that timeout can be changed or disabled, but the onFailure code can do some non-standard handling. - a down agent can sink the whole thing because we try and gather live metrics from each agent. But there is a long wait before giving up on an agent connect. But we should be able to avoid DOWN platforms. - Slow platforms we likely can't avoid unless we can have a fast timeout on the agent request. - It seems that we don't protect against the 30s/slow server response message if the user navigates away. Maybe we should (if it's somehow doable)
master commit 1b7d2edf904bccff5ea848076d333961a5836117 Jay Shaughnessy <jshaughn> Tue Apr 9 16:21:45 2013 -0400 - Fix a bug with overriding the RPC timeout on a GWT service. It now allows 0, which disables the RPC timeout. - Disabled the timeout on the call to get platform utilization data. Even if this is working and all agents are reporting, it could still take time due to the number of agents, or slow agents. - Only try to contact agents whose platform resources are UP. trying to contact a DOWN agent will likely just slow things down. - Added an Availability column to the portlet/report view. So now the user can maybe see why certain platforms didn't report any data. Also, make sure we return a row for each platform. - Change some "getLiveData" handling when we can't make an agent connection. Instead of throwing an exception and spewing all over the logs, just return an empty set of data. Update jdoc and callers as needed. - Fix bug to stop a redundant fetch for platform util data in the portlet/report views. This was certainly wasteful and exacerbated the reported issue. - minor: rename local SLSB method loadLiveMetricsForPlatform() to loadLiveMetricsForPlatformInNewTransaction, for clarity and convention - minor: remove a TODO comment in CannotConnectToAgentException. Leave as ApplicationException so the exception does not get wrapped as an EJBException. So, basically, the "RPC timeout" type exceptions should be gone in this use case. I did not try to implement any sort of disabling of "RPC timeout" messages when switching views. In general I think these are still useful messages and should be displayed. But in this "reporting only" use-case it's not necessary.
commit cd3d18bf347518347209afe58519c6f332e0df81 Author: Jay Shaughnessy <jshaughn> Date: Wed Apr 10 16:56:45 2013 -0400 Using the new feature allowing timeouts on agent client calls, limit individual live data queries to the agent to 10s, making this more likely to deal with a slow agent. Also, fix up the MeasurementDataManager Local and Remote. Deprecated the live data remote methods, which should not be exposed due to the impact on agents. Fixed some jdoc for the remote.
As this is MODIFIED or ON_QA, setting milestone to ER1.