+++ This bug was initially created as a clone of JBoss ON Bug #872731 +++ Description of problem: If using the *Platform Utilization* portlet or viewing the *Platform Utilization* report page, the following error pop-up appears 30 seconds later: Server returned FAILURE with no error message The message center contains the following message: Failed to load platform utilization data. This occurred because the server is taking a long time to complete this request. Please be aware that the server may still be processing your request and it may complete shortly. You can check the server logs to see if any abnormal errors occurred. This only occurs if one or more agents are slow or unable to provide the requested metric data necessary. Version-Release number of selected component (if applicable): 4.4.0.JON310GA How reproducible: Always Steps to Reproduce: 1. Install ON 3.1 system 2. Start ON 3.1 system 3. Register agent and import platform 4. From ON UI, add Platform Utilization portlet to default dashboard 5. Confirm portlet is displaying platform's CPU load, memory usage, etc. 6. Logout of ON UI 7. Use firewall rule to silently ignore TCP connection requests to ON agent's port sudo iptables -I INPUT 1 -m state --state NEW -m tcp -p tcp --dport 16163 -j DROP You can confirm this worked by attempting to telnet to the agent using its listening address and port: telnet localhost 16163 Trying 127.0.0.1... This will eventually fail but will take a very long time (on Fedora 14, 3 minutes). The goal is that it should take more then 30 seconds. This is needed to simulate a slow agent or an agent that is shutdown while network is in stealth mode. 8. Connect to ON UI The default page should be the Dashboard page that contains the Platform Utilization portlet. Approximately 30 seconds after connecting, an error should apper in the UI. 9. Navigate to the Reports > Inventory / Platform Utilization page 10. Navigate to another page Approximately 30 seconds after first navigating to the Platform Utilization page the error will reoccur 11. Navigate back to the dashbaord and wait 5 to 10 seconds and click refresh 30 seconds after first navigating to the dashboard the error will reoccur and then again approximately 5 to 10 seconds later. 12. Now, refresh the dashboard again a few times with a couple seconds between each refresh 13. Navigate to the Inventory > Servers page (or any other page that does not contain the platform utilization data No matter what page you are on, the error will appear -- and seem random -- but is 30 seconds after each time you clicked the refresh when on the dashboard page. Actual results: UI errors appear and platform utilization portlet or report are blank/empty Expected results: No errors and platform utilization portlet or report should contain data for agents that have returned it. A visual indication should also appear for agents that have not returned such data. Additional info: When the server is attempting to get the platform metrics from the agent, it appears that two requests are sent at a the same time resulting in the following two socket connection attempts: tcp 0 1 127.0.0.1:42709 127.0.0.1:16163 SYN_SENT 22603/java tcp 0 1 127.0.0.1:42711 127.0.0.1:16163 SYN_SENT 22603/java 127.0.0.1:42709 and 127.0.0.1:42711 both belong to the server and 127.0.0.1:16163 is the agent. This connection will time out eventually due to our firewall setting which is delaying the server's ability to have the agent execute the request. When the server's attempt to execute the request fails (on Fedora 14 and probably most Linux distributions, 3 minutes) the following two error messages are logged corresponding to the above connection attempts: 2012-11-02 14:50:33,601 ERROR [org.rhq.enterprise.communications.command.client.ClientCommandSenderTask] {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[getRealTimeMeasurementValue], targetInterfaceName=org.rhq.core.clientapi.agent.measurement.MeasurementAgentService}]]. Cause: org.jboss.remoting.CannotConnectException:Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://127.0.0.1:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000] -> java.net.ConnectException:Connection timed out. Cause: org.jboss.remoting.CannotConnectException: Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://127.0.0.1:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000] 2012-11-02 14:50:33,854 ERROR [org.rhq.enterprise.communications.command.client.ClientCommandSenderTask] {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[getRealTimeMeasurementValue], targetInterfaceName=org.rhq.core.clientapi.agent.measurement.MeasurementAgentService}]]. Cause: org.jboss.remoting.CannotConnectException:Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://127.0.0.1:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000] -> java.net.ConnectException:Connection timed out. Cause: org.jboss.remoting.CannotConnectException: Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://127.0.0.1:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000] However, regardless of how long those requests take (assuming they take at least 30 seconds) the UI will timeout waiting for the data and the user will see a pop-up saying "Server returned FAILURE with no error message" along with two message center messages: Message : Failed to load platform utilization data. This occurred because the server is taking a long time to complete this request. Please be aware that the server may still be processing your request and it may complete shortly. You can check the server logs to see if any abnormal errors occurred. Severity : Warning Time : Friday, November 2, 2012 2:47:54 PM UTC-5 Detail : com.google.gwt.http.client.RequestTimeoutException:A request timeout has expired after 30000 ms --- STACK TRACE FOLLOWS --- A request timeout has expired after 30000 ms at Unknown.java_lang_Exception_Exception__Ljava_lang_String_2V(Unknown Source) at Unknown.com_google_gwt_http_client_RequestTimeoutException_RequestTimeoutException__Lcom_google_gwt_http_client_Request_2IV(Unknown Source) at Unknown.com_google_gwt_http_client_Request_$fireOnTimeout__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_RequestCallback_2V(Unknown Source) at Unknown.com_google_gwt_http_client_Request$3_run__V(Unknown Source) at Unknown.com_google_gwt_user_client_Timer_fire__V(Unknown Source) at Unknown.anonymous(Unknown Source) at Unknown.com_google_gwt_core_client_impl_Impl_apply__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source) at Unknown.com_google_gwt_core_client_impl_Impl_entry0__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source) at Unknown.anonymous(Unknown Source) at Unknown.anonymous(Unknown Source) Message : Failed to load platform utilization data. This occurred because the server is taking a long time to complete this request. Please be aware that the server may still be processing your request and it may complete shortly. You can check the server logs to see if any abnormal errors occurred. Severity : Warning Time : Friday, November 2, 2012 2:47:54 PM UTC-5 Detail : com.google.gwt.http.client.RequestTimeoutException:A request timeout has expired after 30000 ms --- STACK TRACE FOLLOWS --- A request timeout has expired after 30000 ms at Unknown.java_lang_Exception_Exception__Ljava_lang_String_2V(Unknown Source) at Unknown.com_google_gwt_http_client_RequestTimeoutException_RequestTimeoutException__Lcom_google_gwt_http_client_Request_2IV(Unknown Source) at Unknown.com_google_gwt_http_client_Request_$fireOnTimeout__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_RequestCallback_2V(Unknown Source) at Unknown.com_google_gwt_http_client_Request$3_run__V(Unknown Source) at Unknown.com_google_gwt_user_client_Timer_fire__V(Unknown Source) at Unknown.anonymous(Unknown Source) at Unknown.com_google_gwt_core_client_impl_Impl_apply__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source) at Unknown.com_google_gwt_core_client_impl_Impl_entry0__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source) at Unknown.anonymous(Unknown Source) at Unknown.anonymous(Unknown Source) These always occur in pairs and will occur 30 seconds after any page containing platform utilization data (dashboard portlet or platform utilization report) is viewed or refreshed. Multiple page views or refreshes will result in additional errors (2 each time). Another aspect of this issue to consider is the load this places on each agent. If each click on the report or refresh generates a new request even while the previous one is still running, one could bring agents down by repeatedly asking for the data. --- Additional comment from loleary on 2012-11-02 16:54:53 EDT --- Created attachment 637243 [details] Error Message Pop-up screenshot1.png - Shows pop-up error message that appears on any page of the UI 30 seconds after viewing the dashboard or the platform utilization report. --- Additional comment from loleary on 2012-11-02 16:56:03 EDT --- Created attachment 637244 [details] Message Center messages two at a time screenshot2.png - Shows *Message Center* with two error messages for each attempt of accessing dashboard or platform utilization report.