Bug 872750 - Platform utilization data fails to be retrieved if agent is slow or down and results in many UI errors Server returned FAILURE with no error message
Summary: Platform utilization data fails to be retrieved if agent is slow or down and ...
Keywords:
Status: NEW
Alias: None
Product: RHQ Project
Classification: Other
Component: Core UI, Monitoring
Version: 4.4
Hardware: All
OS: All
unspecified
high
Target Milestone: ---
: ---
Assignee: Nobody
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 872731
TreeView+ depends on / blocked
 
Reported: 2012-11-02 21:49 UTC by Larry O'Leary
Modified: 2022-03-31 04:28 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 872731
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)

Description Larry O'Leary 2012-11-02 21:49:01 UTC
+++ This bug was initially created as a clone of JBoss ON Bug #872731 +++

Description of problem:
If using the *Platform Utilization* portlet or viewing the *Platform Utilization* report page, the following error pop-up appears 30 seconds later:

    Server returned FAILURE with no error message


The message center contains the following message:

    Failed to load platform utilization data. This occurred because the server is taking a long time to complete this request. Please be aware that the server may still be processing your request and it may complete shortly. You can check the server logs to see if any abnormal errors occurred.


This only occurs if one or more agents are slow or unable to provide the requested metric data necessary.

Version-Release number of selected component (if applicable):
4.4.0.JON310GA

How reproducible:
Always

Steps to Reproduce:
1.  Install ON 3.1 system
2.  Start ON 3.1 system
3.  Register agent and import platform
4.  From ON UI, add Platform Utilization portlet to default dashboard
5.  Confirm portlet is displaying platform's CPU load, memory usage, etc.
6.  Logout of ON UI
7.  Use firewall rule to silently ignore TCP connection requests to ON agent's port

        sudo iptables -I INPUT 1 -m state --state NEW -m tcp -p tcp --dport 16163 -j DROP
        
    You can confirm this worked by attempting to telnet to the agent using its listening address and port:
    
        telnet localhost 16163

        Trying 127.0.0.1...

    This will eventually fail but will take a very long time (on Fedora 14, 3 minutes). The goal is that it should take more then 30 seconds.
    
    This is needed to simulate a slow agent or an agent that is shutdown while network is in stealth mode.
            
8.  Connect to ON UI

    The default page should be the Dashboard page that contains the Platform Utilization portlet. Approximately 30 seconds after connecting, an error should apper in the UI.
    
9.  Navigate to the Reports > Inventory / Platform Utilization page
10. Navigate to another page

    Approximately 30 seconds after first navigating to the Platform Utilization page the error will reoccur
    
11. Navigate back to the dashbaord and wait 5 to 10 seconds and click refresh

    30 seconds after first navigating to the dashboard the error will reoccur and then again approximately 5 to 10 seconds later.
    
12. Now, refresh the dashboard again a few times with a couple seconds between each refresh
13. Navigate to the Inventory > Servers page (or any other page that does not contain the platform utilization data

    No matter what page you are on, the error will appear -- and seem random -- but is 30 seconds after each time you clicked the refresh when on the dashboard page.

  
Actual results:
UI errors appear and platform utilization portlet or report are blank/empty

Expected results:
No errors and platform utilization portlet or report should contain data for agents that have returned it. A visual indication should also appear for agents that have not returned such data.

Additional info:
When the server is attempting to get the platform metrics from the agent, it appears that two requests are sent at a the same time resulting in the following two socket connection attempts:

tcp        0      1 127.0.0.1:42709             127.0.0.1:16163             SYN_SENT    22603/java          
tcp        0      1 127.0.0.1:42711             127.0.0.1:16163             SYN_SENT    22603/java        


127.0.0.1:42709 and 127.0.0.1:42711 both belong to the server and 127.0.0.1:16163 is the agent. This connection will time out eventually due to our firewall setting which is delaying the server's ability to have the agent execute the request.

When the server's attempt to execute the request fails (on Fedora 14 and probably most Linux distributions, 3 minutes) the following two error messages are logged corresponding to the above connection attempts:

2012-11-02 14:50:33,601 ERROR [org.rhq.enterprise.communications.command.client.ClientCommandSenderTask] {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[getRealTimeMeasurementValue], targetInterfaceName=org.rhq.core.clientapi.agent.measurement.MeasurementAgentService}]]. Cause: org.jboss.remoting.CannotConnectException:Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://127.0.0.1:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000] -> java.net.ConnectException:Connection timed out. Cause: org.jboss.remoting.CannotConnectException: Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://127.0.0.1:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000]
2012-11-02 14:50:33,854 ERROR [org.rhq.enterprise.communications.command.client.ClientCommandSenderTask] {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[getRealTimeMeasurementValue], targetInterfaceName=org.rhq.core.clientapi.agent.measurement.MeasurementAgentService}]]. Cause: org.jboss.remoting.CannotConnectException:Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://127.0.0.1:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000] -> java.net.ConnectException:Connection timed out. Cause: org.jboss.remoting.CannotConnectException: Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://127.0.0.1:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000]



However, regardless of how long those requests take (assuming they take at least 30 seconds) the UI will timeout waiting for the data and the user will see a pop-up saying "Server returned FAILURE with no error message" along with two message center messages:

Message :	
Failed to load platform utilization data. This occurred because the server is taking a long time to complete this request. Please be aware that the server may still be processing your request and it may complete shortly. You can check the server logs to see if any abnormal errors occurred.
Severity :	
Warning
	
Time :	
Friday, November 2, 2012 2:47:54 PM UTC-5
Detail :	

com.google.gwt.http.client.RequestTimeoutException:A request timeout has expired after 30000 ms
--- STACK TRACE FOLLOWS ---
A request timeout has expired after 30000 ms
   at Unknown.java_lang_Exception_Exception__Ljava_lang_String_2V(Unknown Source)
   at Unknown.com_google_gwt_http_client_RequestTimeoutException_RequestTimeoutException__Lcom_google_gwt_http_client_Request_2IV(Unknown Source)
   at Unknown.com_google_gwt_http_client_Request_$fireOnTimeout__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_RequestCallback_2V(Unknown Source)
   at Unknown.com_google_gwt_http_client_Request$3_run__V(Unknown Source)
   at Unknown.com_google_gwt_user_client_Timer_fire__V(Unknown Source)
   at Unknown.anonymous(Unknown Source)
   at Unknown.com_google_gwt_core_client_impl_Impl_apply__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source)
   at Unknown.com_google_gwt_core_client_impl_Impl_entry0__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source)
   at Unknown.anonymous(Unknown Source)
   at Unknown.anonymous(Unknown Source)
   
   
   
Message :	
Failed to load platform utilization data. This occurred because the server is taking a long time to complete this request. Please be aware that the server may still be processing your request and it may complete shortly. You can check the server logs to see if any abnormal errors occurred.
Severity :	
Warning
	
Time :	
Friday, November 2, 2012 2:47:54 PM UTC-5
Detail :	

com.google.gwt.http.client.RequestTimeoutException:A request timeout has expired after 30000 ms
--- STACK TRACE FOLLOWS ---
A request timeout has expired after 30000 ms
   at Unknown.java_lang_Exception_Exception__Ljava_lang_String_2V(Unknown Source)
   at Unknown.com_google_gwt_http_client_RequestTimeoutException_RequestTimeoutException__Lcom_google_gwt_http_client_Request_2IV(Unknown Source)
   at Unknown.com_google_gwt_http_client_Request_$fireOnTimeout__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_RequestCallback_2V(Unknown Source)
   at Unknown.com_google_gwt_http_client_Request$3_run__V(Unknown Source)
   at Unknown.com_google_gwt_user_client_Timer_fire__V(Unknown Source)
   at Unknown.anonymous(Unknown Source)
   at Unknown.com_google_gwt_core_client_impl_Impl_apply__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source)
   at Unknown.com_google_gwt_core_client_impl_Impl_entry0__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source)
   at Unknown.anonymous(Unknown Source)
   at Unknown.anonymous(Unknown Source)




These always occur in pairs and will occur 30 seconds after any page containing platform utilization data (dashboard portlet or platform utilization report) is viewed or refreshed. Multiple page views or refreshes will result in additional errors (2 each time).


Another aspect of this issue to consider is the load this places on each agent. If each click on the report or refresh generates a new request even while the previous one is still running, one could bring agents down by repeatedly asking for the data.

--- Additional comment from loleary on 2012-11-02 16:54:53 EDT ---

Created attachment 637243 [details]
Error Message Pop-up

screenshot1.png - Shows pop-up error message that appears on any page of the UI 30 seconds after viewing the dashboard or the platform utilization report.

--- Additional comment from loleary on 2012-11-02 16:56:03 EDT ---

Created attachment 637244 [details]
Message Center messages two at a time

screenshot2.png - Shows *Message Center* with two error messages for each attempt of accessing dashboard or platform utilization report.


Note You need to log in before you can comment on or make changes to this bug.