Bug 872731

Summary: Platform utilization data fails to be retrieved if agent is slow or down and results in many UI errors Server returned FAILURE with no error message
Product: [JBoss] JBoss Operations Network Reporter: Larry O'Leary <loleary>
Component: Monitoring -- Other, UIAssignee: Jay Shaughnessy <jshaughn>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: medium Docs Contact:
Priority: medium    
Version: JON 3.1.0CC: myarboro
Target Milestone: ER01   
Target Release: JON 3.2.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 872750 (view as bug list) Environment:
Last Closed: 2014-01-02 20:42:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 872750    
Bug Blocks:    
Attachments:
Description Flags
Error Message Pop-up
none
Message Center messages two at a time none

Description Larry O'Leary 2012-11-02 20:27:22 UTC
Description of problem:
If using the *Platform Utilization* portlet or viewing the *Platform Utilization* report page, the following error pop-up appears 30 seconds later:

    Server returned FAILURE with no error message


The message center contains the following message:

    Failed to load platform utilization data. This occurred because the server is taking a long time to complete this request. Please be aware that the server may still be processing your request and it may complete shortly. You can check the server logs to see if any abnormal errors occurred.


This only occurs if one or more agents are slow or unable to provide the requested metric data necessary.

Version-Release number of selected component (if applicable):
4.4.0.JON310GA

How reproducible:
Always

Steps to Reproduce:
1.  Install ON 3.1 system
2.  Start ON 3.1 system
3.  Register agent and import platform
4.  From ON UI, add Platform Utilization portlet to default dashboard
5.  Confirm portlet is displaying platform's CPU load, memory usage, etc.
6.  Logout of ON UI
7.  Use firewall rule to silently ignore TCP connection requests to ON agent's port

        sudo iptables -I INPUT 1 -m state --state NEW -m tcp -p tcp --dport 16163 -j DROP
        
    You can confirm this worked by attempting to telnet to the agent using its listening address and port:
    
        telnet localhost 16163

        Trying 127.0.0.1...

    This will eventually fail but will take a very long time (on Fedora 14, 3 minutes). The goal is that it should take more then 30 seconds.
    
    This is needed to simulate a slow agent or an agent that is shutdown while network is in stealth mode.
            
8.  Connect to ON UI

    The default page should be the Dashboard page that contains the Platform Utilization portlet. Approximately 30 seconds after connecting, an error should apper in the UI.
    
9.  Navigate to the Reports > Inventory / Platform Utilization page
10. Navigate to another page

    Approximately 30 seconds after first navigating to the Platform Utilization page the error will reoccur
    
11. Navigate back to the dashbaord and wait 5 to 10 seconds and click refresh

    30 seconds after first navigating to the dashboard the error will reoccur and then again approximately 5 to 10 seconds later.
    
12. Now, refresh the dashboard again a few times with a couple seconds between each refresh
13. Navigate to the Inventory > Servers page (or any other page that does not contain the platform utilization data

    No matter what page you are on, the error will appear -- and seem random -- but is 30 seconds after each time you clicked the refresh when on the dashboard page.

  
Actual results:
UI errors appear and platform utilization portlet or report are blank/empty

Expected results:
No errors and platform utilization portlet or report should contain data for agents that have returned it. A visual indication should also appear for agents that have not returned such data.

Additional info:
When the server is attempting to get the platform metrics from the agent, it appears that two requests are sent at a the same time resulting in the following two socket connection attempts:

tcp        0      1 127.0.0.1:42709             127.0.0.1:16163             SYN_SENT    22603/java          
tcp        0      1 127.0.0.1:42711             127.0.0.1:16163             SYN_SENT    22603/java        


127.0.0.1:42709 and 127.0.0.1:42711 both belong to the server and 127.0.0.1:16163 is the agent. This connection will time out eventually due to our firewall setting which is delaying the server's ability to have the agent execute the request.

When the server's attempt to execute the request fails (on Fedora 14 and probably most Linux distributions, 3 minutes) the following two error messages are logged corresponding to the above connection attempts:

2012-11-02 14:50:33,601 ERROR [org.rhq.enterprise.communications.command.client.ClientCommandSenderTask] {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[getRealTimeMeasurementValue], targetInterfaceName=org.rhq.core.clientapi.agent.measurement.MeasurementAgentService}]]. Cause: org.jboss.remoting.CannotConnectException:Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://127.0.0.1:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000] -> java.net.ConnectException:Connection timed out. Cause: org.jboss.remoting.CannotConnectException: Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://127.0.0.1:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000]
2012-11-02 14:50:33,854 ERROR [org.rhq.enterprise.communications.command.client.ClientCommandSenderTask] {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[getRealTimeMeasurementValue], targetInterfaceName=org.rhq.core.clientapi.agent.measurement.MeasurementAgentService}]]. Cause: org.jboss.remoting.CannotConnectException:Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://127.0.0.1:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000] -> java.net.ConnectException:Connection timed out. Cause: org.jboss.remoting.CannotConnectException: Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://127.0.0.1:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000]



However, regardless of how long those requests take (assuming they take at least 30 seconds) the UI will timeout waiting for the data and the user will see a pop-up saying "Server returned FAILURE with no error message" along with two message center messages:

Message :	
Failed to load platform utilization data. This occurred because the server is taking a long time to complete this request. Please be aware that the server may still be processing your request and it may complete shortly. You can check the server logs to see if any abnormal errors occurred.
Severity :	
Warning
	
Time :	
Friday, November 2, 2012 2:47:54 PM UTC-5
Detail :	

com.google.gwt.http.client.RequestTimeoutException:A request timeout has expired after 30000 ms
--- STACK TRACE FOLLOWS ---
A request timeout has expired after 30000 ms
   at Unknown.java_lang_Exception_Exception__Ljava_lang_String_2V(Unknown Source)
   at Unknown.com_google_gwt_http_client_RequestTimeoutException_RequestTimeoutException__Lcom_google_gwt_http_client_Request_2IV(Unknown Source)
   at Unknown.com_google_gwt_http_client_Request_$fireOnTimeout__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_RequestCallback_2V(Unknown Source)
   at Unknown.com_google_gwt_http_client_Request$3_run__V(Unknown Source)
   at Unknown.com_google_gwt_user_client_Timer_fire__V(Unknown Source)
   at Unknown.anonymous(Unknown Source)
   at Unknown.com_google_gwt_core_client_impl_Impl_apply__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source)
   at Unknown.com_google_gwt_core_client_impl_Impl_entry0__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source)
   at Unknown.anonymous(Unknown Source)
   at Unknown.anonymous(Unknown Source)
   
   
   
Message :	
Failed to load platform utilization data. This occurred because the server is taking a long time to complete this request. Please be aware that the server may still be processing your request and it may complete shortly. You can check the server logs to see if any abnormal errors occurred.
Severity :	
Warning
	
Time :	
Friday, November 2, 2012 2:47:54 PM UTC-5
Detail :	

com.google.gwt.http.client.RequestTimeoutException:A request timeout has expired after 30000 ms
--- STACK TRACE FOLLOWS ---
A request timeout has expired after 30000 ms
   at Unknown.java_lang_Exception_Exception__Ljava_lang_String_2V(Unknown Source)
   at Unknown.com_google_gwt_http_client_RequestTimeoutException_RequestTimeoutException__Lcom_google_gwt_http_client_Request_2IV(Unknown Source)
   at Unknown.com_google_gwt_http_client_Request_$fireOnTimeout__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_RequestCallback_2V(Unknown Source)
   at Unknown.com_google_gwt_http_client_Request$3_run__V(Unknown Source)
   at Unknown.com_google_gwt_user_client_Timer_fire__V(Unknown Source)
   at Unknown.anonymous(Unknown Source)
   at Unknown.com_google_gwt_core_client_impl_Impl_apply__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source)
   at Unknown.com_google_gwt_core_client_impl_Impl_entry0__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source)
   at Unknown.anonymous(Unknown Source)
   at Unknown.anonymous(Unknown Source)




These always occur in pairs and will occur 30 seconds after any page containing platform utilization data (dashboard portlet or platform utilization report) is viewed or refreshed. Multiple page views or refreshes will result in additional errors (2 each time).


Another aspect of this issue to consider is the load this places on each agent. If each click on the report or refresh generates a new request even while the previous one is still running, one could bring agents down by repeatedly asking for the data.

Comment 1 Larry O'Leary 2012-11-02 20:54:53 UTC
Created attachment 637243 [details]
Error Message Pop-up

screenshot1.png - Shows pop-up error message that appears on any page of the UI 30 seconds after viewing the dashboard or the platform utilization report.

Comment 2 Larry O'Leary 2012-11-02 20:56:03 UTC
Created attachment 637244 [details]
Message Center messages two at a time

screenshot2.png - Shows *Message Center* with two error messages for each attempt of accessing dashboard or platform utilization report.

Comment 3 Charles Crouch 2013-03-11 14:33:31 UTC
We need to get to the bottom of what is going on here and make sure agents being slow doesn't impact the server UI rendering.

Comment 4 Jay Shaughnessy 2013-04-09 01:06:33 UTC
Some initial thoughts without really looking. There are several things going here, I think. 

- I think we have a 30s timeout on server RPC calls from the GUI.  That means any call that takes longer than 30s will get get the onFailure block called for the async call. The cause will be a timeout, the service may still succeed.

- I don't know if that timeout can be changed or disabled, but the onFailure code can do some non-standard handling.

- a down agent can sink the whole thing because we try and gather live metrics from each agent. But there is a long wait before giving up on an agent connect. But we should be able to avoid DOWN platforms.

- Slow platforms we likely can't avoid unless we can have a fast timeout on the agent request.

- It seems that we don't protect against the 30s/slow server response message if the user navigates away. Maybe we should (if it's somehow doable)

Comment 5 Jay Shaughnessy 2013-04-09 20:39:53 UTC
master commit 1b7d2edf904bccff5ea848076d333961a5836117
Jay Shaughnessy <jshaughn>
Tue Apr 9 16:21:45 2013 -0400

- Fix a bug with overriding the RPC timeout on a GWT service. It now allows
  0, which disables the RPC timeout.
- Disabled the timeout on the call to get platform utilization data. Even if
  this is working and all agents are reporting, it could still take time due
  to the number of agents, or slow agents.
- Only try to contact agents whose platform resources are UP.  trying to contact
  a DOWN agent will likely just slow things down.
- Added an Availability column to the portlet/report view.  So now the user
  can maybe see why certain platforms didn't report any data.  Also, make sure
  we return a row for each platform.
- Change some "getLiveData" handling when we can't make an agent connection.
  Instead of throwing an exception and spewing all over the logs, just
  return an empty set of data. Update jdoc and callers as needed.
- Fix bug to stop a redundant fetch for platform util data in the
  portlet/report views. This was certainly wasteful and exacerbated the
  reported issue.
- minor: rename local SLSB method loadLiveMetricsForPlatform() to
  loadLiveMetricsForPlatformInNewTransaction, for clarity and convention
- minor: remove a TODO comment in CannotConnectToAgentException.  Leave as
  ApplicationException so the exception does not get wrapped as an EJBException.

So, basically, the "RPC timeout" type exceptions should be gone in this
use case. I did not try to implement any sort of disabling of "RPC timeout"
messages when switching views. In general I think these are still useful
messages and should be displayed. But in this "reporting only" use-case
it's not necessary.

Comment 6 Jay Shaughnessy 2013-04-10 21:36:29 UTC
commit cd3d18bf347518347209afe58519c6f332e0df81
Author: Jay Shaughnessy <jshaughn>
Date:   Wed Apr 10 16:56:45 2013 -0400

    Using the new feature allowing timeouts on agent client calls, limit
    individual live data queries to the agent to 10s, making this more
    likely to deal with a slow agent.

    Also, fix up the MeasurementDataManager Local and Remote. Deprecated the
    live data remote methods, which should not be exposed due to the impact
    on agents. Fixed some jdoc for the remote.

Comment 7 Larry O'Leary 2013-09-06 14:30:16 UTC
As this is MODIFIED or ON_QA, setting milestone to ER1.