Bug 872731
Summary: | Platform utilization data fails to be retrieved if agent is slow or down and results in many UI errors Server returned FAILURE with no error message | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [JBoss] JBoss Operations Network | Reporter: | Larry O'Leary <loleary> | ||||||
Component: | Monitoring -- Other, UI | Assignee: | Jay Shaughnessy <jshaughn> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Foley <mfoley> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | JON 3.1.0 | CC: | myarboro | ||||||
Target Milestone: | ER01 | ||||||||
Target Release: | JON 3.2.0 | ||||||||
Hardware: | All | ||||||||
OS: | All | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 872750 (view as bug list) | Environment: | |||||||
Last Closed: | 2014-01-02 20:42:53 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 872750 | ||||||||
Bug Blocks: | |||||||||
Attachments: |
|
Description
Larry O'Leary
2012-11-02 20:27:22 UTC
Created attachment 637243 [details]
Error Message Pop-up
screenshot1.png - Shows pop-up error message that appears on any page of the UI 30 seconds after viewing the dashboard or the platform utilization report.
Created attachment 637244 [details]
Message Center messages two at a time
screenshot2.png - Shows *Message Center* with two error messages for each attempt of accessing dashboard or platform utilization report.
We need to get to the bottom of what is going on here and make sure agents being slow doesn't impact the server UI rendering. Some initial thoughts without really looking. There are several things going here, I think. - I think we have a 30s timeout on server RPC calls from the GUI. That means any call that takes longer than 30s will get get the onFailure block called for the async call. The cause will be a timeout, the service may still succeed. - I don't know if that timeout can be changed or disabled, but the onFailure code can do some non-standard handling. - a down agent can sink the whole thing because we try and gather live metrics from each agent. But there is a long wait before giving up on an agent connect. But we should be able to avoid DOWN platforms. - Slow platforms we likely can't avoid unless we can have a fast timeout on the agent request. - It seems that we don't protect against the 30s/slow server response message if the user navigates away. Maybe we should (if it's somehow doable) master commit 1b7d2edf904bccff5ea848076d333961a5836117 Jay Shaughnessy <jshaughn> Tue Apr 9 16:21:45 2013 -0400 - Fix a bug with overriding the RPC timeout on a GWT service. It now allows 0, which disables the RPC timeout. - Disabled the timeout on the call to get platform utilization data. Even if this is working and all agents are reporting, it could still take time due to the number of agents, or slow agents. - Only try to contact agents whose platform resources are UP. trying to contact a DOWN agent will likely just slow things down. - Added an Availability column to the portlet/report view. So now the user can maybe see why certain platforms didn't report any data. Also, make sure we return a row for each platform. - Change some "getLiveData" handling when we can't make an agent connection. Instead of throwing an exception and spewing all over the logs, just return an empty set of data. Update jdoc and callers as needed. - Fix bug to stop a redundant fetch for platform util data in the portlet/report views. This was certainly wasteful and exacerbated the reported issue. - minor: rename local SLSB method loadLiveMetricsForPlatform() to loadLiveMetricsForPlatformInNewTransaction, for clarity and convention - minor: remove a TODO comment in CannotConnectToAgentException. Leave as ApplicationException so the exception does not get wrapped as an EJBException. So, basically, the "RPC timeout" type exceptions should be gone in this use case. I did not try to implement any sort of disabling of "RPC timeout" messages when switching views. In general I think these are still useful messages and should be displayed. But in this "reporting only" use-case it's not necessary. commit cd3d18bf347518347209afe58519c6f332e0df81 Author: Jay Shaughnessy <jshaughn> Date: Wed Apr 10 16:56:45 2013 -0400 Using the new feature allowing timeouts on agent client calls, limit individual live data queries to the agent to 10s, making this more likely to deal with a slow agent. Also, fix up the MeasurementDataManager Local and Remote. Deprecated the live data remote methods, which should not be exposed due to the impact on agents. Fixed some jdoc for the remote. As this is MODIFIED or ON_QA, setting milestone to ER1. |