1380471 – UI's live availability check results in repeated "A request timeout has expired after 30000 ms"

Bug 1380471 - UI's live availability check results in repeated "A request timeout has expired after 30000 ms"

Summary: UI's live availability check results in repeated "A request timeout has expir...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	JBoss Operations Network
Classification:	JBoss
Component:	UI
Sub Component:
Version:	JON 3.3.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	CR01
Target Release:	JON 3.3.9
Assignee:	Josejulio Martínez
QA Contact:	Prachi
Docs Contact:
URL:	https://developer.jboss.org/message/9...
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-09-29 16:50 UTC by Larry O'Leary
Modified:	2017-10-02 17:21 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-10-02 17:21:51 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
screen-shot (176.00 KB, image/png) 2017-09-25 10:25 UTC, Prachi	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:2846	0	normal	SHIPPED_LIVE	Red Hat JBoss Operations Network 3.3.9 bug fix update	2017-10-02 21:21:28 UTC

Description Larry O'Leary 2016-09-29 16:50:25 UTC

Description of problem:
When resources in the JBoss ON UI, every 30 seconds or so a warning is displayed in the UI:

> Resource with id [12345] does not exist or is not accessible. This occurred because the server is taking a long time to complete this request. Please be aware that the server may still be processing your request and it may complete shortly. You can check the server logs to see if any abnormal errors occurred.
> Root Cause: A request timeout has expired after 30000 ms

Version-Release number of selected component (if applicable):
3.3.7

How reproducible:
Always -- With slow network or agent

Steps to Reproduce:
1. Start JBoss ON system and ensure agent is properly reporting availability of managed resources.
2. Configure firewall on agent host to silently drop packets destined to the agent's listen port of 16163.
3. In the JBoss ON UI, navigate to a resource managed by the agent with its firewall silently dropping packets.
4. Wait a minute or more.

Actual results:
After about 30 to 45 seconds, a warning appears regarding request timeout and continues to appear every 30 seconds after.

Expected results:
No warnings and the availability icon reflect the last known availability state of the resource based on the last receive "real" or "live" availability report (whichever is newer).

It may be necessary to reflect that the live availability is unknown and perhaps a notice could be displayed to that indication? For example, "Unable to obtain this resources current availability state. Availability state reported is based on last known state from date/time."

Additional info:
In the user's scenario it is not known what is causing the delay. It is suspected that the agent is just slow to respond to the live availability request due to other operations being performed concurrently.

The original report of this issue was received from the Red Hat Developer discussion thread https://developer.jboss.org/message/963465 .

Comment 1 Josejulio Martínez 2017-08-21 22:59:00 UTC

I couldn't reproduce this bug using "Steps to Reproduce".
Reason is, server tries to ping the agent before trying to get resource avail. 
It tries to ping agent with a timeout of 5s, if agent doesn't ping back, it will report AvailabilityType.UNKNOWN (or AvailabilityType.DOWN if resource is of type Platform).

As port 16163 is configured on DROP, it will fail the ping check.

I could reproduce this was by setting a breakpoint [1] and keep it waiting there until timeout.

[1] https://github.com/josejulio/rhq/blob/012b4f48f0072a4df3995cc3279cdd0cabde6361/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/resource/ResourceManagerBean.java#L2710-L2710

Comment 2 Larry O'Leary 2017-08-22 15:26:36 UTC

Interesting.

I am not able to reproduce this using my original steps either. I also attempted the following:

1. Confirm no sockets/connections exist from server to agent.
2. Navigate to RHQ Server resource.

--

1. Confirm no sockets/connections exist from server to agent.
2. Navigate to RHQ Agent resource.

--

1. Unblock port 16163.
2. Import EAP 6 server into inventory using port 9990.
3. Confirm EAP 6 reports available (i.e. connection settings are valid).
4. Navigate away from EAP 6 resource in UI.
5. Block (DROP) packets to port 9990.
6. Navigate to EAP 6 resource and wait a couple minutes.

^^ Repeate steps 1 to 5

6. Navigate to EAP 6 resource's metric table to confirm live metrics aren't causing the UI timeout.

In all instances this worked without any warning/error to the UI.

I would be okay with CLOSED/WORKSFORME unless you want to address the general timeout/hang issue you were able to reproduce form adding the break point. Basically, my concern is that if it is taking longer then 15 seconds to get the live availability for a resource, then we should treat the availability as UNKNOWN without waiting for the 30 second generic UI timeout. Primary reason is that if we wait 30 seconds, we now have another 1 or 2 availability checks queued up that will also result in the same UI warning -- potentially after the user has already navigated away.

I'll defer to dev's expertise as it relates to simplicity of a fix and risk assessment.

Comment 3 Josejulio Martínez 2017-08-22 16:04:00 UTC

I already have fix, i just need to do a bit more of testing.

I'll try to change timeout to 15s. Currently, on timeout, is showing previous availability. It makes sense to change it as UNKNOWN (or DOWN if resource is a Platform[1])

[1] https://github.com/josejulio/rhq/blob/012b4f48f0072a4df3995cc3279cdd0cabde6361/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/resource/ResourceManagerBean.java#L2712-L2716

Comment 4 Josejulio Martínez 2017-08-22 16:33:52 UTC

Larry,

I don't think is necessary to lower the timeout to 15s, there is a countdown latch [1] that will only allow one refresh at a given time.

[1] https://github.com/josejulio/rhq/blob/91291e00a58349c1c36166ac8d3a3c3bfc3bdc2f/modules/enterprise/gui/coregui/src/main/java/org/rhq/coregui/client/inventory/resource/detail/ResourceTitleBar.java#L150-L153

Comment 5 Larry O'Leary 2017-08-22 16:43:38 UTC

Okay.

I think the reason I was concerned is because I can see a new socket being created every 15 seconds. If the socket is waiting for a connect or has hung, it seems to remain around for a couple of minutes. This resulted in 8 sockets to this single agent.

Comment 6 Josejulio Martínez 2017-08-31 00:03:17 UTC

commit f1bbd51c69a67a9ff59e42b6d1c515d25526bd6f
Merge: 3f5df89 803dbc9
Author: Michael Burman <yak>
Date:   Wed Aug 30 21:22:48 2017 +0300

    Merge pull request #317 from josejulio/bugs/1380471
    
    Bug 1380471 - Check for timeouts when getting live availability

commit 803dbc93d70b5b9136ebb2440a879eff621340ee
Author: Josejulio Martínez <jmartine>
Date:   Tue Aug 22 13:30:53 2017 -0500

    Bug 1380471 - Check for timeouts when getting live availability
     - Set availability to unknown on timeout

Comment 8 Simeon Pinder 2017-09-19 11:33:20 UTC

Moving to ON_QA.

JON 3.3.9 CR01 artifacts are available for test from here:
http://download.eng.bos.redhat.com/brewroot/packages/org.jboss.on-jboss-on-parent/3.3.0.GA/135/maven/org/jboss/on/jon-server-patch/3.3.0.GA/jon-server-patch-3.3.0.GA.zip
 *Note: jon-server-patch-3.3.0.GA.zip maps to CR01 build of
 jon-server-3.3.0.GA-update-09.zip.

Comment 9 Prachi 2017-09-25 10:24:05 UTC

Verified
Version :	
3.3.0.GA Update 09
Build Number :	
fcb34f1:80f74f5

Comment 10 Prachi 2017-09-25 10:25:08 UTC

Created attachment 1330473 [details]
screen-shot

Comment 11 errata-xmlrpc 2017-10-02 17:21:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2846

Note You need to log in before you can comment on or make changes to this bug.