Description of problem: When resources in the JBoss ON UI, every 30 seconds or so a warning is displayed in the UI: > Resource with id [12345] does not exist or is not accessible. This occurred because the server is taking a long time to complete this request. Please be aware that the server may still be processing your request and it may complete shortly. You can check the server logs to see if any abnormal errors occurred. > Root Cause: A request timeout has expired after 30000 ms Version-Release number of selected component (if applicable): 3.3.7 How reproducible: Always -- With slow network or agent Steps to Reproduce: 1. Start JBoss ON system and ensure agent is properly reporting availability of managed resources. 2. Configure firewall on agent host to silently drop packets destined to the agent's listen port of 16163. 3. In the JBoss ON UI, navigate to a resource managed by the agent with its firewall silently dropping packets. 4. Wait a minute or more. Actual results: After about 30 to 45 seconds, a warning appears regarding request timeout and continues to appear every 30 seconds after. Expected results: No warnings and the availability icon reflect the last known availability state of the resource based on the last receive "real" or "live" availability report (whichever is newer). It may be necessary to reflect that the live availability is unknown and perhaps a notice could be displayed to that indication? For example, "Unable to obtain this resources current availability state. Availability state reported is based on last known state from date/time." Additional info: In the user's scenario it is not known what is causing the delay. It is suspected that the agent is just slow to respond to the live availability request due to other operations being performed concurrently. The original report of this issue was received from the Red Hat Developer discussion thread https://developer.jboss.org/message/963465 .
I couldn't reproduce this bug using "Steps to Reproduce". Reason is, server tries to ping the agent before trying to get resource avail. It tries to ping agent with a timeout of 5s, if agent doesn't ping back, it will report AvailabilityType.UNKNOWN (or AvailabilityType.DOWN if resource is of type Platform). As port 16163 is configured on DROP, it will fail the ping check. I could reproduce this was by setting a breakpoint [1] and keep it waiting there until timeout. [1] https://github.com/josejulio/rhq/blob/012b4f48f0072a4df3995cc3279cdd0cabde6361/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/resource/ResourceManagerBean.java#L2710-L2710
Interesting. I am not able to reproduce this using my original steps either. I also attempted the following: 1. Confirm no sockets/connections exist from server to agent. 2. Navigate to RHQ Server resource. -- 1. Confirm no sockets/connections exist from server to agent. 2. Navigate to RHQ Agent resource. -- 1. Unblock port 16163. 2. Import EAP 6 server into inventory using port 9990. 3. Confirm EAP 6 reports available (i.e. connection settings are valid). 4. Navigate away from EAP 6 resource in UI. 5. Block (DROP) packets to port 9990. 6. Navigate to EAP 6 resource and wait a couple minutes. ^^ Repeate steps 1 to 5 6. Navigate to EAP 6 resource's metric table to confirm live metrics aren't causing the UI timeout. In all instances this worked without any warning/error to the UI. I would be okay with CLOSED/WORKSFORME unless you want to address the general timeout/hang issue you were able to reproduce form adding the break point. Basically, my concern is that if it is taking longer then 15 seconds to get the live availability for a resource, then we should treat the availability as UNKNOWN without waiting for the 30 second generic UI timeout. Primary reason is that if we wait 30 seconds, we now have another 1 or 2 availability checks queued up that will also result in the same UI warning -- potentially after the user has already navigated away. I'll defer to dev's expertise as it relates to simplicity of a fix and risk assessment.
I already have fix, i just need to do a bit more of testing. I'll try to change timeout to 15s. Currently, on timeout, is showing previous availability. It makes sense to change it as UNKNOWN (or DOWN if resource is a Platform[1]) [1] https://github.com/josejulio/rhq/blob/012b4f48f0072a4df3995cc3279cdd0cabde6361/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/resource/ResourceManagerBean.java#L2712-L2716
Larry, I don't think is necessary to lower the timeout to 15s, there is a countdown latch [1] that will only allow one refresh at a given time. [1] https://github.com/josejulio/rhq/blob/91291e00a58349c1c36166ac8d3a3c3bfc3bdc2f/modules/enterprise/gui/coregui/src/main/java/org/rhq/coregui/client/inventory/resource/detail/ResourceTitleBar.java#L150-L153
Okay. I think the reason I was concerned is because I can see a new socket being created every 15 seconds. If the socket is waiting for a connect or has hung, it seems to remain around for a couple of minutes. This resulted in 8 sockets to this single agent.
commit f1bbd51c69a67a9ff59e42b6d1c515d25526bd6f Merge: 3f5df89 803dbc9 Author: Michael Burman <yak> Date: Wed Aug 30 21:22:48 2017 +0300 Merge pull request #317 from josejulio/bugs/1380471 Bug 1380471 - Check for timeouts when getting live availability commit 803dbc93d70b5b9136ebb2440a879eff621340ee Author: Josejulio Martínez <jmartine> Date: Tue Aug 22 13:30:53 2017 -0500 Bug 1380471 - Check for timeouts when getting live availability - Set availability to unknown on timeout
Moving to ON_QA. JON 3.3.9 CR01 artifacts are available for test from here: http://download.eng.bos.redhat.com/brewroot/packages/org.jboss.on-jboss-on-parent/3.3.0.GA/135/maven/org/jboss/on/jon-server-patch/3.3.0.GA/jon-server-patch-3.3.0.GA.zip *Note: jon-server-patch-3.3.0.GA.zip maps to CR01 build of jon-server-3.3.0.GA-update-09.zip.
Verified Version : 3.3.0.GA Update 09 Build Number : fcb34f1:80f74f5
Created attachment 1330473 [details] screen-shot
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:2846