Red Hat Bugzilla – Bug 850860
JON Drools knowledge session table has invalid values
Last modified: 2013-09-11 07:03:04 EDT
Created attachment 606302 [details]
When monitoring the Drools example application (BRMS 5.2.0.GA, BRMS 5.3.0.GA) with JON 3.1.1, the session has different values compared with JConsole.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Install the JON 3.1.1.ER2 server and with the BRMS plugin
2. Run a drools-jon-example application
3. discover the Drools application.
4. Go to the Drools Knowledge > Session-1 > Monitoring tables and compare it with JConsole
Values should be the same.
Steps to get Drools application running:
1) Unpack drools_application.tar.gz
2) Download the BRMS deployable zip file:
3) unzip brms-p-5.*-deployable.zip
4) unzip jboss-brms-engine.zip
5) Create link from binaries to drools-jon-example/drools
6) Run the application:
ant compile; ant server
ant client.phase1; ant client.phase2
7) Try to monitor with Jconsole and JON
Created attachment 606304 [details]
JON log files
Created attachment 606306 [details]
JON monitoring page
Created attachment 606307 [details]
JConsole monitoring page
If the Drools team are going to be able to get to the bottom of the issue, I expect they will need DEBUG level logs from the JON server and agent.
Created attachment 609647 [details]
JON 3.1.1.CR1 DEBUG logs for BRMS 5.2.0
did not fix the bug for BRMS 5.2.0/5.3.0.
Adding JON debug logs.
Created attachment 609649 [details]
JON 3.1.1.CR1 DEBUG logs for BRMS 5.3.0
Created attachment 610014 [details]
Drools session 0 and NaN values
Created attachment 610015 [details]
Drools session 0 and NaN values BRMS 5.3.0
First we should be clear that the RHQ tables/graphs are not really expected to always display the same data as JConsole. The JConsole shows the current runtime values while RHQ only takes snapshots of data at some interval determined by the customer. If the default collection time for the attributes is 10-20 mins then it's entirely possible that the NaN values are valid data since RHQ has not yet been asked to collect them. A quick test of this would be to select a metric and test with the 'Get Live Value' button to see if there is data but it just hasn't been formally requested for table/graphing.
You should also focus metric collection on the specific value collection to start and not the "Per Minute" collections as those values will take a bit longer to display as well.
With that being said, what were the scheduled metric collection intervals set to for the properties displayed in the graph? By default some of these metrics are not collected for 10-20 mins. The second thing to check would be to make sure that the time range that you're displaying information for is valid. For newly imported resources the default 'Time Range' displayed may not be displaying data for the time ranges that you expect.
Another data point to check in this situation is to make sure that the clock on the agent and server are synchronized. If you import the RHQ-Agent resource into your inventory the default metrics for the agent will help you track that.
With that being said, with a JON 3.0.1 install I am still seeing some disparity in the displayed metric values. They all appear to show up after some time but
not initially. I'll continue to look into the timing disparity issue here but some display is expected here.
Hi Pavel, Can you weigh in on the previous comment? It would help to understand what schedule collection intervals were used in your reproduce steps as well and whether the data does show up for you later if enough time has passed. Thanks.
Hi simeon I used 1minute collection interval.
After the collection interval passed there were still NaN and 0 values.
Do the values show up say after 10 or 15 mins? Trying to figure out if it's an error loading data ever that you're seeing .. or whether it's just a timing only issue. What happens when you select a specific metric and do getLiveData? Does that work?
After 15 minutes the NaN values change to 0 values.
I spent quite a bit of time on this but only found a small unexpected delay for metric collection on the very first collection( which is actually attributed to closed BZ's 751231 and 536181). I initially thought this was a sign of a deeper problem but the delay is planned/intentional to ensure that metric collections consistently happens. After accounting for this issue then I no longer see any significant failures/delays collecting metric values for the "Drools Knowledge Session" as described with your test application.
With that being said the customer should expect the following initialization collections delays before seeing valid collection:
i)The Drools application resources are not initially discoverable until the [JMX Server] node has been discovered, imported and valid connection settings entered.
ii)After step i) then some time should be allowed or initiate another discovery scan to occur and discover the child Drools resources.
iii)One should then wait for/initiate an availability scan to occur as well because metric collection will not happen for 'down' resources.
iv)Metric collection intervals need to be set to appropriate values as by default no collection will occur for 10-20 minutes. As currently configured you will need to go in and change all the metric schedules for the specific session to 30 seconds to ensure prompt collection.
v)Expect that the very first collection will take a bit longer than 30s, and account for delay between ui request and actual agent collection to begin.
vi)Expect all initial "per Minute" metrics to take twice as long to show up as RHQ needs at least two collections to occur before reporting valid data.
Additionally, because of how the metrics have been defined in this plugin, it is primarily the "per Minute" metrics that are enabled with default discovery which means that because of the above reasons that at least two collections, factoring in the above delays, is expected before valid data is seen.
Given the above explanation, we should retest this BZ with the following to see if this is still considered a problem:
- set/enable all metrics on the "Drools Knowledge Sessions" to 30s.
- account for the very first collection to take a minute or more, depending upon if the request for collection is immediate or will be honored in the next request.
- expect the "per Minute" metrics to take longer to be collected than the other metrics.
Moving this to ON_QA for re testing.
added len to :cc list.
len ... can someone on your team verify this? thanks!
Verified. JON 3.1.2.ER4 and BRMS 5.2.0.GA/5.3.0.GA/5.3.1.CR1.