Description of problem: Sometimes Measurement Cache Element counts are not correct. Using an Alert Template Definition that generates 5 conditions per agent, the Measurement Element Cache during correct operations is 15. Over time the Measurement Cache Element count fluctuated as high as 20 even though no new alert conditions were added to the system. Some time during alert operations the cache is not being correctly cleared out.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Load three agents into a JON server and remove all Alert conditions for all agents, and verify by navigating to the JBoss AS : RHQ Server instance > Alert Subsystem that Measurement Cache Element Count is 0.
2. Import JBoss AS instances that have DefaultDS datasources(for example) into inventory and then modify Alert Templates for this same datasource to have three different alert conditions such that: i)one alert always fires ii)total Measurement Cache Element Count for all three is 15 or some verifiable multiple of the number of agents.
3. Monitor the Cache Element Counts over several hours to reproduce the behavior. May need to restart agents and servers to reproduce.
Cache count rose to 20 and stayed there. Correct number of alerts were not being generated either.
Consistent cache count.
The alerts used above were based on alert conditions relative to baseline values. It is necessary to set a baseline value for successful alert generation.
Created attachment 515056 [details]
Measurement Cache Element should never go above 15, but hit 20 here.
The Measurement Cache Element count was at 15 during correct alert generation and no new conditions were added but alert cache count rose to 20. The logs show cache recalculation where the additional cache elements are still being found.
Extra cache elements could cause extra or missing alert notifications.
(4:10:22 PM) jshaughn: ccrouch: the spinder alert bzs are basically done. Today we closed one (725320) as a duplicate of yesterday's resolved issue. Also, we created 726202 for another behavior he was tracking and I've got that fixed (you can review).
(4:11:17 PM) jshaughn: it leaves only 725429 and whatever issues may still get reported from the customer.
(4:11:21 PM) ccrouch: good stuff jshaughn spinder
(4:12:54 PM) ccrouch: nice catch on https://bugzilla.redhat.com/show_bug.cgi?id=726202
(4:13:10 PM) jshaughn: as for 725429, I'm not sure about that one. I'd suggest we defer work there as I have not been able to recreate it, nor is the observed behavior, I think, obviously linked to an alerting issue.
(4:14:25 PM) ccrouch: jshaughn: the customer has seen it though right? Every 24hrs? Or was that a graph of something else?
(4:14:44 PM) jshaughn: as for the customer's issue, I never really saw that exact behavior. So, between 725429 and their report there may be something lurking. On the other hand, both could be innocuous.
(4:15:07 PM) jshaughn: or, it could have been related to yesterday's issue.
(4:15:15 PM) jshaughn: resolved by the slowed restart
(4:15:30 PM) ccrouch: are you saying that its perfectly ok to the cache element count vary?
(4:15:35 PM) jshaughn: I'm not sure. I think we need them to come back to us
(4:15:54 PM) ccrouch: ...for the cache element count to vary?
(4:15:55 PM) jshaughn: after the whole perf issue is reolved
(4:16:32 PM) ccrouch: they were certainly in a pretty bad state
(4:17:02 PM) jshaughn: the cache element count maybe should not vary but I'm not sure. As the agent caches are reloaded at db maintenance time, perhaps the dip is related to the reload.
(4:17:34 PM) ccrouch: jshaughn: but you've not been able to trigger it?
(4:17:46 PM) jshaughn: I've not seen it yet
(4:18:21 PM) jshaughn: but my reloads are very fast because I'm not built out like they are, and I don't generate millions of alerts and crush my db
(4:18:38 PM) ccrouch: right, but then neither was spinder
(4:18:49 PM) ccrouch: so i guess its a 1-1 draw so far
(4:18:50 PM) jshaughn: spinder did not report that issue in his bzs
(4:19:14 PM) jshaughn: if he has seen it then we should definitely pusue it further
(4:19:38 PM) ccrouch: i'm sorry i'm talking about the cache element count changes
(4:19:43 PM) jshaughn: or, if they come back with that issue again, after the server restart tweak and resolved perf
(4:20:11 PM) jshaughn: the customer complained abouta dip, and of missing alerts
(4:20:36 PM) jshaughn: simeon claimed to see a higher than expected cache size.
(4:21:13 PM) spinder: yep. I'm not sure how much of that was related to my agent<->server mismatch though.
(4:21:25 PM) ccrouch: what agent/server mismatch?
(4:21:41 PM) jshaughn: the customer dip, maybe was a product of agents that had been lost due to a server restart. I don't know. Or, it may certainly be another , real, issue.
(4:22:16 PM) ccrouch: jshaughn: i see, you are differentiating between going up and going down. I was merging them together into "count changed"
(4:22:17 PM) ccrouch: i see your point
(4:22:52 PM) ccrouch: jshaughn: but regardless counts were steady for you?
(4:22:57 PM) spinder: ccrouch: 725445. Basically if you're ever find you agent count not correct ... it could affect your agent cache count numbers I believe.
(4:24:38 PM) jshaughn: ccrouch: for me, I have yet to see an unexpected cache size, other than due to the problems we've resolved.
(4:24:48 PM) vhalber_afk is now known as vhalbert
(4:24:50 PM) ccrouch: great ok
(4:25:21 PM) ccrouch: so do you want to close as cannot repro ?
(4:26:21 PM) ccrouch: 725429 i mean
(4:27:07 PM) ccrouch: also https://bugzilla.redhat.com/show_bug.cgi?id=725445 can presumably be closed as wontfix? and we'll pick everything up in https://bugzilla.redhat.com/show_bug.cgi?id=725881 ?
(4:28:11 PM) brandon_hm is now known as brandon_hm_afk
(4:30:55 PM) ccrouch: jshaughn: ^ ?
(4:31:37 PM) jsanda is now known as jsanda_bbl
(4:32:09 PM) jshaughn: we can leave 725429, I think. It's basically unexplained. Or, have simeon try to reproduce it again.
(4:32:40 PM) jshaughn: but I recommend we don't work on it now as I'm not sure it related to any actual customer or alerting issue.
(4:33:07 PM) ccrouch: right, thats why i'm hesitant to keep it open, if it cant be reproduced
(4:33:24 PM) jshaughn: ask spinder what he wants to do
(4:33:48 PM) jshaughn: 725445 can be closed as wontfix but yeah, the one I added should be done for jon3
(4:34:09 PM) spinder: close it. I've reinstalled three times already. I'm sure it happened. Mazz witnessed it. I'm just not sure what the repro steps are.