Red Hat Bugzilla – Bug 980765
RHQ 4.8 Graphs do not show in new charts
Last modified: 2014-04-23 11:22:39 EDT
Created attachment 768131 [details]
Logs and stuff
Description of problem:
The new graphs do not show for several resources, especially for grouped resources. Either graphs do not show at all, or only one of a group shows.
Version-Release number of selected component (if applicable):
RHQ 4.7, 4.8
Always, but to varying degrees dependent on resource. Mostly happens to grouped resources. Also seems to happen to mostly the line charts, not bar charts since they are shown for single resources not grouped.
Steps to Reproduce:
1. Log in to RHQ
2. Open RHQ UI to a group or resource
3. Select to display a metric graph from Summary/Metrics
Graph subwindow appears but is empty apart from the timeline selection.
OR: One graph is shown in a group of two. When deselecting the one that is shown, the other graph appears.
Expect to see the graphs for the metric for all resources in the group.
Going to Monitoring > Graphs the bar charts show, but I guess only an average?
On client side using:
Microsoft Windows Server 2008 R2 Standard
Versjon 6.1.7601 Service Pack 1 Build 7601
Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz, 2600 Mhz
Reproduced in Chrome 25.0.1364.97 and Firefox 21.0.
On RHQ side:
Red Hat Enterprise Linux Server release 6.4 (Santiago)
Linux d26apvl007.test.local 2.6.32-358.6.1.el6.x86_64 #1 SMP Fri Mar 29 16:51:51 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux
Attaching logs from Firebug, Chrome Devtools and
Devtools-1/Firebug-1: Only one graph of group of two shows
Devtools-2/Firebug-2: No graphs show of group
Attaching screenshot of first situation.
Thanks Stian for this information. Investigating...
By the way thanks for all the wonderful logs and screenshots -- it help greatly.
The strange thing is that the data is getting queried and loaded and drawn as shown by:
FINE: # of child resources: 2
FINE: Adding child composite: AMelding (10590)16431
FINE: Adding child composite: AMelding (10590)16432
Draw nvd3 charts for composite multiline graph
So it doesn't make sense that only one of lines in the graph is showing. Which made me think that it was a timing issue. But the countdown latch ensures that drawing the graph doesn't occur until all resources have been loaded. I was also looking for an off by one error in the count down latch code. So perhaps it is in the nvd3 library itself. Since I'm getting rid of that code we can try it out with the new pure d3 version and hopefully all of this disappears.
I will have a new release for you to try out in a couple days if you are willing to take it for a spin. At any rate, I'm adding extra logging to help diagnose. A most interesting problem ;-)
I'll be happy to test anything you throw my way :)
Be aware I am not able to compile code but I can test a new .war or .jar for instance, or replace some files.
Awesome! The power of the community.
any update on this one Mike?
Have you been able to repro the problem on your end?
I am still very much willing to test any build/debugs you have for me :)
Sorry for the long delay. I thought we would have the actual 4.9 release out a couple weeks ago. I checked in the fixes/rewrite for the multiline chart on 7/9/13 to master(e4acf0f). So if you want to build the code it is there. As for deployable artifacts, the build process (at Red Hat called 'brew') is quite involved as it builds from known, certified artifacts. RHQ 4.9 will be out soon (that's all I can say). I will check to see if QE has an alpha build with the desired commits it in the meantime.
WRT to being able to reproduce the bug on my end: I have not with my VM running Windows 7. So it is important to get this into your hands for testing.
Thank you for your Patience!
Here a link to a RHQ 4.9 Snapshot (332Mb) August 8, 2013 build:
I am having problems installing the snapshot :(
Hey - Jay pushed a fix of this to master - could I please get another snapshot to try? Really keen on seeing if this bug is fixed.
Here is another snapshot that you requested. It is the Aug 12, 2013 1:43pm snapshot.
Thanks Mike for the snapshot.
Here's an update after trying it.
- I was able to graph one groups of Linux servers, however the lines go outside the graphed area. See attachment.
- Also the hostnames are not selectable like I thought they would be?
- Other graphs do not show like before, and have the same behaviour as reported, i.e. just blank.
Other issues I have with 4.9 that are quite serious:
- I seem to have lost all metric history data since upgrade, even though /opt/rhq/rhq-storage/data and commitlog are large. Only new metrics are shown. Possibly baseline calculation fails, there are entries in log of Overflow Exception trying to bind NaN.
- All Jboss servers are Unavailable, Agent throws NPE on ASConnection.NewInstanceFor(...) So I have no Jboss servers to try graphs on.This includes the RHQ Server instance.
- Chrome throws a nasty JS error popup when loading the start page, something about compile time user.agent value (Gecko) not matching runtime user.agent value (Safari)? I can still log in though. Firefox is OK but very slow.
- Log throws WARN on invalid module options on the LDAP module (minor issue)
Basically there is so much wrong with this at the moment I am not sure I want to risk going further, not to mention reporting the numerous other issues I discovered with 4.9 :(
Created attachment 786468 [details]
RHQ 4.9 Linux Load Graph test
(In reply to Stian Lund from comment #11)
> Thanks Mike for the snapshot.
> Here's an update after trying it.
> - I was able to graph one groups of Linux servers, however the lines go
> outside the graphed area. See attachment.
Strange. I have never seen this before. I'm not sure how this is even possible with the graph library.
> - Also the hostnames are not selectable like I thought they would be?
I will have to add this back in once we are happy with this multi-line chart solution (which is a new rewrite of old solution).
> - Other graphs do not show like before, and have the same behaviour as
> reported, i.e. just blank.
> Other issues I have with 4.9 that are quite serious:
> - I seem to have lost all metric history data since upgrade, even though
> /opt/rhq/rhq-storage/data and commitlog are large. Only new metrics are
> shown. Possibly baseline calculation fails, there are entries in log of
> Overflow Exception trying to bind NaN.
I'm not sure the upgrade process that you went through and I'm not the expert on this issue to answer this question. For this dev build I would assume that we would be installing new to avoid any other issues.
> - All Jboss servers are Unavailable, Agent throws NPE on
> ASConnection.NewInstanceFor(...) So I have no Jboss servers to try graphs
> on.This includes the RHQ Server instance.
Something very wrong with this build. All bets off at this point.
> - Chrome throws a nasty JS error popup when loading the start page,
> something about compile time user.agent value (Gecko) not matching runtime
> user.agent value (Safari)? I can still log in though. Firefox is OK but very
Not an error. We only build for FireFox (to minimize the build times).
> - Log throws WARN on invalid module options on the LDAP module (minor issue)
> Basically there is so much wrong with this at the moment I am not sure I
> want to risk going further, not to mention reporting the numerous other
> issues I discovered with 4.9 :(
I don't think we will do anymore testing with the dev builds as it causes more confusion than its worth. Perhaps the build was pulled at a bad time.
We appreciate you taking the time to do this testing and highlighting some issues. I will continue to try to reproduce your strange graph issues.
No worries Mike, I will try some stuff on my side as well.
For instance clearing all metric data, it might work to start from a fresh set - do you know how to do this with Cassandra? I suspect we might have some bad data in there.
I will try to get some more "weird" examples for you, as well as seeing if I get some understandable message in Firebug, and seeing if there are some things to do on the browser side, trying Opera for instance.
> Not an error. We only build for FireFox (to minimize the build times).
Ok, but do you test with IE9? According to stuff I read IE9 is supported (well far as it goes)? The error with the msgbox is there as well. Firefox probably has the same error, just does not pop up an error dialog.
Also - Do you think it's an idea to post some of the stuff I'm seeing with the dev build on BZ? I can't find any of them with a quick search, and I would not want them to be in the final if no-one catches them.
(In reply to Stian Lund from comment #14)
> No worries Mike, I will try some stuff on my side as well.
> For instance clearing all metric data, it might work to start from a fresh
> set - do you know how to do this with Cassandra? I suspect we might have
> some bad data in there.
Let me ask a coworker the best way to clear out the metrics data. Will get back to you on this...
> I will try to get some more "weird" examples for you, as well as seeing if I
> get some understandable message in Firebug, and seeing if there are some
> things to do on the browser side, trying Opera for instance.
Yes, the Firebug logs help a whole lot! Please include information about the environment so we can get as close as possible to reproduce.
> You say:
> > Not an error. We only build for FireFox (to minimize the build times).
> Ok, but do you test with IE9? According to stuff I read IE9 is supported
> (well far as it goes)? The error with the msgbox is there as well. Firefox
> probably has the same error, just does not pop up an error dialog.
IE9 is supported. This Hudson build is just a quick check of things and since it is done many times a day; it only compiles the GWT stuff for firefox to keep it quick. So IE9 will not work on this build.
I'm talking with QE to get a different build (maybe not so often) but that has more browsers support. Stay tuned.
> Also - Do you think it's an idea to post some of the stuff I'm seeing with
> the dev build on BZ? I can't find any of them with a quick search, and I
> would not want them to be in the final if no-one catches them.
By all means, Go ahead and add that stuff to this BZ.
Thanks for being patient and especially helping make RHQ better!
Regarding the clearing of metric data, I spoke with a coworker that informed me that is not there yet. For now, you need to reinstall.
1) Remove rhq-data/
2) bin/rhqctrl install
It has come to my attention via Heiko that one of the issues you were seeing:
"Possibly baseline calculation fails, there are entries in log of Overflow Exception trying to bind NaN."
was most likely:
This is currently being worked on.
Thanks Mike- I did a search of BZ but wasn't able to find that one. It very much looks like the issue I have, every hour, but can't remember seeing it in 4.8.
The AS7 issue I'm seeing is very serious, but I figure it is so obvious someone else must have seen it, right?
I might post the LDAP issue but it's more a minor annoyance that it gives a WARN.
I will get a chance to do a complete reinstall early next week - I will clear all metric data (just delete the folder), and see if it helps.
Thanks again Mike,
have a good weekend :)
Hi Mike -
an update: I tried doing a clean install with clearing all metric data under rhq-storage. It still doesn't seem to display graphs for grouped resources unfortunately... :(
I also made a BZ for the bug with AS7 plugin - I stated the RHQ server also was Unavailable but apparently it does show up, and it is running EAP, which apparently works. But all JBoss AS7 fail to connect.
I think this one is quite important so if you could hint at someone to have a look at it that would be appreciated ;)
Hello Again -
I have now had help from Thomas to fix the bug with AS7 servers, so I have more resources to test with.
I have tried a lot of different things now;
- Resetting all metric data with a clean install
- Clearing Agents persistent data (for BZ 998842)
- Trying different browsers (IE, Firefox, Chrome)
- Logging in through localhost on Linux to avoid potention firewall/network issue
- Logging in with rhqadmin to see if it is an auth issue (we normally use LDAP)
For some reason, the only charts I am able to create, is the top-level of groups, i.e.
- I have a group of 19 Linux hosts and I can chart stuff like User Load, Free Memory etc (you saw an example earlier)
- I can go to the top level of a group of two AS7 and chart Max. Request Time. But any lower level resources fail to chart.
Could this be a clue - that top-level grouped resources do work (albeit but errors) while lower level (children) do not?
Also, it seems there's a limit to the number of resources I can chart here, I have a group of 40 JBoss AS7 servers, a couple of them Down, and get an uncaught exception in the background UI when tried to load the chart.
I am attaching some more logs from Firebug as well as logs from the server.
I find it really strange that no-one else has been able to repro this.
Hope this helps a bit to nail this one.
Created attachment 789503 [details]
RHQ Server logs
Created attachment 789504 [details]
Ok great! This info should help nail this one down. I will update you next week. Thanks for your legwork here, Stian!
Hello Mike & all :)
According to "rumour", 4.9 is "just outside the door" - so I am wondering if there's been any progress on this issue? Have you been able to repro on your side?
Honestly no, between cramming for the release (and vacation time on my part :-0) As 4.9 is released today, those changes did not get into that release. I will be able to devote time to it shortly now that release is out the door.
Thanks for the update Mike, I can live with that :)
And I will probably test in 4.9 during the next week anyway, maybe I get lucky and something Magic happens ;)
Let me know once you got something to test, the easiest would be to replace a jar/war or something instead of building a full snapshot, if possible.
Hey, a little update.
I have (finally) been able to get 4.9 run, and it seems the same problem is still there.
However, I guess 4.10 is being worked on and should contain some fixes so hopefully once it's out I would be able to test.
If there was a way to test just by replacing relevant bits of the rhq.ear then I could try that - however I don't think I will risk using a snapshot ;)
Thanks for your help and patience in testing this!
Picking up from previous conversation:
"Could this be a clue - that top-level grouped resources do work (albeit but errors) while lower level (children) do not?"
Is the group a compatible group? All resources in the group are of exactly the same type (and have exactly the same metrics)?
"Also, it seems there's a limit to the number of resources I can chart here, I have a group of 40 JBoss AS7 servers, a couple of them Down, and get an uncaught exception in the background UI when tried to load the chart."
This could be an issue, as there is a color palette of 20. I will investigate this more.
Also, when looking through the logs that you provided I noticed there were many baseline recalculation errors on the server. This has been totally rewritten now and would be great to retest in 4.10.
> Is the group a compatible group?
Yes - I usually create logical groups consisting of JBoss AS7 servers, Linux platforms, Tomcat servers. They are not necessarily exactly the same since different AS7 servers could have different deployments etc. But they are Compatible groups consisting of the same type of resources.
> This could be an issue, as there is a color palette of 20. I will investigate this more.
Maybe you could just limit the max displayed to 20 like it's done in 4.5.1. And then allow checkboxes to select the ones you want if it's a large group of resources.
For RHQ 4.5.1 I think it just throws an exception if you try to chart very large groups. Probably for performance reasons, and a huge number of lines in a single graph just looks messy anyway...
Author: Mike Thompson <firstname.lastname@example.org>
Date: Tue Mar 11 10:22:06 2014 -0700
[BZ 980765] Groups with more than 20 resources don't display in Multi-line composite graph.
The d3 color palette gives us 20 shades of colors; this along with the fact that waiting for more than n +1 queries to execute (even though they are in parallel - the browser is still a single thread) makes graphing groups of larger than 20 resources prohibitive. This fix places a maximum bounds of 20 resources for the multi-line graph.
@Stian, you will be happy to know that this makes the 4.1.0 release.
Absolutely happy to hear that Mike :)
Does this fix also affect the problem with graphs of groups consisting of fewer resources? It affects all sub-resources of groups, even if they consist of only two members. Top-level shows, sub-levels do not.
Happy to test something if it is possible to just replace some WAR/JAR files :)
WRT: "Top-level shows, sub-levels do not" - I'm assuming that you are referring to recursive compatible groups? If that is the case, I was not aware of that as a requirement.
The 4.10 release is imminent, so please test that and file a new BZ for this sub-level issue if you still find it as this BZ is getting long has multiple things in it that have been fixed (you can link this BZ to the new BZ).
WRT: testing artifacts: our QE is close to having artifacts continuously built from dev builds. [Here is the in progress url for that: https://drone.io/github.com/ahovsepy/test-drone/11
We will announce that in mailing list once complete.
I've now had a chance to test with RHQ 4.10 and it seems we still have the problem with recursive groups, where the sub-levels are not able to be graphed.
I am really puzzled by this, that no-one else has experienced it, because it seems to be consistent and I can't figure out what is wrong with our installation that might be causing an error like this.
I will open a new BZ for grouped resources (even though this BZ was originally about the same problem :)
Thanks Stian! I know you put a lot of effort into testing and documenting this.
This way it will get prioritised.
Bulk closing of 4.10 issues.
If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.