Created attachment 817112 [details] PM402 test results for 3.1.2 Description of problem: In 3.1.2 it's possible to inventory 75,000 perftest resources with agent max heap = 1G [1]. The agent now requires larger heap size. [1] Perftest configuration: 100 servers x 750 services -Drhq.perftest.scenario=configurable-1 -Drhq.perftest.server-a-count=100 -Drhq.perftest.service-a-count=750 Agent JRE: openjdk 1.6 VM -Xms64m -Xmx1024m Version-Release number of selected component (if applicable): 3.2 ER4 How reproducible: Steps to Reproduce: 1. Install perftest plugin 2. See [1] 3. From JON UI import all
heapdump, logs: hudson.23.50:/agent_oom_bz1024397.tar.gz
correction to #c1 hudson.23.50:/home/hudson/agent_oom_bz1024397.tar.gz
We need to determine if this is a valid use-case or not. 100 servers each with 750 children. Do we even have known cases where customers have 100 servers managed by an agent each with 750 direct children? Such a large number of children is known to be a worst case for the merge/sync code (which was built for speed at the expense of memory usage) - however, I'm not sure this test case is something we should care about. We need to test real-world scenarios before determining that we have performance issues here that cause GA to block. I would hate to block GA for this because this appears to be an artificial test that doesn't mimic a real-world scenario. I will run this test as described to see if I can see the OOM. I will then tweek the perfplugin configuration, keeping the # of resources the same but having a more realistic resource hierarchy.
larry, can you comment on the validity of this test case? is this a real-world scenario? in line with the requirements in pm-402? or an unsupported edge case that just needs to be documented?
what concerns me is that it is a regression from JON 3.1.2. there is an expectation that 3.2 will be more robust and scalable.
(In reply to Mike Foley from comment #6) > what concerns me is that it is a regression from JON 3.1.2. there is an > expectation that 3.2 will be more robust and scalable. Yes, more robust. It might have "worked" in 3.1.2, but how long did it take? This BZ's description only says "it used to work" but didn't say how well it worked. For tens of thousands of resources, in our prior code, it literally could take on the order of hours for that to completely import. Now its faster.
The yield of at least 75,000 resources from a single agent is correct. The test case of 100 servers with 750 services each is not completely valid. The test case that was originally defined should have been: 10 Servers (10 resources) -> 10 Services (100 resources) -> 750 Services (75,000 resources) Platform |-- Server01 | |-- Service01 | | |-- ChildService001 | | |-- ChildService... | | `-- ChildService750 | |-- Service... | | |-- ChildService001 | | |-- ChildService... | | `-- ChildService750 | `-- Service10 | |-- ChildService001 | |-- ChildService... | `-- ChildService750 |-- Server... | |-- Service01 | | |-- ChildService001 | | |-- ChildService... | | `-- ChildService750 | |-- Service... | | |-- ChildService001 | | |-- ChildService... | | `-- ChildService750 | `-- Service10 | |-- ChildService001 | |-- ChildService... | `-- ChildService750 `-- Server10 |-- Service01 | |-- ChildService001 | |-- ChildService... | `-- ChildService750 |-- Service... | |-- ChildService001 | |-- ChildService... | `-- ChildService750 `-- Service10 |-- ChildService001 |-- ChildService... `-- ChildService750
(In reply to Larry O'Leary from comment #8) > 10 Servers (10 resources) -> 10 Services (100 resources) -> 750 Services To mimic this use-case, I will run a test using these perftest plugin settings: -Drhq.perftest.scenario=configurable-deep -Drhq.perftest.server-deep1-count=10 -Drhq.perftest.sevice-deep2-count=10 -Drhq.perftest.sevice-deep3-count=750 -Drhq.perftest.sevice-deep4-count=0 -Drhq.perftest.sevice-deep5-count=0 -Drhq.perftest.sevice-deep6-count=0 There are probably other perftest settings that could do the same thing, but I'll use those.
just for completeness, there was typos in the previous comment - its "-service-" not "-sevice-": -Drhq.perftest.scenario=configurable-deep -Drhq.perftest.server-deep1-count=10 -Drhq.perftest.service-deep2-count=10 -Drhq.perftest.service-deep3-count=750 -Drhq.perftest.service-deep4-count=0 -Drhq.perftest.service-deep5-count=0 -Drhq.perftest.service-deep6-count=0
Wow! I hit OOM almost immediately. Well, I can tell you that I was able to do this many, many times in the past without any issues. Something clearly got broken in the past several months since the last time I was in this code.
some info from Viet ... - Ran 2 cycles of 75K resource import. The times were 3:42 (hours:minutes) almost 1 hour longer than previous 3.1.2 baseline 32:44. At 2G heapsize agent OOM was no longer an issue (bz1024397) http://10.16.23.1/perf-dashboard/g-summary.html?172,173 To confirm these numbers today I'm rerunning 100x750 configuration against 3.1.2 build on the same VMs as well as new configuration 10x10x750 as suggested by Larry in the bz.
(In reply to John Mazzitelli from comment #11) > Wow! I hit OOM almost immediately. Well, I can tell you that I was able to > do this many, many times in the past without any issues. Something clearly > got broken in the past several months since the last time I was in this code. OK, I need to step back. I think my problem is because I left the VM heap configuration as the out of box heap max of -X128mx. IIRC, back when me and jay fixed up the import stuff, we increased this. I do now notice that in the description, this BZ shows "VM -Xms64m -Xmx1024m". So I'm gonna retest by bumping up the heap. Clearly, 128MB is not enough for all these resources - for some reason, I thought we shipped the agent with a high heap size out of box, but that isn't the case.
hit server OOM before the agent could complete the import. this is bug #1023451 I think.
JProfiler is telling me that on the server we are getting abnormally large java.io ObjectInputStream objects. Looking at it, these are coming from JBoss/Remoting. We need to look at a problem in the latest version of Remoting that we upgraded to. I know we've gone up to some later versions from what was in JON 3.1.2. Its possible this is the cause. I also noticed that when I refresh the dashboard "inventory summary" portlet every second or two and see the service totals tick up in blocks of a couple hundred. In the past, this would tick up in the tens, not hundreds. Its possible something changed in our transactioning demarcations in the server SLSBs that is causing hibernate to bloat memory (I did see in one JProfiler report alot of hibernate objects). I haven't yet looked at the agent with JProfiler - but its interesting they both OOM in my tests (even after increasing heap). The common denominator? Jboss/Remoting.
I installed RHQ 4.8 and ran the same test (agent and the 10/10/750 perftest plugin config). The memory footprint looks much different. Its steadily using just about 1gig with a sawtooth memory usage pattern (spikes to over 1gig but then comes down to under it, and back and forth - this is the pattern to be expected when merging large inventories - its our batch inserting). The same thing in RHQ 4.9 occurs. I will attach a memory screenshot of JProfiler to show what I mean. This tells me that something broken post-4.9. I will continue to look to see when it broke.
Created attachment 818066 [details] RHQ 4.9 memory usage
(In reply to John Mazzitelli from comment #16) > I installed RHQ 4.8 and ran the same test (agent and the 10/10/750 perftest > plugin config). The memory footprint looks much different. Its steadily > using just about 1gig with a sawtooth memory usage pattern (spikes to over > 1gig but then comes down to under it, and back and forth - this is the > pattern to be expected when merging large inventories - its our batch > inserting). > > The same thing in RHQ 4.9 occurs. I will attach a memory screenshot of > JProfiler to show what I mean. > > This tells me that something broken post-4.9. I will continue to look to see > when it broke. The mystery continues. I tested with JON 3.2 ER2, ER3 and ER4 builds and still I cannot get it to blow up like I did before. I have the server's heap set to 2gig and the agent's heap set to 1gig. When I was seeing the OOMs, it was with a local build of the release/jon3.2.x branch, but I can't believe that would be the difference. I must be doing something differently but I can't think of what it could be. Perhaps I need to deploy all of the many community plugins not just those that ship in the JON ER builds? Not sure why that would affect things.
> When I was seeing the OOMs, it was with a local build of the > release/jon3.2.x branch, but I can't believe that would be the difference. I > must be doing something differently but I can't think of what it could be. > Perhaps I need to deploy all of the many community plugins not just those > that ship in the JON ER builds? Not sure why that would affect things. I rebuilt on my box, and I was able to import all 75k - I don't know what happened, but perhaps I just has a bad local build for that one test of mine earlier. As long as I give the server 2gig and agent 1gig of heap, it runs fine no matter what release or build I use.
Something must be amiss. I ran the test 4 more times against ER4 in 10x10x750 and 100x750 configurations and didn't see OOM. There were multiple log messages on the agent side indicating VM health check being activated and restarted the agent. I suspect in OOM cases heap consumption grew so fast before VM health thread could start the self healing. 2013-11-05 04:57:15,873 WARN [RHQ VM Health Check Thread] (org.rhq.enterprise.agent.VMHealthCheckThread)- {VMHealthCheckThread.mem-low}VM health check thread has detected [VM heap] memory has crossed the threshold [0.9] and is low: memory-usage=[init = 1073741824(1048576K) used = 879092136(858488K) committed = 955252736(932864K) max = 955252736(932864K)] 2013-11-05 04:57:15,873 INFO [ConfigurationManager.threadpool-1] (rhq.core.pc.configuration.ConfigurationCheckExecutor)- New configuration version detected on resource: Resource[id=32203, uuid=8f5a3be7-fb37-4838-a74c-77f81f14a4a0, type={PerfTest}service-a, key=service-a-613, name=service-a-613, parent=server-a-56, version=1.0] 2013-11-05 04:57:15,874 WARN [RHQ VM Health Check Thread] (org.rhq.enterprise.agent.VMHealthCheckThread)- {VMHealthCheckThread.gc}VM health check thread is invoking the garbage collector to see if more memory can be freed 2013-11-05 04:57:15,874 INFO [ConfigurationManager.threadpool-1] (rhq.core.pc.configuration.ConfigurationCheckExecutor)- New configuration version detected on resource: Resource[id=32204, uuid=f1f4b6e8-bcdc-46b1-9e6d-cc9337259106, type={PerfTest}service-a, key=service-a-659, name=service-a-659, parent=server-a-56, version=1.0] 2013-11-05 04:57:18,791 WARN [RHQ VM Health Check Thread] (org.rhq.enterprise.agent.VMHealthCheckThread)- {VMHealthCheckThread.mem-low}VM health check thread has detected [VM heap] memory has crossed the threshold [0.9] and is low: memory-usage=[init = 1073741824(1048576K) used = 878649720(858056K) committed = 955252736(932864K) max = 955252736(932864K)] 2013-11-05 04:57:18,791 FATAL [RHQ VM Health Check Thread] (org.rhq.enterprise.agent.VMHealthCheckThread)- {VMHealthCheckThread.mem-problem}VM health check thread sees that memory is critically low and will try to reboot the agent 2013-11-05 04:57:18,792 INFO [RHQ VM Health Check Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.shutting-down}Agent is being shut down... 2013-11-05 04:57:18,792 INFO [RHQ Primary Server Switchover Thread] (org.rhq.enterprise.agent.AgentMain)- {PrimaryServerSwitchoverThread.stopped}The primary server switchover thread has stopped. 2013-11-05 04:57:18,792 INFO [RHQ VM Health Check Thread] (org.rhq.core.pc.PluginContainer)- Plugin container is being shutdown... 2013-11-05 04:57:18,792 INFO [RHQ VM Health Check Thread] (org.rhq.core.pc.PluginContainer)- Plugin container shutdown will wait up to 300 seconds for shut down background threads to terminate. 2013-11-05 04:57:18,792 INFO [ConfigurationManager.threadpool-1] (rhq.core.pc.configuration.ConfigurationCheckExecutor)- New configuration version detected on resource: Resource[id=32205, uuid=83f77336-48a1-460b-8dca-e411b652942d, type={PerfTest}service-a, key=service-a-230, name=service-a-230, parent=server-a-56, version=1.0] 2013-11-05 04:57:18,793 INFO
we need to start documenting what JVMs we are using and what VM arguments we are passing. Because this smells like different behavior due to different VM implementations. When starting the agent, have RHQ_AGENT_DEBUG=true env var set, and capture the command used to launch the agent VM. This will tell us what -XX: vm args are passed in. Also, make sure we document vendor and version of VM. For my case, I'm using: Java version "1.7.0_11" Java(TM) SE Runtime Environment (build 1.7.0_11-b21) Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode) I did not change rhq-agent-env.sh except for adding the -D options for perftest plugin and I changed -Xmx and -Xms - all other options were left as is. Same with server VM.
based on the following evidence: comment 18 comment 19 comment 20 i am removing the blocks jon 3.2 flag
Created attachment 820017 [details] 100x750 scenario OOM logs I know 100 servers x 750 services is not considered valid scenario. Attaching log files for documentation purposes. OOM occured after 75,000 resources were inventoried.
Larry .... in comment #22 I removed the GA blocker because both mazz and viet can run pm-402 as defined. BUT then Viet confirms that he can OOM the product in comment #23. to be pedantic, this scenario he is running is *not* pm-402. but still, it is an OOM. i think it may need to be investigated more fully. i am going to go ahead an put the BZ back on the GA blocker ... but this may require a team decision "you/heiko/mark/alan" if you want this to block GA or not.
Can we confirm that what is described in comment 23 is a regression from JON 3.1.2?
Just to deconstruct this a bit ... 2 issues here: 1) is the OOM a regression? 2) is the OOM an issue? Viet can speak to question #1.
re #1 100x750 and 10x10x750 test against 3.1.2.GA are in progress. I'm having issue with postgresql timing out. Will report back results.
I was able to run 2 import scenarios against JON Build id 263443 (post 3.1.2.GA). Agent ran OOM hours after inventory completed. Agent JVM: open-jdk 1.6.0_24 Agent heap -Xms64m -Xmx1024m JON heap -Xms4G -Xmx4G A. 10x10x750 test (JON jvm=openjdk1.7) Inventory - start 17:49 - end 20:35 (2:46 elapsed time) - OOM 23:49 B. 100x750 test Inventory - start 15:58 - end 18:12 (2:14 elapsed time) - OOM 19:52 I recommend 1. running same tests against latest ER build and observe memory behaviors after inventory completion and 2. Giving more heap ER build will complete both topologies w/o OOM but it takes significantly longer (30-45 minutes). This is worth looking into.
documenting from 11/8/2013 meeting foley, viet, spinder, heiko, larry. paraphrasing.... larry: performance regression exists, but is not important for the customer. heiko/spinder: valid reasons due to changes in query criteria can explain the performance change from JON 3.1.2 with hot patch. Viet: OOMs do occur, sporadically, after the 75K goal is met. minimally, PM-402 is met in both the deep resource heirarchy and horizontal resource hierarchy. consensus agreement: PM-402 is met for JON 3.2. mark this complete. remove the JON 3.2 blocker flag.
Both tests are passing from the above. Slowdown (wrt 3.1.2) may come from other changes like - different http connection provide implementation in as7-plugin - chunking of reports sending to the server - some query changes on server side - performance enhancements for paging through results could legitimately cause import slow down and require larger heap. More specifically, fixes for the following are in JON 3.2.0 and cause more complete resource lists: i)Many places in the code we only parsed first 200 results ii)Some paging results returned duplicates resource ids
this is now a doc issue ... we need to document the 2GB heap size for the agent for the import of large number of resources.
I tried to summarize the different inventory structures and performance here: https://access.redhat.com/site/documentation/en-US/Red_Hat_JBoss_Operations_Network/3.2/html/Admin_and_Config/performance.html#inventory-baselines Tuning the heap is listed as an optional performance tuning proc here: https://access.redhat.com/site/documentation/en-US/Red_Hat_JBoss_Operations_Network/3.2/html/Admin_and_Config/agent-heap-size.html