1024397 – Agent OOM at 75K perftest resources -- PM-402 retest with JON 3.2 ER4

Bug 1024397 - Agent OOM at 75K perftest resources -- PM-402 retest with JON 3.2 ER4

Summary: Agent OOM at 75K perftest resources -- PM-402 retest with JON 3.2 ER4

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Operations Network
Classification:	JBoss
Component:	Documentation
Sub Component:
Version:	JON 3.2
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ER06
Target Release:	JON 3.2.0
Assignee:	Deon Ballard
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:	1023451 1029688
Blocks:	1012435 1030066
TreeView+	depends on / blocked

Reported:	2013-10-29 15:10 UTC by Viet Nguyen
Modified:	2014-09-05 15:40 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-09-05 15:40:32 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
PM402 test results for 3.1.2 (24.12 KB, image/png) 2013-10-29 15:10 UTC, Viet Nguyen	no flags	Details
RHQ 4.9 memory usage (103.76 KB, image/jpeg) 2013-10-31 20:39 UTC, John Mazzitelli	no flags	Details
100x750 scenario OOM logs (1.10 MB, application/zip) 2013-11-05 21:24 UTC, Viet Nguyen	no flags	Details
View All

Description Viet Nguyen 2013-10-29 15:10:20 UTC

Created attachment 817112 [details]
PM402 test results for 3.1.2

Description of problem:

In 3.1.2 it's possible to inventory 75,000 perftest resources with agent max heap = 1G [1].  The agent now requires larger heap size. 


[1] Perftest configuration:

100 servers x 750 services

-Drhq.perftest.scenario=configurable-1 -Drhq.perftest.server-a-count=100 -Drhq.perftest.service-a-count=750


Agent JRE: openjdk 1.6
VM -Xms64m -Xmx1024m




Version-Release number of selected component (if applicable):

3.2 ER4

How reproducible:


Steps to Reproduce:
1. Install perftest plugin
2. See [1]
3. From JON UI import all

Comment 1 Viet Nguyen 2013-10-29 15:11:23 UTC

heapdump, logs:  hudson.23.50:/agent_oom_bz1024397.tar.gz

Comment 3 Viet Nguyen 2013-10-29 16:00:38 UTC

correction to #c1

 hudson.23.50:/home/hudson/agent_oom_bz1024397.tar.gz

Comment 4 John Mazzitelli 2013-10-29 17:41:00 UTC

We need to determine if this is a valid use-case or not. 100 servers each with 750 children. Do we even have known cases where customers have 100 servers managed by an agent each with 750 direct children?

Such a large number of children is known to be a worst case for the merge/sync code (which was built for speed at the expense of memory usage) - however, I'm not sure this test case is something we should care about.

We need to test real-world scenarios before determining that we have performance issues here that cause GA to block. I would hate to block GA for this because this appears to be an artificial test that doesn't mimic a real-world scenario.

I will run this test as described to see if I can see the OOM. I will then tweek the perfplugin configuration, keeping the # of resources the same but having a more realistic resource hierarchy.

Comment 5 Mike Foley 2013-10-29 17:45:57 UTC

larry, can you comment on the validity of this test case?  is this a real-world scenario?  in line with the requirements in pm-402?  or an unsupported edge case that just needs to be documented?

Comment 6 Mike Foley 2013-10-29 18:07:42 UTC

what concerns me is that it is a regression from JON 3.1.2.  there is an expectation that 3.2 will be more robust and scalable.

Comment 7 John Mazzitelli 2013-10-29 18:14:33 UTC

(In reply to Mike Foley from comment #6)
> what concerns me is that it is a regression from JON 3.1.2.  there is an
> expectation that 3.2 will be more robust and scalable.

Yes, more robust. It might have "worked" in 3.1.2, but how long did it take? This BZ's description only says "it used to work" but didn't say how well it worked.

For tens of thousands of resources, in our prior code, it literally could take on the order of hours for that to completely import. Now its faster.

Comment 8 Larry O'Leary 2013-10-29 23:23:11 UTC

The yield of at least 75,000 resources from a single agent is correct. The test case of 100 servers with 750 services each is not completely valid. The test case that was originally defined should have been:

10 Servers (10 resources) -> 10 Services (100 resources) -> 750 Services (75,000 resources)

Platform
|-- Server01
|   |-- Service01
|   |   |-- ChildService001
|   |   |-- ChildService...
|   |   `-- ChildService750
|   |-- Service...
|   |   |-- ChildService001
|   |   |-- ChildService...
|   |   `-- ChildService750
|   `-- Service10
|       |-- ChildService001
|       |-- ChildService...
|       `-- ChildService750
|-- Server...
|   |-- Service01
|   |   |-- ChildService001
|   |   |-- ChildService...
|   |   `-- ChildService750
|   |-- Service...
|   |   |-- ChildService001
|   |   |-- ChildService...
|   |   `-- ChildService750
|   `-- Service10
|       |-- ChildService001
|       |-- ChildService...
|       `-- ChildService750
`-- Server10
    |-- Service01
    |   |-- ChildService001
    |   |-- ChildService...
    |   `-- ChildService750
    |-- Service...
    |   |-- ChildService001
    |   |-- ChildService...
    |   `-- ChildService750
    `-- Service10
        |-- ChildService001
        |-- ChildService...
        `-- ChildService750

Comment 9 John Mazzitelli 2013-10-30 10:23:24 UTC

(In reply to Larry O'Leary from comment #8)
> 10 Servers (10 resources) -> 10 Services (100 resources) -> 750 Services

To mimic this use-case, I will run a test using these perftest plugin settings:

        -Drhq.perftest.scenario=configurable-deep
        -Drhq.perftest.server-deep1-count=10
        -Drhq.perftest.sevice-deep2-count=10
        -Drhq.perftest.sevice-deep3-count=750
        -Drhq.perftest.sevice-deep4-count=0
        -Drhq.perftest.sevice-deep5-count=0
        -Drhq.perftest.sevice-deep6-count=0
 
There are probably other perftest settings that could do the same thing, but I'll use those.

Comment 10 John Mazzitelli 2013-10-30 15:35:59 UTC

just for completeness, there was typos in the previous comment - its "-service-" not "-sevice-":

-Drhq.perftest.scenario=configurable-deep
-Drhq.perftest.server-deep1-count=10
-Drhq.perftest.service-deep2-count=10
-Drhq.perftest.service-deep3-count=750
-Drhq.perftest.service-deep4-count=0
-Drhq.perftest.service-deep5-count=0
-Drhq.perftest.service-deep6-count=0

Comment 11 John Mazzitelli 2013-10-30 15:38:34 UTC

Wow! I hit OOM almost immediately. Well, I can tell you that I was able to do this many, many times in the past without any issues. Something clearly got broken in the past several months since the last time I was in this code.

Comment 12 Mike Foley 2013-10-30 16:03:19 UTC

some info from Viet ...



- Ran 2 cycles of 75K resource import.  The times were 3:42 (hours:minutes) almost 1 hour longer than previous 3.1.2 baseline 32:44. At 2G heapsize agent OOM was no longer an issue (bz1024397)

http://10.16.23.1/perf-dashboard/g-summary.html?172,173

To confirm these numbers today I'm rerunning 100x750 configuration against 3.1.2 build on the same VMs as well as new configuration 10x10x750 as suggested by Larry in the bz.

Comment 13 John Mazzitelli 2013-10-30 16:32:43 UTC

(In reply to John Mazzitelli from comment #11)
> Wow! I hit OOM almost immediately. Well, I can tell you that I was able to
> do this many, many times in the past without any issues. Something clearly
> got broken in the past several months since the last time I was in this code.

OK, I need to step back. I think my problem is because I left the VM heap configuration as the out of box heap max of -X128mx. IIRC, back when me and jay fixed up the import stuff, we increased this. I do now notice that in the description, this BZ shows "VM -Xms64m -Xmx1024m". So I'm gonna retest by bumping up the heap. Clearly, 128MB is not enough for all these resources - for some reason, I thought we shipped the agent with a high heap size out of box, but that isn't the case.

Comment 14 John Mazzitelli 2013-10-30 16:52:18 UTC

hit server OOM before the agent could complete the import. this is bug #1023451 I think.

Comment 15 John Mazzitelli 2013-10-30 23:46:34 UTC

JProfiler is telling me that on the server we are getting abnormally large java.io ObjectInputStream objects. Looking at it, these are coming from JBoss/Remoting. We need to look at a problem in the latest version of Remoting that we upgraded to. I know we've gone up to some later versions from what was in JON 3.1.2. Its possible this is the cause.

I also noticed that when I refresh the dashboard "inventory summary" portlet every second or two and see the service totals tick up in blocks of a couple hundred. In the past, this would tick up in the tens, not hundreds. Its possible something changed in our transactioning demarcations in the server SLSBs that is causing hibernate to bloat memory (I did see in one JProfiler report alot of hibernate objects).

I haven't yet looked at the agent with JProfiler - but its interesting they both OOM in my tests (even after increasing heap). The common denominator? Jboss/Remoting.

Comment 16 John Mazzitelli 2013-10-31 20:38:26 UTC

I installed RHQ 4.8 and ran the same test (agent and the 10/10/750 perftest plugin config). The memory footprint looks much different. Its steadily using just about 1gig with a sawtooth memory usage pattern (spikes to over 1gig but then comes down to under it, and back and forth - this is the pattern to be expected when merging large inventories - its our batch inserting).

The same thing in RHQ 4.9 occurs. I will attach a memory screenshot of JProfiler to show what I mean.

This tells me that something broken post-4.9. I will continue to look to see when it broke.

Comment 17 John Mazzitelli 2013-10-31 20:39:01 UTC

Created attachment 818066 [details]
RHQ 4.9 memory usage

Comment 18 John Mazzitelli 2013-11-01 00:44:11 UTC

(In reply to John Mazzitelli from comment #16)
> I installed RHQ 4.8 and ran the same test (agent and the 10/10/750 perftest
> plugin config). The memory footprint looks much different. Its steadily
> using just about 1gig with a sawtooth memory usage pattern (spikes to over
> 1gig but then comes down to under it, and back and forth - this is the
> pattern to be expected when merging large inventories - its our batch
> inserting).
> 
> The same thing in RHQ 4.9 occurs. I will attach a memory screenshot of
> JProfiler to show what I mean.
> 
> This tells me that something broken post-4.9. I will continue to look to see
> when it broke.

The mystery continues. I tested with JON 3.2 ER2, ER3 and ER4 builds and still I cannot get it to blow up like I did before. I have the server's heap set to 2gig and the agent's heap set to 1gig.

When I was seeing the OOMs, it was with a local build of the release/jon3.2.x branch, but I can't believe that would be the difference. I must be doing something differently but I can't think of what it could be. Perhaps I need to deploy all of the many community plugins not just those that ship in the JON ER builds? Not sure why that would affect things.

Comment 19 John Mazzitelli 2013-11-04 14:36:48 UTC

> When I was seeing the OOMs, it was with a local build of the
> release/jon3.2.x branch, but I can't believe that would be the difference. I
> must be doing something differently but I can't think of what it could be.
> Perhaps I need to deploy all of the many community plugins not just those
> that ship in the JON ER builds? Not sure why that would affect things.

I rebuilt on my box, and I was able to import all 75k - I don't know what happened, but perhaps I just has a bad local build for that one test of mine earlier. As long as I give the server 2gig and agent 1gig of heap, it runs fine no matter what release or build I use.

Comment 20 Viet Nguyen 2013-11-05 16:37:33 UTC

Something must be amiss. I ran the test 4 more times against ER4 in 10x10x750 and 100x750 configurations and didn't see OOM.  There were multiple log messages on the agent side indicating VM health check being activated and restarted the agent.  I suspect in OOM cases heap consumption grew so fast before VM health thread could start the self healing.


2013-11-05 04:57:15,873 WARN  [RHQ VM Health Check Thread] (org.rhq.enterprise.agent.VMHealthCheckThread)- {VMHealthCheckThread.mem-low}VM health check thread has detected [VM heap] memory has crossed the threshold [0.9] and is low: memory-usage=[init = 1073741824(1048576K) used = 879092136(858488K) committed = 955252736(932864K) max = 955252736(932864K)]
2013-11-05 04:57:15,873 INFO  [ConfigurationManager.threadpool-1] (rhq.core.pc.configuration.ConfigurationCheckExecutor)- New configuration version detected on resource: Resource[id=32203, uuid=8f5a3be7-fb37-4838-a74c-77f81f14a4a0, type={PerfTest}service-a, key=service-a-613, name=service-a-613, parent=server-a-56, version=1.0]
2013-11-05 04:57:15,874 WARN  [RHQ VM Health Check Thread] (org.rhq.enterprise.agent.VMHealthCheckThread)- {VMHealthCheckThread.gc}VM health check thread is invoking the garbage collector to see if more memory can be freed
2013-11-05 04:57:15,874 INFO  [ConfigurationManager.threadpool-1] (rhq.core.pc.configuration.ConfigurationCheckExecutor)- New configuration version detected on resource: Resource[id=32204, uuid=f1f4b6e8-bcdc-46b1-9e6d-cc9337259106, type={PerfTest}service-a, key=service-a-659, name=service-a-659, parent=server-a-56, version=1.0]
2013-11-05 04:57:18,791 WARN  [RHQ VM Health Check Thread] (org.rhq.enterprise.agent.VMHealthCheckThread)- {VMHealthCheckThread.mem-low}VM health check thread has detected [VM heap] memory has crossed the threshold [0.9] and is low: memory-usage=[init = 1073741824(1048576K) used = 878649720(858056K) committed = 955252736(932864K) max = 955252736(932864K)]
2013-11-05 04:57:18,791 FATAL [RHQ VM Health Check Thread] (org.rhq.enterprise.agent.VMHealthCheckThread)- {VMHealthCheckThread.mem-problem}VM health check thread sees that memory is critically low and will try to reboot the agent
2013-11-05 04:57:18,792 INFO  [RHQ VM Health Check Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.shutting-down}Agent is being shut down...
2013-11-05 04:57:18,792 INFO  [RHQ Primary Server Switchover Thread] (org.rhq.enterprise.agent.AgentMain)- {PrimaryServerSwitchoverThread.stopped}The primary server switchover thread has stopped.
2013-11-05 04:57:18,792 INFO  [RHQ VM Health Check Thread] (org.rhq.core.pc.PluginContainer)- Plugin container is being shutdown...
2013-11-05 04:57:18,792 INFO  [RHQ VM Health Check Thread] (org.rhq.core.pc.PluginContainer)- Plugin container shutdown will wait up to 300 seconds for shut down background threads to terminate.
2013-11-05 04:57:18,792 INFO  [ConfigurationManager.threadpool-1] (rhq.core.pc.configuration.ConfigurationCheckExecutor)- New configuration version detected on resource: Resource[id=32205, uuid=83f77336-48a1-460b-8dca-e411b652942d, type={PerfTest}service-a, key=service-a-230, name=service-a-230, parent=server-a-56, version=1.0]
2013-11-05 04:57:18,793 INFO

Comment 21 John Mazzitelli 2013-11-05 16:41:38 UTC

we need to start documenting what JVMs we are using and what VM arguments we are passing. Because this smells like different behavior due to different VM implementations.

When starting the agent, have RHQ_AGENT_DEBUG=true env var set, and capture the command used to launch the agent VM. This will tell us what -XX: vm args are passed in. Also, make sure we document vendor and version of VM.

For my case, I'm using:

Java version "1.7.0_11"
Java(TM) SE Runtime Environment (build 1.7.0_11-b21)
Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)

I did not change rhq-agent-env.sh except for adding the -D options for perftest plugin and I changed -Xmx and -Xms - all other options were left as is. Same with server VM.

Comment 22 Mike Foley 2013-11-05 18:17:54 UTC

based on the following evidence:
comment 18
comment 19
comment 20

i am removing the blocks jon 3.2 flag

Comment 23 Viet Nguyen 2013-11-05 21:24:42 UTC

Created attachment 820017 [details]
100x750 scenario OOM logs

I know 100 servers x 750 services is not considered valid scenario. Attaching log files for documentation purposes.  OOM occured after 75,000 resources were inventoried.

Comment 24 Mike Foley 2013-11-06 15:34:42 UTC

Larry ....  in comment #22 I removed the GA blocker because both mazz and viet can run pm-402 as defined.


BUT 

then Viet confirms that he can OOM the product in comment #23.  to be pedantic, this scenario he is running is *not* pm-402.  but still, it is an OOM.  i think it may need to be investigated more fully.   i am going to go ahead an put the BZ back on the GA blocker ... but this may require a team decision "you/heiko/mark/alan" if you want this to block GA or not.

Comment 25 Larry O'Leary 2013-11-06 16:34:38 UTC

Can we confirm that what is described in comment 23 is a regression from JON 3.1.2?

Comment 26 Mike Foley 2013-11-06 18:39:52 UTC

Just to deconstruct this a bit ... 2 issues here:
1) is the OOM a regression?
2) is the OOM an issue?

Viet can speak to question #1.

Comment 27 Viet Nguyen 2013-11-06 20:02:53 UTC

re #1 100x750 and 10x10x750 test against 3.1.2.GA are in progress.  I'm having issue with postgresql timing out.  Will report back results.

Comment 28 Viet Nguyen 2013-11-08 15:22:33 UTC

I was able to run 2 import scenarios against JON Build id 263443 (post 3.1.2.GA).  Agent ran OOM hours after inventory completed.

Agent JVM:  open-jdk 1.6.0_24
Agent heap -Xms64m -Xmx1024m
JON heap -Xms4G -Xmx4G


A. 10x10x750 test (JON jvm=openjdk1.7)
Inventory 
- start 17:49
- end   20:35 (2:46 elapsed time)
- OOM   23:49

B. 100x750 test
Inventory
- start 15:58
- end   18:12 (2:14 elapsed time)
- OOM   19:52

I recommend 1. running same tests against latest ER build and observe memory behaviors after inventory completion and 2.  Giving more heap ER build will complete both topologies w/o OOM but it takes significantly longer (30-45 minutes).  This is worth looking into.

Comment 29 Mike Foley 2013-11-08 16:27:36 UTC

documenting from 11/8/2013 meeting foley, viet, spinder, heiko, larry.

paraphrasing....

larry: performance regression exists, but is not important for the customer.

heiko/spinder:  valid reasons due to changes in query criteria can explain the performance change from JON 3.1.2 with hot patch.

Viet:  OOMs do occur, sporadically, after the 75K goal is met.

minimally, PM-402 is met in both the deep resource heirarchy and horizontal resource hierarchy.   

consensus agreement:  PM-402 is met for JON 3.2.  mark this complete. remove the JON 3.2 blocker flag.

Comment 30 Heiko W. Rupp 2013-11-08 16:32:18 UTC

Both tests are passing from the above.

Slowdown (wrt 3.1.2) may come from other changes like
- different http connection provide implementation in as7-plugin
- chunking of reports sending to the server
- some query changes on server side
- performance enhancements for paging through results could legitimately cause import slow down and require larger heap. More specifically, fixes for the following are in JON 3.2.0 and cause more complete resource lists: 
  i)Many places in the code we only parsed first 200 results 
  ii)Some paging results returned duplicates resource ids

Comment 31 Mike Foley 2013-11-08 16:38:06 UTC

this is now a doc issue ... we need to document the 2GB heap size for the agent for the import of large number of resources.

Comment 33 Deon Ballard 2014-01-24 18:03:55 UTC

I tried to summarize the different inventory structures and performance here:
https://access.redhat.com/site/documentation/en-US/Red_Hat_JBoss_Operations_Network/3.2/html/Admin_and_Config/performance.html#inventory-baselines

Tuning the heap is listed as an optional performance tuning proc here:
https://access.redhat.com/site/documentation/en-US/Red_Hat_JBoss_Operations_Network/3.2/html/Admin_and_Config/agent-heap-size.html

Note You need to log in before you can comment on or make changes to this bug.