525549 – Java OOM (OutOfMemory) when removing a package via SSM (System Set Manager)

Bug 525549 - Java OOM (OutOfMemory) when removing a package via SSM (System Set Manager)

Summary: Java OOM (OutOfMemory) when removing a package via SSM (System Set Manager)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite 5
Classification:	Red Hat
Component:	WebUI
Sub Component:
Version:	530
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Justin Sherrill
QA Contact:	Pavel Kralik
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	sat531-blockers
TreeView+	depends on / blocked

Reported:	2009-09-24 19:05 UTC by Xixi
Modified:	2018-10-27 16:00 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-01-08 10:02:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2010:0021	0	normal	SHIPPED_LIVE	Red Hat Network Satellite bug fix update	2010-01-08 10:02:22 UTC

Description Xixi 2009-09-24 19:05:00 UTC

Description of problem:
Customer tries to remove a package from a number of systems (~11.000) via SSM on upgraded 5.3.0 satellite. After confirm the uninstallation WebUI runs into an error and some time later we found java core and heapdump in /tmp on satellite server.

Version-Release number of selected component (if applicable):
Red Hat Network (RHN) Satellite 5.3.0

System architecture(s):
arch=x86_64
release=5
flavor=server
base=rhel-x86_64-server-5

How reproducible:
Always.

Steps to Reproduce:
1. Have a satellite with at least 11000 systems registered and some RHEL base channels sync'd (we should have a customer db-control backup dump soon).
2. Select a number of Systems for use in SSM (~11.000)
3. Click on https://SATELLITEFQDN/rhn/ssm/PackageRemove.do
4. Select at least one package to remove (e.g., acpid) and click on Remove Selected Packages
5. On the next page see the first pageful of list of affected systems. Here click on "Confirm"
6. Wait.

Actual results:
After some time ( ~20 seconds or so) run into an internal server error but there are still active oracle process active with ~ 200 % CPU Load on a 8 core system. After a time for a very short moment a java process appears.
Even the web UI runs in an error the event is in the scheduled overview some times later.  Java core and heapdumps are created.

Expected results:
Action completes fast, no internal server error due to java OOM.

Additional info:
Already being worked with Engineering.

Comment 2 Xixi 2009-09-24 19:14:10 UTC

GPS consultant on-site has taken a look at the trace files, and found that this was running at the time of the out of memory error:

"
3XMTHREADINFO      "RHN Message Dispatcher" TID:0x0000000014D3F300, j9thread_t:0x0000000014D20350, state:CW, prio=5
3XMTHREADINFO1            (native thread ID:0x5E7E, native priority:0x5, native policy:UNKNOWN)
4XESTACKTRACE          at org/hibernate/util/IdentityMap.entryList(IdentityMap.java:187(Compiled Code))
4XESTACKTRACE          at org/hibernate/util/IdentityMap.entries(IdentityMap.java:63(Compiled Code))
4XESTACKTRACE          at org/hibernate/event/def/AbstractFlushingEventListener.flushCollections(AbstractFlushingEventListener.java:226(Compiled Code))
4XESTACKTRACE          at org/hibernate/event/def/AbstractFlushingEventListener.flushEverythingToExecutions(AbstractFlushingEventListener.java:77(Compiled Code))
4XESTACKTRACE          at org/hibernate/event/def/DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:26(Compiled Code))
4XESTACKTRACE          at org/hibernate/impl/SessionImpl.flush(SessionImpl.java:1000(Compiled Code))
4XESTACKTRACE          at com/redhat/rhn/common/db/datasource/CachedStatement.stealConnection(CachedStatement.java(Compiled Code))
4XESTACKTRACE          at com/redhat/rhn/common/db/datasource/CachedStatement.execute(CachedStatement.java:441(Compiled Code))
4XESTACKTRACE          at com/redhat/rhn/common/db/datasource/CachedStatement.execute(CachedStatement.java:431(Compiled Code))
4XESTACKTRACE          at com/redhat/rhn/common/db/datasource/CachedStatement.executeUpdate(CachedStatement.java:269(Compiled Code))
4XESTACKTRACE          at com/redhat/rhn/common/db/datasource/WriteMode.executeUpdate(WriteMode.java:33(Compiled Code))
4XESTACKTRACE          at com/redhat/rhn/manager/action/ActionManager.schedulePackageAction(ActionManager.java:1670)
4XESTACKTRACE          at com/redhat/rhn/manager/action/ActionManager.schedulePackageRemoval(ActionManager.java:1131)
4XESTACKTRACE          at com/redhat/rhn/frontend/events/SsmRemovePackagesAction.scheduleDeletes(SsmRemovePackagesAction.java:130)
4XESTACKTRACE          at com/redhat/rhn/frontend/events/SsmRemovePackagesAction.doExecute(SsmRemovePackagesAction.java:59)
4XESTACKTRACE          at com/redhat/rhn/frontend/events/AbstractDatabaseAction.execute(AbstractDatabaseAction.java:45)
4XESTACKTRACE          at com/redhat/rhn/common/messaging/ActionExecutor.run(ActionExecutor.java:72)
4XESTACKTRACE          at com/redhat/rhn/common/messaging/MessageDispatcher.run(MessageDispatcher.java:58)
4XESTACKTRACE          at java/lang/Thread.run(Thread.java:735)

Also analyzed the Java heap dump using the IBM HeapAnalyzer from http://www.alphaworks.ibm.com/tech/heapanalyzer/

This showed that the most memory is consumed by Hibernate, more precisely by a HashMap of RHN Server objects. See the attached heapanalyzer.png screenshot.

This looks like the action is scheduled for all selected systems at once, and stored in memory before it's committed to the database.

[snip]

What we are seeing now is different from what was originally described in the ticket.

When we click Confirm, the Web UI comes back without an internal server error.

With JConsole and "top" we can see that the Tomcat process is doing some work in the background, and eventually encounters an Out of Memory error. The background work is probably the RHN Message Dispatcher thread.

For testing, I modified the JAVA_OPTS parameter in /etc/tomcat5/tomcat5.conf to set the heap size to 1024 MByte (-Xms1024m -Xmx1024m).

With this, scheduling a package removal for 10850 systems worked. It took about 3 minutes, running in the background. "

Comment 4 Xixi 2009-09-24 19:17:04 UTC

Per IRC with Justin:
There must be something we can do to decrease the memory usage for that action. In the meantime can continue to used the increased java heapsize as a workaround.

Comment 7 Justin Sherrill 2009-09-30 15:12:00 UTC

couple more related fixes:

5073c7dd19a65d1b0cc15854d6e88e1f65eef003
0f97ba4eb098c579dfb0488f5283451490d0b5d8

Comment 13 Justin Sherrill 2009-12-08 20:00:19 UTC

Actually that initial commit should be 98bc6fc9f6dd8d5c41f734e77a100bdf1197505f

Gonna try to get this into an errata

Comment 16 Pavel Kralik 2009-12-23 21:16:16 UTC

tested with: sat530, x8664, rhel54 (rhel48)
11K system loaded

failed with old, spacewalk-java*-0.5.44-50
Internal server error

passed with
spacewalk-taskomatic-0.5.44-61.el5sat
spacewalk-java-config-0.5.44-61.el5sat
spacewalk-java-0.5.44-61.el5sat
spacewalk-java-lib-0.5.44-61.el5sat

moving to verified

Comment 17 Pavel Kralik 2010-01-04 18:46:56 UTC

Retested with the packages from latest respin.
sat530, x8664, rhel54, 11,825 systems
removed 1 and 4 packages 

passed with:
spacewalk-java-0.5.44-62.el5sat.noarch.rpm
spacewalk-taskomatic-0.5.44-62.el5sat.noarch.rpm
spacewalk-java-config-0.5.44-62.el5sat.noarch.rpm
spacewalk-java-lib-0.5.44-62.el5sat.noarch.rpm

No Internal server error experienced and no exceptional load.

Comment 19 errata-xmlrpc 2010-01-08 10:02:34 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0021.html

Note You need to log in before you can comment on or make changes to this bug.