Description of problem: Customer tries to remove a package from a number of systems (~11.000) via SSM on upgraded 5.3.0 satellite. After confirm the uninstallation WebUI runs into an error and some time later we found java core and heapdump in /tmp on satellite server. Version-Release number of selected component (if applicable): Red Hat Network (RHN) Satellite 5.3.0 System architecture(s): arch=x86_64 release=5 flavor=server base=rhel-x86_64-server-5 How reproducible: Always. Steps to Reproduce: 1. Have a satellite with at least 11000 systems registered and some RHEL base channels sync'd (we should have a customer db-control backup dump soon). 2. Select a number of Systems for use in SSM (~11.000) 3. Click on https://SATELLITEFQDN/rhn/ssm/PackageRemove.do 4. Select at least one package to remove (e.g., acpid) and click on Remove Selected Packages 5. On the next page see the first pageful of list of affected systems. Here click on "Confirm" 6. Wait. Actual results: After some time ( ~20 seconds or so) run into an internal server error but there are still active oracle process active with ~ 200 % CPU Load on a 8 core system. After a time for a very short moment a java process appears. Even the web UI runs in an error the event is in the scheduled overview some times later. Java core and heapdumps are created. Expected results: Action completes fast, no internal server error due to java OOM. Additional info: Already being worked with Engineering.
GPS consultant on-site has taken a look at the trace files, and found that this was running at the time of the out of memory error: " 3XMTHREADINFO "RHN Message Dispatcher" TID:0x0000000014D3F300, j9thread_t:0x0000000014D20350, state:CW, prio=5 3XMTHREADINFO1 (native thread ID:0x5E7E, native priority:0x5, native policy:UNKNOWN) 4XESTACKTRACE at org/hibernate/util/IdentityMap.entryList(IdentityMap.java:187(Compiled Code)) 4XESTACKTRACE at org/hibernate/util/IdentityMap.entries(IdentityMap.java:63(Compiled Code)) 4XESTACKTRACE at org/hibernate/event/def/AbstractFlushingEventListener.flushCollections(AbstractFlushingEventListener.java:226(Compiled Code)) 4XESTACKTRACE at org/hibernate/event/def/AbstractFlushingEventListener.flushEverythingToExecutions(AbstractFlushingEventListener.java:77(Compiled Code)) 4XESTACKTRACE at org/hibernate/event/def/DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:26(Compiled Code)) 4XESTACKTRACE at org/hibernate/impl/SessionImpl.flush(SessionImpl.java:1000(Compiled Code)) 4XESTACKTRACE at com/redhat/rhn/common/db/datasource/CachedStatement.stealConnection(CachedStatement.java(Compiled Code)) 4XESTACKTRACE at com/redhat/rhn/common/db/datasource/CachedStatement.execute(CachedStatement.java:441(Compiled Code)) 4XESTACKTRACE at com/redhat/rhn/common/db/datasource/CachedStatement.execute(CachedStatement.java:431(Compiled Code)) 4XESTACKTRACE at com/redhat/rhn/common/db/datasource/CachedStatement.executeUpdate(CachedStatement.java:269(Compiled Code)) 4XESTACKTRACE at com/redhat/rhn/common/db/datasource/WriteMode.executeUpdate(WriteMode.java:33(Compiled Code)) 4XESTACKTRACE at com/redhat/rhn/manager/action/ActionManager.schedulePackageAction(ActionManager.java:1670) 4XESTACKTRACE at com/redhat/rhn/manager/action/ActionManager.schedulePackageRemoval(ActionManager.java:1131) 4XESTACKTRACE at com/redhat/rhn/frontend/events/SsmRemovePackagesAction.scheduleDeletes(SsmRemovePackagesAction.java:130) 4XESTACKTRACE at com/redhat/rhn/frontend/events/SsmRemovePackagesAction.doExecute(SsmRemovePackagesAction.java:59) 4XESTACKTRACE at com/redhat/rhn/frontend/events/AbstractDatabaseAction.execute(AbstractDatabaseAction.java:45) 4XESTACKTRACE at com/redhat/rhn/common/messaging/ActionExecutor.run(ActionExecutor.java:72) 4XESTACKTRACE at com/redhat/rhn/common/messaging/MessageDispatcher.run(MessageDispatcher.java:58) 4XESTACKTRACE at java/lang/Thread.run(Thread.java:735) Also analyzed the Java heap dump using the IBM HeapAnalyzer from http://www.alphaworks.ibm.com/tech/heapanalyzer/ This showed that the most memory is consumed by Hibernate, more precisely by a HashMap of RHN Server objects. See the attached heapanalyzer.png screenshot. This looks like the action is scheduled for all selected systems at once, and stored in memory before it's committed to the database. [snip] What we are seeing now is different from what was originally described in the ticket. When we click Confirm, the Web UI comes back without an internal server error. With JConsole and "top" we can see that the Tomcat process is doing some work in the background, and eventually encounters an Out of Memory error. The background work is probably the RHN Message Dispatcher thread. For testing, I modified the JAVA_OPTS parameter in /etc/tomcat5/tomcat5.conf to set the heap size to 1024 MByte (-Xms1024m -Xmx1024m). With this, scheduling a package removal for 10850 systems worked. It took about 3 minutes, running in the background. "
Per IRC with Justin: There must be something we can do to decrease the memory usage for that action. In the meantime can continue to used the increased java heapsize as a workaround.
couple more related fixes: 5073c7dd19a65d1b0cc15854d6e88e1f65eef003 0f97ba4eb098c579dfb0488f5283451490d0b5d8
Actually that initial commit should be 98bc6fc9f6dd8d5c41f734e77a100bdf1197505f Gonna try to get this into an errata
tested with: sat530, x8664, rhel54 (rhel48) 11K system loaded failed with old, spacewalk-java*-0.5.44-50 Internal server error passed with spacewalk-taskomatic-0.5.44-61.el5sat spacewalk-java-config-0.5.44-61.el5sat spacewalk-java-0.5.44-61.el5sat spacewalk-java-lib-0.5.44-61.el5sat moving to verified
Retested with the packages from latest respin. sat530, x8664, rhel54, 11,825 systems removed 1 and 4 packages passed with: spacewalk-java-0.5.44-62.el5sat.noarch.rpm spacewalk-taskomatic-0.5.44-62.el5sat.noarch.rpm spacewalk-java-config-0.5.44-62.el5sat.noarch.rpm spacewalk-java-lib-0.5.44-62.el5sat.noarch.rpm No Internal server error experienced and no exceptional load.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0021.html