Bug 1025767
Summary: | JON may require non-default number of processes available for user running it | ||
---|---|---|---|
Product: | [JBoss] JBoss Operations Network | Reporter: | Lukas Krejci <lkrejci> |
Component: | Core Server, Launch Scripts | Assignee: | Stefan Negrea <snegrea> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Foley <mfoley> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | JON 3.2 | CC: | asantos, fbrychta, hrupp, loleary, theute, vnguyen |
Target Milestone: | ER07 | ||
Target Release: | JON 3.2.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | Type: | Bug | |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1018233 | ||
Bug Blocks: | 1012435 | ||
Attachments: |
Description
Lukas Krejci
2013-11-01 14:02:02 UTC
Tentatively setting this to block JON 320, because it is related to BZ 1018233 which is a blocker, too. Please revise if needed. Beta was released on ~2013/10/17 , Lukas implemented a fix for the model controler client finalizer issue on 10/31 , so we should re-test that with ER5. We have identified the potential leak position in the code - the above steps make it easy to reproduce. Attachments show thread list from the server and every minute a reconnection thread + 2 metrics-meter-tick-threads are started. The reconnection one finishes quickly, but the other two still linger around. They also continue to stay around when the storage node is finally reachable. Created attachment 821711 [details]
Threads when storage is unavailable
Created attachment 821712 [details]
Threads when storage is finally reachable. Note that the grey one is an old metrics-meter-thick thingy that stays being around even now that storage is up.
The thread leak was caused by the Cassandra driver. The leak is due to the driver not correctly disposing the internal metrics reporting classes when a connection fails Steps taken to address this: 1) Disable driver metrics since they are not used by the application 2) Call cluster shutdown as an insurance 3) Remove a superflous call that initializes a second session when one was already created Note: On every single connection attempt, the server code retrieves the full list of storage nodes from the database. This is then passed, processed, and copied around a few times until it reaches the driver. Because of this, the JVM memory increases until the GC runs when a connection to storace cluster cannot be established. When the GC runs a huge amount of space is completely freed from the "PS Eden Space". So there is no memory leak since GC clears all the memory. However, because those entities are loaded in memory, the overall JVM memory increases between two GC events while there is no connectivity to the storage cluster. It is desirable to load the storage node information from the database every time because storage nodes can be added to the database by other HA server installations or just storage installers. Having a fresh copy of the data prior to attempting a connection increases the chance of success. release/jon3.2.x branch commit: https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?h=release/jon3.2.x&id=84bdc0f727dee089e4e5e6d19e09000111a4a5f1 So as far as I can tell, during the course of investigating this bug we determined that the RHEL default of 1024 processes/threads per user should be enough to start the basic install of JON. However I still miss the sizing information for servers that need to handle a bigger load. It might be as simple as saying that if the users run into OOMs, it might be necessary to also increase the ulimits nproc parameter in addition to increasing the maximum heap size of the server. Creating a doc bug for that where we can track that requirement. doc bug: BZ 1028639 Moving to ON_QA as available for testing with new brew build. Mass moving all of these from ER6 to target milestone ER07 since the ER6 build was bad and QE was halted for the same reason. Verified on Version : 3.2.0.CR1 Build Number : 6ecd678:d0dc0b6 I had following 2 environments running over the weekend: 1- JON server with 2 agents and one storage node. Storage node was DOWN 2- JON server with 5 agents and 2 storage nodes. One of the storage nodes was DOWN Jon servers on both environments were using ~250 threads. |