Description of problem: Bug 1018233 brought to light the problems JON server can get to when run by a user, which can run at most <= 1024 processes/threads. We should do the following: - the rhqctl script should take care of setting the user process limit - ulimit -u <NUMBER> - the container configuration should be modified to limit the threads to stay within the expected process limit - document that the JON server will require a user process limit of 'n' and that the user process limit can be increased in (on Linux) in etc/security/limits.conf. The most important part of this BZ is to determine the <NUMBER>. This is quite difficult, because it is proportional to the expected load on the server. The default in RHEL (1024) is enough for basic operation under light load but we can run out of available processes under heavy load. Version-Release number of selected component (if applicable): JON 3.2.0-ER4 How reproducible: always Steps to Reproduce: 1. prlimit --nproc=1024 rhqctl start 2. generate heavy load on the server and access .../rest/status URI periodically Actual results: the server goes out of memory fairly quickly Expected results: no OOMs Additional info: Note that this BZ tracks additional work that needs to accompany BZ 1018233. It is handled separately from BZ 1018233 because it is more of a configuration job than a code change and might not result in anything more than a documentation change if we determine that sufficient.
Tentatively setting this to block JON 320, because it is related to BZ 1018233 which is a blocker, too. Please revise if needed.
Beta was released on ~2013/10/17 , Lukas implemented a fix for the model controler client finalizer issue on 10/31 , so we should re-test that with ER5.
We have identified the potential leak position in the code - the above steps make it easy to reproduce. Attachments show thread list from the server and every minute a reconnection thread + 2 metrics-meter-tick-threads are started. The reconnection one finishes quickly, but the other two still linger around. They also continue to stay around when the storage node is finally reachable.
Created attachment 821711 [details] Threads when storage is unavailable
Created attachment 821712 [details] Threads when storage is finally reachable. Note that the grey one is an old metrics-meter-thick thingy that stays being around even now that storage is up.
The thread leak was caused by the Cassandra driver. The leak is due to the driver not correctly disposing the internal metrics reporting classes when a connection fails Steps taken to address this: 1) Disable driver metrics since they are not used by the application 2) Call cluster shutdown as an insurance 3) Remove a superflous call that initializes a second session when one was already created Note: On every single connection attempt, the server code retrieves the full list of storage nodes from the database. This is then passed, processed, and copied around a few times until it reaches the driver. Because of this, the JVM memory increases until the GC runs when a connection to storace cluster cannot be established. When the GC runs a huge amount of space is completely freed from the "PS Eden Space". So there is no memory leak since GC clears all the memory. However, because those entities are loaded in memory, the overall JVM memory increases between two GC events while there is no connectivity to the storage cluster. It is desirable to load the storage node information from the database every time because storage nodes can be added to the database by other HA server installations or just storage installers. Having a fresh copy of the data prior to attempting a connection increases the chance of success. release/jon3.2.x branch commit: https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?h=release/jon3.2.x&id=84bdc0f727dee089e4e5e6d19e09000111a4a5f1
So as far as I can tell, during the course of investigating this bug we determined that the RHEL default of 1024 processes/threads per user should be enough to start the basic install of JON. However I still miss the sizing information for servers that need to handle a bigger load. It might be as simple as saying that if the users run into OOMs, it might be necessary to also increase the ulimits nproc parameter in addition to increasing the maximum heap size of the server. Creating a doc bug for that where we can track that requirement.
doc bug: BZ 1028639
Moving to ON_QA as available for testing with new brew build.
Mass moving all of these from ER6 to target milestone ER07 since the ER6 build was bad and QE was halted for the same reason.
Verified on Version : 3.2.0.CR1 Build Number : 6ecd678:d0dc0b6 I had following 2 environments running over the weekend: 1- JON server with 2 agents and one storage node. Storage node was DOWN 2- JON server with 5 agents and 2 storage nodes. One of the storage nodes was DOWN Jon servers on both environments were using ~250 threads.