Bug 1025767 - JON may require non-default number of processes available for user running it
JON may require non-default number of processes available for user running it
Status: CLOSED CURRENTRELEASE
Product: JBoss Operations Network
Classification: JBoss
Component: Core Server, Launch Scripts (Show other bugs)
JON 3.2
Unspecified Unspecified
unspecified Severity urgent
: ER07
: JON 3.2.0
Assigned To: Stefan Negrea
Mike Foley
:
Depends On: 1018233
Blocks: 1012435
  Show dependency treegraph
 
Reported: 2013-11-01 10:02 EDT by Lukas Krejci
Modified: 2014-01-02 15:38 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Threads when storage is unavailable (195.57 KB, image/png)
2013-11-08 12:01 EST, Heiko W. Rupp
no flags Details
Threads when storage is finally reachable. Note that the grey one is an old metrics-meter-thick thingy that stays being around even now that storage is up. (205.03 KB, image/png)
2013-11-08 12:02 EST, Heiko W. Rupp
no flags Details

  None (edit)
Description Lukas Krejci 2013-11-01 10:02:02 EDT
Description of problem:

Bug 1018233 brought to light the problems JON server can get to when run by a user, which can run at most <= 1024 processes/threads.

We should do the following:
- the rhqctl script should take care of setting the user process limit - ulimit -u <NUMBER>
- the container configuration should be modified to limit the threads to stay within the expected process limit
- document that the JON server will require a user process limit of 'n' and that the user process limit can be increased in (on Linux) in etc/security/limits.conf.

The most important part of this BZ is to determine the <NUMBER>. This is quite difficult, because it is proportional to the expected load on the server. The default in RHEL (1024) is enough for basic operation under light load but we can run out of available processes under heavy load.

Version-Release number of selected component (if applicable):
JON 3.2.0-ER4

How reproducible:
always

Steps to Reproduce:
1. prlimit --nproc=1024 rhqctl start
2. generate heavy load on the server and access .../rest/status URI periodically

Actual results:
the server goes out of memory fairly quickly

Expected results:
no OOMs

Additional info:
Note that this BZ tracks additional work that needs to accompany BZ 1018233. It is handled separately from BZ 1018233 because it is more of a configuration job than a code change and might not result in anything more than a documentation change if we determine that sufficient.
Comment 1 Lukas Krejci 2013-11-01 10:04:25 EDT
Tentatively setting this to block JON 320, because it is related to BZ 1018233 which is a blocker, too.

Please revise if needed.
Comment 5 Heiko W. Rupp 2013-11-08 09:16:28 EST
Beta was released on ~2013/10/17 , Lukas implemented a fix for the model controler client finalizer issue on 10/31 , so we should re-test that with ER5.
Comment 7 Heiko W. Rupp 2013-11-08 12:00:19 EST
We have identified the potential leak position in the code - the above steps make it easy to reproduce.
Attachments show thread list from the server and every minute a reconnection thread + 2 metrics-meter-tick-threads are started. The reconnection one finishes quickly, but the other two still linger around.

They also continue to stay around when the storage node is finally reachable.
Comment 8 Heiko W. Rupp 2013-11-08 12:01:12 EST
Created attachment 821711 [details]
Threads when storage is unavailable
Comment 9 Heiko W. Rupp 2013-11-08 12:02:28 EST
Created attachment 821712 [details]
Threads when storage is finally reachable. Note that the grey one is an old metrics-meter-thick thingy that stays being around even now that storage is up.
Comment 10 Stefan Negrea 2013-11-08 15:25:36 EST
The thread leak was caused by the Cassandra driver. The leak is due to the driver not correctly disposing the internal metrics reporting classes when a connection fails

Steps taken to address this:
1) Disable driver metrics since they are not used by the application
2) Call cluster shutdown as an insurance
3) Remove a superflous call that initializes a second session when one was already created


Note: On every single connection attempt, the server code retrieves the full list of storage nodes from the database. This is then passed, processed, and copied around a few times until it reaches the driver. Because of this, the JVM memory increases until the GC runs when a connection to storace cluster cannot be established. When the GC runs a huge amount of space is completely freed from the "PS Eden Space". So there is no memory leak since GC clears all the memory. However, because those entities are loaded in memory, the overall JVM memory increases between two GC events while there is no connectivity to the storage cluster.

It is desirable to load the storage node information from the database every time because storage nodes can be added to the database by other HA server installations or just storage installers. Having a fresh copy of the data prior to attempting a connection increases the chance of success.



release/jon3.2.x branch commit:

https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?h=release/jon3.2.x&id=84bdc0f727dee089e4e5e6d19e09000111a4a5f1
Comment 11 Lukas Krejci 2013-11-08 18:33:54 EST
So as far as I can tell, during the course of investigating this bug we determined that the RHEL default of 1024 processes/threads per user should be enough to start the basic install of JON.

However I still miss the sizing information for servers that need to handle a bigger load. It might be as simple as saying that if the users run into OOMs, it might be necessary to also increase the ulimits nproc parameter in addition to increasing the maximum heap size of the server.

Creating a doc bug for that where we can track that requirement.
Comment 12 Lukas Krejci 2013-11-08 18:40:00 EST
doc bug: BZ 1028639
Comment 13 Simeon Pinder 2013-11-19 10:48:59 EST
Moving to ON_QA as available for testing with new brew build.
Comment 14 Simeon Pinder 2013-11-22 00:14:25 EST
Mass moving all of these from ER6 to target milestone ER07 since the ER6 build was bad and QE was halted for the same reason.
Comment 15 Filip Brychta 2013-12-09 03:46:01 EST
Verified on
Version :	
3.2.0.CR1
Build Number :	
6ecd678:d0dc0b6

I had following 2 environments running over the weekend:
1- JON server with 2 agents and one storage node. Storage node was DOWN
2- JON server with 5 agents and 2 storage nodes. One of the storage nodes was DOWN

Jon servers on both environments were using ~250 threads.

Note You need to log in before you can comment on or make changes to this bug.