1025767 – JON may require non-default number of processes available for user running it

Bug 1025767 - JON may require non-default number of processes available for user running it

Summary: JON may require non-default number of processes available for user running it

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Operations Network
Classification:	JBoss
Component:	Core Server, Launch Scripts
Sub Component:
Version:	JON 3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	ER07
Target Release:	JON 3.2.0
Assignee:	Stefan Negrea
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:	1018233
Blocks:	1012435
TreeView+	depends on / blocked

Reported:	2013-11-01 14:02 UTC by Lukas Krejci
Modified:	2014-01-02 20:38 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
Threads when storage is unavailable (195.57 KB, image/png) 2013-11-08 17:01 UTC, Heiko W. Rupp	no flags	Details
Threads when storage is finally reachable. Note that the grey one is an old metrics-meter-thick thingy that stays being around even now that storage is up. (205.03 KB, image/png) 2013-11-08 17:02 UTC, Heiko W. Rupp	no flags	Details
View All

Description Lukas Krejci 2013-11-01 14:02:02 UTC

Description of problem:

Bug 1018233 brought to light the problems JON server can get to when run by a user, which can run at most <= 1024 processes/threads.

We should do the following:
- the rhqctl script should take care of setting the user process limit - ulimit -u <NUMBER>
- the container configuration should be modified to limit the threads to stay within the expected process limit
- document that the JON server will require a user process limit of 'n' and that the user process limit can be increased in (on Linux) in etc/security/limits.conf.

The most important part of this BZ is to determine the <NUMBER>. This is quite difficult, because it is proportional to the expected load on the server. The default in RHEL (1024) is enough for basic operation under light load but we can run out of available processes under heavy load.

Version-Release number of selected component (if applicable):
JON 3.2.0-ER4

How reproducible:
always

Steps to Reproduce:
1. prlimit --nproc=1024 rhqctl start
2. generate heavy load on the server and access .../rest/status URI periodically

Actual results:
the server goes out of memory fairly quickly

Expected results:
no OOMs

Additional info:
Note that this BZ tracks additional work that needs to accompany BZ 1018233. It is handled separately from BZ 1018233 because it is more of a configuration job than a code change and might not result in anything more than a documentation change if we determine that sufficient.

Comment 1 Lukas Krejci 2013-11-01 14:04:25 UTC

Tentatively setting this to block JON 320, because it is related to BZ 1018233 which is a blocker, too.

Please revise if needed.

Comment 5 Heiko W. Rupp 2013-11-08 14:16:28 UTC

Beta was released on ~2013/10/17 , Lukas implemented a fix for the model controler client finalizer issue on 10/31 , so we should re-test that with ER5.

Comment 7 Heiko W. Rupp 2013-11-08 17:00:19 UTC

We have identified the potential leak position in the code - the above steps make it easy to reproduce.
Attachments show thread list from the server and every minute a reconnection thread + 2 metrics-meter-tick-threads are started. The reconnection one finishes quickly, but the other two still linger around.

They also continue to stay around when the storage node is finally reachable.

Comment 8 Heiko W. Rupp 2013-11-08 17:01:12 UTC

Created attachment 821711 [details]
Threads when storage is unavailable

Comment 9 Heiko W. Rupp 2013-11-08 17:02:28 UTC

Created attachment 821712 [details]
Threads when storage is finally reachable. Note that the grey one is an old metrics-meter-thick thingy that stays being around even now that storage is up.

Comment 10 Stefan Negrea 2013-11-08 20:25:36 UTC

The thread leak was caused by the Cassandra driver. The leak is due to the driver not correctly disposing the internal metrics reporting classes when a connection fails

Steps taken to address this:
1) Disable driver metrics since they are not used by the application
2) Call cluster shutdown as an insurance
3) Remove a superflous call that initializes a second session when one was already created


Note: On every single connection attempt, the server code retrieves the full list of storage nodes from the database. This is then passed, processed, and copied around a few times until it reaches the driver. Because of this, the JVM memory increases until the GC runs when a connection to storace cluster cannot be established. When the GC runs a huge amount of space is completely freed from the "PS Eden Space". So there is no memory leak since GC clears all the memory. However, because those entities are loaded in memory, the overall JVM memory increases between two GC events while there is no connectivity to the storage cluster.

It is desirable to load the storage node information from the database every time because storage nodes can be added to the database by other HA server installations or just storage installers. Having a fresh copy of the data prior to attempting a connection increases the chance of success.



release/jon3.2.x branch commit:

https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?h=release/jon3.2.x&id=84bdc0f727dee089e4e5e6d19e09000111a4a5f1

Comment 11 Lukas Krejci 2013-11-08 23:33:54 UTC

So as far as I can tell, during the course of investigating this bug we determined that the RHEL default of 1024 processes/threads per user should be enough to start the basic install of JON.

However I still miss the sizing information for servers that need to handle a bigger load. It might be as simple as saying that if the users run into OOMs, it might be necessary to also increase the ulimits nproc parameter in addition to increasing the maximum heap size of the server.

Creating a doc bug for that where we can track that requirement.

Comment 12 Lukas Krejci 2013-11-08 23:40:00 UTC

doc bug: BZ 1028639

Comment 13 Simeon Pinder 2013-11-19 15:48:59 UTC

Moving to ON_QA as available for testing with new brew build.

Comment 14 Simeon Pinder 2013-11-22 05:14:25 UTC

Mass moving all of these from ER6 to target milestone ER07 since the ER6 build was bad and QE was halted for the same reason.

Comment 15 Filip Brychta 2013-12-09 08:46:01 UTC

Verified on
Version :	
3.2.0.CR1
Build Number :	
6ecd678:d0dc0b6

I had following 2 environments running over the weekend:
1- JON server with 2 agents and one storage node. Storage node was DOWN
2- JON server with 5 agents and 2 storage nodes. One of the storage nodes was DOWN

Jon servers on both environments were using ~250 threads.

Note You need to log in before you can comment on or make changes to this bug.