Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1025767

Summary:

JON may require non-default number of processes available for user running it

Product:

[JBoss] JBoss Operations Network

Reporter:

Lukas Krejci <lkrejci>

Component:

Core Server, Launch Scripts

Assignee:

Stefan Negrea <snegrea>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Mike Foley <mfoley>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

JON 3.2

CC:

asantos, fbrychta, hrupp, loleary, theute, vnguyen

Target Milestone:

ER07

Target Release:

JON 3.2.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1018233

Bug Blocks:

1012435

Attachments:

Description	Flags
Threads when storage is unavailable	none
Threads when storage is finally reachable. Note that the grey one is an old metrics-meter-thick thingy that stays being around even now that storage is up.	none

Description Lukas Krejci 2013-11-01 14:02:02 UTC

Description of problem:

Bug 1018233 brought to light the problems JON server can get to when run by a user, which can run at most <= 1024 processes/threads.

We should do the following:
- the rhqctl script should take care of setting the user process limit - ulimit -u <NUMBER>
- the container configuration should be modified to limit the threads to stay within the expected process limit
- document that the JON server will require a user process limit of 'n' and that the user process limit can be increased in (on Linux) in etc/security/limits.conf.

The most important part of this BZ is to determine the <NUMBER>. This is quite difficult, because it is proportional to the expected load on the server. The default in RHEL (1024) is enough for basic operation under light load but we can run out of available processes under heavy load.

Version-Release number of selected component (if applicable):
JON 3.2.0-ER4

How reproducible:
always

Steps to Reproduce:
1. prlimit --nproc=1024 rhqctl start
2. generate heavy load on the server and access .../rest/status URI periodically

Actual results:
the server goes out of memory fairly quickly

Expected results:
no OOMs

Additional info:
Note that this BZ tracks additional work that needs to accompany BZ 1018233. It is handled separately from BZ 1018233 because it is more of a configuration job than a code change and might not result in anything more than a documentation change if we determine that sufficient.

Comment 1 Lukas Krejci 2013-11-01 14:04:25 UTC

Tentatively setting this to block JON 320, because it is related to BZ 1018233 which is a blocker, too.

Please revise if needed.

Comment 5 Heiko W. Rupp 2013-11-08 14:16:28 UTC

Beta was released on ~2013/10/17 , Lukas implemented a fix for the model controler client finalizer issue on 10/31 , so we should re-test that with ER5.

Comment 7 Heiko W. Rupp 2013-11-08 17:00:19 UTC

We have identified the potential leak position in the code - the above steps make it easy to reproduce.
Attachments show thread list from the server and every minute a reconnection thread + 2 metrics-meter-tick-threads are started. The reconnection one finishes quickly, but the other two still linger around.

They also continue to stay around when the storage node is finally reachable.

Comment 8 Heiko W. Rupp 2013-11-08 17:01:12 UTC

Created attachment 821711 [details]
Threads when storage is unavailable

Comment 9 Heiko W. Rupp 2013-11-08 17:02:28 UTC

Created attachment 821712 [details]
Threads when storage is finally reachable. Note that the grey one is an old metrics-meter-thick thingy that stays being around even now that storage is up.

Comment 10 Stefan Negrea 2013-11-08 20:25:36 UTC

The thread leak was caused by the Cassandra driver. The leak is due to the driver not correctly disposing the internal metrics reporting classes when a connection fails

Steps taken to address this:
1) Disable driver metrics since they are not used by the application
2) Call cluster shutdown as an insurance
3) Remove a superflous call that initializes a second session when one was already created


Note: On every single connection attempt, the server code retrieves the full list of storage nodes from the database. This is then passed, processed, and copied around a few times until it reaches the driver. Because of this, the JVM memory increases until the GC runs when a connection to storace cluster cannot be established. When the GC runs a huge amount of space is completely freed from the "PS Eden Space". So there is no memory leak since GC clears all the memory. However, because those entities are loaded in memory, the overall JVM memory increases between two GC events while there is no connectivity to the storage cluster.

It is desirable to load the storage node information from the database every time because storage nodes can be added to the database by other HA server installations or just storage installers. Having a fresh copy of the data prior to attempting a connection increases the chance of success.



release/jon3.2.x branch commit:

https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?h=release/jon3.2.x&id=84bdc0f727dee089e4e5e6d19e09000111a4a5f1

Comment 11 Lukas Krejci 2013-11-08 23:33:54 UTC

So as far as I can tell, during the course of investigating this bug we determined that the RHEL default of 1024 processes/threads per user should be enough to start the basic install of JON.

However I still miss the sizing information for servers that need to handle a bigger load. It might be as simple as saying that if the users run into OOMs, it might be necessary to also increase the ulimits nproc parameter in addition to increasing the maximum heap size of the server.

Creating a doc bug for that where we can track that requirement.

Comment 12 Lukas Krejci 2013-11-08 23:40:00 UTC

doc bug: BZ 1028639

Comment 13 Simeon Pinder 2013-11-19 15:48:59 UTC

Moving to ON_QA as available for testing with new brew build.

Comment 14 Simeon Pinder 2013-11-22 05:14:25 UTC

Mass moving all of these from ER6 to target milestone ER07 since the ER6 build was bad and QE was halted for the same reason.

Comment 15 Filip Brychta 2013-12-09 08:46:01 UTC

Verified on
Version :	
3.2.0.CR1
Build Number :	
6ecd678:d0dc0b6

I had following 2 environments running over the weekend:
1- JON server with 2 agents and one storage node. Storage node was DOWN
2- JON server with 5 agents and 2 storage nodes. One of the storage nodes was DOWN

Jon servers on both environments were using ~250 threads.