956442 – Server script rhq-server.sh 'stop' command creates many shutdown threads that can exhaust user process limits

Bug 956442 - Server script rhq-server.sh 'stop' command creates many shutdown threads that can exhaust user process limits

Summary: Server script rhq-server.sh 'stop' command creates many shutdown threads that...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Operations Network
Classification:	JBoss
Component:	Launch Scripts
Sub Component:
Version:	JON 3.1.2
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ER01
Target Release:	JON 3.2.0
Assignee:	Thomas Segismont
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-04-24 23:30 UTC by Larry O'Leary
Modified:	2018-12-02 18:47 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-01-02 20:35:45 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
Measuring the number of threads of RHQ server before and during its stopping. (272.51 KB, image/png) 2013-10-29 13:01 UTC, jvlasak	no flags	Details
View All

Description Larry O'Leary 2013-04-24 23:30:30 UTC

Description of problem:
The shutdown script:

    rhq-server.sh stop

Generates many shutdown threads in the server process which could result in the following JVM error:

    java.lang.OutOfMemoryError: unable to create new native thread

At which point, the user who started the ON server may not be able to login to the machine or execute any commands. The result can be catastrophic.

Version-Release number of selected component (if applicable):
4.4.0.JON312GA

How reproducible:
Always

Steps to Reproduce:
1. Start JBoss ON server
2. Load it up with some task
3. Use the rhq-server.sh stop command
  
Actual results:
During shutdown monitor threads to see that multiple shutdown threads are created (one every two seconds until the process has completely shutdown)

The thread stack will look similar to:

    Thread 5977: (state = BLOCKED)
     - java.lang.Shutdown.exit(int) @bci=96, line=168 (Interpreted frame)
     - java.lang.Terminator$1.handle(sun.misc.Signal) @bci=8, line=35 (Interpreted frame)
     - sun.misc.Signal$1.run() @bci=8, line=195 (Interpreted frame)
     - java.lang.Thread.run() @bci=11, line=662 (Interpreted frame)


Expected results:
Only one shutdown thread should be invoked.

Additional info:
This is a result of a KILL loop that is in the stop command of rhq-server.sh:

    'stop')
            ...
            while [ "$_SERVER_RUNNING" = "1"  ]; do
               kill -TERM $_SERVER_PID
               sleep 2
               check_status "stopping..."
            done
            ...



Each time the kill -TERM is invoked a new thread is created. We should only send the TERM once and wait a while and finally send a KILL if a configurable timeout is exceeded. The following example actually invokes the TERM up to 5 times and follows it with a KILL if the server is still running after 20 minutes. It also adds up to 11 QUIT signals if debug is enabled:

    'stop')
            ...
            wait=2
            maxWaitTime=1200
            waited=0
            while [ "$_SERVER_RUNNING" = "1"  ]; do
               [ -n "$RHQ_SERVER_DEBUG" ] && 
                   [ "$RHQ_SERVER_DEBUG" != "false" ] && 
                       [ $(( ($waited / $wait) % (($maxWaitTime / $wait) / 10) )) -eq 0 ] && 
                           kill -QUIT $_JVM_PID
               
               [ $(( ($waited / $wait) % (($maxWaitTime / $wait) / 4) )) -eq 0 ] && 
                   kill -TERM $_SERVER_PID
                   
               [ $waited -ge $maxWaitTime ] &&
                   kill -KILL $_SERVER_PID &&
                       break

               ((waited += $wait))
               sleep $wait
               check_status "stopping..."
            done

            check_status "stopping..."
            if [ "$_SERVER_RUNNING" = "1"  ]; then
                debug_msg "RHQ Server did not stop within $waited seconds."
                echo "Timed out waiting for RHQ Server to stop."
                exit 127
            fi
            remove_pid_files
            echo "RHQ Server has stopped."
            exit 0
            ;;
            ...


Please note that it may not be ideal to even issue the KILL signal. Perhaps after waiting 5 minutes, we should just return indicating that the server did not stop within 5 minutes (or what ever our default value should be). That way, it can be left up to the user to either adjust the timeout or add their own KILL signal via an external process based on the exit code of 127.

Comment 1 Thomas Segismont 2013-05-28 16:01:23 UTC

Is there a reason to send multiple QUIT or TERM signals ? Can't we just send one TERM, wait X minutes and then send KILL ?

Comment 2 Thomas Segismont 2013-05-28 16:19:24 UTC

I'm thinking of having two variables in the script:
* one to set the number of minutes to wait for the server to shutdown after the TERM signal
* another to decide if the server should be killed if still being up after the waiting period

Comment 3 Larry O'Leary 2013-05-28 16:32:50 UTC

(In reply to Thomas Segismont from comment #1)
> Is there a reason to send multiple QUIT or TERM signals ? Can't we just send
> one TERM, wait X minutes and then send KILL ?

The QUIT is for thread dumps in the event RHQ_SERVER_DEBUG is enabled. This should be sent multiple times to generate a thread dump.

The TERM, I really don't think so. Not sure why we were sending multiple in the first place. I felt that sending four during the total wait was better then sending one every two seconds. But my original solution only included one TERM as this should be sufficient to trigger the shutdown hook.


(In reply to Thomas Segismont from comment #2)
> I'm thinking of having two variables in the script:
> * one to set the number of minutes to wait for the server to shutdown after
> the TERM signal
> * another to decide if the server should be killed if still being up after
> the waiting period

This sounds like a good idea. Was actually what I was intending as well. My original thought was that if stop failed the user could manually invoke rhq-server.sh kill to kill the server using SIGKILL. But, provide an option to do this automatically if the user chooses to.

Comment 4 Thomas Segismont 2013-06-05 09:26:02 UTC

Fixed in master

commit a704d926b0adafa8f197ba4eaf87b3d3ca858187
Author: Thomas Segismont <tsegismo>
Date:   Fri May 31 11:21:40 2013 +0200

    Bug 956442 - Server script rhq-server.sh 'stop' command creates many shutdown threads that can exhaust user process limits
    
    Added two variables in the script:
    * one to set the number of minutes to wait for the server to shutdown after the TERM signal
    * another to decide if the server should be killed if still being up after the waiting period
    
    Stop and kill cases updated accordingly

Comment 5 Larry O'Leary 2013-09-06 14:32:17 UTC

As this is MODIFIED or ON_QA, setting milestone to ER1.

Comment 6 jvlasak 2013-10-29 13:01:08 UTC

Created attachment 817083 [details]
Measuring the number of threads of RHQ server before and during its stopping.

Verified.
The number of threads of RHQ server was during its stopping decreasing. Before stopping RHQ server, the number of its threads was found out by command (ps -e -T | grep -c <SERVER_PID>) and 2 operations on the platform (Manual Autodiscovery and view Process List) were executed. During the process of stopping the server its threads number was measured a few times. The result of the whole job can be seen in the attached screenshot.

Note You need to log in before you can comment on or make changes to this bug.