Description of problem: The shutdown script: rhq-server.sh stop Generates many shutdown threads in the server process which could result in the following JVM error: java.lang.OutOfMemoryError: unable to create new native thread At which point, the user who started the ON server may not be able to login to the machine or execute any commands. The result can be catastrophic. Version-Release number of selected component (if applicable): 4.4.0.JON312GA How reproducible: Always Steps to Reproduce: 1. Start JBoss ON server 2. Load it up with some task 3. Use the rhq-server.sh stop command Actual results: During shutdown monitor threads to see that multiple shutdown threads are created (one every two seconds until the process has completely shutdown) The thread stack will look similar to: Thread 5977: (state = BLOCKED) - java.lang.Shutdown.exit(int) @bci=96, line=168 (Interpreted frame) - java.lang.Terminator$1.handle(sun.misc.Signal) @bci=8, line=35 (Interpreted frame) - sun.misc.Signal$1.run() @bci=8, line=195 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=662 (Interpreted frame) Expected results: Only one shutdown thread should be invoked. Additional info: This is a result of a KILL loop that is in the stop command of rhq-server.sh: 'stop') ... while [ "$_SERVER_RUNNING" = "1" ]; do kill -TERM $_SERVER_PID sleep 2 check_status "stopping..." done ... Each time the kill -TERM is invoked a new thread is created. We should only send the TERM once and wait a while and finally send a KILL if a configurable timeout is exceeded. The following example actually invokes the TERM up to 5 times and follows it with a KILL if the server is still running after 20 minutes. It also adds up to 11 QUIT signals if debug is enabled: 'stop') ... wait=2 maxWaitTime=1200 waited=0 while [ "$_SERVER_RUNNING" = "1" ]; do [ -n "$RHQ_SERVER_DEBUG" ] && [ "$RHQ_SERVER_DEBUG" != "false" ] && [ $(( ($waited / $wait) % (($maxWaitTime / $wait) / 10) )) -eq 0 ] && kill -QUIT $_JVM_PID [ $(( ($waited / $wait) % (($maxWaitTime / $wait) / 4) )) -eq 0 ] && kill -TERM $_SERVER_PID [ $waited -ge $maxWaitTime ] && kill -KILL $_SERVER_PID && break ((waited += $wait)) sleep $wait check_status "stopping..." done check_status "stopping..." if [ "$_SERVER_RUNNING" = "1" ]; then debug_msg "RHQ Server did not stop within $waited seconds." echo "Timed out waiting for RHQ Server to stop." exit 127 fi remove_pid_files echo "RHQ Server has stopped." exit 0 ;; ... Please note that it may not be ideal to even issue the KILL signal. Perhaps after waiting 5 minutes, we should just return indicating that the server did not stop within 5 minutes (or what ever our default value should be). That way, it can be left up to the user to either adjust the timeout or add their own KILL signal via an external process based on the exit code of 127.
Is there a reason to send multiple QUIT or TERM signals ? Can't we just send one TERM, wait X minutes and then send KILL ?
I'm thinking of having two variables in the script: * one to set the number of minutes to wait for the server to shutdown after the TERM signal * another to decide if the server should be killed if still being up after the waiting period
(In reply to Thomas Segismont from comment #1) > Is there a reason to send multiple QUIT or TERM signals ? Can't we just send > one TERM, wait X minutes and then send KILL ? The QUIT is for thread dumps in the event RHQ_SERVER_DEBUG is enabled. This should be sent multiple times to generate a thread dump. The TERM, I really don't think so. Not sure why we were sending multiple in the first place. I felt that sending four during the total wait was better then sending one every two seconds. But my original solution only included one TERM as this should be sufficient to trigger the shutdown hook. (In reply to Thomas Segismont from comment #2) > I'm thinking of having two variables in the script: > * one to set the number of minutes to wait for the server to shutdown after > the TERM signal > * another to decide if the server should be killed if still being up after > the waiting period This sounds like a good idea. Was actually what I was intending as well. My original thought was that if stop failed the user could manually invoke rhq-server.sh kill to kill the server using SIGKILL. But, provide an option to do this automatically if the user chooses to.
Fixed in master commit a704d926b0adafa8f197ba4eaf87b3d3ca858187 Author: Thomas Segismont <tsegismo> Date: Fri May 31 11:21:40 2013 +0200 Bug 956442 - Server script rhq-server.sh 'stop' command creates many shutdown threads that can exhaust user process limits Added two variables in the script: * one to set the number of minutes to wait for the server to shutdown after the TERM signal * another to decide if the server should be killed if still being up after the waiting period Stop and kill cases updated accordingly
As this is MODIFIED or ON_QA, setting milestone to ER1.
Created attachment 817083 [details] Measuring the number of threads of RHQ server before and during its stopping. Verified. The number of threads of RHQ server was during its stopping decreasing. Before stopping RHQ server, the number of its threads was found out by command (ps -e -T | grep -c <SERVER_PID>) and 2 operations on the platform (Manual Autodiscovery and view Process List) were executed. During the process of stopping the server its threads number was measured a few times. The result of the whole job can be seen in the attached screenshot.