Bug 893802

Summary: [as7] Restart operation fails to start AS when shutdown hasn't completed when start script is invoked
Product: [JBoss] JBoss Operations Network Reporter: Larry O'Leary <loleary>
Component: Plugin -- JBoss EAP 6Assignee: Jirka Kremser <jkremser>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: unspecified    
Version: JON 3.1.1CC: bkramer, jkremser, lzoubek, myarboro, rhatlapa
Target Milestone: ER01   
Target Release: JON 3.2.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
start script with the pid check at the beginning none

Description Larry O'Leary 2013-01-09 23:33:31 UTC
Description of problem:
Although this primarily happens when using a System V init script to start the and EAP server, it could also happen when using customized start scripts found in the AS installation's bin directory.

If the start script checks to see if the existing process is still running or its process ID file or run lock is in place, it will fail to start because the shutdown operation has not yet been completed. Even though shutdown has been executed and the management interface is no longer accepting new commands, the AS instance is still running. However, the restart operation of the AS7 plug-in and its BaseServerComponent.waitUntilDown() method only check to see if we can execute new commands via the management interface. It should be obvious that we can't considering that we just shut it down and received a success message. 

Instead, the restart operation needs to make certain that the process has shutdown prior to invoking the startServer method.

Version-Release number of selected component (if applicable):
4.4.0.JON311GA

How reproducible:
Frequently

Steps to Reproduce:
1.  Create custom start script and and init script wrapper:

    The goal here is to simply have the start operation verify that the server is not currently running and fail if it is. This is what the shipped init script does in its start function. However, this can be simulated using the standalone.sh script as well and adding something similar to:
    
          if [ -f $JBOSS_PIDFILE ]; then
            read ppid < $JBOSS_PIDFILE
            if [ `ps --pid $ppid 2> /dev/null | grep -c $ppid 2> /dev/null` -eq '1' ]; then
              echo -n "$prog is already running"
              failure
              echo
              return 1 
            else
              rm -f $JBOSS_PIDFILE
            fi
          fi

2.  Start EAP 6 standalone server using custom start script or shipped init script
3.  Start ON system
4.  Import EAP 6 server into inventory and make sure its configured and reports available
5.  Update EAP 6 resource's connection settings to use custom start script for starting EAP resource
6.  Connect to EAP server at http://localhost:8080 and also connect to its HTTP management console
7.  From the EAP resource in inventory, invoke the restart operation.
  
Actual results:
EAP server is shutdown but is not started back up. If running with agent debug logging enabled, the output will indicate that the start was not performed because the server was already running:

    $prog is already running

Expected results:
EAP server should be restarted without any error.

Additional info:
This issue is all about timing. In the existing plug-in code we are not waiting for EAP to shutdown. Therefore, if the start script checks to see if the server is already running with an existing/known PID or by performing a process scan, it will fail to start. The delay between the time the shutdown operation is invoked and the process is shutdown is linked to the work that the EAP server is doing. Basically, the shutdown operation as implemented in the AS7 plug-in is simply telling the server to shutdown at which point the server begins the shutdown process. It first stops new work from being submitted and then finishes existing requests and closes out existing connections that are no longer needed. However, because it is not accepting new work, our avail pings will fail fast because no new work is being allowed.

Comment 1 Jirka Kremser 2013-02-08 14:27:36 UTC
master
http://git.fedorahosted.org/cgit/rhq/rhq.git/diff/?id=c17aa632d

time:    Fri Feb 8 15:18:40 2013 +0100
commit:  c17aa632dca792707a4c84ba16245efb9c4ed99a
author:  Jirka Kremser - jkremser
message: [BZ BugĀ 893802 - [as7] Restart operation fails to start AS when shutdown hasn't completed when start script is invoked] If the AS is deployed on the same node as the agent, the check whether the process is running or not is done in a loop once a second. This new check was added after the original (using the DMR api polling) check. In total, the AS has up to 20 second to finish its work and completely shut down.

Changing status to MODIFIED.

Comment 2 Jirka Kremser 2013-02-08 15:06:55 UTC
Created attachment 695128 [details]
start script with the pid check at the beginning

Adding the complete script for easier verification. (used on Fedora17)

Comment 3 Jirka Kremser 2013-02-08 15:07:46 UTC
As discussed during today's hangout, bugs with 3.2.0 release should go to ON_QA, so I am moving it to ON_QA.

Comment 4 Jirka Kremser 2013-02-12 16:12:09 UTC
master
http://git.fedorahosted.org/cgit/rhq/rhq.git/diff/?id=da8012a65

time:    Tue Feb 12 16:58:31 2013 +0100
commit:  da8012a65541b86ba02ed2405b983f9b6cd0cb3f
author:  Jirka Kremser - jkremser
message: [BZ 893802 [as7] Restart operation fails to start AS when shutdown hasn't completed when start script is invoked] I have added NPE check, because I got NPE in certain circumstances.

Comment 7 Radim Hatlapatka 2013-03-01 14:41:35 UTC
Setting state to MODIFIED, please put it back to ON_QA when testable builds will become available with this fix incorporated in them

Comment 8 Larry O'Leary 2013-09-06 14:30:35 UTC
As this is MODIFIED or ON_QA, setting milestone to ER1.

Comment 9 Libor Zoubek 2013-11-18 18:31:44 UTC
verified on ER05