Description of problem: Although this primarily happens when using a System V init script to start the and EAP server, it could also happen when using customized start scripts found in the AS installation's bin directory. If the start script checks to see if the existing process is still running or its process ID file or run lock is in place, it will fail to start because the shutdown operation has not yet been completed. Even though shutdown has been executed and the management interface is no longer accepting new commands, the AS instance is still running. However, the restart operation of the AS7 plug-in and its BaseServerComponent.waitUntilDown() method only check to see if we can execute new commands via the management interface. It should be obvious that we can't considering that we just shut it down and received a success message. Instead, the restart operation needs to make certain that the process has shutdown prior to invoking the startServer method. Version-Release number of selected component (if applicable): 4.4.0.JON311GA How reproducible: Frequently Steps to Reproduce: 1. Create custom start script and and init script wrapper: The goal here is to simply have the start operation verify that the server is not currently running and fail if it is. This is what the shipped init script does in its start function. However, this can be simulated using the standalone.sh script as well and adding something similar to: if [ -f $JBOSS_PIDFILE ]; then read ppid < $JBOSS_PIDFILE if [ `ps --pid $ppid 2> /dev/null | grep -c $ppid 2> /dev/null` -eq '1' ]; then echo -n "$prog is already running" failure echo return 1 else rm -f $JBOSS_PIDFILE fi fi 2. Start EAP 6 standalone server using custom start script or shipped init script 3. Start ON system 4. Import EAP 6 server into inventory and make sure its configured and reports available 5. Update EAP 6 resource's connection settings to use custom start script for starting EAP resource 6. Connect to EAP server at http://localhost:8080 and also connect to its HTTP management console 7. From the EAP resource in inventory, invoke the restart operation. Actual results: EAP server is shutdown but is not started back up. If running with agent debug logging enabled, the output will indicate that the start was not performed because the server was already running: $prog is already running Expected results: EAP server should be restarted without any error. Additional info: This issue is all about timing. In the existing plug-in code we are not waiting for EAP to shutdown. Therefore, if the start script checks to see if the server is already running with an existing/known PID or by performing a process scan, it will fail to start. The delay between the time the shutdown operation is invoked and the process is shutdown is linked to the work that the EAP server is doing. Basically, the shutdown operation as implemented in the AS7 plug-in is simply telling the server to shutdown at which point the server begins the shutdown process. It first stops new work from being submitted and then finishes existing requests and closes out existing connections that are no longer needed. However, because it is not accepting new work, our avail pings will fail fast because no new work is being allowed.
master http://git.fedorahosted.org/cgit/rhq/rhq.git/diff/?id=c17aa632d time: Fri Feb 8 15:18:40 2013 +0100 commit: c17aa632dca792707a4c84ba16245efb9c4ed99a author: Jirka Kremser - jkremser message: [BZ Bug 893802 - [as7] Restart operation fails to start AS when shutdown hasn't completed when start script is invoked] If the AS is deployed on the same node as the agent, the check whether the process is running or not is done in a loop once a second. This new check was added after the original (using the DMR api polling) check. In total, the AS has up to 20 second to finish its work and completely shut down. Changing status to MODIFIED.
Created attachment 695128 [details] start script with the pid check at the beginning Adding the complete script for easier verification. (used on Fedora17)
As discussed during today's hangout, bugs with 3.2.0 release should go to ON_QA, so I am moving it to ON_QA.
master http://git.fedorahosted.org/cgit/rhq/rhq.git/diff/?id=da8012a65 time: Tue Feb 12 16:58:31 2013 +0100 commit: da8012a65541b86ba02ed2405b983f9b6cd0cb3f author: Jirka Kremser - jkremser message: [BZ 893802 [as7] Restart operation fails to start AS when shutdown hasn't completed when start script is invoked] I have added NPE check, because I got NPE in certain circumstances.
Setting state to MODIFIED, please put it back to ON_QA when testable builds will become available with this fix incorporated in them
As this is MODIFIED or ON_QA, setting milestone to ER1.
verified on ER05