Bug 1089757 - Install fails if shutdown process takes more then 10 seconds
Summary: Install fails if shutdown process takes more then 10 seconds
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Operations Network
Classification: JBoss
Component: Installer
Version: JON 3.2
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: DR01
: JON 3.3.0
Assignee: John Mazzitelli
QA Contact: Mike Foley
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-04-21 21:53 UTC by Larry O'Leary
Modified: 2018-12-05 18:16 UTC (History)
2 users (show)

(edit)
During Installer execution, a shutdown is performed on the newly installed component(s). 

The shutdown delay was hard-coded to ten seconds. While ten seconds was adequate for development environments with no load and the database running locally, a production environment may take anywhere from a few seconds to upwards of a minute to shut down. If the target component took longer than 10 seconds to shutdown, the entire install was rolled back and aborted.

The fix implements a number of improvements to the Installer shutdown process. The timeout has been increased to 30 seconds, from 10 seconds. Additionally, a system property can be set to customize the timeout value through RHQ_CONTROL_ADDITIONAL_JAVA_OPTS using the -D java option. The sysprop name is "rhqctl.wait-for-process-to-stop-timeout-secs". Additional logging is also available to assist with tracking down issues in this area.
Clone Of:
(edit)
Last Closed: 2014-12-11 14:04:11 UTC


Attachments (Terms of Use)
rhqctl.log (6.92 KB, text/plain)
2014-08-21 15:14 UTC, Viet Nguyen
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 796083 None None None Never
Red Hat Bugzilla 1096923 None None None Never

Internal Trackers: 1096923

Description Larry O'Leary 2014-04-21 21:53:45 UTC
Description of problem:
During the execution of the installer, a shutdown is performed on the newly installed component(s). If the target component takes longer then 10 seconds to shutdown, the entire install is rolled back and aborted.

The shutdown delay seems to be hard-coded and therefore can not be overridden by the user. Although 10 seconds may be optimal on a development machine with no load and the database running locally, a production environment could see a shutdown taking anywhere from a few seconds to upwards of a minute.

Version-Release number of selected component (if applicable):
3.2.0.GA

How reproducible:
Always

Steps to Reproduce:
1.  From Solaris 10 machine, perform install:

        rhqctl install

Actual results:
Install fails and the following is printed to the screen:

16:36:51,060 INFO  [org.rhq.server.control.command.Install] Finished configuring the agent
Trying to stop the RHQ Server...
RHQ Server (pid=3287) is stopping...
RHQ Server has stopped.
Stopping RHQ storage node...
RHQ storage node (pid=3277) is stopping...
16:37:03,231 ERROR [org.rhq.server.control.RHQControl] Process [3277] did not finish yet. Terminate it manually and retry.
16:37:03,232 WARN  [org.rhq.server.control.command.Install] UNDO: Removing agent install directory
16:37:03,295 WARN  [org.rhq.server.control.command.Install] UNDO: Removing server-installed marker file and management user
16:37:03,301 WARN  [org.rhq.server.control.command.Install] UNDO: Stopping component: --server
16:37:03,302 WARN  [org.rhq.server.control.command.Install] UNDO: Stopping component: --storage
16:37:03,302 WARN  [org.rhq.server.control.command.Install] UNDO: Removing storage node data and install directories
16:37:03,812 WARN  [org.rhq.server.control.command.Install] UNDO: Reverting server properties file


Expected results:
Install should not fail and installer should wait longer for rhq-storage node to finish its shutdown process.

Additional info:
The offending code is in ControlCommand.waitForProcessToStop[1]:

            int tries = 5;
            while (tries > 0) {
                log.debug(".");
                if (!isUnixPidRunning(pid)) {
                    break;
                }
                Thread.sleep(2*1000L);
                tries--;
            }


This code is executed during the shutdown request and because a hard-coded retry count and sleep are being used, the maximum amount of time that the method will wait and report a successful shutdown is 10 seconds. Because a shutdown can take longer then 10 seconds under normal circumstance, the installer will interpret this "failed shutdown" as a "failed install" and then undone everything that it had already successfully done.

It isn't clear if waitForProcessToStop should wait longer and provide the ability to override the wait time or if the installer should wait an additional amount of time and then use the status check to verify the processes have been shutdown. Perhaps a combination of the two should be used:

 1. Use properties that can be passed via environment variables or system properties to determine how long ControlCommand waits before reporting success or failure. Such property values would only be used in circumstances that ControlCommand can't do the right thing on its own.

 2. Increase the default wait time to 30 seconds or perhaps a minute.

 3. The installer should treat a "failed shutdown" as a soft indication that something may have went wrong but should check the state on its own for an additional amount of time. Or perhaps if in interactive, ask the user if it should continue to wait.


[1]: https://github.com/rhq-project/rhq/blob/RHQ_4_9_0_JON320GA/modules/enterprise/server/server-control/src/main/java/org/rhq/server/control/ControlCommand.java#L437

Comment 1 Larry O'Leary 2014-04-21 23:21:50 UTC
After further analysis of the timeout scenario, I have discovered that even though the process is no longer running and kill -0 reports an exit code of 1, the waitForProcessToStop method continues to wait. This issue seems like something is wrong with the code logic in isUnixPidRunning. Unfortunately there is no debug/trace output option here so what the actual exit code is or exception is unknown.

Comment 2 John Mazzitelli 2014-05-05 20:56:21 UTC
for the record, "kill -0" on Solaris 10 works like more modern OSes. I was concerned "kill -0" (which is what the code uses to determine if a pid is alive) doesn't work on Solaris 10. But I did confirm that it works as expected. So that isn't the source of the problem.

Comment 3 John Mazzitelli 2014-05-05 21:44:05 UTC
Interestingly, it looks like code in master is no longer using waitForProcessToStop - I dont see anyplace where that is called now.

Now isStorageRunning just calls isUnixPidRunning.

I will try this on Solaris 10 and see if master exhibits any problems.

Comment 4 John Mazzitelli 2014-05-05 21:45:57 UTC
(In reply to John Mazzitelli from comment #3)
> Interestingly, it looks like code in master is no longer using
> waitForProcessToStop - I dont see anyplace where that is called now.
> 
> Now isStorageRunning just calls isUnixPidRunning.
> 
> I will try this on Solaris 10 and see if master exhibits any problems.

never mind - stupid eclipse for some reason failed to find references. I see this called via Stop and Upgrade classes.

Comment 5 John Mazzitelli 2014-05-05 21:54:23 UTC
We need to purge some code - for some reason

org.rhq.server.control.command.AbstractInstall.waitForProcessToStop(String)

overloads

org.rhq.server.control.ControlCommand.waitForProcessToStop(String)

but it looks to be identical code. No idea why its like this. I think someone must have refactored this and pulled up waitForProcessToStop but never removed the overloading method.

Comment 6 John Mazzitelli 2014-05-06 19:30:30 UTC
git commit to master: 3a7cec63a9f365c3679792c1832605c7bc95a03c

THIS IS NOT currently in any release branch.

This fix does a few things:

1) just added some additional logging to help track issues down in this area.
2) the timeout is now 30s, not 10s
3) there is now a sysprop that you can set in RHQ_CONTROL_ADDITIONAL_JAVA_OPTS via the -D java option to set the timeout to whatever you want (in case you need it longer). The sysprop name is: rhqctl.wait-for-process-to-stop-timeout-secs

Comment 7 John Mazzitelli 2014-05-07 14:57:26 UTC
this is still broken. will work on this again. here's the new debug logs that I added to show it is broken:

Stopping RHQ storage node...
RHQ storage node (pid=22820) is stopping...
10:54:32,022 DEBUG [org.rhq.server.control.command.Stop] Waiting for pid [22820] to stop. Try #1
10:54:32,027 DEBUG [org.rhq.server.control.command.Stop] unix pid [22820] is running
10:54:38,028 DEBUG [org.rhq.server.control.command.Stop] Waiting for pid [22820] to stop. Try #2
10:54:38,030 DEBUG [org.rhq.server.control.command.Stop] kill -0 for pid [22820] threw exception with exit value [1]
10:54:38,031 DEBUG [org.rhq.server.control.command.Stop] unix pid [22820] is NOT running
10:54:44,031 DEBUG [org.rhq.server.control.command.Stop] Waiting for pid [22820] to stop. Try #3
10:54:44,034 DEBUG [org.rhq.server.control.command.Stop] kill -0 for pid [22820] threw exception with exit value [1]
10:54:44,034 DEBUG [org.rhq.server.control.command.Stop] unix pid [22820] is NOT running
10:54:50,034 DEBUG [org.rhq.server.control.command.Stop] Waiting for pid [22820] to stop. Try #4
10:54:50,038 DEBUG [org.rhq.server.control.command.Stop] kill -0 for pid [22820] threw exception with exit value [1]
10:54:50,038 DEBUG [org.rhq.server.control.command.Stop] unix pid [22820] is NOT running
10:54:56,039 DEBUG [org.rhq.server.control.command.Stop] Waiting for pid [22820] to stop. Try #5
10:54:56,041 DEBUG [org.rhq.server.control.command.Stop] kill -0 for pid [22820] threw exception with exit value [1]
10:54:56,042 DEBUG [org.rhq.server.control.command.Stop] unix pid [22820] is NOT running
10:55:02,043 ERROR [org.rhq.server.control.RHQControl] Process [22820] did not finish yet. Terminate it manually and retry.: org.rhq.server.control.RHQControlException: Process [22820] did not finish yet. Terminate it manually and retry.
        at org.rhq.server.control.ControlCommand.waitForProcessToStop(ControlCommand.java:470) [rhq-server-control-4.12.0-SNAPSHOT.jar:4.12.0-SNAPSHOT]
        at org.rhq.server.control.command.Stop.stopStorage(Stop.java:146) [rhq-server-control-4.12.0-SNAPSHOT.jar:4.12.0-SNAPSHOT]
        at org.rhq.server.control.command.Stop.exec(Stop.java:86) [rhq-server-control-4.12.0-SNAPSHOT.jar:4.12.0-SNAPSHOT]

Comment 8 John Mazzitelli 2014-05-07 15:03:19 UTC
git commit to master that fixes the problem: 845cc38

Comment 9 Larry O'Leary 2014-05-07 16:02:28 UTC
Wow... completely overlooked that. Good find!

Comment 10 John Mazzitelli 2014-05-15 13:26:22 UTC
I commited the following to release/3.2.x to fix the problem:

72d629c145260ccf48516116336b8c5501a97251
92e3e062f26a9fce7b82753b6d64763937bca756
87c31fc24abea9b281f4852f43e276046535eb09

Comment 11 Simeon Pinder 2014-07-31 15:51:50 UTC
Moving to ON_QA as available to test with brew build of DR01: https://brewweb.devel.redhat.com//buildinfo?buildID=373993

Comment 12 Viet Nguyen 2014-08-21 15:14:06 UTC
Created attachment 929235 [details]
rhqctl.log

Comment 13 Viet Nguyen 2014-08-21 15:16:34 UTC
Set time out to 2 seconds and saw failure. 

RHQ_CONTROL_ADDITIONAL_JAVA_OPTS="-Drhqctl.wait-for-process-to-stop-timeout-secs=2"


Note You need to log in before you can comment on or make changes to this bug.