Description of problem: During the execution of the installer, a shutdown is performed on the newly installed component(s). If the target component takes longer then 10 seconds to shutdown, the entire install is rolled back and aborted. The shutdown delay seems to be hard-coded and therefore can not be overridden by the user. Although 10 seconds may be optimal on a development machine with no load and the database running locally, a production environment could see a shutdown taking anywhere from a few seconds to upwards of a minute. Version-Release number of selected component (if applicable): 3.2.0.GA How reproducible: Always Steps to Reproduce: 1. From Solaris 10 machine, perform install: rhqctl install Actual results: Install fails and the following is printed to the screen: 16:36:51,060 INFO [org.rhq.server.control.command.Install] Finished configuring the agent Trying to stop the RHQ Server... RHQ Server (pid=3287) is stopping... RHQ Server has stopped. Stopping RHQ storage node... RHQ storage node (pid=3277) is stopping... 16:37:03,231 ERROR [org.rhq.server.control.RHQControl] Process [3277] did not finish yet. Terminate it manually and retry. 16:37:03,232 WARN [org.rhq.server.control.command.Install] UNDO: Removing agent install directory 16:37:03,295 WARN [org.rhq.server.control.command.Install] UNDO: Removing server-installed marker file and management user 16:37:03,301 WARN [org.rhq.server.control.command.Install] UNDO: Stopping component: --server 16:37:03,302 WARN [org.rhq.server.control.command.Install] UNDO: Stopping component: --storage 16:37:03,302 WARN [org.rhq.server.control.command.Install] UNDO: Removing storage node data and install directories 16:37:03,812 WARN [org.rhq.server.control.command.Install] UNDO: Reverting server properties file Expected results: Install should not fail and installer should wait longer for rhq-storage node to finish its shutdown process. Additional info: The offending code is in ControlCommand.waitForProcessToStop[1]: int tries = 5; while (tries > 0) { log.debug("."); if (!isUnixPidRunning(pid)) { break; } Thread.sleep(2*1000L); tries--; } This code is executed during the shutdown request and because a hard-coded retry count and sleep are being used, the maximum amount of time that the method will wait and report a successful shutdown is 10 seconds. Because a shutdown can take longer then 10 seconds under normal circumstance, the installer will interpret this "failed shutdown" as a "failed install" and then undone everything that it had already successfully done. It isn't clear if waitForProcessToStop should wait longer and provide the ability to override the wait time or if the installer should wait an additional amount of time and then use the status check to verify the processes have been shutdown. Perhaps a combination of the two should be used: 1. Use properties that can be passed via environment variables or system properties to determine how long ControlCommand waits before reporting success or failure. Such property values would only be used in circumstances that ControlCommand can't do the right thing on its own. 2. Increase the default wait time to 30 seconds or perhaps a minute. 3. The installer should treat a "failed shutdown" as a soft indication that something may have went wrong but should check the state on its own for an additional amount of time. Or perhaps if in interactive, ask the user if it should continue to wait. [1]: https://github.com/rhq-project/rhq/blob/RHQ_4_9_0_JON320GA/modules/enterprise/server/server-control/src/main/java/org/rhq/server/control/ControlCommand.java#L437
After further analysis of the timeout scenario, I have discovered that even though the process is no longer running and kill -0 reports an exit code of 1, the waitForProcessToStop method continues to wait. This issue seems like something is wrong with the code logic in isUnixPidRunning. Unfortunately there is no debug/trace output option here so what the actual exit code is or exception is unknown.
for the record, "kill -0" on Solaris 10 works like more modern OSes. I was concerned "kill -0" (which is what the code uses to determine if a pid is alive) doesn't work on Solaris 10. But I did confirm that it works as expected. So that isn't the source of the problem.
Interestingly, it looks like code in master is no longer using waitForProcessToStop - I dont see anyplace where that is called now. Now isStorageRunning just calls isUnixPidRunning. I will try this on Solaris 10 and see if master exhibits any problems.
(In reply to John Mazzitelli from comment #3) > Interestingly, it looks like code in master is no longer using > waitForProcessToStop - I dont see anyplace where that is called now. > > Now isStorageRunning just calls isUnixPidRunning. > > I will try this on Solaris 10 and see if master exhibits any problems. never mind - stupid eclipse for some reason failed to find references. I see this called via Stop and Upgrade classes.
We need to purge some code - for some reason org.rhq.server.control.command.AbstractInstall.waitForProcessToStop(String) overloads org.rhq.server.control.ControlCommand.waitForProcessToStop(String) but it looks to be identical code. No idea why its like this. I think someone must have refactored this and pulled up waitForProcessToStop but never removed the overloading method.
git commit to master: 3a7cec63a9f365c3679792c1832605c7bc95a03c THIS IS NOT currently in any release branch. This fix does a few things: 1) just added some additional logging to help track issues down in this area. 2) the timeout is now 30s, not 10s 3) there is now a sysprop that you can set in RHQ_CONTROL_ADDITIONAL_JAVA_OPTS via the -D java option to set the timeout to whatever you want (in case you need it longer). The sysprop name is: rhqctl.wait-for-process-to-stop-timeout-secs
this is still broken. will work on this again. here's the new debug logs that I added to show it is broken: Stopping RHQ storage node... RHQ storage node (pid=22820) is stopping... 10:54:32,022 DEBUG [org.rhq.server.control.command.Stop] Waiting for pid [22820] to stop. Try #1 10:54:32,027 DEBUG [org.rhq.server.control.command.Stop] unix pid [22820] is running 10:54:38,028 DEBUG [org.rhq.server.control.command.Stop] Waiting for pid [22820] to stop. Try #2 10:54:38,030 DEBUG [org.rhq.server.control.command.Stop] kill -0 for pid [22820] threw exception with exit value [1] 10:54:38,031 DEBUG [org.rhq.server.control.command.Stop] unix pid [22820] is NOT running 10:54:44,031 DEBUG [org.rhq.server.control.command.Stop] Waiting for pid [22820] to stop. Try #3 10:54:44,034 DEBUG [org.rhq.server.control.command.Stop] kill -0 for pid [22820] threw exception with exit value [1] 10:54:44,034 DEBUG [org.rhq.server.control.command.Stop] unix pid [22820] is NOT running 10:54:50,034 DEBUG [org.rhq.server.control.command.Stop] Waiting for pid [22820] to stop. Try #4 10:54:50,038 DEBUG [org.rhq.server.control.command.Stop] kill -0 for pid [22820] threw exception with exit value [1] 10:54:50,038 DEBUG [org.rhq.server.control.command.Stop] unix pid [22820] is NOT running 10:54:56,039 DEBUG [org.rhq.server.control.command.Stop] Waiting for pid [22820] to stop. Try #5 10:54:56,041 DEBUG [org.rhq.server.control.command.Stop] kill -0 for pid [22820] threw exception with exit value [1] 10:54:56,042 DEBUG [org.rhq.server.control.command.Stop] unix pid [22820] is NOT running 10:55:02,043 ERROR [org.rhq.server.control.RHQControl] Process [22820] did not finish yet. Terminate it manually and retry.: org.rhq.server.control.RHQControlException: Process [22820] did not finish yet. Terminate it manually and retry. at org.rhq.server.control.ControlCommand.waitForProcessToStop(ControlCommand.java:470) [rhq-server-control-4.12.0-SNAPSHOT.jar:4.12.0-SNAPSHOT] at org.rhq.server.control.command.Stop.stopStorage(Stop.java:146) [rhq-server-control-4.12.0-SNAPSHOT.jar:4.12.0-SNAPSHOT] at org.rhq.server.control.command.Stop.exec(Stop.java:86) [rhq-server-control-4.12.0-SNAPSHOT.jar:4.12.0-SNAPSHOT]
git commit to master that fixes the problem: 845cc38
Wow... completely overlooked that. Good find!
I commited the following to release/3.2.x to fix the problem: 72d629c145260ccf48516116336b8c5501a97251 92e3e062f26a9fce7b82753b6d64763937bca756 87c31fc24abea9b281f4852f43e276046535eb09
Moving to ON_QA as available to test with brew build of DR01: https://brewweb.devel.redhat.com//buildinfo?buildID=373993
Created attachment 929235 [details] rhqctl.log
Set time out to 2 seconds and saw failure. RHQ_CONTROL_ADDITIONAL_JAVA_OPTS="-Drhqctl.wait-for-process-to-stop-timeout-secs=2"