Bug 1089757
Summary: | Install fails if shutdown process takes more then 10 seconds | ||||||
---|---|---|---|---|---|---|---|
Product: | [JBoss] JBoss Operations Network | Reporter: | Larry O'Leary <loleary> | ||||
Component: | Installer | Assignee: | John Mazzitelli <mazz> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Foley <mfoley> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | JON 3.2 | CC: | mazz, vnguyen | ||||
Target Milestone: | DR01 | ||||||
Target Release: | JON 3.3.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
During Installer execution, a shutdown is performed on the newly installed component(s).
The shutdown delay was hard-coded to ten seconds. While ten seconds was adequate for development environments with no load and the database running locally, a production environment may take anywhere from a few seconds to upwards of a minute to shut down. If the target component took longer than 10 seconds to shutdown, the entire install was rolled back and aborted.
The fix implements a number of improvements to the Installer shutdown process. The timeout has been increased to 30 seconds, from 10 seconds. Additionally, a system property can be set to customize the timeout value through RHQ_CONTROL_ADDITIONAL_JAVA_OPTS using the -D java option. The sysprop name is "rhqctl.wait-for-process-to-stop-timeout-secs". Additional logging is also available to assist with tracking down issues in this area.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2014-12-11 14:04:11 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Larry O'Leary
2014-04-21 21:53:45 UTC
After further analysis of the timeout scenario, I have discovered that even though the process is no longer running and kill -0 reports an exit code of 1, the waitForProcessToStop method continues to wait. This issue seems like something is wrong with the code logic in isUnixPidRunning. Unfortunately there is no debug/trace output option here so what the actual exit code is or exception is unknown. for the record, "kill -0" on Solaris 10 works like more modern OSes. I was concerned "kill -0" (which is what the code uses to determine if a pid is alive) doesn't work on Solaris 10. But I did confirm that it works as expected. So that isn't the source of the problem. Interestingly, it looks like code in master is no longer using waitForProcessToStop - I dont see anyplace where that is called now. Now isStorageRunning just calls isUnixPidRunning. I will try this on Solaris 10 and see if master exhibits any problems. (In reply to John Mazzitelli from comment #3) > Interestingly, it looks like code in master is no longer using > waitForProcessToStop - I dont see anyplace where that is called now. > > Now isStorageRunning just calls isUnixPidRunning. > > I will try this on Solaris 10 and see if master exhibits any problems. never mind - stupid eclipse for some reason failed to find references. I see this called via Stop and Upgrade classes. We need to purge some code - for some reason org.rhq.server.control.command.AbstractInstall.waitForProcessToStop(String) overloads org.rhq.server.control.ControlCommand.waitForProcessToStop(String) but it looks to be identical code. No idea why its like this. I think someone must have refactored this and pulled up waitForProcessToStop but never removed the overloading method. git commit to master: 3a7cec63a9f365c3679792c1832605c7bc95a03c THIS IS NOT currently in any release branch. This fix does a few things: 1) just added some additional logging to help track issues down in this area. 2) the timeout is now 30s, not 10s 3) there is now a sysprop that you can set in RHQ_CONTROL_ADDITIONAL_JAVA_OPTS via the -D java option to set the timeout to whatever you want (in case you need it longer). The sysprop name is: rhqctl.wait-for-process-to-stop-timeout-secs this is still broken. will work on this again. here's the new debug logs that I added to show it is broken: Stopping RHQ storage node... RHQ storage node (pid=22820) is stopping... 10:54:32,022 DEBUG [org.rhq.server.control.command.Stop] Waiting for pid [22820] to stop. Try #1 10:54:32,027 DEBUG [org.rhq.server.control.command.Stop] unix pid [22820] is running 10:54:38,028 DEBUG [org.rhq.server.control.command.Stop] Waiting for pid [22820] to stop. Try #2 10:54:38,030 DEBUG [org.rhq.server.control.command.Stop] kill -0 for pid [22820] threw exception with exit value [1] 10:54:38,031 DEBUG [org.rhq.server.control.command.Stop] unix pid [22820] is NOT running 10:54:44,031 DEBUG [org.rhq.server.control.command.Stop] Waiting for pid [22820] to stop. Try #3 10:54:44,034 DEBUG [org.rhq.server.control.command.Stop] kill -0 for pid [22820] threw exception with exit value [1] 10:54:44,034 DEBUG [org.rhq.server.control.command.Stop] unix pid [22820] is NOT running 10:54:50,034 DEBUG [org.rhq.server.control.command.Stop] Waiting for pid [22820] to stop. Try #4 10:54:50,038 DEBUG [org.rhq.server.control.command.Stop] kill -0 for pid [22820] threw exception with exit value [1] 10:54:50,038 DEBUG [org.rhq.server.control.command.Stop] unix pid [22820] is NOT running 10:54:56,039 DEBUG [org.rhq.server.control.command.Stop] Waiting for pid [22820] to stop. Try #5 10:54:56,041 DEBUG [org.rhq.server.control.command.Stop] kill -0 for pid [22820] threw exception with exit value [1] 10:54:56,042 DEBUG [org.rhq.server.control.command.Stop] unix pid [22820] is NOT running 10:55:02,043 ERROR [org.rhq.server.control.RHQControl] Process [22820] did not finish yet. Terminate it manually and retry.: org.rhq.server.control.RHQControlException: Process [22820] did not finish yet. Terminate it manually and retry. at org.rhq.server.control.ControlCommand.waitForProcessToStop(ControlCommand.java:470) [rhq-server-control-4.12.0-SNAPSHOT.jar:4.12.0-SNAPSHOT] at org.rhq.server.control.command.Stop.stopStorage(Stop.java:146) [rhq-server-control-4.12.0-SNAPSHOT.jar:4.12.0-SNAPSHOT] at org.rhq.server.control.command.Stop.exec(Stop.java:86) [rhq-server-control-4.12.0-SNAPSHOT.jar:4.12.0-SNAPSHOT] git commit to master that fixes the problem: 845cc38 Wow... completely overlooked that. Good find! I commited the following to release/3.2.x to fix the problem: 72d629c145260ccf48516116336b8c5501a97251 92e3e062f26a9fce7b82753b6d64763937bca756 87c31fc24abea9b281f4852f43e276046535eb09 Moving to ON_QA as available to test with brew build of DR01: https://brewweb.devel.redhat.com//buildinfo?buildID=373993 Created attachment 929235 [details]
rhqctl.log
Set time out to 2 seconds and saw failure. RHQ_CONTROL_ADDITIONAL_JAVA_OPTS="-Drhqctl.wait-for-process-to-stop-timeout-secs=2" |