Description of problem: Systems that share the same management interface can sometimes stop each other from running jobs if a blocking process goes awol. Version-Release number of selected component (if applicable): 0.14.1 How reproducible: Steps to Reproduce: 1. ? 2. 3. Actual results: Job was not running Expected results: beaker-provision should have times out the 'Waiting' command, Additional info: The configure_netboot entries etc wete still in the Queued state (had been for a couple of days), beaker-provision et al were running, so there was no problem with Beaker per se.
Beaker-provision already enforces a timeout on power commands and kills the script if the timeout is exceeded. It sounds like the problem here is that the power script spawned a child process (telnet) which wasn't cleaned up. Beaker-provision should make sure each power command is run in its own progress group and then kill the entire process group on timeout.
(In reply to Dan Callaghan from comment #4) > Beaker-provision should make sure each power command is run in its own > progress group process group
This is the kind of provisioning reliability fix I'd like us to focus on in 0.16 :)
The other case beaker-provision should handle better is when the power script crashes or is killed, and leaves behind child processes. So really it should kill the process group in all cases, not just when timeouts occur.
http://gerrit.beaker-project.org/#/c/2322/
beaker 0.15.1 has been released.
This change has been nominated to be back ported to the 0.14 branch, to be released as part of the next maintenance release 0.14.2.
Adjusting target milestone to make the changes backported to 0.14.2 easier to identify. 0.15.0 has enough significant regressions that it shouldn't be used, so the change means that 0.15.1 can be effectively reidentified as the union of that tag and the 0.14.2 target milestone.
Verified as per the original intructions on comment#13
Closing as addressed in Beaker 0.14.2.