Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 498497

Summary: Startd crash can leak VMs
Product: Red Hat Enterprise MRG Reporter: Matthew Farrellee <matt>
Component: condorAssignee: Robert Rati <rrati>
Status: CLOSED ERRATA QA Contact: Luigi Toscano <ltoscano>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 1.1.1CC: iboverma, lans.carstensen, lbrindle, ltoscano, tao
Target Milestone: 1.2   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: 7.3.2-0.4 Doc Type: Bug Fix
Doc Text:
Grid bug fix C: The condor_startd crashes while running a virtual machine universe job C: When the startd is restarted, the VM job has been forgotten and is not restarted F: The VM job's status changes to idle until the startd comes back up, at which time it resumes running R: VM jobs are no longer forgotten after a startd crash. If the condor_startd crashed while running a virtual machine universe job, the VM job was forgotten and not restarted when the startd came back up. This behaviour has been changed, so that the VM job's status changes to idle until the startd comes back up, at which time it resumes running, so that VM jobs are no longer forgotten after a startd crash.
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-12-03 09:17:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 527551    

Description Matthew Farrellee 2009-04-30 18:41:40 UTC
1. submit vm job
2. see vm running
3. kill -9 condor_startd
4. watch condor_master kill condor_startd's children: condor_starter and condor_vm-gahp
5. watch condor_master restart condor_startd with no memory of the vm
6. watch vm go back to idle and be re-run

The proposed fix for this is to have the startd leave a trail so it can cleanup the vm when it is restarted. This solution fails if the startd is never restarted, e.g. if the master is also killed.

If aiming for run-at-most-once semantics, job policy should prevent the job from re-running.

Comment 4 Matthew Farrellee 2009-07-16 20:47:18 UTC
Addressed upstream by Jaime Frey

Comment 6 Irina Boverman 2009-10-28 18:07:57 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Startd crash no longer leaks VMs (498497)

Comment 7 Matthew Farrellee 2009-11-06 18:11:21 UTC
Luigi found a regression here.

Jaime's fix writes a %s.recover file for each job that is run, including VMs. When the Startd starts it removes everything under the EXECUTE directory, but first checks for .recover files. The .recover file contains the job ad of the running job. If the job was a VM job, the startd will invoke the condor_vm-gahp in "KillMode". KillMode eventually invokes VirshType::killVMFast to terminate the VMs. However, on 26 Aug 09 (155e821), the killVMFast function was commented out. The result being the Startd's attempt is effectively ignored.

We know the startd is accessing the .recover files and invoking the vm-gahp because the VMGahpLog contains messages to that effect.

The VirshType has a killVM function that used to call killVMFast. The two functions can likely share some code. Testing should include inducing a failure during the kill attempt so verify that the startd cannot successfully complete the operation and remove the .recover file if the VM is still running.

Comment 8 Robert Rati 2009-11-09 16:14:06 UTC
Regression fixed in:
7.4.1-0.5

Comment 9 Luigi Toscano 2009-11-10 18:20:54 UTC
The job is correctly recovered when condor_startd is killed and restarted by condor_master: the VM is closed, the jobs keeps the running status until the lease time has been elapsed, then it changes its status to idle and finally the the VM is restarted and the status is put back to running.
The execution of condor_rm on the job is correctly handled.

Tested on RHEL 5.4, i386 Xen, x86_64 Xen, x86_64 KVM.

condor-7.4.1-0.5
condor-vm-gahp-7.4.1-0.5

Changing the status to VERIFIED.

Comment 10 Lana Brindley 2009-11-11 20:41:17 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,8 @@
-Startd crash no longer leaks VMs (498497)+Grid bug fix
+
+C: The condor_startd crashes while running a virtual machine universe job
+C: When the startd is restarted, the VM job has been forgotten and is not restarted
+F: The VM job's status changes to idle until the startd comes back up, at which time it resumes running
+R: VM jobs are no longer forgotten after a startd crash.
+
+If the condor_startd crashed while running a virtual machine universe job, the VM job was forgotten and not restarted when the startd came back up. This behaviour has been changed, so that the VM job's status changes to idle until the startd comes back up, at which time it resumes running, so that VM jobs are no longer forgotten after a startd crash.

Comment 13 errata-xmlrpc 2009-12-03 09:17:33 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html