Bug 498497
| Summary: | Startd crash can leak VMs | ||
|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Matthew Farrellee <matt> |
| Component: | condor | Assignee: | Robert Rati <rrati> |
| Status: | CLOSED ERRATA | QA Contact: | Luigi Toscano <ltoscano> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 1.1.1 | CC: | iboverma, lans.carstensen, lbrindle, ltoscano, tao |
| Target Milestone: | 1.2 | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | 7.3.2-0.4 | Doc Type: | Bug Fix |
| Doc Text: |
Grid bug fix
C: The condor_startd crashes while running a virtual machine universe job
C: When the startd is restarted, the VM job has been forgotten and is not restarted
F: The VM job's status changes to idle until the startd comes back up, at which time it resumes running
R: VM jobs are no longer forgotten after a startd crash.
If the condor_startd crashed while running a virtual machine universe job, the VM job was forgotten and not restarted when the startd came back up. This behaviour has been changed, so that the VM job's status changes to idle until the startd comes back up, at which time it resumes running, so that VM jobs are no longer forgotten after a startd crash.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2009-12-03 09:17:33 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 527551 | ||
|
Description
Matthew Farrellee
2009-04-30 18:41:40 UTC
Addressed upstream by Jaime Frey Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Startd crash no longer leaks VMs (498497) Luigi found a regression here. Jaime's fix writes a %s.recover file for each job that is run, including VMs. When the Startd starts it removes everything under the EXECUTE directory, but first checks for .recover files. The .recover file contains the job ad of the running job. If the job was a VM job, the startd will invoke the condor_vm-gahp in "KillMode". KillMode eventually invokes VirshType::killVMFast to terminate the VMs. However, on 26 Aug 09 (155e821), the killVMFast function was commented out. The result being the Startd's attempt is effectively ignored. We know the startd is accessing the .recover files and invoking the vm-gahp because the VMGahpLog contains messages to that effect. The VirshType has a killVM function that used to call killVMFast. The two functions can likely share some code. Testing should include inducing a failure during the kill attempt so verify that the startd cannot successfully complete the operation and remove the .recover file if the VM is still running. Regression fixed in: 7.4.1-0.5 The job is correctly recovered when condor_startd is killed and restarted by condor_master: the VM is closed, the jobs keeps the running status until the lease time has been elapsed, then it changes its status to idle and finally the the VM is restarted and the status is put back to running. The execution of condor_rm on the job is correctly handled. Tested on RHEL 5.4, i386 Xen, x86_64 Xen, x86_64 KVM. condor-7.4.1-0.5 condor-vm-gahp-7.4.1-0.5 Changing the status to VERIFIED. Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1,8 @@ -Startd crash no longer leaks VMs (498497)+Grid bug fix + +C: The condor_startd crashes while running a virtual machine universe job +C: When the startd is restarted, the VM job has been forgotten and is not restarted +F: The VM job's status changes to idle until the startd comes back up, at which time it resumes running +R: VM jobs are no longer forgotten after a startd crash. + +If the condor_startd crashed while running a virtual machine universe job, the VM job was forgotten and not restarted when the startd came back up. This behaviour has been changed, so that the VM job's status changes to idle until the startd comes back up, at which time it resumes running, so that VM jobs are no longer forgotten after a startd crash. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-1633.html |