474845 – EC2E AMis shutting down and resetarting

Bug 474845 - EC2E AMis shutting down and resetarting

Summary: EC2E AMis shutting down and resetarting

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	grid
Sub Component:
Version:	1.0
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	1.1
Target Release:	---
Assignee:	Robert Rati
QA Contact:	Jeff Needle
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-12-05 16:05 UTC by Robert Rati
Modified:	2009-02-04 16:04 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-02-04 16:04:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2009:0036	0	normal	SHIPPED_LIVE	Red Hat Enterprise MRG Grid 1.1 Release	2009-02-04 16:03:49 UTC

Description Robert Rati 2008-12-05 16:05:43 UTC

Description of problem:
It's been observed a few times that EC2E AMIs would start and apparently run jobs, but the source jobs would move from R->C->R.  The AMIs would shutdown and then restart (possibly the job being re-routed?).  When this occurs, the thrashing never seems to end.  No jobs complete and AMIs keep being restarted.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Robert Rati 2008-12-10 18:37:06 UTC

This is being caused by a failure in the finalization process.  The exit hook is executed and then condor fails to find a file it is expecting in the spool directory so the finalization process fails and condor sets the JobStatus = 1.  This causes the job to be re-routed and a new AMI started.  The reason the finalization process can't find the files it is looking for is because the exit hook is being told that the spool directory is in a different location than it should be.  In the case I am seeing, a job is submitted from /home/testmonkey/ec2e and the routed job has:

SUBMIT_Iwd = "/home/testmonkey/ec2e"
Iwd = "/mnt/sharedfs/condor_ha_schedd/cluster3142.proc0.subproc0"

The spool directory that the exit hook is being given is "/home/testmonkey/ec2e", not "/mnt/sharedfs/condor_ha_schedd/cluster3142.proc0.subproc0".  So the exit hook is extracting the tarball in S3 into "/home/testmonkey/ec2e" instead of the ...cluster3142... dir, thus the finalization process can't find the files it needs to complete.

Comment 2 Robert Rati 2008-12-17 02:49:12 UTC

The jobs are no longer forcibly spooled by the hooks, and the finalize hook does the extracting/remapping of files back to the original job's iwd, as opposed to extracting the files from AWS into the routed job's spool directory.

Fixed in:
condor-ec2-enhanced-1.0-7
condor-ec2-enhanced-hooks-1.0-8

Comment 5 errata-xmlrpc 2009-02-04 16:04:58 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0036.html

Note You need to log in before you can comment on or make changes to this bug.