Bug 474845 - EC2E AMis shutting down and resetarting
EC2E AMis shutting down and resetarting
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: grid (Show other bugs)
All Linux
high Severity high
: 1.1
: ---
Assigned To: Robert Rati
Jeff Needle
Depends On:
  Show dependency treegraph
Reported: 2008-12-05 11:05 EST by Robert Rati
Modified: 2009-02-04 11:04 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-02-04 11:04:58 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2009:0036 normal SHIPPED_LIVE Red Hat Enterprise MRG Grid 1.1 Release 2009-02-04 11:03:49 EST

  None (edit)
Description Robert Rati 2008-12-05 11:05:43 EST
Description of problem:
It's been observed a few times that EC2E AMIs would start and apparently run jobs, but the source jobs would move from R->C->R.  The AMIs would shutdown and then restart (possibly the job being re-routed?).  When this occurs, the thrashing never seems to end.  No jobs complete and AMIs keep being restarted.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
Actual results:

Expected results:

Additional info:
Comment 1 Robert Rati 2008-12-10 13:37:06 EST
This is being caused by a failure in the finalization process.  The exit hook is executed and then condor fails to find a file it is expecting in the spool directory so the finalization process fails and condor sets the JobStatus = 1.  This causes the job to be re-routed and a new AMI started.  The reason the finalization process can't find the files it is looking for is because the exit hook is being told that the spool directory is in a different location than it should be.  In the case I am seeing, a job is submitted from /home/testmonkey/ec2e and the routed job has:

SUBMIT_Iwd = "/home/testmonkey/ec2e"
Iwd = "/mnt/sharedfs/condor_ha_schedd/cluster3142.proc0.subproc0"

The spool directory that the exit hook is being given is "/home/testmonkey/ec2e", not "/mnt/sharedfs/condor_ha_schedd/cluster3142.proc0.subproc0".  So the exit hook is extracting the tarball in S3 into "/home/testmonkey/ec2e" instead of the ...cluster3142... dir, thus the finalization process can't find the files it needs to complete.
Comment 2 Robert Rati 2008-12-16 21:49:12 EST
The jobs are no longer forcibly spooled by the hooks, and the finalize hook does the extracting/remapping of files back to the original job's iwd, as opposed to extracting the files from AWS into the routed job's spool directory.

Fixed in:
Comment 5 errata-xmlrpc 2009-02-04 11:04:58 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.