Bug 451797
Summary: | failure in one ec2 job can block other jobs | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Matthew Farrellee <matt> |
Component: | grid | Assignee: | Matthew Farrellee <matt> |
Status: | CLOSED WONTFIX | QA Contact: | Kim van der Riet <kim.vdriet> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | 1.0 | CC: | jfrey |
Target Milestone: | 1.1 | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-11-04 15:00:51 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Matthew Farrellee
2008-06-17 13:40:55 UTC
The EC2 resource is marked as down only if a AMAZON_VM_STATUS_ALL gahp command from the AmazonResource object in the gridmanager fails. There's a separate AmazonResource object for each set of key files used by jobs, so if a command fails, it should only affect other jobs using the same set of key files. The issue may be the credentials used to service the AMAZON_VM_STATUS_ALL command. If they are somehow invalid, the resource will be marked as down unnecessarily. I unfortunately do not remember the specifics of the issue. A tests containing valid and invalid credentials submitted by a single user, or even a mix of users, may flush it out though. The invalid credentials should be tested as the first credentials available. Verified this happens when a job is submitted with a bad cert, the grid resource is marked as down, the cert is then changed (cert file's contents!) and a new job is submitted using the same cert filename. This is because the two jobs are considered to be using the same resource - same cert filename + other similarities. If the second job uses a different file name for the cert it will not be held up by the failure from the first job. This could also be addressed by fingerprinting the file in some way other than just its name. However, since this is an obscure issue that only occurs when modifying files that arguably shouldn't be modified during the execution of a job, I'm closing this as WONTFIX. It's also possible to argue that Condor should make its own copies of the files to avoid this issue more generally. This issue could be reopened if it becomes important enough. |