Bug 489006

Summary: Cannot distinguish between completion and other termination of AMQP submitted work
Product: Red Hat Enterprise MRG Reporter: Matthew Farrellee <matt>
Component: gridAssignee: Robert Rati <rrati>
Status: CLOSED ERRATA QA Contact: Jan Sarenik <jsarenik>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 1.1CC: jsarenik, mkudlej
Target Milestone: 1.1.1   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-04-21 16:17:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 459615    
Bug Blocks:    

Description Matthew Farrellee 2009-03-06 17:37:42 UTC
1) AMQP work message submitted
2) condor picks up work
3) condor restarted
4) condor picks up work (again)
5) work completes

In step 3 and 5 a message with JobState="Exited" is sent to the submitter, there's no way to tell the difference between the situations, at least in condor-low-latency-1.0-9.el5.

In condor-low-latency-1.0-10.el5, the JobStatus is also set, so the message in step (3) is JobState="Exited" and JobStatus=1 whereas in (5) it is JobState="Exited" and JobStatus=4.

Comment 1 Robert Rati 2009-03-06 20:16:48 UTC
Different symptom of BZ459615

Comment 3 Jan Sarenik 2009-03-19 14:20:38 UTC
Should I just verify that condor-low-latency-1.0-10 and higher
return the JobStatus as mentioned above?

Comment 4 Matthew Farrellee 2009-03-19 17:33:27 UTC
And the JobState. You may want to test through a situation where Condor runs the job without interruption, and runs it with restart and maybe kill -9 interruption, including to the carod (service condor-low-latency) process.

Comment 5 Jan Sarenik 2009-04-01 13:54:43 UTC
Jobs submitted via AMQP do not get run. Condor's StartLog says:

Slot requirements not satisfied.
Job requirements not satisfied.

When I put the dump into job.submit file, change Cmd to Executable
and '5' to 'vanilla', add Queue at the end, the job runs flawlessly
with condor_submit (just few lines of WARNINGs for I include really
full dump including parameters that are probably unknown to
condor_submit).

This condor runs all the vanilla jobs via condor_submit with no
problems. Low-latency is configured by adding these lines to
/etc/condor/condor_config

--------------------------------------------------------------------------
LOW_LATENCY_HOOK_FETCH_WORK = $(LIBEXEC)/hooks/hook_fetch_work.py
LOW_LATENCY_HOOK_REPLY_FETCH = $(LIBEXEC)/hooks/hook_reply_fetch.py

# Starter hooks
LOW_LATENCY_JOB_HOOK_PREPARE_JOB = $(LIBEXEC)/hooks/hook_prepare_job.py
LOW_LATENCY_JOB_HOOK_UPDATE_JOB_INFO = $(LIBEXEC)/hooks/hook_update_job_status.py
LOW_LATENCY_JOB_HOOK_JOB_EXIT = $(LIBEXEC)/hooks/hook_job_exit.py

STARTD_JOB_HOOK_KEYWORD = LOW_LATENCY

FetchWorkDelay = 10 * (Activity == "Idle")
STARTER_UPDATE_INTERVAL = 30
--------------------------------------------------------------------------

condor-7.2.2-0.9.el5
condor-job-hooks-1.0-5.el5
condor-job-hooks-common-1.0-5.el5
condor-low-latency-1.0-12.el5

I was using mainly cmd_args.py test from mrg-grid.git repo's low-latency
branch. Can you enlighten me, please?

Comment 6 Jan Sarenik 2009-04-02 13:59:28 UTC
Works as expected.

Comment 8 errata-xmlrpc 2009-04-21 16:17:27 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0434.html