Bug 489006 - Cannot distinguish between completion and other termination of AMQP submitted work
Summary: Cannot distinguish between completion and other termination of AMQP submitted...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: grid
Version: 1.1
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: 1.1.1
: ---
Assignee: Robert Rati
QA Contact: Jan Sarenik
URL:
Whiteboard:
Depends On: 459615
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-03-06 17:37 UTC by Matthew Farrellee
Modified: 2009-04-21 16:17 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-04-21 16:17:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2009:0434 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging and Grid Version 1.1.1 2009-04-21 16:15:50 UTC

Description Matthew Farrellee 2009-03-06 17:37:42 UTC
1) AMQP work message submitted
2) condor picks up work
3) condor restarted
4) condor picks up work (again)
5) work completes

In step 3 and 5 a message with JobState="Exited" is sent to the submitter, there's no way to tell the difference between the situations, at least in condor-low-latency-1.0-9.el5.

In condor-low-latency-1.0-10.el5, the JobStatus is also set, so the message in step (3) is JobState="Exited" and JobStatus=1 whereas in (5) it is JobState="Exited" and JobStatus=4.

Comment 1 Robert Rati 2009-03-06 20:16:48 UTC
Different symptom of BZ459615

Comment 3 Jan Sarenik 2009-03-19 14:20:38 UTC
Should I just verify that condor-low-latency-1.0-10 and higher
return the JobStatus as mentioned above?

Comment 4 Matthew Farrellee 2009-03-19 17:33:27 UTC
And the JobState. You may want to test through a situation where Condor runs the job without interruption, and runs it with restart and maybe kill -9 interruption, including to the carod (service condor-low-latency) process.

Comment 5 Jan Sarenik 2009-04-01 13:54:43 UTC
Jobs submitted via AMQP do not get run. Condor's StartLog says:

Slot requirements not satisfied.
Job requirements not satisfied.

When I put the dump into job.submit file, change Cmd to Executable
and '5' to 'vanilla', add Queue at the end, the job runs flawlessly
with condor_submit (just few lines of WARNINGs for I include really
full dump including parameters that are probably unknown to
condor_submit).

This condor runs all the vanilla jobs via condor_submit with no
problems. Low-latency is configured by adding these lines to
/etc/condor/condor_config

--------------------------------------------------------------------------
LOW_LATENCY_HOOK_FETCH_WORK = $(LIBEXEC)/hooks/hook_fetch_work.py
LOW_LATENCY_HOOK_REPLY_FETCH = $(LIBEXEC)/hooks/hook_reply_fetch.py

# Starter hooks
LOW_LATENCY_JOB_HOOK_PREPARE_JOB = $(LIBEXEC)/hooks/hook_prepare_job.py
LOW_LATENCY_JOB_HOOK_UPDATE_JOB_INFO = $(LIBEXEC)/hooks/hook_update_job_status.py
LOW_LATENCY_JOB_HOOK_JOB_EXIT = $(LIBEXEC)/hooks/hook_job_exit.py

STARTD_JOB_HOOK_KEYWORD = LOW_LATENCY

FetchWorkDelay = 10 * (Activity == "Idle")
STARTER_UPDATE_INTERVAL = 30
--------------------------------------------------------------------------

condor-7.2.2-0.9.el5
condor-job-hooks-1.0-5.el5
condor-job-hooks-common-1.0-5.el5
condor-low-latency-1.0-12.el5

I was using mainly cmd_args.py test from mrg-grid.git repo's low-latency
branch. Can you enlighten me, please?

Comment 6 Jan Sarenik 2009-04-02 13:59:28 UTC
Works as expected.

Comment 8 errata-xmlrpc 2009-04-21 16:17:27 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0434.html


Note You need to log in before you can comment on or make changes to this bug.