Bug 610833 - Receiving occasional failures on qmf method calls HoldJob and ReleaseJob
Summary: Receiving occasional failures on qmf method calls HoldJob and ReleaseJob
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor-qmf
Version: Development
Hardware: All
OS: Linux
high
high
Target Milestone: 1.3
: ---
Assignee: Pete MacKinnon
QA Contact: Martin Kudlej
URL:
Whiteboard:
Depends On: 612636
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-07-02 14:33 UTC by Ernie
Modified: 2012-03-15 12:12 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-10-21 18:44:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Ernie 2010-07-02 14:33:37 UTC
Will occasionally get an error response when calling the HoldJob and ReleaseJob qmf methods on Scheduler.

An example of the error text and code are 
Failed to release job (65537) - {}

In this case the job was in a Held state. The GlobalJobId was schedd7@#12082.0#1277835264

The submission name was Thwump!

This may also happend on the RemoveJob call, but I don't want to experiment.

Comment 1 Pete MacKinnon 2010-07-02 14:39:55 UTC
Is this from qpid-tool or cumin?

Comment 2 Ernie 2010-07-06 14:39:22 UTC
This is from cumin. Finding the correct scheduler given a submission is difficult using qpid-tool.

Comment 3 Ernie 2010-07-06 14:53:47 UTC
This may be an issue with finding the correct Scheduler. 
The linkage from submission to scheduler is currently this:

submission -> jobserver using submission.jobServerRef

jobserver -> scheduler using jobserver.Machine == scheduler.Machine

However, there are multiple schedulers that have the same Machine.

If the HoldJob method is called on the wrong scheduler, will the method fail like this?

Comment 4 Pete MacKinnon 2010-08-03 14:15:51 UTC
Added jobserver->scheduler linkage which seems to work for accurate job control in multi-schedd pools

Comment 5 Martin Kudlej 2010-09-08 12:17:50 UTC
In which version I can reproduce this issue? Do I understand this correctly that there can be just one jobserver in the pool and many schedulers connected to it?

Comment 6 Pete MacKinnon 2010-09-08 12:45:30 UTC
Yes, this can be tested in a multi-schedd environment, particulalry with cumin. condor 7.4.4-0.7 should exhibit this problem.

Comment 7 Martin Kudlej 2010-10-04 10:53:45 UTC
I've tested this with
condor-7.4.4-0.16
condor-debuginfo-7.4.4-0.16
condor-qmf-7.4.4-0.16
qpid-cpp-client-0.7.946106-17
qpid-tools-0.7.946106-11
python-qpid-0.7.946106-14
qpid-cpp-server-0.7.946106-17

in multi-scheduler, one jobserver environment on RHEL 5.5/4.8 x i386/x86_64 with qpid-tool and there is correct link from jobserver to scheduler. --> VERIFIED


Note You need to log in before you can comment on or make changes to this bug.