610833 – Receiving occasional failures on qmf method calls HoldJob and ReleaseJob

Bug 610833 - Receiving occasional failures on qmf method calls HoldJob and ReleaseJob

Summary: Receiving occasional failures on qmf method calls HoldJob and ReleaseJob

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor-qmf
Sub Component:
Version:	Development
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	1.3
Target Release:	---
Assignee:	Pete MacKinnon
QA Contact:	Martin Kudlej
Docs Contact:
URL:
Whiteboard:
Depends On:	612636
Blocks:
TreeView+	depends on / blocked

Reported:	2010-07-02 14:33 UTC by Ernie
Modified:	2012-03-15 12:12 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-10-21 18:44:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ernie 2010-07-02 14:33:37 UTC

Will occasionally get an error response when calling the HoldJob and ReleaseJob qmf methods on Scheduler.

An example of the error text and code are 
Failed to release job (65537) - {}

In this case the job was in a Held state. The GlobalJobId was schedd7@#12082.0#1277835264

The submission name was Thwump!

This may also happend on the RemoveJob call, but I don't want to experiment.

Comment 1 Pete MacKinnon 2010-07-02 14:39:55 UTC

Is this from qpid-tool or cumin?

Comment 2 Ernie 2010-07-06 14:39:22 UTC

This is from cumin. Finding the correct scheduler given a submission is difficult using qpid-tool.

Comment 3 Ernie 2010-07-06 14:53:47 UTC

This may be an issue with finding the correct Scheduler. 
The linkage from submission to scheduler is currently this:

submission -> jobserver using submission.jobServerRef

jobserver -> scheduler using jobserver.Machine == scheduler.Machine

However, there are multiple schedulers that have the same Machine.

If the HoldJob method is called on the wrong scheduler, will the method fail like this?

Comment 4 Pete MacKinnon 2010-08-03 14:15:51 UTC

Added jobserver->scheduler linkage which seems to work for accurate job control in multi-schedd pools

Comment 5 Martin Kudlej 2010-09-08 12:17:50 UTC

In which version I can reproduce this issue? Do I understand this correctly that there can be just one jobserver in the pool and many schedulers connected to it?

Comment 6 Pete MacKinnon 2010-09-08 12:45:30 UTC

Yes, this can be tested in a multi-schedd environment, particulalry with cumin. condor 7.4.4-0.7 should exhibit this problem.

Comment 7 Martin Kudlej 2010-10-04 10:53:45 UTC

I've tested this with
condor-7.4.4-0.16
condor-debuginfo-7.4.4-0.16
condor-qmf-7.4.4-0.16
qpid-cpp-client-0.7.946106-17
qpid-tools-0.7.946106-11
python-qpid-0.7.946106-14
qpid-cpp-server-0.7.946106-17

in multi-scheduler, one jobserver environment on RHEL 5.5/4.8 x i386/x86_64 with qpid-tool and there is correct link from jobserver to scheduler. --> VERIFIED

Note You need to log in before you can comment on or make changes to this bug.