Bug 610833

Summary: Receiving occasional failures on qmf method calls HoldJob and ReleaseJob
Product: Red Hat Enterprise MRG Reporter: Ernie <eallen>
Component: condor-qmfAssignee: Pete MacKinnon <pmackinn>
Status: CLOSED CURRENTRELEASE QA Contact: Martin Kudlej <mkudlej>
Severity: high Docs Contact:
Priority: high    
Version: DevelopmentCC: matt, mkudlej, pmackinn
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-21 18:44:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 612636    
Bug Blocks:    

Description Ernie 2010-07-02 14:33:37 UTC
Will occasionally get an error response when calling the HoldJob and ReleaseJob qmf methods on Scheduler.

An example of the error text and code are 
Failed to release job (65537) - {}

In this case the job was in a Held state. The GlobalJobId was schedd7@#12082.0#1277835264

The submission name was Thwump!

This may also happend on the RemoveJob call, but I don't want to experiment.

Comment 1 Pete MacKinnon 2010-07-02 14:39:55 UTC
Is this from qpid-tool or cumin?

Comment 2 Ernie 2010-07-06 14:39:22 UTC
This is from cumin. Finding the correct scheduler given a submission is difficult using qpid-tool.

Comment 3 Ernie 2010-07-06 14:53:47 UTC
This may be an issue with finding the correct Scheduler. 
The linkage from submission to scheduler is currently this:

submission -> jobserver using submission.jobServerRef

jobserver -> scheduler using jobserver.Machine == scheduler.Machine

However, there are multiple schedulers that have the same Machine.

If the HoldJob method is called on the wrong scheduler, will the method fail like this?

Comment 4 Pete MacKinnon 2010-08-03 14:15:51 UTC
Added jobserver->scheduler linkage which seems to work for accurate job control in multi-schedd pools

Comment 5 Martin Kudlej 2010-09-08 12:17:50 UTC
In which version I can reproduce this issue? Do I understand this correctly that there can be just one jobserver in the pool and many schedulers connected to it?

Comment 6 Pete MacKinnon 2010-09-08 12:45:30 UTC
Yes, this can be tested in a multi-schedd environment, particulalry with cumin. condor 7.4.4-0.7 should exhibit this problem.

Comment 7 Martin Kudlej 2010-10-04 10:53:45 UTC
I've tested this with
condor-7.4.4-0.16
condor-debuginfo-7.4.4-0.16
condor-qmf-7.4.4-0.16
qpid-cpp-client-0.7.946106-17
qpid-tools-0.7.946106-11
python-qpid-0.7.946106-14
qpid-cpp-server-0.7.946106-17

in multi-scheduler, one jobserver environment on RHEL 5.5/4.8 x i386/x86_64 with qpid-tool and there is correct link from jobserver to scheduler. --> VERIFIED