Bug 700452

Summary: "condor_q -bet" output should be more informative
Product: Red Hat Enterprise MRG Reporter: Martin Kudlej <mkudlej>
Component: condorAssignee: Erik Erlandson <eerlands>
Status: CLOSED NOTABUG QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.0CC: matt
Target Milestone: 2.1Keywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-09 23:47:17 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
host1 configuration and logs
none
host2 configuration and logs none

Description Martin Kudlej 2011-04-28 12:51:05 UTC
Description of problem: There was useful description about job matching in "condor_q -better-analyze":

2770.000:  Run analysis summary.  Of 12 machines,
     0 are rejected by your job's requirements
     4 reject your job because of their own requirements
     0 match but are serving users with a better priority in the pool
     8 match but reject the job for unknown reasons
     0 match but will not currently preempt their existing job
     0 match but are currently offline
     0 are available to run your job

The following attributes are missing from the job ClassAd:
 
CheckpointPlatform 

and now there is just:
Request 7.0 did not match any resource's constraints

There is the same output from -analyze as from -better-analyze so I think they are same parameters now. But I think there should be possibility to get information in old format about job matching.

Version-Release number of selected component (if applicable):
condor-7.6.1-0.4.el6.i686

How reproducible:
100%

Steps to Reproduce:
1. install condor pool
2. submit job which doesn't match with any slot
3. condor_q -bet

Comment 1 Martin Kudlej 2011-04-28 12:58:05 UTC
output of condor_q -l:
RECIPE = "SECRET_SAUCE"
LastJobStatus = 0
ImageSize_RAW = 0
Submission = "python_test_submit"
Cmd = "/bin/echo"
ImageSize = 0
PeriodicRemove = false
Iwd = "/tmp"
PeriodicHold = false
JobStatus = 1
ClusterId = 7
RemoteUserCpu = 0.0
MinHosts = 1
JobUniverse = 5
PeriodicRelease = false
ScheddBday = 1303994942
Requirements = ( FileSystemDomain =!= undefined && Arch =!= undefined )
ShouldTransferFiles = "NO"
GlobalJobId = "_hostname_#7.0#1303990674"
LastRejMatchReason = "no match found"
MaxHosts = 1
ServerTime = 1303995411
ProcId = 0
CurrentHosts = 0
OnExitRemove = true
AutoClusterAttrs = "ImageSize,JobUniverse,LastCheckpointPlatform,NumCkpts,JobStart,RequestCpus,RequestDisk,RequestMemory,LastPeriodicCheckpoint,Requirements,NiceUser,ConcurrencyLimits"
AutoClusterId = 2
TargetType = "Machine"
QDate = 1303990674
OnExitHold = false
JobPrio = 0
Args = "test_hu"
CurrentTime = time()
User = "root@_hostname_"
LastRejMatchTime = 1303995394
MyType = "Job"
Owner = "root"

Comment 2 Martin Kudlej 2011-04-28 13:03:22 UTC
For another job:
RECIPE = "SECRET_SAUCE"
LastJobStatus = 0
ImageSize_RAW = 0
Submission = "python_test_submit"
ImageSize = 0
Cmd = "/bin/sleep"
PeriodicRemove = false
Iwd = "/tmp"
PeriodicHold = false
JobStatus = 1
ClusterId = 8
RemoteUserCpu = 0.0
MinHosts = 1
JobUniverse = 5
PeriodicRelease = false
Requirements = ( TARGET.Arch =!= undefined ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= 0 ) && ( ( TARGET.Memory * 1024 ) >= 0 ) && ( TARGET.FileSystemDomain =!= undefined )
ShouldTransferFiles = "NO"
GlobalJobId = "_hostname_#8.0#1303995376"
LastRejMatchReason = "no match found"
MaxHosts = 1
ServerTime = 1303995695
ProcId = 0
CurrentHosts = 0
OnExitRemove = true
AutoClusterAttrs = "ImageSize,JobUniverse,LastCheckpointPlatform,NumCkpts,JobStart,RequestCpus,RequestDisk,RequestMemory,LastPeriodicCheckpoint,Requirements,NiceUser,ConcurrencyLimits"
AutoClusterId = 4
TargetType = "Machine"
QDate = 1303995376
OnExitHold = false
JobPrio = 0
Args = "120"
CurrentTime = time()
User = "root@_hostname_"
LastRejMatchTime = 1303995684
MyType = "Job"
Owner = "root"

it has strange output:

Reason for last match failure: no match found

The Requirements expression for your job is:

( TARGET.Arch isnt undefined ) && ( TARGET.OpSys == "LINUX" ) &&
( TARGET.Disk >= 0 ) && ( ( TARGET.Memory * 1024 ) >= 0 ) &&
( TARGET.FileSystemDomain isnt undefined )

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( TARGET.OpSys == "LINUX" )       2                    
2   ( TARGET.Arch isnt undefined )    17                   
3   ( TARGET.Disk >= 0 )              17                   
4   ( ( 1024 * TARGET.Memory ) >= 0 ) 17                   
5   ( TARGET.FileSystemDomain isnt undefined )17    

Does it means that 2 machines matched or "no match found"?

Comment 3 Erik Erlandson 2011-05-03 18:41:11 UTC
I am getting what appears to be correct and reasonable output from condor_q -bet:

$ condor_q -bet


-- Submitter: localhost.localdomain : <192.168.1.2:59694> : localhost.localdomain
---
009.000:  Run analysis summary.  Of 20 machines,
     20 are rejected by your job's requirements 
      0 reject your job because of their own requirements 
      0 match but are serving users with a better priority in the pool 
      0 match but reject the job for unknown reasons 
      0 match but will not currently preempt their existing job 
      0 match but are currently offline 
      0 are available to run your job
	No successful match recorded.
	Last failed match: Tue May  3 11:36:02 2011
	Reason for last match failure: no match found

WARNING:  Be advised:
   No resources matched request's constraints

The Requirements expression for your job is:

( ( TARGET.Arch is "unobtainium" ) ) && ( TARGET.OpSys == "LINUX" ) &&
( TARGET.Disk >= DiskUsage ) && ( ( TARGET.Memory * 1024 ) >= ImageSize ) &&
( ( RequestMemory * 1024 ) >= ImageSize ) &&
( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( ( TARGET.Arch is "unobtainium" ) )0                   REMOVE
2   ( TARGET.OpSys == "LINUX" )       20                   
3   ( TARGET.Disk >= 30 )             20                   
4   ( ( 1024 * TARGET.Memory ) >= 30 )20                   
5   ( ( 1024 * ceiling(ifThenElse(JobVMMemory isnt undefined,JobVMMemory,2.929687500000000E-02)) ) >= 30 )
                                      20                   
6   ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "localhost.localdomain" ) )
                                      20

Comment 4 Erik Erlandson 2011-05-03 18:49:40 UTC
(In reply to comment #0)

What I'm seeing looks correct.   Can you attach your config and the job submission file you used so that I can try a more precise repro?

Comment 5 Martin Kudlej 2011-05-06 14:28:14 UTC
Another example with user condor:
Out = "/tmp/mrg_$(Cluster).$(Process).out"
LastJobStatus = 0
ImageSize_RAW = 0
Submission = "host1#30"
ImageSize = 0
cmd = "/bin/sleep"
PeriodicRemove = false
iwd = "/tmp"
PeriodicHold = false
JobStatus = 1
ClusterId = 30
RemoteUserCpu = 0.0
MinHosts = 1
JobUniverse = 5
PeriodicRelease = false
requirements = true
ShouldTransferFiles = "NO"
GlobalJobId = "host1#30.0#1304691827"
UserLog = "/tmp/mrg_$(Cluster).$(Process).log"
MaxHosts = 1
ServerTime = 1304691941
ProcId = 0
Err = "/tmp/mrg_$(Cluster).$(Process).err"
CurrentHosts = 0
OnExitRemove = true
AutoClusterAttrs = "ImageSize,JobUniverse,LastCheckpointPlatform,NumCkpts,JobStart,RequestCpus,RequestDisk,RequestMemory,LastPeriodicCheckpoint,Requirements,NiceUser,ConcurrencyLimits"
AutoClusterId = 1
TargetType = "Machine"
QDate = 1304691827
OnExitHold = false
JobPrio = 0
args = "1"
CurrentTime = time()
User = "condor@host2"
MyType = "Job"
owner = "condor"


QMF submit dictionary:
{'iwd': '/tmp', 'requirements': 'TRUE', '!!descriptors': {'requirements': 'com.redhat.grid.Expression'}, 'args': '1', 'cmd': '/bin/sleep', 'Err': '/tmp/mrg_$(Cluster).$(Process).err', 'UserLog': '/tmp/mrg_$(Cluster).$(Process).log', 'JobUniverse': 5, 'owner': 'condor', 'Out': '/tmp/mrg_$(Cluster).$(Process).out'}

And output from condor_q -bet is just:

-- Submitter: host1 : <ip1:35215> : host1
   Request 30.0 did not match any resource's constraints

Comment 7 Erik Erlandson 2011-05-06 18:49:11 UTC
It may help if you set NEGOTIATOR_DEBUG = D_FULLDEBUG | D_MATCH, and attach the log output of the negotiation cycle attempting to match one of these idle jobs.

Also, set TOOL_DEBUG = D_FULLDEBUG, and attach output from condor_q -bet

Comment 8 Martin Kudlej 2011-05-09 10:49:32 UTC
Created attachment 497756 [details]
host1 configuration and logs

Comment 9 Martin Kudlej 2011-05-09 10:50:05 UTC
Created attachment 497757 [details]
host2 configuration and logs