| Summary: | "condor_q -bet" output should be more informative | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Martin Kudlej <mkudlej> | ||||||
| Component: | condor | Assignee: | Erik Erlandson <eerlands> | ||||||
| Status: | CLOSED NOTABUG | QA Contact: | MRG Quality Engineering <mrgqe-bugs> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 2.0 | CC: | matt | ||||||
| Target Milestone: | 2.1 | Keywords: | Regression | ||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2011-05-09 23:47:17 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
output of condor_q -l: RECIPE = "SECRET_SAUCE" LastJobStatus = 0 ImageSize_RAW = 0 Submission = "python_test_submit" Cmd = "/bin/echo" ImageSize = 0 PeriodicRemove = false Iwd = "/tmp" PeriodicHold = false JobStatus = 1 ClusterId = 7 RemoteUserCpu = 0.0 MinHosts = 1 JobUniverse = 5 PeriodicRelease = false ScheddBday = 1303994942 Requirements = ( FileSystemDomain =!= undefined && Arch =!= undefined ) ShouldTransferFiles = "NO" GlobalJobId = "_hostname_#7.0#1303990674" LastRejMatchReason = "no match found" MaxHosts = 1 ServerTime = 1303995411 ProcId = 0 CurrentHosts = 0 OnExitRemove = true AutoClusterAttrs = "ImageSize,JobUniverse,LastCheckpointPlatform,NumCkpts,JobStart,RequestCpus,RequestDisk,RequestMemory,LastPeriodicCheckpoint,Requirements,NiceUser,ConcurrencyLimits" AutoClusterId = 2 TargetType = "Machine" QDate = 1303990674 OnExitHold = false JobPrio = 0 Args = "test_hu" CurrentTime = time() User = "root@_hostname_" LastRejMatchTime = 1303995394 MyType = "Job" Owner = "root" For another job:
RECIPE = "SECRET_SAUCE"
LastJobStatus = 0
ImageSize_RAW = 0
Submission = "python_test_submit"
ImageSize = 0
Cmd = "/bin/sleep"
PeriodicRemove = false
Iwd = "/tmp"
PeriodicHold = false
JobStatus = 1
ClusterId = 8
RemoteUserCpu = 0.0
MinHosts = 1
JobUniverse = 5
PeriodicRelease = false
Requirements = ( TARGET.Arch =!= undefined ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= 0 ) && ( ( TARGET.Memory * 1024 ) >= 0 ) && ( TARGET.FileSystemDomain =!= undefined )
ShouldTransferFiles = "NO"
GlobalJobId = "_hostname_#8.0#1303995376"
LastRejMatchReason = "no match found"
MaxHosts = 1
ServerTime = 1303995695
ProcId = 0
CurrentHosts = 0
OnExitRemove = true
AutoClusterAttrs = "ImageSize,JobUniverse,LastCheckpointPlatform,NumCkpts,JobStart,RequestCpus,RequestDisk,RequestMemory,LastPeriodicCheckpoint,Requirements,NiceUser,ConcurrencyLimits"
AutoClusterId = 4
TargetType = "Machine"
QDate = 1303995376
OnExitHold = false
JobPrio = 0
Args = "120"
CurrentTime = time()
User = "root@_hostname_"
LastRejMatchTime = 1303995684
MyType = "Job"
Owner = "root"
it has strange output:
Reason for last match failure: no match found
The Requirements expression for your job is:
( TARGET.Arch isnt undefined ) && ( TARGET.OpSys == "LINUX" ) &&
( TARGET.Disk >= 0 ) && ( ( TARGET.Memory * 1024 ) >= 0 ) &&
( TARGET.FileSystemDomain isnt undefined )
Condition Machines Matched Suggestion
--------- ---------------- ----------
1 ( TARGET.OpSys == "LINUX" ) 2
2 ( TARGET.Arch isnt undefined ) 17
3 ( TARGET.Disk >= 0 ) 17
4 ( ( 1024 * TARGET.Memory ) >= 0 ) 17
5 ( TARGET.FileSystemDomain isnt undefined )17
Does it means that 2 machines matched or "no match found"?
I am getting what appears to be correct and reasonable output from condor_q -bet:
$ condor_q -bet
-- Submitter: localhost.localdomain : <192.168.1.2:59694> : localhost.localdomain
---
009.000: Run analysis summary. Of 20 machines,
20 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match but are serving users with a better priority in the pool
0 match but reject the job for unknown reasons
0 match but will not currently preempt their existing job
0 match but are currently offline
0 are available to run your job
No successful match recorded.
Last failed match: Tue May 3 11:36:02 2011
Reason for last match failure: no match found
WARNING: Be advised:
No resources matched request's constraints
The Requirements expression for your job is:
( ( TARGET.Arch is "unobtainium" ) ) && ( TARGET.OpSys == "LINUX" ) &&
( TARGET.Disk >= DiskUsage ) && ( ( TARGET.Memory * 1024 ) >= ImageSize ) &&
( ( RequestMemory * 1024 ) >= ImageSize ) &&
( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )
Condition Machines Matched Suggestion
--------- ---------------- ----------
1 ( ( TARGET.Arch is "unobtainium" ) )0 REMOVE
2 ( TARGET.OpSys == "LINUX" ) 20
3 ( TARGET.Disk >= 30 ) 20
4 ( ( 1024 * TARGET.Memory ) >= 30 )20
5 ( ( 1024 * ceiling(ifThenElse(JobVMMemory isnt undefined,JobVMMemory,2.929687500000000E-02)) ) >= 30 )
20
6 ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "localhost.localdomain" ) )
20
(In reply to comment #0) What I'm seeing looks correct. Can you attach your config and the job submission file you used so that I can try a more precise repro? Another example with user condor:
Out = "/tmp/mrg_$(Cluster).$(Process).out"
LastJobStatus = 0
ImageSize_RAW = 0
Submission = "host1#30"
ImageSize = 0
cmd = "/bin/sleep"
PeriodicRemove = false
iwd = "/tmp"
PeriodicHold = false
JobStatus = 1
ClusterId = 30
RemoteUserCpu = 0.0
MinHosts = 1
JobUniverse = 5
PeriodicRelease = false
requirements = true
ShouldTransferFiles = "NO"
GlobalJobId = "host1#30.0#1304691827"
UserLog = "/tmp/mrg_$(Cluster).$(Process).log"
MaxHosts = 1
ServerTime = 1304691941
ProcId = 0
Err = "/tmp/mrg_$(Cluster).$(Process).err"
CurrentHosts = 0
OnExitRemove = true
AutoClusterAttrs = "ImageSize,JobUniverse,LastCheckpointPlatform,NumCkpts,JobStart,RequestCpus,RequestDisk,RequestMemory,LastPeriodicCheckpoint,Requirements,NiceUser,ConcurrencyLimits"
AutoClusterId = 1
TargetType = "Machine"
QDate = 1304691827
OnExitHold = false
JobPrio = 0
args = "1"
CurrentTime = time()
User = "condor@host2"
MyType = "Job"
owner = "condor"
QMF submit dictionary:
{'iwd': '/tmp', 'requirements': 'TRUE', '!!descriptors': {'requirements': 'com.redhat.grid.Expression'}, 'args': '1', 'cmd': '/bin/sleep', 'Err': '/tmp/mrg_$(Cluster).$(Process).err', 'UserLog': '/tmp/mrg_$(Cluster).$(Process).log', 'JobUniverse': 5, 'owner': 'condor', 'Out': '/tmp/mrg_$(Cluster).$(Process).out'}
And output from condor_q -bet is just:
-- Submitter: host1 : <ip1:35215> : host1
Request 30.0 did not match any resource's constraints
It may help if you set NEGOTIATOR_DEBUG = D_FULLDEBUG | D_MATCH, and attach the log output of the negotiation cycle attempting to match one of these idle jobs. Also, set TOOL_DEBUG = D_FULLDEBUG, and attach output from condor_q -bet Created attachment 497756 [details]
host1 configuration and logs
Created attachment 497757 [details]
host2 configuration and logs
|
Description of problem: There was useful description about job matching in "condor_q -better-analyze": 2770.000: Run analysis summary. Of 12 machines, 0 are rejected by your job's requirements 4 reject your job because of their own requirements 0 match but are serving users with a better priority in the pool 8 match but reject the job for unknown reasons 0 match but will not currently preempt their existing job 0 match but are currently offline 0 are available to run your job The following attributes are missing from the job ClassAd: CheckpointPlatform and now there is just: Request 7.0 did not match any resource's constraints There is the same output from -analyze as from -better-analyze so I think they are same parameters now. But I think there should be possibility to get information in old format about job matching. Version-Release number of selected component (if applicable): condor-7.6.1-0.4.el6.i686 How reproducible: 100% Steps to Reproduce: 1. install condor pool 2. submit job which doesn't match with any slot 3. condor_q -bet