Bug 915210
| Summary: | Suspend and continue of parallel universe job | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Daniel Horák <dahorak> | ||||
| Component: | condor | Assignee: | grid-maint-list <grid-maint-list> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | MRG Quality Engineering <mrgqe-bugs> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 2.2 | CC: | matt, tstclair | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2016-05-26 19:51:31 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
MRG-Grid is in maintenance and only customer escalations will be considered. This issue can be reopened if a customer escalation associated with it occurs. |
Created attachment 702243 [details] Parallel universe job. Description of problem: Suspending job doesn't work with parallel universe Job. Version-Release number of selected component (if applicable): MRG 2.3: condor-7.8.8-0.4.1 MRG 2.2: condor-7.6.5-0.22 How reproducible: 100% Steps to Reproduce: 1. Configure condor for run parallel universe job: DedicatedScheduler="DedicatedScheduler@HOSTNAME" STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler 2. Prepare parallel universe job (job.py is attached): cat /tmp/parallel.job universe = parallel executable = /tmp/job.py args = 120 log = /tmp/parallel_job.$(cluster).$(process).log output = /tmp/parallel_job.$(cluster).$(process)-$(NODE).out error = /tmp/parallel_job.$(cluster).$(process)-$(NODE).err machine_count = 1 should_transfer_files = yes when_to_transfer_output = on_exit requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED) +ParallelShutdownPolicy = "WAIT_FOR_ALL" queue 3. Submit job and wait for start, then try to submit them: $ condor_submit /tmp/parallel.job Submitting job(s). 1 job(s) submitted to cluster 1. $ condor_q -- Submitter: HOSTNAME : <IP:47117> : HOSTNAME ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 test 2/25 09:43 0+00:00:00 I 0 0.0 job2.py 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended $ condor_q -- Submitter: HOSTNAME : <IP:47117> : HOSTNAME ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 test 2/25 09:43 0+00:00:15 R 0 0.0 job2.py 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended $ condor_suspend 1.0 Job 1.0 suspended $ condor_q -- Submitter: HOSTNAME : <IP:47117> : HOSTNAME ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 test 2/25 09:43 0+00:00:21 I 0 4.2 job2.py 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended Actual results: Job is not suspended but moved to Idle. Expected results: Job is correctly suspended. Additional info: Suspending for Vanilla and Java universe jobs works well.