Bug 862550
| Summary: | schedd crash on local universe condor_suspend+condor_continue job | ||
|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Daniel Horák <dahorak> |
| Component: | condor | Assignee: | Timothy St. Clair <tstclair> |
| Status: | CLOSED ERRATA | QA Contact: | Daniel Horák <dahorak> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 2.2 | CC: | ltoscano, matt, sgraf, tstclair |
| Target Milestone: | 2.3 | Keywords: | Reopened |
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | condor-7.8.6-0.2 | Doc Type: | Bug Fix |
| Doc Text: |
Cause:
Try to suspend local universe jobs
Consequence:
Schedd will crash
Fix:
Properly ignore requests to suspend and continue scheduler and local universe jobs
Result:
Schedd continues normally and an error is reported to the user when they try to suspend a local universe job.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-03-06 18:46:59 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
local universe is currently unsupported... requires a suspend/continue require a startd. Any universe job which is ran through a startd should be supported. With new packages there is different behaviour of condor_suspend for local universe dependent on job identification (CLUSTER, CLUSTER.JOB, USER):
$ cat local.job
universe = local
executable = /bin/sleep
arguments = 120
iwd = /tmp
requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED)
queue
$ condor_submit local.job
Submitting job(s).
1 job(s) submitted to cluster 1.
$ condor_q
-- Submitter: dhcp-37-141.lab.eng.brq.redhat.com : <10.34.37.141:58377> : dhcp-37-141.lab.eng.brq.redhat.com
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1.0 test 12/7 08:58 0+00:00:02 R 0 0.0 sleep 120
1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
$ condor_suspend 1
Couldn't find/suspend all jobs in cluster 1.
$ condor_suspend 1.0
Job 1.0 suspended
$ condor_suspend test
Couldn't find/suspend all of user test's job(s).
Following log record is for command "condor_suspend 1.0":
# tail -F SchedLog | grep -v "Number of Active Workers"
12/07/12 08:58:55 (pid:12018) Local universe: Ignoring unsupported action (8 Suspend)
# rpm -qa | grep condor
condor-classads-7.8.7-0.6.el5.i386
condor-7.8.7-0.6.el5.i386
Is this behaviour expected? >>> NEEDINFO
(In reply to comment #6) > Is this behaviour expected? >>> NEEDINFO Yes Tested and verified on RHEL 5.9/6.4, i386/x86_64 (output from RHEL 6.4 i386):
# rpm -q condor
condor-7.8.8-0.1.el6.i686
$ cat local.job
universe = local
executable = /bin/sleep
arguments = 300
iwd = /tmp
requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED)
queue
$ condor_submit local.job
Submitting job(s).
1 job(s) submitted to cluster 1.
$ condor_q
-- Submitter: HOSTNAME : <IP:43481> : HOSTNAME
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1.0 test 12/20 13:56 0+00:00:11 R 0 0.0 sleep 300
1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
$ condor_suspend 1
Couldn't find/suspend all jobs in cluster 1.
$ condor_suspend 1.0
Job 1.0 suspended
$ condor_suspend test
Couldn't find/suspend all of user test's job(s).
$ condor_q
-- Submitter: HOSTNAME : <IP:43481> : HOSTNAME
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1.0 test 12/20 13:56 0+00:00:46 R 0 0.0 sleep 300
1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
# tail -F /var/log/condor/SchedLog | grep -v "Number of Active Workers"
. . .
12/20/12 13:56:31 (pid:2086) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
12/20/12 13:56:31 (pid:2086) Sent ad to central manager for test@HOSTNAME
12/20/12 13:56:31 (pid:2086) Sent ad to 1 collectors for test@HOSTNAME
12/20/12 13:56:31 (pid:2086) Starting add_shadow_birthdate(1.0)
12/20/12 13:56:31 (pid:2086) Spawned local starter (pid 2467) for job 1.0
12/20/12 13:56:58 (pid:2086) Local universe: Ignoring unsupported action (8 Suspend)
. . .
>>> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0564.html |
Description of problem: When I try to suspend local universe job, condor_schedd fail ("exit with status 4"). Which job universes are supported for suspend and continue functionality? Version-Release number of selected component (if applicable): condor-7.6.5-0.22.el5.i386 How reproducible: 100% Steps to Reproduce: 1. Prepare simple local job: $ cat local.job universe = local executable = /bin/sleep arguments = 30 iwd = /tmp requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED) queue 2. Submit prepared job and wait for running state: $ condor_submit local.job $ condor_q 3. Try to suspend local job $ condor_suspend 15 Actual results: $ condor_submit local.job Submitting job(s). 1 job(s) submitted to cluster 15. $ condor_q -- Submitter: dhcp-37-195.lab.eng.brq.redhat.com : <10.34.37.195:36995> : dhcp-37-195.lab.eng.brq.redhat.com ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 15.0 test 10/3 10:17 0+00:00:07 R 0 0.0 sleep 30 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended $ condor_suspend 15 Cluster 15 suspended. # tail -F SchedLog ... 10/03/12 10:06:02 (pid:26936) ERROR "unknown action (8 Suspend) in abort_job_myself()" at line 1870 in file /builddir/build/BUILD/condor-7.6.4/src/condor_schedd.V6/schedd.cpp ... # tail -F MasterLog 10/03/12 10:06:02 DaemonCore: No more children processes to reap. 10/03/12 10:06:02 The SCHEDD (pid 26936) exited with status 4 10/03/12 10:06:02 ProcAPI::buildFamily() Parent pid 26936 is gone. Found descendant 26937 via ancestor environment tracking and assigning as new "parent". 10/03/12 10:06:02 ProcAPI::getProcInfo() pid 26936 does not exist. 10/03/12 10:06:02 ProcAPI::getProcInfo() pid 26936 does not exist. 10/03/12 10:06:02 ProcAPI::getProcInfo() pid 26936 does not exist. 10/03/12 10:06:02 ProcAPI::getProcInfo() pid 26936 does not exist. 10/03/12 10:06:02 ProcAPI::getProcInfo() pid 26936 does not exist. 10/03/12 10:06:02 ProcAPI::getProcInfo() pid 27015 does not exist. 10/03/12 10:06:02 ProcAPI::getProcInfo() pid 27015 does not exist. 10/03/12 10:06:02 ProcAPI::getProcInfo() pid 27015 does not exist. 10/03/12 10:06:02 ProcAPI::getProcInfo() pid 27015 does not exist. 10/03/12 10:06:02 ProcAPI::getProcInfo() pid 27015 does not exist. 10/03/12 10:06:02 Sending obituary for "/usr/sbin/condor_schedd" 10/03/12 10:06:02 Forking Mailer process... 10/03/12 10:06:02 restarting /usr/sbin/condor_schedd in 10 seconds 10/03/12 10:06:02 enter Daemons::UpdateCollector 10/03/12 10:06:02 Trying to update collector <10.34.37.195:9618> 10/03/12 10:06:02 Attempting to send update via UDP to collector dhcp-37-195.lab.eng.brq.redhat.com <10.34.37.195:9618> 10/03/12 10:06:02 MgmtMasterPlugin: calling update 10/03/12 10:06:02 exit Daemons::UpdateCollector 10/03/12 10:06:02 DaemonCore: No more children processes to reap. 10/03/12 10:06:12 ::RealStart; SCHEDD on_hold=0 10/03/12 10:06:12 SharedPortEndpoint: Inside destructor. 10/03/12 10:06:12 start recover timer (26) 10/03/12 10:06:12 Started DaemonCore process "/usr/sbin/condor_schedd -f", pid and pgroup = 27259 10/03/12 10:06:12 enter Daemons::UpdateCollector 10/03/12 10:06:12 Trying to update collector <10.34.37.195:9618> 10/03/12 10:06:12 Attempting to send update via UDP to collector dhcp-37-195.lab.eng.brq.redhat.com <10.34.37.195:9618> 10/03/12 10:06:12 MgmtMasterPlugin: calling update 10/03/12 10:06:12 exit Daemons::UpdateCollector 10/03/12 10:06:12 Received TCP command 60008 (DC_CHILDALIVE) from unauthenticated@unmapped <10.34.37.195:57343>, access level DAEMON Expected results: At least, condor_schedd should not exit with error. If local universe is supported for suspending, jobs have to be correctly suspended. Additional info: