Red Hat Bugzilla – Bug 862550
schedd crash on local universe condor_suspend+condor_continue job
Last modified: 2013-03-06 13:46:59 EST
Description of problem: When I try to suspend local universe job, condor_schedd fail ("exit with status 4"). Which job universes are supported for suspend and continue functionality? Version-Release number of selected component (if applicable): condor-7.6.5-0.22.el5.i386 How reproducible: 100% Steps to Reproduce: 1. Prepare simple local job: $ cat local.job universe = local executable = /bin/sleep arguments = 30 iwd = /tmp requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED) queue 2. Submit prepared job and wait for running state: $ condor_submit local.job $ condor_q 3. Try to suspend local job $ condor_suspend 15 Actual results: $ condor_submit local.job Submitting job(s). 1 job(s) submitted to cluster 15. $ condor_q -- Submitter: dhcp-37-195.lab.eng.brq.redhat.com : <10.34.37.195:36995> : dhcp-37-195.lab.eng.brq.redhat.com ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 15.0 test 10/3 10:17 0+00:00:07 R 0 0.0 sleep 30 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended $ condor_suspend 15 Cluster 15 suspended. # tail -F SchedLog ... 10/03/12 10:06:02 (pid:26936) ERROR "unknown action (8 Suspend) in abort_job_myself()" at line 1870 in file /builddir/build/BUILD/condor-7.6.4/src/condor_schedd.V6/schedd.cpp ... # tail -F MasterLog 10/03/12 10:06:02 DaemonCore: No more children processes to reap. 10/03/12 10:06:02 The SCHEDD (pid 26936) exited with status 4 10/03/12 10:06:02 ProcAPI::buildFamily() Parent pid 26936 is gone. Found descendant 26937 via ancestor environment tracking and assigning as new "parent". 10/03/12 10:06:02 ProcAPI::getProcInfo() pid 26936 does not exist. 10/03/12 10:06:02 ProcAPI::getProcInfo() pid 26936 does not exist. 10/03/12 10:06:02 ProcAPI::getProcInfo() pid 26936 does not exist. 10/03/12 10:06:02 ProcAPI::getProcInfo() pid 26936 does not exist. 10/03/12 10:06:02 ProcAPI::getProcInfo() pid 26936 does not exist. 10/03/12 10:06:02 ProcAPI::getProcInfo() pid 27015 does not exist. 10/03/12 10:06:02 ProcAPI::getProcInfo() pid 27015 does not exist. 10/03/12 10:06:02 ProcAPI::getProcInfo() pid 27015 does not exist. 10/03/12 10:06:02 ProcAPI::getProcInfo() pid 27015 does not exist. 10/03/12 10:06:02 ProcAPI::getProcInfo() pid 27015 does not exist. 10/03/12 10:06:02 Sending obituary for "/usr/sbin/condor_schedd" 10/03/12 10:06:02 Forking Mailer process... 10/03/12 10:06:02 restarting /usr/sbin/condor_schedd in 10 seconds 10/03/12 10:06:02 enter Daemons::UpdateCollector 10/03/12 10:06:02 Trying to update collector <10.34.37.195:9618> 10/03/12 10:06:02 Attempting to send update via UDP to collector dhcp-37-195.lab.eng.brq.redhat.com <10.34.37.195:9618> 10/03/12 10:06:02 MgmtMasterPlugin: calling update 10/03/12 10:06:02 exit Daemons::UpdateCollector 10/03/12 10:06:02 DaemonCore: No more children processes to reap. 10/03/12 10:06:12 ::RealStart; SCHEDD on_hold=0 10/03/12 10:06:12 SharedPortEndpoint: Inside destructor. 10/03/12 10:06:12 start recover timer (26) 10/03/12 10:06:12 Started DaemonCore process "/usr/sbin/condor_schedd -f", pid and pgroup = 27259 10/03/12 10:06:12 enter Daemons::UpdateCollector 10/03/12 10:06:12 Trying to update collector <10.34.37.195:9618> 10/03/12 10:06:12 Attempting to send update via UDP to collector dhcp-37-195.lab.eng.brq.redhat.com <10.34.37.195:9618> 10/03/12 10:06:12 MgmtMasterPlugin: calling update 10/03/12 10:06:12 exit Daemons::UpdateCollector 10/03/12 10:06:12 Received TCP command 60008 (DC_CHILDALIVE) from unauthenticated@unmapped <10.34.37.195:57343>, access level DAEMON Expected results: At least, condor_schedd should not exit with error. If local universe is supported for suspending, jobs have to be correctly suspended. Additional info:
local universe is currently unsupported... requires a suspend/continue require a startd. Any universe job which is ran through a startd should be supported.
With new packages there is different behaviour of condor_suspend for local universe dependent on job identification (CLUSTER, CLUSTER.JOB, USER): $ cat local.job universe = local executable = /bin/sleep arguments = 120 iwd = /tmp requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED) queue $ condor_submit local.job Submitting job(s). 1 job(s) submitted to cluster 1. $ condor_q -- Submitter: dhcp-37-141.lab.eng.brq.redhat.com : <10.34.37.141:58377> : dhcp-37-141.lab.eng.brq.redhat.com ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 test 12/7 08:58 0+00:00:02 R 0 0.0 sleep 120 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended $ condor_suspend 1 Couldn't find/suspend all jobs in cluster 1. $ condor_suspend 1.0 Job 1.0 suspended $ condor_suspend test Couldn't find/suspend all of user test's job(s). Following log record is for command "condor_suspend 1.0": # tail -F SchedLog | grep -v "Number of Active Workers" 12/07/12 08:58:55 (pid:12018) Local universe: Ignoring unsupported action (8 Suspend) # rpm -qa | grep condor condor-classads-7.8.7-0.6.el5.i386 condor-7.8.7-0.6.el5.i386 Is this behaviour expected? >>> NEEDINFO
(In reply to comment #6) > Is this behaviour expected? >>> NEEDINFO Yes
Tested and verified on RHEL 5.9/6.4, i386/x86_64 (output from RHEL 6.4 i386): # rpm -q condor condor-7.8.8-0.1.el6.i686 $ cat local.job universe = local executable = /bin/sleep arguments = 300 iwd = /tmp requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED) queue $ condor_submit local.job Submitting job(s). 1 job(s) submitted to cluster 1. $ condor_q -- Submitter: HOSTNAME : <IP:43481> : HOSTNAME ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 test 12/20 13:56 0+00:00:11 R 0 0.0 sleep 300 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended $ condor_suspend 1 Couldn't find/suspend all jobs in cluster 1. $ condor_suspend 1.0 Job 1.0 suspended $ condor_suspend test Couldn't find/suspend all of user test's job(s). $ condor_q -- Submitter: HOSTNAME : <IP:43481> : HOSTNAME ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 test 12/20 13:56 0+00:00:46 R 0 0.0 sleep 300 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended # tail -F /var/log/condor/SchedLog | grep -v "Number of Active Workers" . . . 12/20/12 13:56:31 (pid:2086) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s 12/20/12 13:56:31 (pid:2086) Sent ad to central manager for test@HOSTNAME 12/20/12 13:56:31 (pid:2086) Sent ad to 1 collectors for test@HOSTNAME 12/20/12 13:56:31 (pid:2086) Starting add_shadow_birthdate(1.0) 12/20/12 13:56:31 (pid:2086) Spawned local starter (pid 2467) for job 1.0 12/20/12 13:56:58 (pid:2086) Local universe: Ignoring unsupported action (8 Suspend) . . . >>> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0564.html