Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 862550

Summary:	schedd crash on local universe condor_suspend+condor_continue job
Product:	Red Hat Enterprise MRG	Reporter:	Daniel Horák <dahorak>
Component:	condor	Assignee:	Timothy St. Clair <tstclair>
Status:	CLOSED ERRATA	QA Contact:	Daniel Horák <dahorak>
Severity:	high	Docs Contact:
Priority:	high
Version:	2.2	CC:	ltoscano, matt, sgraf, tstclair
Target Milestone:	2.3	Keywords:	Reopened
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	condor-7.8.6-0.2	Doc Type:	Bug Fix
Doc Text:	Cause: Try to suspend local universe jobs Consequence: Schedd will crash Fix: Properly ignore requests to suspend and continue scheduler and local universe jobs Result: Schedd continues normally and an error is reported to the user when they try to suspend a local universe job.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-03-06 18:46:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Daniel Horák 2012-10-03 08:28:48 UTC

Description of problem:
  When I try to suspend local universe job, condor_schedd fail ("exit with status 4").

  Which job universes are supported for suspend and continue functionality?

Version-Release number of selected component (if applicable):
  condor-7.6.5-0.22.el5.i386

How reproducible:
  100%

Steps to Reproduce:
1. Prepare simple local job:
  $ cat local.job 
    universe = local
    executable = /bin/sleep
    arguments = 30
    iwd = /tmp
    requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED)
    queue

2. Submit prepared job and wait for running state:
  $ condor_submit local.job 
  $ condor_q

3. Try to suspend local job
  $ condor_suspend 15
  

Actual results:
  $ condor_submit local.job 
      Submitting job(s).
    1 job(s) submitted to cluster 15.

  $ condor_q
    -- Submitter: dhcp-37-195.lab.eng.brq.redhat.com : <10.34.37.195:36995> : dhcp-37-195.lab.eng.brq.redhat.com
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
      15.0   test           10/3  10:17   0+00:00:07 R  0   0.0  sleep 30          
  
    1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
  $ condor_suspend 15
    Cluster 15 suspended.

  # tail -F SchedLog 
    ...
    10/03/12 10:06:02 (pid:26936) ERROR "unknown action (8 Suspend) in abort_job_myself()" at line 1870 in file /builddir/build/BUILD/condor-7.6.4/src/condor_schedd.V6/schedd.cpp
    ...
    
  # tail -F MasterLog
    10/03/12 10:06:02 DaemonCore: No more children processes to reap.
    10/03/12 10:06:02 The SCHEDD (pid 26936) exited with status 4
    10/03/12 10:06:02 ProcAPI::buildFamily() Parent pid 26936 is gone. Found descendant 26937 via ancestor environment tracking and assigning as new "parent".
    10/03/12 10:06:02 ProcAPI::getProcInfo() pid 26936 does not exist.
    10/03/12 10:06:02 ProcAPI::getProcInfo() pid 26936 does not exist.
    10/03/12 10:06:02 ProcAPI::getProcInfo() pid 26936 does not exist.
    10/03/12 10:06:02 ProcAPI::getProcInfo() pid 26936 does not exist.
    10/03/12 10:06:02 ProcAPI::getProcInfo() pid 26936 does not exist.
    10/03/12 10:06:02 ProcAPI::getProcInfo() pid 27015 does not exist.
    10/03/12 10:06:02 ProcAPI::getProcInfo() pid 27015 does not exist.
    10/03/12 10:06:02 ProcAPI::getProcInfo() pid 27015 does not exist.
    10/03/12 10:06:02 ProcAPI::getProcInfo() pid 27015 does not exist.
    10/03/12 10:06:02 ProcAPI::getProcInfo() pid 27015 does not exist.
    10/03/12 10:06:02 Sending obituary for "/usr/sbin/condor_schedd"
    10/03/12 10:06:02 Forking Mailer process...
    10/03/12 10:06:02 restarting /usr/sbin/condor_schedd in 10 seconds
    10/03/12 10:06:02 enter Daemons::UpdateCollector
    10/03/12 10:06:02 Trying to update collector <10.34.37.195:9618>
    10/03/12 10:06:02 Attempting to send update via UDP to collector dhcp-37-195.lab.eng.brq.redhat.com <10.34.37.195:9618>
    10/03/12 10:06:02 MgmtMasterPlugin: calling update
    10/03/12 10:06:02 exit Daemons::UpdateCollector
    10/03/12 10:06:02 DaemonCore: No more children processes to reap.
    10/03/12 10:06:12 ::RealStart; SCHEDD on_hold=0
    10/03/12 10:06:12 SharedPortEndpoint: Inside destructor.
    10/03/12 10:06:12 start recover timer (26)
    10/03/12 10:06:12 Started DaemonCore process "/usr/sbin/condor_schedd -f", pid and pgroup = 27259
    10/03/12 10:06:12 enter Daemons::UpdateCollector
    10/03/12 10:06:12 Trying to update collector <10.34.37.195:9618>
    10/03/12 10:06:12 Attempting to send update via UDP to collector dhcp-37-195.lab.eng.brq.redhat.com <10.34.37.195:9618>
    10/03/12 10:06:12 MgmtMasterPlugin: calling update
    10/03/12 10:06:12 exit Daemons::UpdateCollector
    10/03/12 10:06:12 Received TCP command 60008 (DC_CHILDALIVE) from unauthenticated@unmapped <10.34.37.195:57343>, access level DAEMON


Expected results:
  At least, condor_schedd should not exit with error. If local universe is supported for suspending, jobs have to be correctly suspended.

Additional info:

Comment 1 Timothy St. Clair 2012-10-08 14:40:15 UTC

local universe is currently unsupported... requires a suspend/continue require a startd.  

Any universe job which is ran through a startd should be supported.

Comment 6 Daniel Horák 2012-12-07 08:11:45 UTC

With new packages there is different behaviour of condor_suspend for local universe dependent on job identification (CLUSTER, CLUSTER.JOB, USER):

$ cat local.job 
  universe = local
  executable = /bin/sleep
  arguments = 120
  iwd = /tmp
  requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED)
  queue

$ condor_submit local.job 
  Submitting job(s).
  1 job(s) submitted to cluster 1.

$ condor_q
  -- Submitter: dhcp-37-141.lab.eng.brq.redhat.com : <10.34.37.141:58377> : dhcp-37-141.lab.eng.brq.redhat.com
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     1.0   test           12/7  08:58   0+00:00:02 R  0   0.0  sleep 120         

  1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

$ condor_suspend 1

  Couldn't find/suspend all jobs in cluster 1.

$ condor_suspend 1.0
  Job 1.0 suspended

$ condor_suspend test

  Couldn't find/suspend all of user test's job(s).


Following log record is for command "condor_suspend 1.0":
# tail -F SchedLog | grep -v "Number of Active Workers"
  12/07/12 08:58:55 (pid:12018) Local universe: Ignoring unsupported action (8 Suspend)

# rpm -qa | grep condor
  condor-classads-7.8.7-0.6.el5.i386
  condor-7.8.7-0.6.el5.i386

Is this behaviour expected? >>> NEEDINFO

Comment 7 Matthew Farrellee 2012-12-12 13:53:33 UTC

(In reply to comment #6)

> Is this behaviour expected? >>> NEEDINFO

Yes

Comment 8 Daniel Horák 2012-12-20 13:06:39 UTC

Tested and verified on RHEL 5.9/6.4, i386/x86_64 (output from RHEL 6.4 i386):

# rpm -q condor
  condor-7.8.8-0.1.el6.i686

$ cat local.job 
  universe = local
  executable = /bin/sleep
  arguments = 300
  iwd = /tmp
  requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED)
  queue

$ condor_submit local.job 
  Submitting job(s).
  1 job(s) submitted to cluster 1.

$ condor_q
  -- Submitter: HOSTNAME : <IP:43481> : HOSTNAME
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     1.0   test           12/20 13:56   0+00:00:11 R  0   0.0  sleep 300         

  1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

$ condor_suspend 1

  Couldn't find/suspend all jobs in cluster 1.

$ condor_suspend 1.0
  Job 1.0 suspended

$ condor_suspend test

  Couldn't find/suspend all of user test's job(s).

$ condor_q
  -- Submitter: HOSTNAME : <IP:43481> : HOSTNAME
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     1.0   test           12/20 13:56   0+00:00:46 R  0   0.0  sleep 300         

  1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

# tail -F /var/log/condor/SchedLog | grep -v "Number of Active Workers"
    . . .
  12/20/12 13:56:31 (pid:2086) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
  12/20/12 13:56:31 (pid:2086) Sent ad to central manager for test@HOSTNAME
  12/20/12 13:56:31 (pid:2086) Sent ad to 1 collectors for test@HOSTNAME
  12/20/12 13:56:31 (pid:2086) Starting add_shadow_birthdate(1.0)
  12/20/12 13:56:31 (pid:2086) Spawned local starter (pid 2467) for job 1.0
  12/20/12 13:56:58 (pid:2086) Local universe: Ignoring unsupported action (8 Suspend)
    . . .

>>> VERIFIED

Comment 11 errata-xmlrpc 2013-03-06 18:46:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0564.html