Bug 862550 - schedd crash on local universe condor_suspend+condor_continue job
schedd crash on local universe condor_suspend+condor_continue job
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
2.2
All Linux
high Severity high
: 2.3
: ---
Assigned To: Timothy St. Clair
Daniel Horák
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-10-03 04:28 EDT by Daniel Horák
Modified: 2013-03-06 13:46 EST (History)
4 users (show)

See Also:
Fixed In Version: condor-7.8.6-0.2
Doc Type: Bug Fix
Doc Text:
Cause: Try to suspend local universe jobs Consequence: Schedd will crash Fix: Properly ignore requests to suspend and continue scheduler and local universe jobs Result: Schedd continues normally and an error is reported to the user when they try to suspend a local universe job.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-03-06 13:46:59 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Condor 3259 None None None 2012-10-09 09:55:26 EDT

  None (edit)
Description Daniel Horák 2012-10-03 04:28:48 EDT
Description of problem:
  When I try to suspend local universe job, condor_schedd fail ("exit with status 4").

  Which job universes are supported for suspend and continue functionality?

Version-Release number of selected component (if applicable):
  condor-7.6.5-0.22.el5.i386

How reproducible:
  100%

Steps to Reproduce:
1. Prepare simple local job:
  $ cat local.job 
    universe = local
    executable = /bin/sleep
    arguments = 30
    iwd = /tmp
    requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED)
    queue

2. Submit prepared job and wait for running state:
  $ condor_submit local.job 
  $ condor_q

3. Try to suspend local job
  $ condor_suspend 15
  

Actual results:
  $ condor_submit local.job 
      Submitting job(s).
    1 job(s) submitted to cluster 15.

  $ condor_q
    -- Submitter: dhcp-37-195.lab.eng.brq.redhat.com : <10.34.37.195:36995> : dhcp-37-195.lab.eng.brq.redhat.com
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
      15.0   test           10/3  10:17   0+00:00:07 R  0   0.0  sleep 30          
  
    1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
  $ condor_suspend 15
    Cluster 15 suspended.

  # tail -F SchedLog 
    ...
    10/03/12 10:06:02 (pid:26936) ERROR "unknown action (8 Suspend) in abort_job_myself()" at line 1870 in file /builddir/build/BUILD/condor-7.6.4/src/condor_schedd.V6/schedd.cpp
    ...
    
  # tail -F MasterLog
    10/03/12 10:06:02 DaemonCore: No more children processes to reap.
    10/03/12 10:06:02 The SCHEDD (pid 26936) exited with status 4
    10/03/12 10:06:02 ProcAPI::buildFamily() Parent pid 26936 is gone. Found descendant 26937 via ancestor environment tracking and assigning as new "parent".
    10/03/12 10:06:02 ProcAPI::getProcInfo() pid 26936 does not exist.
    10/03/12 10:06:02 ProcAPI::getProcInfo() pid 26936 does not exist.
    10/03/12 10:06:02 ProcAPI::getProcInfo() pid 26936 does not exist.
    10/03/12 10:06:02 ProcAPI::getProcInfo() pid 26936 does not exist.
    10/03/12 10:06:02 ProcAPI::getProcInfo() pid 26936 does not exist.
    10/03/12 10:06:02 ProcAPI::getProcInfo() pid 27015 does not exist.
    10/03/12 10:06:02 ProcAPI::getProcInfo() pid 27015 does not exist.
    10/03/12 10:06:02 ProcAPI::getProcInfo() pid 27015 does not exist.
    10/03/12 10:06:02 ProcAPI::getProcInfo() pid 27015 does not exist.
    10/03/12 10:06:02 ProcAPI::getProcInfo() pid 27015 does not exist.
    10/03/12 10:06:02 Sending obituary for "/usr/sbin/condor_schedd"
    10/03/12 10:06:02 Forking Mailer process...
    10/03/12 10:06:02 restarting /usr/sbin/condor_schedd in 10 seconds
    10/03/12 10:06:02 enter Daemons::UpdateCollector
    10/03/12 10:06:02 Trying to update collector <10.34.37.195:9618>
    10/03/12 10:06:02 Attempting to send update via UDP to collector dhcp-37-195.lab.eng.brq.redhat.com <10.34.37.195:9618>
    10/03/12 10:06:02 MgmtMasterPlugin: calling update
    10/03/12 10:06:02 exit Daemons::UpdateCollector
    10/03/12 10:06:02 DaemonCore: No more children processes to reap.
    10/03/12 10:06:12 ::RealStart; SCHEDD on_hold=0
    10/03/12 10:06:12 SharedPortEndpoint: Inside destructor.
    10/03/12 10:06:12 start recover timer (26)
    10/03/12 10:06:12 Started DaemonCore process "/usr/sbin/condor_schedd -f", pid and pgroup = 27259
    10/03/12 10:06:12 enter Daemons::UpdateCollector
    10/03/12 10:06:12 Trying to update collector <10.34.37.195:9618>
    10/03/12 10:06:12 Attempting to send update via UDP to collector dhcp-37-195.lab.eng.brq.redhat.com <10.34.37.195:9618>
    10/03/12 10:06:12 MgmtMasterPlugin: calling update
    10/03/12 10:06:12 exit Daemons::UpdateCollector
    10/03/12 10:06:12 Received TCP command 60008 (DC_CHILDALIVE) from unauthenticated@unmapped <10.34.37.195:57343>, access level DAEMON


Expected results:
  At least, condor_schedd should not exit with error. If local universe is supported for suspending, jobs have to be correctly suspended.

Additional info:
Comment 1 Timothy St. Clair 2012-10-08 10:40:15 EDT
local universe is currently unsupported... requires a suspend/continue require a startd.  

Any universe job which is ran through a startd should be supported.
Comment 6 Daniel Horák 2012-12-07 03:11:45 EST
With new packages there is different behaviour of condor_suspend for local universe dependent on job identification (CLUSTER, CLUSTER.JOB, USER):

$ cat local.job 
  universe = local
  executable = /bin/sleep
  arguments = 120
  iwd = /tmp
  requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED)
  queue

$ condor_submit local.job 
  Submitting job(s).
  1 job(s) submitted to cluster 1.

$ condor_q
  -- Submitter: dhcp-37-141.lab.eng.brq.redhat.com : <10.34.37.141:58377> : dhcp-37-141.lab.eng.brq.redhat.com
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     1.0   test           12/7  08:58   0+00:00:02 R  0   0.0  sleep 120         

  1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

$ condor_suspend 1

  Couldn't find/suspend all jobs in cluster 1.

$ condor_suspend 1.0
  Job 1.0 suspended

$ condor_suspend test

  Couldn't find/suspend all of user test's job(s).


Following log record is for command "condor_suspend 1.0":
# tail -F SchedLog | grep -v "Number of Active Workers"
  12/07/12 08:58:55 (pid:12018) Local universe: Ignoring unsupported action (8 Suspend)

# rpm -qa | grep condor
  condor-classads-7.8.7-0.6.el5.i386
  condor-7.8.7-0.6.el5.i386

Is this behaviour expected? >>> NEEDINFO
Comment 7 Matthew Farrellee 2012-12-12 08:53:33 EST
(In reply to comment #6)

> Is this behaviour expected? >>> NEEDINFO

Yes
Comment 8 Daniel Horák 2012-12-20 08:06:39 EST
Tested and verified on RHEL 5.9/6.4, i386/x86_64 (output from RHEL 6.4 i386):

# rpm -q condor
  condor-7.8.8-0.1.el6.i686

$ cat local.job 
  universe = local
  executable = /bin/sleep
  arguments = 300
  iwd = /tmp
  requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED)
  queue

$ condor_submit local.job 
  Submitting job(s).
  1 job(s) submitted to cluster 1.

$ condor_q
  -- Submitter: HOSTNAME : <IP:43481> : HOSTNAME
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     1.0   test           12/20 13:56   0+00:00:11 R  0   0.0  sleep 300         

  1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

$ condor_suspend 1

  Couldn't find/suspend all jobs in cluster 1.

$ condor_suspend 1.0
  Job 1.0 suspended

$ condor_suspend test

  Couldn't find/suspend all of user test's job(s).

$ condor_q
  -- Submitter: HOSTNAME : <IP:43481> : HOSTNAME
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     1.0   test           12/20 13:56   0+00:00:46 R  0   0.0  sleep 300         

  1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

# tail -F /var/log/condor/SchedLog | grep -v "Number of Active Workers"
    . . .
  12/20/12 13:56:31 (pid:2086) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
  12/20/12 13:56:31 (pid:2086) Sent ad to central manager for test@HOSTNAME
  12/20/12 13:56:31 (pid:2086) Sent ad to 1 collectors for test@HOSTNAME
  12/20/12 13:56:31 (pid:2086) Starting add_shadow_birthdate(1.0)
  12/20/12 13:56:31 (pid:2086) Spawned local starter (pid 2467) for job 1.0
  12/20/12 13:56:58 (pid:2086) Local universe: Ignoring unsupported action (8 Suspend)
    . . .

>>> VERIFIED
Comment 11 errata-xmlrpc 2013-03-06 13:46:59 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0564.html

Note You need to log in before you can comment on or make changes to this bug.