Bug 626362

Summary: low latency hook_update_job_status exits with 1
Product: Red Hat Enterprise MRG Reporter: Martin Kudlej <mkudlej>
Component: gridAssignee: Robert Rati <rrati>
Status: CLOSED ERRATA QA Contact: Martin Kudlej <mkudlej>
Severity: medium Docs Contact:
Priority: medium    
Version: betaCC: matt
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Martin Kudlej 2010-08-23 10:58:02 UTC
Description of problem:
I've run "update status" low-latency test with job:
Cmd = /bin/sleep
Arguments = 60
Iwd = /tmp
Owner = nobody
JobUniverse = 5

and status update hook exits with result 1.
Nobody is valid account in system, but he hasn't shell.
$ grep nobody /etc/passwd
nobody:x:99:99:Nobody:/:/sbin/nologin

So status update hook cannot run condor_config_val(reading
configuration in hook), it cannot write log into syslog(there is no log entry in /var/log/messages from nobody's update hook - ordinary python script run by ordinary user can write to syslog) and exits with
1.


Version-Release number of selected component (if applicable):
condor-low-latency-1.1-0.2.el4
python-condorutils-1.4-4.el4
condor-job-hooks-1.4-4.el4

How reproducible:
100%

Steps to Reproduce:
1. install condor pool with low-latency support
2. run long term job
3. set up full debug for starter
4. watch exit status of status update hook in StarterLog
  
Actual results:
It doesn't send update status message from Carod to low-latency client.

Expected results:
It will send update status message from Carod to low-latency client.

Additional info:

08/23/10 06:11:50 ******************************************************
08/23/10 06:11:50 ** condor_starter (CONDOR_STARTER) STARTING UP
08/23/10 06:11:50 ** /usr/sbin/condor_starter
08/23/10 06:11:50 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
08/23/10 06:11:50 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
08/23/10 06:11:50 ** $CondorVersion: 7.4.4 Aug  9 2010 BuildID: RH-7.4.4-0.9.el4 PRE-RELEASE $
08/23/10 06:11:50 ** $CondorPlatform: X86_64-LINUX_RHEL4 $
08/23/10 06:11:50 ** PID = 1129
08/23/10 06:11:50 ** Log last touched 8/23 06:10:44
08/23/10 06:11:50 ******************************************************
08/23/10 06:11:50 Using config source: /etc/condor/condor_config
08/23/10 06:11:50 Using local config sources:
08/23/10 06:11:50    /etc/condor/config.d/00personal_condor.config
08/23/10 06:11:50    /etc/condor/config.d/zzz_condor_config.test
08/23/10 06:11:50 DaemonCore: Command Socket at <:41341>
08/23/10 06:11:50 Setting maximum accepts per cycle 4.
08/23/10 06:11:50 Will use UDP to update collector  <:9618>
08/23/10 06:11:50 50807556 kbytes available for "."
08/23/10 06:11:50 Done setting resource limits
08/23/10 06:11:50 Starter running a local job with no shadow
08/23/10 06:11:50 Reading job ClassAd from "STDIN"
08/23/10 06:11:50 Found ClassAd data in "STDIN"
08/23/10 06:11:50 Instantiating a StarterHookMgr
08/23/10 06:11:50 Using HookKeyword value from job ClassAd: "LOW_LATENCY_JOB"
08/23/10 06:11:50 setting the orig job name in starter
08/23/10 06:11:50 setting the orig job iwd in starter
08/23/10 06:11:50 Job's cluster ID not specified in ClassAd or on command-line, using '1'
08/23/10 06:11:50 Job's proc ID not specified in ClassAd or on command-line, using '0'
08/23/10 06:11:50 Initialized user_priv as "nobody"
08/23/10 06:11:50 Done moving to directory "/var/lib/condor/execute/dir_1129"
08/23/10 06:11:50 No StarterUserLog found in job ClassAd
08/23/10 06:11:50 Starter will not write a local UserLog
08/23/10 06:11:50 Create_Process: using fast clone() to create child process.
08/23/10 06:11:50 HOOK_PREPARE_JOB (/usr/libexec/condor/hooks/hook_prepare_job.py) invoked.
08/23/10 06:11:50 DaemonCore: in SendAliveToParent()
08/23/10 06:11:50 Initialized the following authorization table:
08/23/10 06:11:50 Authorizations yet to be resolved:
08/23/10 06:11:50 allow NEGOTIATOR: *
08/23/10 06:11:50 allow ADMINISTRATOR: *
08/23/10 06:11:50 allow OWNER:  *
08/23/10 06:11:50 DaemonCore: Leaving SendAliveToParent() - success
08/23/10 06:11:50 DaemonCore: No more children processes to reap.
08/23/10 06:11:50 HookClient /usr/libexec/condor/hooks/hook_prepare_job.py (pid 1130) exited with status 0
08/23/10 06:11:50 Prepare hook output classad
MyType = ""
TargetType = ""
08/23/10 06:11:50 After Prepare hook: merged job classad:
MyType = ""
TargetType = ""
MyType = ""
TargetType = ""
AMQPID = "63313339-3033-3761-2d33-3032322d3435"
WF_REQ_SLOT = "1" 
IsFeatched = TRUE 
Iwd = "/tmp"
Cmd = "/bin/sleep"
Arguments = "60"
JobUniverse = 5
Owner = "nobody"
HookKeyword = "LOW_LATENCY_JOB"
OrigCmd = "/bin/sleep"
OrigIwd = "/tmp"
08/23/10 06:11:50 Job 1.0 set to execute immediately
08/23/10 06:11:50 Starting a VANILLA universe job with ID: 1.0
08/23/10 06:11:50 In OsProc::OsProc()
08/23/10 06:11:50 Main job KillSignal: 15 (SIGTERM)
08/23/10 06:11:50 Main job RmKillSignal: 15 (SIGTERM)
08/23/10 06:11:50 Main job HoldKillSignal: 15 (SIGTERM)
08/23/10 06:11:50 in VanillaProc::StartJob()
08/23/10 06:11:50 in OsProc::StartJob()
08/23/10 06:11:50 IWD: /tmp
08/23/10 06:11:50 Input file: /dev/null
08/23/10 06:11:50 Output file: /dev/null
08/23/10 06:11:50 Error file: /dev/null
08/23/10 06:11:50 About to exec /bin/sleep 60
08/23/10 06:11:50 Env = TMP=/var/lib/condor/execute/dir_1129 _CONDOR_JOB_IWD=/tmp _CONDOR_SLOT=1 _CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_1129/.machine.ad TEMP=/var/lib/condor/execute/dir_1129 TMPDIR=/var/lib/condor/execute/dir_1129 _CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_1129 _CONDOR_JOB_AD=/var/lib/condor/execute/dir_1129/.job.ad _CONDOR_JOB_PIDS=
08/23/10 06:11:50 ENFORCE_CPU_AFFINITY not true, not setting affinity
08/23/10 06:11:50 Create_Process: using fast clone() to create child process.
08/23/10 06:11:50 Create_Process succeeded, pid=1139
08/23/10 06:11:59 In VanillaProc::PublishUpdateAd()
08/23/10 06:11:59 Inside OsProc::PublishUpdateAd()
08/23/10 06:11:59 Create_Process: using fast clone() to create child process.
08/23/10 06:11:59 HOOK_UPDATE_JOB_INFO (/usr/libexec/condor/hooks/hook_update_job_status.py) invoked.
08/23/10 06:11:59 DaemonCore: No more children processes to reap.
08/23/10 06:12:00 Hook (pid 1261) exited with status 1
08/23/10 06:12:30 In VanillaProc::PublishUpdateAd()
08/23/10 06:12:30 Inside OsProc::PublishUpdateAd()
08/23/10 06:12:30 Create_Process: using fast clone() to create child process.
08/23/10 06:12:30 HOOK_UPDATE_JOB_INFO (/usr/libexec/condor/hooks/hook_update_job_status.py) invoked.
08/23/10 06:12:30 DaemonCore: No more children processes to reap.
08/23/10 06:12:30 Hook (pid 1650) exited with status 1
08/23/10 06:12:50 DaemonCore: No more children processes to reap.
08/23/10 06:12:50 Process exited, pid=1139, status=0
08/23/10 06:12:50 in VanillaProc::JobReaper()
08/23/10 06:12:50 Inside OsProc::JobReaper()
08/23/10 06:12:50 Reaper: all=1 handled=1 ShuttingDown=0
08/23/10 06:12:50 In VanillaProc::PublishUpdateAd()
08/23/10 06:12:50 Inside OsProc::PublishUpdateAd()
08/23/10 06:12:50 Create_Process: using fast clone() to create child process.
08/23/10 06:12:51 HOOK_JOB_EXIT (/usr/libexec/condor/hooks/hook_job_exit.py) invoked with reason: "exit"
08/23/10 06:12:51 DaemonCore: No more children processes to reap.
08/23/10 06:12:51 HookClient /usr/libexec/condor/hooks/hook_job_exit.py (pid 1870) exited with status 0
08/23/10 06:12:51 Inside OsProc::JobExit()
08/23/10 06:12:51 In VanillaProc::PublishUpdateAd()
08/23/10 06:12:51 Inside OsProc::PublishUpdateAd()
08/23/10 06:12:51 All jobs have exited... starter exiting
08/23/10 06:12:51 Removing /var/lib/condor/execute/dir_1129
08/23/10 06:12:51 Attempting to remove /var/lib/condor/execute/dir_1129 as SuperUser (root)
08/23/10 06:12:51 **** condor_starter (condor_STARTER) pid 1129 EXITING WITH STATUS 0

Comment 1 Robert Rati 2010-08-23 13:38:44 UTC
There was a typo in the update status hook.

Fixed in:
condor-job-hooks-1.4-5

Comment 2 Martin Kudlej 2010-10-12 12:01:26 UTC
Tested on RHEL 5.5/4.8 x i386/x86_64 with 
condor-low-latency-1.1-0.2.el5
python-condorutils-1.4-5.el5
condor-job-hooks-1.4-5.el5
condor-7.4.4-0.16.el5
and it works. -->VERIFIED