Bug 852537

Summary: RHHAv2 won't run jobs
Product: Red Hat Enterprise MRG Reporter: Robert Rati <rrati>
Component: condor-cluster-resource-agentAssignee: Robert Rati <rrati>
Status: CLOSED ERRATA QA Contact: Tomas Rusnak <trusnak>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 2.2CC: iboverma, jneedle, ltrilety, matt, mkudlej, trusnak, tstclair
Target Milestone: 2.2   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: condor-7.6.5-0.22 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-09-19 18:26:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 853945    
Bug Blocks: 785145    

Description Robert Rati 2012-08-28 20:34:51 UTC
Description of problem:
When a job in a RHHA controlled scheduler attempts to run, the shadow errors out with:

ERROR "According to /var/lib/condor/spool/spool_version, the SPOOL directory is written in spool version 0, but I only support versions back to 1.
" at line 67 in file /builddir/build/BUILD/condor-7.6.4/src/condor_utils/spool_version.cpp

The issue here is the shadow isn't using the spool directory the schedd is.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Robert Rati 2012-08-28 20:40:58 UTC
The error is caused by the lack of a spool_version in $SPOOL.  This isn't usually seen because the default installation is a Personal Condor with a schedd.  The first time condor is started in that configuration, the schedd creates the spool_version file in $SPOOL.  When RHHA controls the schedd, the spool location is someplace other than SPOOL.  The Shadow is looking at SPOOL and finding /var/lib/condor/spool (the default) instead of the SCHEDD.<name>.SPOOL value, but unless that file is deleted the shadow will not produce the above error.

The result of this is that in the best case the shadow is working off a spool directory that is not the same as the schedd, and the worst case (the spool_version file is deleted) the shadow refused to run jobs at all.

Comment 2 Robert Rati 2012-08-28 20:44:12 UTC
1) Reproduce by configuring RHHA to manage a schedd
2) Remove /var/lib/condor/spool/spool_version
3) submit a job like:
echo -e "cmd = /bin/sleep\nargs=300\nRequirements=(FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED)\nshould_transfer_files=if_needed\nwhen_to_transfer_output=on_exit\nqueue 5" | su condor -s /bin/bash -c "condor_submit -name ha-schedd-schedd1@"

Comment 4 Jeff Needle 2012-08-29 11:47:46 UTC
*** Bug 852321 has been marked as a duplicate of this bug. ***

Comment 5 Robert Rati 2012-08-30 14:39:14 UTC
The configuration of the shadow isn't very flexible and dealing with multiple schedds with multiple SPOOL directories is more cumbersome to configure than it should be.  The Gridmanager, which is also spawned by the schedd, could also run into this issue since it looks in SPOOL for an executable in some situations.  The tools now point the shadow and gridmanager at the correct spool for their respective schedds by setting _CONDOR_SPOOL in the schedd's environment.  The shadows also use log/lock files per schedd as well.

Fixed on:
boysenberry-BZ852537-shadow-spool

Comment 6 Tomas Rusnak 2012-09-03 10:35:18 UTC
Reproduced on:
$CondorVersion: 7.6.5 Jul 12 2012 BuildID: RH-7.6.5-0.18.el6 $
$CondorPlatform: X86_64-RedHat_6.2 $

# condor_q
-- Submitter: ha-schedd-ha_schedd1@ : <IP:54999> : rhel-ha-1.nerd.usersys.redhat.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

# rm /var/lib/condor/spool/spool_version
rm: remove regular file `/var/lib/condor/spool/spool_version'? y

# sudo -u test  condor_submit -name ha-schedd-ha_schedd1@ bz852537.job 
Submitting job(s).....
5 job(s) submitted to cluster 8.

# tail -f /var/log/condor/ShadowLog
09/03/12 10:32:06 ERROR "According to /var/lib/condor/spool/spool_version, the SPOOL directory is written in spool version 0, but I only support versions back to 1.
" at line 67 in file /builddir/build/BUILD/condor-7.6.4/src/condor_utils/spool_version.cpp

# condor_q -globa
-- Schedd: ha-schedd-ha_schedd1@ : <IP:54999>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   8.0   test            9/3  10:17   0+00:00:00 I  0   0.0  sleep 300         
   8.1   test            9/3  10:17   0+00:00:00 I  0   0.0  sleep 300         
   8.2   test            9/3  10:17   0+00:00:00 I  0   0.0  sleep 300         
   8.3   test            9/3  10:17   0+00:00:00 I  0   0.0  sleep 300         
   8.4   test            9/3  10:17   0+00:00:00 I  0   0.0  sleep 300         

5 jobs; 0 completed, 0 removed, 5 idle, 0 running, 0 held, 0 suspended

Comment 7 Tomas Rusnak 2012-09-05 14:51:36 UTC
Retested with:
$CondorVersion: 7.6.5 Aug 30 2012 BuildID: RH-7.6.5-0.22.el6 $
$CondorPlatform: X86_64-RedHat_6.3 $

# cat bz852537.job 
cmd=/bin/sleep
args=300
Requirements=(FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED)
queue 5

# condor_q
-- Submitter: ha-schedd-ha_schedd1@ : <IP:56273> : rhel-ha-1
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  17.0   test            9/5  14:48   0+00:01:02 S  0   4.2  sleep 300         
  17.1   test            9/5  14:48   0+00:01:02 R  0   0.0  sleep 300         
  17.2   test            9/5  14:48   0+00:00:00 H  0   0.0  sleep 300         
  17.3   test            9/5  14:48   0+00:00:00 H  0   0.0  sleep 300         
  17.4   test            9/5  14:48   0+00:00:00 H  0   0.0  sleep 300         

5 jobs; 0 completed, 0 removed, 0 idle, 1 running, 3 held, 1 suspended


ShadowLog:
09/05/12 14:42:38 Using config source: /etc/condor/condor_config
09/05/12 14:42:38 Using local config sources: 
09/05/12 14:42:38    /etc/condor/config.d/00personal_condor.config
09/05/12 14:42:38    /etc/condor/config.d/50ha.config
09/05/12 14:42:38    /etc/condor/config.d/60condor-qmf.config
09/05/12 14:42:38    /etc/condor/config.d/61aviary.config
09/05/12 14:42:38    /etc/condor/config.d/99configd.config
09/05/12 14:42:38    /var/lib/condor/wallaby_node.config

No errors or complaints about spool_version in ShadowLog found.

>>> VERIFIED