Bug 852537 - RHHAv2 won't run jobs
RHHAv2 won't run jobs
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor-cluster-resource-agent (Show other bugs)
2.2
Unspecified Unspecified
unspecified Severity unspecified
: 2.2
: ---
Assigned To: Robert Rati
Tomas Rusnak
:
Depends On: 853945
Blocks: 785145
  Show dependency treegraph
 
Reported: 2012-08-28 16:34 EDT by Robert Rati
Modified: 2012-09-25 04:59 EDT (History)
7 users (show)

See Also:
Fixed In Version: condor-7.6.5-0.22
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-09-19 14:26:12 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Robert Rati 2012-08-28 16:34:51 EDT
Description of problem:
When a job in a RHHA controlled scheduler attempts to run, the shadow errors out with:

ERROR "According to /var/lib/condor/spool/spool_version, the SPOOL directory is written in spool version 0, but I only support versions back to 1.
" at line 67 in file /builddir/build/BUILD/condor-7.6.4/src/condor_utils/spool_version.cpp

The issue here is the shadow isn't using the spool directory the schedd is.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 Robert Rati 2012-08-28 16:40:58 EDT
The error is caused by the lack of a spool_version in $SPOOL.  This isn't usually seen because the default installation is a Personal Condor with a schedd.  The first time condor is started in that configuration, the schedd creates the spool_version file in $SPOOL.  When RHHA controls the schedd, the spool location is someplace other than SPOOL.  The Shadow is looking at SPOOL and finding /var/lib/condor/spool (the default) instead of the SCHEDD.<name>.SPOOL value, but unless that file is deleted the shadow will not produce the above error.

The result of this is that in the best case the shadow is working off a spool directory that is not the same as the schedd, and the worst case (the spool_version file is deleted) the shadow refused to run jobs at all.
Comment 2 Robert Rati 2012-08-28 16:44:12 EDT
1) Reproduce by configuring RHHA to manage a schedd
2) Remove /var/lib/condor/spool/spool_version
3) submit a job like:
echo -e "cmd = /bin/sleep\nargs=300\nRequirements=(FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED)\nshould_transfer_files=if_needed\nwhen_to_transfer_output=on_exit\nqueue 5" | su condor -s /bin/bash -c "condor_submit -name ha-schedd-schedd1@"
Comment 4 Jeff Needle 2012-08-29 07:47:46 EDT
*** Bug 852321 has been marked as a duplicate of this bug. ***
Comment 5 Robert Rati 2012-08-30 10:39:14 EDT
The configuration of the shadow isn't very flexible and dealing with multiple schedds with multiple SPOOL directories is more cumbersome to configure than it should be.  The Gridmanager, which is also spawned by the schedd, could also run into this issue since it looks in SPOOL for an executable in some situations.  The tools now point the shadow and gridmanager at the correct spool for their respective schedds by setting _CONDOR_SPOOL in the schedd's environment.  The shadows also use log/lock files per schedd as well.

Fixed on:
boysenberry-BZ852537-shadow-spool
Comment 6 Tomas Rusnak 2012-09-03 06:35:18 EDT
Reproduced on:
$CondorVersion: 7.6.5 Jul 12 2012 BuildID: RH-7.6.5-0.18.el6 $
$CondorPlatform: X86_64-RedHat_6.2 $

# condor_q
-- Submitter: ha-schedd-ha_schedd1@ : <IP:54999> : rhel-ha-1.nerd.usersys.redhat.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

# rm /var/lib/condor/spool/spool_version
rm: remove regular file `/var/lib/condor/spool/spool_version'? y

# sudo -u test  condor_submit -name ha-schedd-ha_schedd1@ bz852537.job 
Submitting job(s).....
5 job(s) submitted to cluster 8.

# tail -f /var/log/condor/ShadowLog
09/03/12 10:32:06 ERROR "According to /var/lib/condor/spool/spool_version, the SPOOL directory is written in spool version 0, but I only support versions back to 1.
" at line 67 in file /builddir/build/BUILD/condor-7.6.4/src/condor_utils/spool_version.cpp

# condor_q -globa
-- Schedd: ha-schedd-ha_schedd1@ : <IP:54999>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   8.0   test            9/3  10:17   0+00:00:00 I  0   0.0  sleep 300         
   8.1   test            9/3  10:17   0+00:00:00 I  0   0.0  sleep 300         
   8.2   test            9/3  10:17   0+00:00:00 I  0   0.0  sleep 300         
   8.3   test            9/3  10:17   0+00:00:00 I  0   0.0  sleep 300         
   8.4   test            9/3  10:17   0+00:00:00 I  0   0.0  sleep 300         

5 jobs; 0 completed, 0 removed, 5 idle, 0 running, 0 held, 0 suspended
Comment 7 Tomas Rusnak 2012-09-05 10:51:36 EDT
Retested with:
$CondorVersion: 7.6.5 Aug 30 2012 BuildID: RH-7.6.5-0.22.el6 $
$CondorPlatform: X86_64-RedHat_6.3 $

# cat bz852537.job 
cmd=/bin/sleep
args=300
Requirements=(FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED)
queue 5

# condor_q
-- Submitter: ha-schedd-ha_schedd1@ : <IP:56273> : rhel-ha-1
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  17.0   test            9/5  14:48   0+00:01:02 S  0   4.2  sleep 300         
  17.1   test            9/5  14:48   0+00:01:02 R  0   0.0  sleep 300         
  17.2   test            9/5  14:48   0+00:00:00 H  0   0.0  sleep 300         
  17.3   test            9/5  14:48   0+00:00:00 H  0   0.0  sleep 300         
  17.4   test            9/5  14:48   0+00:00:00 H  0   0.0  sleep 300         

5 jobs; 0 completed, 0 removed, 0 idle, 1 running, 3 held, 1 suspended


ShadowLog:
09/05/12 14:42:38 Using config source: /etc/condor/condor_config
09/05/12 14:42:38 Using local config sources: 
09/05/12 14:42:38    /etc/condor/config.d/00personal_condor.config
09/05/12 14:42:38    /etc/condor/config.d/50ha.config
09/05/12 14:42:38    /etc/condor/config.d/60condor-qmf.config
09/05/12 14:42:38    /etc/condor/config.d/61aviary.config
09/05/12 14:42:38    /etc/condor/config.d/99configd.config
09/05/12 14:42:38    /var/lib/condor/wallaby_node.config

No errors or complaints about spool_version in ShadowLog found.

>>> VERIFIED

Note You need to log in before you can comment on or make changes to this bug.