Bug 852537
Summary: | RHHAv2 won't run jobs | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Robert Rati <rrati> |
Component: | condor-cluster-resource-agent | Assignee: | Robert Rati <rrati> |
Status: | CLOSED ERRATA | QA Contact: | Tomas Rusnak <trusnak> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 2.2 | CC: | iboverma, jneedle, ltrilety, matt, mkudlej, trusnak, tstclair |
Target Milestone: | 2.2 | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | condor-7.6.5-0.22 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2012-09-19 18:26:12 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 853945 | ||
Bug Blocks: | 785145 |
Description
Robert Rati
2012-08-28 20:34:51 UTC
The error is caused by the lack of a spool_version in $SPOOL. This isn't usually seen because the default installation is a Personal Condor with a schedd. The first time condor is started in that configuration, the schedd creates the spool_version file in $SPOOL. When RHHA controls the schedd, the spool location is someplace other than SPOOL. The Shadow is looking at SPOOL and finding /var/lib/condor/spool (the default) instead of the SCHEDD.<name>.SPOOL value, but unless that file is deleted the shadow will not produce the above error. The result of this is that in the best case the shadow is working off a spool directory that is not the same as the schedd, and the worst case (the spool_version file is deleted) the shadow refused to run jobs at all. 1) Reproduce by configuring RHHA to manage a schedd 2) Remove /var/lib/condor/spool/spool_version 3) submit a job like: echo -e "cmd = /bin/sleep\nargs=300\nRequirements=(FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED)\nshould_transfer_files=if_needed\nwhen_to_transfer_output=on_exit\nqueue 5" | su condor -s /bin/bash -c "condor_submit -name ha-schedd-schedd1@" *** Bug 852321 has been marked as a duplicate of this bug. *** The configuration of the shadow isn't very flexible and dealing with multiple schedds with multiple SPOOL directories is more cumbersome to configure than it should be. The Gridmanager, which is also spawned by the schedd, could also run into this issue since it looks in SPOOL for an executable in some situations. The tools now point the shadow and gridmanager at the correct spool for their respective schedds by setting _CONDOR_SPOOL in the schedd's environment. The shadows also use log/lock files per schedd as well. Fixed on: boysenberry-BZ852537-shadow-spool Reproduced on: $CondorVersion: 7.6.5 Jul 12 2012 BuildID: RH-7.6.5-0.18.el6 $ $CondorPlatform: X86_64-RedHat_6.2 $ # condor_q -- Submitter: ha-schedd-ha_schedd1@ : <IP:54999> : rhel-ha-1.nerd.usersys.redhat.com ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended # rm /var/lib/condor/spool/spool_version rm: remove regular file `/var/lib/condor/spool/spool_version'? y # sudo -u test condor_submit -name ha-schedd-ha_schedd1@ bz852537.job Submitting job(s)..... 5 job(s) submitted to cluster 8. # tail -f /var/log/condor/ShadowLog 09/03/12 10:32:06 ERROR "According to /var/lib/condor/spool/spool_version, the SPOOL directory is written in spool version 0, but I only support versions back to 1. " at line 67 in file /builddir/build/BUILD/condor-7.6.4/src/condor_utils/spool_version.cpp # condor_q -globa -- Schedd: ha-schedd-ha_schedd1@ : <IP:54999> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 8.0 test 9/3 10:17 0+00:00:00 I 0 0.0 sleep 300 8.1 test 9/3 10:17 0+00:00:00 I 0 0.0 sleep 300 8.2 test 9/3 10:17 0+00:00:00 I 0 0.0 sleep 300 8.3 test 9/3 10:17 0+00:00:00 I 0 0.0 sleep 300 8.4 test 9/3 10:17 0+00:00:00 I 0 0.0 sleep 300 5 jobs; 0 completed, 0 removed, 5 idle, 0 running, 0 held, 0 suspended Retested with: $CondorVersion: 7.6.5 Aug 30 2012 BuildID: RH-7.6.5-0.22.el6 $ $CondorPlatform: X86_64-RedHat_6.3 $ # cat bz852537.job cmd=/bin/sleep args=300 Requirements=(FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED) queue 5 # condor_q -- Submitter: ha-schedd-ha_schedd1@ : <IP:56273> : rhel-ha-1 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 17.0 test 9/5 14:48 0+00:01:02 S 0 4.2 sleep 300 17.1 test 9/5 14:48 0+00:01:02 R 0 0.0 sleep 300 17.2 test 9/5 14:48 0+00:00:00 H 0 0.0 sleep 300 17.3 test 9/5 14:48 0+00:00:00 H 0 0.0 sleep 300 17.4 test 9/5 14:48 0+00:00:00 H 0 0.0 sleep 300 5 jobs; 0 completed, 0 removed, 0 idle, 1 running, 3 held, 1 suspended ShadowLog: 09/05/12 14:42:38 Using config source: /etc/condor/condor_config 09/05/12 14:42:38 Using local config sources: 09/05/12 14:42:38 /etc/condor/config.d/00personal_condor.config 09/05/12 14:42:38 /etc/condor/config.d/50ha.config 09/05/12 14:42:38 /etc/condor/config.d/60condor-qmf.config 09/05/12 14:42:38 /etc/condor/config.d/61aviary.config 09/05/12 14:42:38 /etc/condor/config.d/99configd.config 09/05/12 14:42:38 /var/lib/condor/wallaby_node.config No errors or complaints about spool_version in ShadowLog found. >>> VERIFIED |