Bug 848823 - Multiple schedds with same job_queue.log
Summary: Multiple schedds with same job_queue.log
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: Development
Hardware: All
OS: Linux
medium
medium
Target Milestone: 2.3
: ---
Assignee: Timothy St. Clair
QA Contact: Lubos Trilety
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-08-16 13:53 UTC by Timothy St. Clair
Modified: 2013-03-19 16:37 UTC (History)
4 users (show)

Fixed In Version: condor-7.8.3-0.1
Doc Type: Bug Fix
Doc Text:
Cause: Submitting a single job to a machine which has multiple scheduler daemons would cause the job to appear in all queues, or collide. Consequence: The job could be matched multiple times and run. Fix: Remove the JOB_QUEUE_LOG default so the schedd code evaluates the proper value for SPOOL based on subsystem. Result: No job collision.
Clone Of:
Environment:
Last Closed: 2013-03-19 16:37:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Condor 3196 0 None None None 2012-08-22 20:21:10 UTC

Description Timothy St. Clair 2012-08-16 13:53:50 UTC
Email Chain describing the issue: 

There appears to be a change in behavior in Condor when multiple schedds
are defined. I have tested this with 7.7.5 and 7.8. It does not occur
in 7.6.6 and prior.

Test condition:
1. 3 schedds are defined
2. I submit 1 job.
3. condor_q -g shows 1 schedd queue with the job
4. I restart condor
5. condor_q -g shows the same job in all 3 schedd queues and treats
them as independent jobs.

I use the same configuration for all 3 versions of Condor for the
secondary schedds:

SCHEDDJOBS2 = $(SCHEDD)
SCHEDDJOBS2_ARGS = -local-name scheddjobs2
SCHEDD.SCHEDDJOBS2.SCHEDD_NAME = schedd_jobs2
SCHEDD.SCHEDDJOBS2.SCHEDD_LOG =$(LOG)/SchedLog.$(SCHEDD.SCHEDDJOBS2.SCHEDD_NAME)
SCHEDD.SCHEDDJOBS2.LOCAL_DIR =$(LOCAL_DIR)/$(SCHEDD.SCHEDDJOBS2.SCHEDD_NAME)
SCHEDD.SCHEDDJOBS2.EXECUTE = $(SCHEDD.SCHEDDJOBS2.LOCAL_DIR)/execute
SCHEDD.SCHEDDJOBS2.LOCK = $(SCHEDD.SCHEDDJOBS2.LOCAL_DIR)/lock
SCHEDD.SCHEDDJOBS2.PROCD_ADDRESS =$(SCHEDD.SCHEDDJOBS2.LOCAL_DIR)/procd_pipe
SCHEDD.SCHEDDJOBS2.SPOOL = $(SCHEDD.SCHEDDJOBS2.LOCAL_DIR)/spool
SCHEDD.SCHEDDJOBS2.SCHEDD_ADDRESS_FILE=$(SCHEDD.SCHEDDJOBS2.SPOOL)/.schedd_address

SCHEDD.SCHEDDJOBS2.SCHEDD_DAEMON_AD_FILE=$(SCHEDD.SCHEDDJOBS2.SPOOL)/.schedd_classad
SCHEDDJOBS2_LOCAL_DIR_STRING = "$(SCHEDD.SCHEDDJOBS2.LOCAL_DIR)"
SCHEDD.SCHEDDJOBS2.SCHEDD_EXPRS = LOCAL_DIR_STRING
DAEMON_LIST = $(DAEMON_LIST), SCHEDDJOBS2

(same for schedd3):
DC_DAEMON_LIST = + SCHEDDJOBS2 SCHEDDJOBS3


This works in 7.6.6 and prior, just not in 7.7.5 and 7.8.
Any ideas?

------------------------------------------------------------------------
re 1: ------------------------------------------------------------------
------------------------------------------------------------------------

First thought it somehow all the Schedds are using the same spool.

When you restart them they should log something like "About to rotate ClassAd log /var/lib/condor/spool/job_queue.log". Make sure they're all processing a different job_queue.log.

------------------------------------------------------------------------
re 2: ------------------------------------------------------------------
------------------------------------------------------------------------
You were correct in the problem being the job_queue.log.

A JOB_QUEUE_LOG attribute was introduced in Condor 7.7.5
.. ticket 2598 https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2598

http://research.cs.wisc.edu/condor/manual/v7.7/3_3Configuration.html#16343

Prior to the introduction of this feature a job_queue.log was always
maintained in the spool directory of each schedd.  With this change, it
appears (either a bug or by desire), the job queue log of each additional
schedd must be defined explicitly.

SCHEDD.SCHEDDJOBS2.JOB_QUEUE_LOG = $(SCHEDD.SCHEDDJOBS2.SPOOL)/job_queue.log

If not explicitely stated, only 1 job_queue.log is used.  Hence, all
jobs are assigned to all schedd queues on a restart.

------------------------------------------------------------------------
re 3: ------------------------------------------------------------------
------------------------------------------------------------------------

This looks like a bug (and regression) IMHO.

src/condor_utils/param_info.in:
[JOB_QUEUE_LOG]
default=$(SPOOL)/job_queue.log

src/condor_schedd.V6/schedd_main.cpp:
// Initialize the job queue
char *job_queue_param_name = param("JOB_QUEUE_LOG");

    if (job_queue_param_name == NULL) {
       // the default place for the job_queue.log is in spool
       job_queue_name.sprintf( "%s/job_queue.log", Spool);
    } else {
       job_queue_name = job_queue_param_name; // convert char * to MyString
       free(job_queue_param_name);
    }

Because of the default the Spool/job_queue.log code won't be hit.

$ env _CONDOR_MATT.SPOOL=/tmp strace -e open condor_schedd -t -f 
-local-name matt 2>&1 | grep -e spool -e tmp
open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY) 
= -1 ENOENT (No such file or directory)
open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 
0644) = 7
open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY) 
= -1 ENOENT (No such file or directory)
open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 
0644) = 7
08/16/12 07:27:27 (pid:14896) initLocalStarterDir: 
/home/matt/Documents/CondorInstallation/spool/local_univ_execute already 
exists, deleting old contents
open("/tmp/spool_version", O_RDONLY)    = 11
open("/home/matt/Documents/CondorInstallation/spool/job_queue.log", 
O_RDWR) = 11

A workaround is to set JOB_QUEUE_LOG=

$ env _CONDOR_MATT.SPOOL=/tmp _CONDOR_JOB_QUEUE_LOG= strace -e open 
condor_schedd -t -f -local-name matt 2>&1 | grep -e spool -e tmp
open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY) 
= -1 ENOENT (No such file or directory)
open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 
0644) = 7
open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY) 
= -1 ENOENT (No such file or directory)
open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 
0644) = 7
08/16/12 07:28:31 (pid:14908) initLocalStarterDir: 
/home/matt/Documents/CondorInstallation/spool/local_univ_execute already 
exists, deleting old contents
open("/tmp/spool_version", O_RDONLY)    = 11
open("/tmp/job_queue.log", O_RDWR)      = -1 ENOENT (No such file or 
directory)
open("/tmp/job_queue.log", O_RDWR|O_CREAT|O_EXCL, 0600) = 11

Note, SCHEDD_ADDRESS_FILE also has a default (defined in condor_config) 
of $(SPOOL)/.schedd_address

Comment 1 Timothy St. Clair 2012-08-16 14:18:50 UTC
The verification will be very easy. 

1.) Setup a multi-schedd config as listed above 
2.) submit jobs on older-dev version, note collision 
3.) submit jobs on newer-dev version, no collision

Comment 5 Lubos Trilety 2013-02-07 13:47:52 UTC
Successfully reproduced with:
condor-7.8.2-0.3

Tested with:
condor-7.8.8-0.4.1

Tested on:
RHEL5 x86_64,i386
RHEL6 x86_64,i386

Tested using automated script.

>>> verified


Note You need to log in before you can comment on or make changes to this bug.