Email Chain describing the issue: There appears to be a change in behavior in Condor when multiple schedds are defined. I have tested this with 7.7.5 and 7.8. It does not occur in 7.6.6 and prior. Test condition: 1. 3 schedds are defined 2. I submit 1 job. 3. condor_q -g shows 1 schedd queue with the job 4. I restart condor 5. condor_q -g shows the same job in all 3 schedd queues and treats them as independent jobs. I use the same configuration for all 3 versions of Condor for the secondary schedds: SCHEDDJOBS2 = $(SCHEDD) SCHEDDJOBS2_ARGS = -local-name scheddjobs2 SCHEDD.SCHEDDJOBS2.SCHEDD_NAME = schedd_jobs2 SCHEDD.SCHEDDJOBS2.SCHEDD_LOG =$(LOG)/SchedLog.$(SCHEDD.SCHEDDJOBS2.SCHEDD_NAME) SCHEDD.SCHEDDJOBS2.LOCAL_DIR =$(LOCAL_DIR)/$(SCHEDD.SCHEDDJOBS2.SCHEDD_NAME) SCHEDD.SCHEDDJOBS2.EXECUTE = $(SCHEDD.SCHEDDJOBS2.LOCAL_DIR)/execute SCHEDD.SCHEDDJOBS2.LOCK = $(SCHEDD.SCHEDDJOBS2.LOCAL_DIR)/lock SCHEDD.SCHEDDJOBS2.PROCD_ADDRESS =$(SCHEDD.SCHEDDJOBS2.LOCAL_DIR)/procd_pipe SCHEDD.SCHEDDJOBS2.SPOOL = $(SCHEDD.SCHEDDJOBS2.LOCAL_DIR)/spool SCHEDD.SCHEDDJOBS2.SCHEDD_ADDRESS_FILE=$(SCHEDD.SCHEDDJOBS2.SPOOL)/.schedd_address SCHEDD.SCHEDDJOBS2.SCHEDD_DAEMON_AD_FILE=$(SCHEDD.SCHEDDJOBS2.SPOOL)/.schedd_classad SCHEDDJOBS2_LOCAL_DIR_STRING = "$(SCHEDD.SCHEDDJOBS2.LOCAL_DIR)" SCHEDD.SCHEDDJOBS2.SCHEDD_EXPRS = LOCAL_DIR_STRING DAEMON_LIST = $(DAEMON_LIST), SCHEDDJOBS2 (same for schedd3): DC_DAEMON_LIST = + SCHEDDJOBS2 SCHEDDJOBS3 This works in 7.6.6 and prior, just not in 7.7.5 and 7.8. Any ideas? ------------------------------------------------------------------------ re 1: ------------------------------------------------------------------ ------------------------------------------------------------------------ First thought it somehow all the Schedds are using the same spool. When you restart them they should log something like "About to rotate ClassAd log /var/lib/condor/spool/job_queue.log". Make sure they're all processing a different job_queue.log. ------------------------------------------------------------------------ re 2: ------------------------------------------------------------------ ------------------------------------------------------------------------ You were correct in the problem being the job_queue.log. A JOB_QUEUE_LOG attribute was introduced in Condor 7.7.5 .. ticket 2598 https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2598 http://research.cs.wisc.edu/condor/manual/v7.7/3_3Configuration.html#16343 Prior to the introduction of this feature a job_queue.log was always maintained in the spool directory of each schedd. With this change, it appears (either a bug or by desire), the job queue log of each additional schedd must be defined explicitly. SCHEDD.SCHEDDJOBS2.JOB_QUEUE_LOG = $(SCHEDD.SCHEDDJOBS2.SPOOL)/job_queue.log If not explicitely stated, only 1 job_queue.log is used. Hence, all jobs are assigned to all schedd queues on a restart. ------------------------------------------------------------------------ re 3: ------------------------------------------------------------------ ------------------------------------------------------------------------ This looks like a bug (and regression) IMHO. src/condor_utils/param_info.in: [JOB_QUEUE_LOG] default=$(SPOOL)/job_queue.log src/condor_schedd.V6/schedd_main.cpp: // Initialize the job queue char *job_queue_param_name = param("JOB_QUEUE_LOG"); if (job_queue_param_name == NULL) { // the default place for the job_queue.log is in spool job_queue_name.sprintf( "%s/job_queue.log", Spool); } else { job_queue_name = job_queue_param_name; // convert char * to MyString free(job_queue_param_name); } Because of the default the Spool/job_queue.log code won't be hit. $ env _CONDOR_MATT.SPOOL=/tmp strace -e open condor_schedd -t -f -local-name matt 2>&1 | grep -e spool -e tmp open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY) = -1 ENOENT (No such file or directory) open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0644) = 7 open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY) = -1 ENOENT (No such file or directory) open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0644) = 7 08/16/12 07:27:27 (pid:14896) initLocalStarterDir: /home/matt/Documents/CondorInstallation/spool/local_univ_execute already exists, deleting old contents open("/tmp/spool_version", O_RDONLY) = 11 open("/home/matt/Documents/CondorInstallation/spool/job_queue.log", O_RDWR) = 11 A workaround is to set JOB_QUEUE_LOG= $ env _CONDOR_MATT.SPOOL=/tmp _CONDOR_JOB_QUEUE_LOG= strace -e open condor_schedd -t -f -local-name matt 2>&1 | grep -e spool -e tmp open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY) = -1 ENOENT (No such file or directory) open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0644) = 7 open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY) = -1 ENOENT (No such file or directory) open("/home/matt/Documents/CondorInstallation/spool/.schedd_address.new", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0644) = 7 08/16/12 07:28:31 (pid:14908) initLocalStarterDir: /home/matt/Documents/CondorInstallation/spool/local_univ_execute already exists, deleting old contents open("/tmp/spool_version", O_RDONLY) = 11 open("/tmp/job_queue.log", O_RDWR) = -1 ENOENT (No such file or directory) open("/tmp/job_queue.log", O_RDWR|O_CREAT|O_EXCL, 0600) = 11 Note, SCHEDD_ADDRESS_FILE also has a default (defined in condor_config) of $(SPOOL)/.schedd_address
The verification will be very easy. 1.) Setup a multi-schedd config as listed above 2.) submit jobs on older-dev version, note collision 3.) submit jobs on newer-dev version, no collision
Successfully reproduced with: condor-7.8.2-0.3 Tested with: condor-7.8.8-0.4.1 Tested on: RHEL5 x86_64,i386 RHEL6 x86_64,i386 Tested using automated script. >>> verified