Bug 588502 - Documentation of new SCHEDD_CLUSTER_MAXIMUM_VALUE is needed
Summary: Documentation of new SCHEDD_CLUSTER_MAXIMUM_VALUE is needed
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: Grid_User_Guide
Version: beta
Hardware: All
OS: All
low
medium
Target Milestone: 1.3
: ---
Assignee: Lana Brindley
QA Contact: Lubos Trilety
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-05-03 20:12 UTC by Erik Erlandson
Modified: 2013-10-23 23:16 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-10-14 20:02:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 575150 0 high CLOSED Need to be able to configure maximum cluster id 2021-02-22 00:41:40 UTC

Description Erik Erlandson 2010-05-03 20:12:08 UTC
Description of problem: New parameter SCHEDD_CLUSTER_MAXIMUM_VALUE is not yet documented.


Version-Release number of selected component (if applicable): 1.3


Expected results: proper documentation of this parameter, including need for condor administrator to manage cluster ids to avoid collisions, if it is enabled


Additional info:
developer contact: eje

Comment 1 Lana Brindley 2010-05-04 20:59:27 UTC
Hi Matt and Rob,

Can you please provide some more detail around this? I'd like to get it in for 1.3 if at all possible.

LKB

Comment 2 Lana Brindley 2010-05-04 21:51:53 UTC
======================================
SCHEDD_CLUSTER_MAXIMUM_VALUE is an upper bound on the job cluster id.
If this parameter is set to a value (M), the maximum job cluster id
assigned to any job will be (M-1).  When the maximum id is reached, job
ids will wrap around back to SCHEDD_CLUSTER_INITIAL_VALUE. The default
value is zero, in which case behavior is no maximum cluster id (backward
compatible).

Caveat: If you set this parameter to a value (M), you are responsible
for ensuring that you have fewer than (M) jobs in your que at any time.
Otherwise, in the event that many new jobs are queued up simultaneously,
it is possible for jobs to be erroneously assigned duplicate cluster
ids, which will result in a corrupted job queue.

See also:  SCHEDD_CLUSTER_INITIAL_VALUE, SCHEDD_CLUSTER_INCREMENT_VALUE
=====================================

Actually, the true caveat is a bit stronger than that: user is
responsible for making sure that old jobs complete before new job ids
wrap around.  In other words, if you have a really long-running job with
id (x.0), and cluster id wraps back around to (x), you'll have a
problem, even if cluster (x+1, x+2, ...) are free.

From Erik, on email.

LKB

Comment 3 Lana Brindley 2010-05-20 04:08:53 UTC
Hi Erik,

I don't seem to have SCHEDD_CLUSTER_INITIAL_VALUE or SCHEDD_CLUSTER_INCREMENT_VALUE documented either. Can you please provide information around these parameters, and an indication of which section in the Appendix (http://documentation-stage.bne.redhat.com/docs/en-US/Red_Hat_Enterprise_MRG/1.2/html/Grid_User_Guide/appe-Grid_User_Guide-Configuration_options.html) they should go? I'm assuing the condor_schedd section, but would like your approval before going ahead.

Thanks,
LKB

Comment 4 Erik Erlandson 2010-05-21 15:38:54 UTC
Here is some content for initial-val and increment-val:

================================
SCHEDD_CLUSTER_INITIAL_VALUE specifies the first cluster id number that will be assigned.  This parameter defaults to 1.  If the job cluster id reaches the value set by SCHEDD_CLUSTER_MAXIMUM_VALUE and wraps, it will be re-set to SCHEDD_CLUSTER_INITIAL_VALUE.  

Note that the most recently used cluster id is saved, and persists across shutdown and restart.  If the "job_queue.log" file is removed, then after system restart cluster ids will be assigned starting from SCHEDD_CLUSTER_INITIAL_VALUE.

See also: SCHEDD_CLUSTER_INCREMENT_VALUE, SCHEDD_CLUSTER_MAXIMUM_VALUE
================================
================================
SCHEDD_CLUSTER_INCREMENT_VALUE specifies the increment used to assign new cluster id numbers.  Its default value is 1.

For example, if SCHEDD_CLUSTER_INITIAL_VALUE is set to 2, and SCHEDD_CLUSTER_INCREMENT_VALUE is set to 2, then cluster id numbers will be {2, 4, 6, ...}

See also: SCHEDD_CLUSTER_INITIAL_VALUE, SCHEDD_CLUSTER_MAXIMUM_VALUE
================================

Comment 5 Lana Brindley 2010-06-16 03:16:20 UTC
(In reply to comment #2)
> ======================================
> SCHEDD_CLUSTER_MAXIMUM_VALUE is an upper bound on the job cluster id.
> If this parameter is set to a value (M), the maximum job cluster id
> assigned to any job will be (M-1).  When the maximum id is reached, job
> ids will wrap around back to SCHEDD_CLUSTER_INITIAL_VALUE. The default
> value is zero, in which case behavior is no maximum cluster id (backward
> compatible).
> 
> Caveat: If you set this parameter to a value (M), you are responsible
> for ensuring that you have fewer than (M) jobs in your que at any time.
> Otherwise, in the event that many new jobs are queued up simultaneously,
> it is possible for jobs to be erroneously assigned duplicate cluster
> ids, which will result in a corrupted job queue.
> 
> See also:  SCHEDD_CLUSTER_INITIAL_VALUE, SCHEDD_CLUSTER_INCREMENT_VALUE
> =====================================
> 
> Actually, the true caveat is a bit stronger than that: user is
> responsible for making sure that old jobs complete before new job ids
> wrap around.  In other words, if you have a really long-running job with
> id (x.0), and cluster id wraps back around to (x), you'll have a
> problem, even if cluster (x+1, x+2, ...) are free.
> 
> From Erik, on email.
> 
> LKB    

<varlistentry>
				<term><command>SCHEDD_CLUSTER_MAXIMUM_VALUE</command></term>
				 <listitem>
					<para>
						An upper bound on the job cluster ID. If this parameter is set to a value (<parameter>M</parameter>), the maximum job cluster ID assigned to any job will be (<parameter>M-1</parameter>).  When the maximum ID is reached, job IDs will wrap around back to <command>SCHEDD_CLUSTER_INITIAL_VALUE</command>. The default value is <parameter>0</parameter>, which will not set a maximum cluster ID.
					</para>
					<warning>
						<para>
							When setting this parameter, it is important to ensure that the number of jobs in the queue at any one time is less than the value. If too many jobs are queued at once, duplicate cluster IDs could be assigned, which will result in a corrupted job queue.
						</para>
					</warning>
				</listitem>

			</varlistentry>

Comment 6 Lana Brindley 2010-06-16 03:23:56 UTC
(In reply to comment #4)
> Here is some content for initial-val and increment-val:
> 
> ================================
> SCHEDD_CLUSTER_INITIAL_VALUE specifies the first cluster id number that will be
> assigned.  This parameter defaults to 1.  If the job cluster id reaches the
> value set by SCHEDD_CLUSTER_MAXIMUM_VALUE and wraps, it will be re-set to
> SCHEDD_CLUSTER_INITIAL_VALUE.  
> 
> Note that the most recently used cluster id is saved, and persists across
> shutdown and restart.  If the "job_queue.log" file is removed, then after
> system restart cluster ids will be assigned starting from
> SCHEDD_CLUSTER_INITIAL_VALUE.
> 
> See also: SCHEDD_CLUSTER_INCREMENT_VALUE, SCHEDD_CLUSTER_MAXIMUM_VALUE

<varlistentry>
				<term><command>SCHEDD_CLUSTER_INITIAL_VALUE</command></term>
				 <listitem>
					<para>
						Specifies the first cluster ID number to be assigned. Defaults to <parameter>1</parameter>.  If the job cluster ID reaches the value set by <command>SCHEDD_CLUSTER_MAXIMUM_VALUE</command> and wraps around, the job cluster ID will be reset to the value of <command>SCHEDD_CLUSTER_INITIAL_VALUE</command>.
					</para>
					<para>
						If the <filename>job_queue.log</filename> file is removed, cluster IDs will be assigned starting from <command>SCHEDD_CLUSTER_INITIAL_VALUE</command> after system restart.
					</para>
				</listitem>
			</varlistentry>

> ================================
> ================================
> SCHEDD_CLUSTER_INCREMENT_VALUE specifies the increment used to assign new
> cluster id numbers.  Its default value is 1.
> 
> For example, if SCHEDD_CLUSTER_INITIAL_VALUE is set to 2, and
> SCHEDD_CLUSTER_INCREMENT_VALUE is set to 2, then cluster id numbers will be {2,
> 4, 6, ...}
> 
> See also: SCHEDD_CLUSTER_INITIAL_VALUE, SCHEDD_CLUSTER_MAXIMUM_VALUE
> ================================    

<varlistentry>
				<term><command>SCHEDD_CLUSTER_INCREMENT_VALUE</command></term>
				 <listitem>
					<para>
						Specifies the increment to use when assigning new cluster ID numbers. Defaults to <parameter>1</parameter>.
					</para>
					<para>
						For example, if <command>SCHEDD_CLUSTER_INITIAL_VALUE</command> is set to <parameter>2</parameter>, and <command>SCHEDD_CLUSTER_INCREMENT_VALUE</command> is set to <parameter>2</parameter>, the cluster ID numbers will be <parameter>{2, 4, 6, ...}</parameter>.
					</para>
				</listitem>
			</varlistentry>

LKB

Comment 7 Lubos Trilety 2010-09-03 10:12:01 UTC
(In reply to comment #5)
> (In reply to comment #2)

> > Caveat: If you set this parameter to a value (M), you are responsible
> > for ensuring that you have fewer than (M) jobs in your que at any time.
> > Otherwise, in the event that many new jobs are queued up simultaneously,
> > it is possible for jobs to be erroneously assigned duplicate cluster
> > ids, which will result in a corrupted job queue.
> > 
> > See also:  SCHEDD_CLUSTER_INITIAL_VALUE, SCHEDD_CLUSTER_INCREMENT_VALUE
> > =====================================
> > 
> > Actually, the true caveat is a bit stronger than that: user is
> > responsible for making sure that old jobs complete before new job ids
> > wrap around.  In other words, if you have a really long-running job with
> > id (x.0), and cluster id wraps back around to (x), you'll have a
> > problem, even if cluster (x+1, x+2, ...) are free.
> > 

>      </para>
>      <warning>
>       <para>
>        When setting this parameter, it is important to ensure that the number
> of jobs in the queue at any one time is less than the value. If too many jobs
> are queued at once, duplicate cluster IDs could be assigned, which will result
> in a corrupted job queue.
>       </para>
>      </warning>
>     </listitem>
> 
>    </varlistentry>

As I understand it, user has to ensure it never happens that someone tries to submit a job with a cluster_id, which is currently used by another job. Perhaps the warning should be rewritten to reflect that.

Comment 8 Erik Erlandson 2010-09-03 14:35:11 UTC
> As I understand it, user has to ensure it never happens that someone tries to
> submit a job with a cluster_id, which is currently used by another job. Perhaps
> the warning should be rewritten to reflect that.

I thought the sentence regarding duplicate cluster IDs corrupting the job queue captured this idea, however, could rewrite it:

"When setting this parameter, it is important to set it large enough so that the cluster id wrapping never results in assigning a cluster id that is currently in use by a running job, which will result in a corrupted job queue."

Comment 9 Lana Brindley 2010-09-06 02:51:45 UTC
(In reply to comment #8)
> > As I understand it, user has to ensure it never happens that someone tries to
> > submit a job with a cluster_id, which is currently used by another job. Perhaps
> > the warning should be rewritten to reflect that.
> 
> I thought the sentence regarding duplicate cluster IDs corrupting the job queue
> captured this idea, however, could rewrite it:
> 
> "When setting this parameter, it is important to set it large enough so that
> the cluster id wrapping never results in assigning a cluster id that is
> currently in use by a running job, which will result in a corrupted job queue."

I see there's two clear (but related) issues here:

*IF* either:
1: Too many jobs are queued, duplicate cluster IDs might be assigned to jobs
or 2: A job is submitted with a cluster ID the same as a job already in the queue

*THEN*
The job queue will become corrupted.

With that in mind, here's the updated admonition:

<warning>
	<para>
		It is important to ensure that the number of jobs in the queue at any one time is less than the value of this parameter. If too many jobs are queued at once, duplicate cluster IDs could be assigned. Additionally, it is important that a job is never submitted with a cluster ID the same as an already running job. Duplicate cluster IDs will result in a corrupted job queue.
	</para>
</warning>

Better?

LKB

Comment 10 Lubos Trilety 2010-09-08 08:14:21 UTC
Check definition of SCHEDD_CLUSTER_INITIAL_VALUE, SCHEDD_CLUSTER_MAXIMUM_VALUE, SCHEDD_CLUSTER_INCREMENT_VALUE in Grid User Guide.

>>> VERIFIED


Note You need to log in before you can comment on or make changes to this bug.