Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 700143

Summary:	Clarify Negotiator statistics cycle period definition
Product:	Red Hat Enterprise MRG	Reporter:	Lubos Trilety <ltrilety>
Component:	condor	Assignee:	Erik Erlandson <eerlands>
Status:	CLOSED DUPLICATE	QA Contact:	MRG Quality Engineering <mrgqe-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	Development	CC:	iboverma, matt, mkudlej
Target Milestone:	2.1	Keywords:	Reopened
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-09-20 16:23:44 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Lubos Trilety 2011-04-27 15:11:17 UTC

Description of problem:
There are several issues in performance-related statistics in negotiator ad:
1) LastNegotiationCyclePeriod2 is always equal to 0
# condor_status -subsystem negotiator -l | grep -i NegotiationCyclePeriod
LastNegotiationCyclePeriod0 = 22
LastNegotiationCyclePeriod1 = 23
LastNegotiationCyclePeriod2 = 0

2) LastNegotiationCycleNumIdleJobs is not working correctly.
E.g.
condor settings:
CLAIM_WORKLIFE = 0
NEGOTIATOR_UPDATE_AFTER_CYCLE = True

run 1000 jobs
# echo -e "cmd=/bin/sleep\nargs=10\nqueue 1000" | runuser condor -s /bin/bash -c condor_submit
Submitting job(s)..........
1000 job(s) submitted to cluster 837.

Observe "condor_q | grep job" and "condor_status -subsystem negotiator -l | grep -i NumIdleJobs"

Statistics obtained after about 2 minutes from submit
# condor_q | grep job
890 jobs; 883 idle, 7 running, 0 held
# condor_status -subsystem negotiator -l | grep -i NumIdleJobs
LastNegotiationCycleNumIdleJobs0 = 1000
LastNegotiationCycleNumIdleJobs1 = 1000
LastNegotiationCycleNumIdleJobs2 = 1000

Notice that those numbers of idle jobs are not corresponding with each other


Version-Release number of selected component (if applicable):
condor-7.6.1-0.1

How reproducible:
100%


Additional info:
The LastNegotiationCycleNumIdleJobs was changed after some time, but not every negotiator cycle.

Comment 1 Erik Erlandson 2011-05-02 18:00:35 UTC

(In reply to comment #0)


> 1) LastNegotiationCyclePeriod2 is always equal to 0
> # condor_status -subsystem negotiator -l | grep -i NegotiationCyclePeriod
> LastNegotiationCyclePeriod0 = 22
> LastNegotiationCyclePeriod1 = 23
> LastNegotiationCyclePeriod2 = 0

This looks like an artifact of NEGOTIATOR_UPDATE_INTERVAL:
https://bugzilla.redhat.com/show_bug.cgi?id=673538

Comment 2 Erik Erlandson 2011-05-02 18:05:06 UTC

> 2) LastNegotiationCycleNumIdleJobs is not working correctly.
> 
> Additional info:
> The LastNegotiationCycleNumIdleJobs was changed after some time, but not every
> negotiator cycle.


This is almost certainly due to latency between schedd and propagation to submitter ads on collector, but I will take a look at it.

Comment 3 Erik Erlandson 2011-05-02 22:55:17 UTC

Looking at the negotiator log, it seems that both (1) and (2) are happening because jobs are being scheduled without the negotiator loop.  

Some kind of autoclustering/claim re-use trickery?

Comment 4 Erik Erlandson 2011-05-05 20:40:25 UTC

There are two things that could cause the described behavior.  The first is that NEGOTIATOR_UPDATE_INTERVAL could be too high (Bug 673538).  This can be alleviated by setting it to a shorter interval.

The second is that the schedd is reusing claims, and the negotiator loop is being by-passed for these jobs, after the first initial scheduling cycle (which creates the claims).  Those statistics are intended to measure activity inside the negotiation loop, so in this case their being zero is acceptable.  This behavior can be turned off by setting CLAIM_WORKLIFE=0

Comment 6 Lubos Trilety 2011-05-10 12:41:01 UTC

(In reply to comment #4)
> There are two things that could cause the described behavior.  The first is
> that NEGOTIATOR_UPDATE_INTERVAL could be too high (Bug 673538).  This can be
> alleviated by setting it to a shorter interval.
> 

It was found with NEGOTIATOR_UPDATE_AFTER_CYCLE set to True, so it should not be a problem. Anyway retested again with NEGOTIATOR_UPDATE_INTERVAL = 5. The results are still the same.

# condor_status -subsystem negotiator -l | grep -i NegotiationCyclePeriod
LastNegotiationCyclePeriod0 = 64
LastNegotiationCyclePeriod1 = 25
LastNegotiationCyclePeriod2 = 0

change after one minute:
# condor_status -subsystem negotiator -l | grep -i NegotiationCyclePeriod
LastNegotiationCyclePeriod0 = 63
LastNegotiationCyclePeriod1 = 64
LastNegotiationCyclePeriod2 = 0


> The second is that the schedd is reusing claims, and the negotiator loop is
> being by-passed for these jobs, after the first initial scheduling cycle (which
> creates the claims).  Those statistics are intended to measure activity inside
> the negotiation loop, so in this case their being zero is acceptable.  This
> behavior can be turned off by setting CLAIM_WORKLIFE=0

If you look at description of this bug, there is already set CLAIM_WORKLIFE to zero in example.

>>> ASSIGNED

Comment 7 Erik Erlandson 2011-05-10 19:27:50 UTC

(In reply to comment #6)

> It was found with NEGOTIATOR_UPDATE_AFTER_CYCLE set to True, so it should not
> be a problem. Anyway retested again with NEGOTIATOR_UPDATE_INTERVAL = 5. The
> results are still the same.
> 
> # condor_status -subsystem negotiator -l | grep -i NegotiationCyclePeriod
> LastNegotiationCyclePeriod0 = 64
> LastNegotiationCyclePeriod1 = 25
> LastNegotiationCyclePeriod2 = 0

The above behavior from LastNegotiationCyclePeriod2 is because "period" is calculated using difference between end of current cycle and the previous cycle.  For the last cycle-stat, there is no previous cycle, so it is defaulted to zero.   For example, if you set "NEGOTIATION_CYCLE_STATS_LENGTH = 10" then you would see "LastNegotiationCyclePeriod9 = 0"

relevant code is this:
int period = 0;
if (((1+i) < num_negotiation_cycle_stats) && (negotiation_cycle_stats[1+i] != NULL))
    period = s->end_time - negotiation_cycle_stats[1+i]->end_time;



Regarding LastNegotiationCycleNumIdleJobs<N>, I'm seeing correct behavior when I attempt the repro.  The negotiator's measurement of idle jobs lags the schedd:

$ condor_q | tail -1; condor_status -l -neg | grep LastNegotiationCycleNumIdleJobs
406 jobs; 406 idle, 0 running, 0 held
LastNegotiationCycleNumIdleJobs0 = 408
LastNegotiationCycleNumIdleJobs1 = 410
LastNegotiationCycleNumIdleJobs2 = 412

$ condor_q | tail -1; condor_status -l -neg | grep LastNegotiationCycleNumIdleJobs
404 jobs; 404 idle, 0 running, 0 held
LastNegotiationCycleNumIdleJobs0 = 406
LastNegotiationCycleNumIdleJobs1 = 408
LastNegotiationCycleNumIdleJobs2 = 410

Comment 9 Lubos Trilety 2011-05-11 08:05:12 UTC

(In reply to comment #7)
> The above behavior from LastNegotiationCyclePeriod2 is because "period" is
> calculated using difference between end of current cycle and the previous
> cycle.  For the last cycle-stat, there is no previous cycle, so it is defaulted
> to zero.   For example, if you set "NEGOTIATION_CYCLE_STATS_LENGTH = 10" then
> you would see "LastNegotiationCyclePeriod9 = 0"
> 
> relevant code is this:
> int period = 0;
> if (((1+i) < num_negotiation_cycle_stats) && (negotiation_cycle_stats[1+i] !=
> NULL))
>     period = s->end_time - negotiation_cycle_stats[1+i]->end_time;

I don't think it's correct behaviour, I believe that from customer point of view all of these values should indicate time between negotiator cycles. For example if customer choose "NEGOTIATION_CYCLE_STATS_LENGTH = 1", he does not expect that "LastNegotiationCyclePeriod0 = 0".

> 
> Regarding LastNegotiationCycleNumIdleJobs<N>, I'm seeing correct behavior when
> I attempt the repro.  The negotiator's measurement of idle jobs lags the
> schedd:
> 
> $ condor_q | tail -1; condor_status -l -neg | grep
> LastNegotiationCycleNumIdleJobs
> 406 jobs; 406 idle, 0 running, 0 held
> LastNegotiationCycleNumIdleJobs0 = 408
> LastNegotiationCycleNumIdleJobs1 = 410
> LastNegotiationCycleNumIdleJobs2 = 412
> 
> $ condor_q | tail -1; condor_status -l -neg | grep
> LastNegotiationCycleNumIdleJobs
> 404 jobs; 404 idle, 0 running, 0 held
> LastNegotiationCycleNumIdleJobs0 = 406
> LastNegotiationCycleNumIdleJobs1 = 408
> LastNegotiationCycleNumIdleJobs2 = 410

My reproduction:
# condor_q | tail -1; condor_status -l -neg | grep LastNegotiationCycleNumIdleJobs
941 jobs; 940 idle, 1 running, 0 held
LastNegotiationCycleNumIdleJobs0 = 1000
LastNegotiationCycleNumIdleJobs1 = 1000
LastNegotiationCycleNumIdleJobs2 = 1000

Comment 11 Erik Erlandson 2011-05-11 16:44:54 UTC

The slower update of LastNegotiationCycleNumIdleJobs was due to SCHEDD_INTERVAL taking its default value of 5 minutes.   Setting it to something smaller, such as 15 sec, keeps it in sync.   Generally, SCHEDD_INTERVAL should be set to same value of NEGOTIATOR_INTERVAL for optimal behavior.

Comment 12 Erik Erlandson 2011-05-11 16:45:14 UTC

QE recommends moving the remaining "cycle period" fix to 2.1, and I agree.

Comment 14 Erik Erlandson 2011-09-20 16:23:44 UTC

Closing this and moving to doc ticket 
https://bugzilla.redhat.com/show_bug.cgi?id=739999

*** This bug has been marked as a duplicate of bug 739999 ***