Bug 546770

Summary: condor_schedd performance, job removal fsync for each job
Product: Red Hat Enterprise MRG Reporter: Matthew Farrellee <matt>
Component: condorAssignee: Matthew Farrellee <matt>
Status: CLOSED ERRATA QA Contact: Tomas Rusnak <trusnak>
Severity: medium Docs Contact:
Priority: high    
Version: 1.2CC: dan, fnadge, iboverma, trusnak
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, the scheduler daemon exceedingly slowed down when removing large amounts of jobs. With this update, a single rm_condor command results in a single transaction and a low count of fsync.
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-14 16:08:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matthew Farrellee 2009-12-11 21:52:41 UTC
See: http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1046

Issue in all known versions, including 7.4.1-0.7

Job removal is exceedingly slow, especially when removing tens of thousands of jobs.

A quick review of the Schedd code suggests that a single condor_rm should result in a single transaction and fsync, but that is not the case.

Generate some Schedd traffic:

$ echo "cmd=/bin/sleep\nargs=1h\nnotification=never\nqueue 29" | condor_submit
Submitting job(s).............................
$ condor_rm -a

While watching the Schedd:

$ strace -c -efsync -p $(pidof condor_schedd)
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.002000          65        31           fsync
------ ----------- ----------- --------- --------- ----------------
100.00    0.002000                    31           total

The job_queue.log:

105
103 0.0 NextClusterNum 8
101 07.-1 Job Machine
101 7.0 Job Machine
103 7.0 GlobalJobId "robin.local#7.0#1260542935"
103 07.-1 ClusterId 7
...
103 7.1 GlobalJobId "robin.local#7.1#1260542935"
103 7.1 ProcId 1
103 7.1 LastJobStatus 0
103 7.1 JobStatus 1
101 7.2 Job Machine
103 7.2 GlobalJobId "robin.local#7.2#1260542935"
103 7.2 ProcId 2
103 7.2 LastJobStatus 0
103 7.2 JobStatus 1
101 7.3 Job Machine
...
101 7.28 Job Machine
103 7.28 GlobalJobId "robin.local#7.28#1260542936"
103 7.28 ProcId 28
103 7.28 LastJobStatus 0
103 7.28 JobStatus 1
103 7.28 EnteredCurrentStatus 1260542936
106
105
103 7.10 LastJobStatus 1
103 7.10 JobStatus 3
103 7.10 RemoveReason "via condor_rm (by user matt)"
103 7.10 EnteredCurrentStatus 1260542937
104 7.10 JobStatusOnRelease
103 7.11 LastJobStatus 1
103 7.11 JobStatus 3
103 7.11 RemoveReason "via condor_rm (by user matt)"
103 7.11 EnteredCurrentStatus 1260542937
104 7.11 JobStatusOnRelease
...
104 7.27 JobStatusOnRelease
103 7.28 LastJobStatus 1
103 7.28 JobStatus 3
103 7.28 RemoveReason "via condor_rm (by user matt)"
103 7.28 EnteredCurrentStatus 1260542937
104 7.28 JobStatusOnRelease
106
103 7.10 JobFinishedHookDone 1260542937
105
102 7.10
106
103 7.11 JobFinishedHookDone 1260542937
105
102 7.11
106
...
105
102 7.28
102 07.-1
106

105 is BeginTransaction, 106 is EndTransaction, 102 is DestroyClassAd

We are getting one transaction, and thus one fsync, for the submit (good), one for the removal (good), and one for each job to destroy it (bad).

Suggestion: Batch the job ad deletion.

Comment 3 Matthew Farrellee 2010-02-18 21:39:31 UTC
fixed upstream, built in condor 7.4.3-0.3

Comment 4 Matthew Farrellee 2010-02-19 17:43:56 UTC
While running:

$ strace -c -efsync -p $(pidof condor_schedd)

Submit:

$ time echo "cmd=/bin/true\nlog=/tmp/log\nnotification=never\nqueue 1000" | strace -c -efsync condor_submit

Before:

Process 14310 attached - interrupt to quit
^CProcess 14310 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.069582          35      2002           fsync
------ ----------- ----------- --------- --------- ----------------
100.00    0.069582                  2002           total

After:

ocess 24049 attached - interrupt to quit
^CProcess 24049 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.027550          27      1002           fsync
------ ----------- ----------- --------- --------- ----------------
100.00    0.027550                  1002           total

Comment 5 Dan Bradley 2010-02-22 14:45:08 UTC
The 1000 fsyncs still happening in the schedd are probably fsyncs of the job user log.

Comment 6 Matthew Farrellee 2010-02-22 15:57:11 UTC
I believe so too: http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1237

Comment 7 Tomas Rusnak 2010-04-15 11:40:17 UTC
Reproduced on RHEL5 x86_64:

$ time echo -e "cmd=/bin/true\nlog=/tmp/log\nnotification=never\nqueue 1000" | strace -c -efsync condor_submit
Submitting job(s).........
Logging submit event(s)..................
1000 job(s) submitted to cluster 88.
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.073913          74      1000           fsync
------ ----------- ----------- --------- --------- ----------------
100.00    0.073913                  1000           total

real	0m47.919s
user	0m1.089s
sys	0m2.399s

Comment 8 Tomas Rusnak 2010-04-19 12:36:19 UTC
As Dan told in the previous comment, there is still some issue with fsyncs of the job user log. Is it related to this bug? It is needed to wait with verification until logging fsync will be solved?

Comment 9 Matthew Farrellee 2010-04-21 05:34:54 UTC
Unrelated.

You should be able to remove the log line and see the number of fsync calls go to nearly 0.

Comment 10 Tomas Rusnak 2010-05-28 16:29:04 UTC
Tested on all combination of RHEL4/RHEL5 and i386/x86_64 with 7.4.3-0.16 version of condor.

RHEL 4.8 x86_64 
---------------
jobs   | 500 | 1000 | 1500 | 2000 | 2500 |
fsyncs | 2   | 6    | 6    | 6    | 7    |

RHEL 4.8 i386
---------------
jobs   | 500 | 1000 | 1500 | 2000 | 2500 |
fsyncs | 4   | 4    | 4    | 5    | 4    |

RHEL 5.5 x86_64
---------------
jobs   | 500 | 1000 | 1500 | 2000 | 2500 |
fsyncs | 2   | 6    | 6    | 6    | 7    |

RHEL 5.5 i386
---------------
jobs   | 500 | 1000 | 1500 | 2000 | 2500 |
fsyncs | 4   | 4    | 5    | 5    | 5    |

>>> VERIFIED

Comment 11 Florian Nadge 2010-10-07 14:44:20 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, the scheduler daemon exceedingly slowed down when removing large amounts of jobs. With this update, a single rm_condor command results in a single transaction and a low count of fsync.

Comment 13 errata-xmlrpc 2010-10-14 16:08:48 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html