See: http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1046 Issue in all known versions, including 7.4.1-0.7 Job removal is exceedingly slow, especially when removing tens of thousands of jobs. A quick review of the Schedd code suggests that a single condor_rm should result in a single transaction and fsync, but that is not the case. Generate some Schedd traffic: $ echo "cmd=/bin/sleep\nargs=1h\nnotification=never\nqueue 29" | condor_submit Submitting job(s)............................. $ condor_rm -a While watching the Schedd: $ strace -c -efsync -p $(pidof condor_schedd) % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.002000 65 31 fsync ------ ----------- ----------- --------- --------- ---------------- 100.00 0.002000 31 total The job_queue.log: 105 103 0.0 NextClusterNum 8 101 07.-1 Job Machine 101 7.0 Job Machine 103 7.0 GlobalJobId "robin.local#7.0#1260542935" 103 07.-1 ClusterId 7 ... 103 7.1 GlobalJobId "robin.local#7.1#1260542935" 103 7.1 ProcId 1 103 7.1 LastJobStatus 0 103 7.1 JobStatus 1 101 7.2 Job Machine 103 7.2 GlobalJobId "robin.local#7.2#1260542935" 103 7.2 ProcId 2 103 7.2 LastJobStatus 0 103 7.2 JobStatus 1 101 7.3 Job Machine ... 101 7.28 Job Machine 103 7.28 GlobalJobId "robin.local#7.28#1260542936" 103 7.28 ProcId 28 103 7.28 LastJobStatus 0 103 7.28 JobStatus 1 103 7.28 EnteredCurrentStatus 1260542936 106 105 103 7.10 LastJobStatus 1 103 7.10 JobStatus 3 103 7.10 RemoveReason "via condor_rm (by user matt)" 103 7.10 EnteredCurrentStatus 1260542937 104 7.10 JobStatusOnRelease 103 7.11 LastJobStatus 1 103 7.11 JobStatus 3 103 7.11 RemoveReason "via condor_rm (by user matt)" 103 7.11 EnteredCurrentStatus 1260542937 104 7.11 JobStatusOnRelease ... 104 7.27 JobStatusOnRelease 103 7.28 LastJobStatus 1 103 7.28 JobStatus 3 103 7.28 RemoveReason "via condor_rm (by user matt)" 103 7.28 EnteredCurrentStatus 1260542937 104 7.28 JobStatusOnRelease 106 103 7.10 JobFinishedHookDone 1260542937 105 102 7.10 106 103 7.11 JobFinishedHookDone 1260542937 105 102 7.11 106 ... 105 102 7.28 102 07.-1 106 105 is BeginTransaction, 106 is EndTransaction, 102 is DestroyClassAd We are getting one transaction, and thus one fsync, for the submit (good), one for the removal (good), and one for each job to destroy it (bad). Suggestion: Batch the job ad deletion.
fixed upstream, built in condor 7.4.3-0.3
While running: $ strace -c -efsync -p $(pidof condor_schedd) Submit: $ time echo "cmd=/bin/true\nlog=/tmp/log\nnotification=never\nqueue 1000" | strace -c -efsync condor_submit Before: Process 14310 attached - interrupt to quit ^CProcess 14310 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.069582 35 2002 fsync ------ ----------- ----------- --------- --------- ---------------- 100.00 0.069582 2002 total After: ocess 24049 attached - interrupt to quit ^CProcess 24049 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.027550 27 1002 fsync ------ ----------- ----------- --------- --------- ---------------- 100.00 0.027550 1002 total
The 1000 fsyncs still happening in the schedd are probably fsyncs of the job user log.
I believe so too: http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1237
Reproduced on RHEL5 x86_64: $ time echo -e "cmd=/bin/true\nlog=/tmp/log\nnotification=never\nqueue 1000" | strace -c -efsync condor_submit Submitting job(s)......... Logging submit event(s).................. 1000 job(s) submitted to cluster 88. % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.073913 74 1000 fsync ------ ----------- ----------- --------- --------- ---------------- 100.00 0.073913 1000 total real 0m47.919s user 0m1.089s sys 0m2.399s
As Dan told in the previous comment, there is still some issue with fsyncs of the job user log. Is it related to this bug? It is needed to wait with verification until logging fsync will be solved?
Unrelated. You should be able to remove the log line and see the number of fsync calls go to nearly 0.
Tested on all combination of RHEL4/RHEL5 and i386/x86_64 with 7.4.3-0.16 version of condor. RHEL 4.8 x86_64 --------------- jobs | 500 | 1000 | 1500 | 2000 | 2500 | fsyncs | 2 | 6 | 6 | 6 | 7 | RHEL 4.8 i386 --------------- jobs | 500 | 1000 | 1500 | 2000 | 2500 | fsyncs | 4 | 4 | 4 | 5 | 4 | RHEL 5.5 x86_64 --------------- jobs | 500 | 1000 | 1500 | 2000 | 2500 | fsyncs | 2 | 6 | 6 | 6 | 7 | RHEL 5.5 i386 --------------- jobs | 500 | 1000 | 1500 | 2000 | 2500 | fsyncs | 4 | 4 | 5 | 5 | 5 | >>> VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, the scheduler daemon exceedingly slowed down when removing large amounts of jobs. With this update, a single rm_condor command results in a single transaction and a low count of fsync.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html