Bug 546770
| Summary: | condor_schedd performance, job removal fsync for each job | ||
|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Matthew Farrellee <matt> |
| Component: | condor | Assignee: | Matthew Farrellee <matt> |
| Status: | CLOSED ERRATA | QA Contact: | Tomas Rusnak <trusnak> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 1.2 | CC: | dan, fnadge, iboverma, trusnak |
| Target Milestone: | 1.3 | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Previously, the scheduler daemon exceedingly slowed down when removing large amounts of jobs. With this update, a single rm_condor command results in a single transaction and a low count of fsync.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2010-10-14 16:08:48 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Matthew Farrellee
2009-12-11 21:52:41 UTC
fixed upstream, built in condor 7.4.3-0.3 While running: $ strace -c -efsync -p $(pidof condor_schedd) Submit: $ time echo "cmd=/bin/true\nlog=/tmp/log\nnotification=never\nqueue 1000" | strace -c -efsync condor_submit Before: Process 14310 attached - interrupt to quit ^CProcess 14310 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.069582 35 2002 fsync ------ ----------- ----------- --------- --------- ---------------- 100.00 0.069582 2002 total After: ocess 24049 attached - interrupt to quit ^CProcess 24049 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.027550 27 1002 fsync ------ ----------- ----------- --------- --------- ---------------- 100.00 0.027550 1002 total The 1000 fsyncs still happening in the schedd are probably fsyncs of the job user log. I believe so too: http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1237 Reproduced on RHEL5 x86_64: $ time echo -e "cmd=/bin/true\nlog=/tmp/log\nnotification=never\nqueue 1000" | strace -c -efsync condor_submit Submitting job(s)......... Logging submit event(s).................. 1000 job(s) submitted to cluster 88. % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.073913 74 1000 fsync ------ ----------- ----------- --------- --------- ---------------- 100.00 0.073913 1000 total real 0m47.919s user 0m1.089s sys 0m2.399s As Dan told in the previous comment, there is still some issue with fsyncs of the job user log. Is it related to this bug? It is needed to wait with verification until logging fsync will be solved? Unrelated. You should be able to remove the log line and see the number of fsync calls go to nearly 0. Tested on all combination of RHEL4/RHEL5 and i386/x86_64 with 7.4.3-0.16 version of condor.
RHEL 4.8 x86_64
---------------
jobs | 500 | 1000 | 1500 | 2000 | 2500 |
fsyncs | 2 | 6 | 6 | 6 | 7 |
RHEL 4.8 i386
---------------
jobs | 500 | 1000 | 1500 | 2000 | 2500 |
fsyncs | 4 | 4 | 4 | 5 | 4 |
RHEL 5.5 x86_64
---------------
jobs | 500 | 1000 | 1500 | 2000 | 2500 |
fsyncs | 2 | 6 | 6 | 6 | 7 |
RHEL 5.5 i386
---------------
jobs | 500 | 1000 | 1500 | 2000 | 2500 |
fsyncs | 4 | 4 | 5 | 5 | 5 |
>>> VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
Previously, the scheduler daemon exceedingly slowed down when removing large amounts of jobs. With this update, a single rm_condor command results in a single transaction and a low count of fsync.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html |