Bug 523806
| Summary: | No-op syscalls create schedd bottlenecks on shared file systems | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Will Benton <willb> | ||||
| Component: | condor | Assignee: | Will Benton <willb> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Luigi Toscano <ltoscano> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 1.0 | CC: | iboverma, jthomas, lbrindle, ltoscano, matt, tao, tross | ||||
| Target Milestone: | 1.2 | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: |
Grid bug fix
C: In cases where no file transfer is required, the condor schedd will attempt to stat, chown, and delete the spool directory when the job completes or is removed.
C: In environments using NFS spool directories, these operations run extremely slowly, and may also overload the NFS server.
F: The schedd was changed, and no longer creates or deletes job spool directories.
R: The bottleneck no longer occurs.
In cases where no file transfer is required, the condor schedd will attempt to stat, chown, and delete the spool directory when the job completes or is removed. In environments using NFS spool directories, these operations run extremely slowly, and may also overload the NFS server. The schedd was changed, and no longer creates or deletes job spool directories, and the bottleneck no longer occurs.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2009-12-03 09:18:32 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 527551 | ||||||
| Attachments: |
|
||||||
|
Description
Will Benton
2009-09-16 17:54:12 UTC
Created attachment 361766 [details]
Patch for spurious spool directory creation and deletion
This patch prevents the schedd from creating a job sandbox when running jobs without file transfer (as determined by "should_transfer_files = NO" in the job ad).
This patch is built in 7.4.0-0.5. Tested with submission then removal of 100 jobs. The -0.4 data may have let the jobs run to completion. Data gathered with strace -c -p `pidof condor_schedd`. $ grep -e mkdir -e fsync -e stat -e open -e chown -e rmdir nfs-* local-* nfs-0.4: 21.92 0.012985 65 200 mkdir nfs-0.4: 20.23 0.011987 40 301 fsync nfs-0.4: 10.01 0.005930 1 4122 300 stat nfs-0.4: 8.83 0.005230 1 4576 1884 open nfs-0.4: 8.43 0.004993 17 300 chown nfs-0.4: 6.74 0.003994 20 200 rmdir nfs-0.4: 0.09 0.000056 0 2695 fstat nfs-0.4: 0.07 0.000040 0 800 lstat nfs-0.5: 39.20 0.001998 6 308 fstat nfs-0.5: 39.18 0.001997 18 114 fsync nfs-0.5: 12.14 0.000619 1 457 300 stat nfs-0.5: 0.00 0.000000 0 30 1 open local-0.4: 12.21 0.001997 7 301 fsync local-0.4: 5.66 0.000926 5 200 mkdir local-0.4: 1.78 0.000291 0 3365 1351 open local-0.4: 1.74 0.000285 0 3525 300 stat local-0.4: 1.52 0.000249 1 200 rmdir local-0.4: 0.42 0.000068 0 800 lstat local-0.4: 0.31 0.000050 0 300 chown local-0.4: 0.30 0.000049 0 2158 fstat local-0.5: 84.42 0.002997 29 102 fsync local-0.5: 2.08 0.000074 0 328 300 stat local-0.5: 0.39 0.000014 1 20 1 open local-0.5: 0.00 0.000000 0 318 fstat Is there an upgrade issue with this patch? Will existing jobs that have spool directories and are not using file transfer have those directories leaked? In that case, yes, those directories will leak. For now, keep this in mind when upgrading running pools; I'll find a workaround. *** Bug 523801 has been marked as a duplicate of this bug. *** *** Bug 524644 has been marked as a duplicate of this bug. *** The fix is in 7.4.0-0.5, and should be going upstream as well pending code review. Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: In cases that require no file transfer (e.g. in which the spool directory is on a shared filesystem or the local filesystem), the condor schedd will no longer delete the spool directory when the job completes or is removed (523806) Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -In cases that require no file transfer (e.g. in which the spool directory is on a shared filesystem or the local filesystem), the condor schedd will no longer delete the spool directory when the job completes or is removed (523806)+In cases that require no file transfer (e.g. in which the spool directory is on a shared filesystem or the local filesystem), the condor schedd will no longer maintain a job spool directory (523806) Spool directories are not created anymore when they are not needed. Tested with the strace, both on local filesystem and a with remote (nfs) filesystem, on RHEL 4.8/5.4, i386/x86_64, condor-7.4.1-0.5. Changing the status to VERIFIED. Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1,8 @@ -In cases that require no file transfer (e.g. in which the spool directory is on a shared filesystem or the local filesystem), the condor schedd will no longer maintain a job spool directory (523806)+Grid bug fix + +C: In cases where no file transfer is required, the condor schedd will attempt to stat, chown, and delete the spool directory when the job completes or is removed. +C: In environments using NFS spool directories, these operations run extremely slowly, and may also overload the NFS server. +F: The schedd was changed, and no longer creates or deletes job spool directories. +R: The bottleneck no longer occurs. + +In cases where no file transfer is required, the condor schedd will attempt to stat, chown, and delete the spool directory when the job completes or is removed. In environments using NFS spool directories, these operations run extremely slowly, and may also overload the NFS server. The schedd was changed, and no longer creates or deletes job spool directories, and the bottleneck no longer occurs. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-1633.html |