Bug 523806 - No-op syscalls create schedd bottlenecks on shared file systems
Summary: No-op syscalls create schedd bottlenecks on shared file systems
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.0
Hardware: All
OS: Linux
high
high
Target Milestone: 1.2
: ---
Assignee: Will Benton
QA Contact: Luigi Toscano
URL:
Whiteboard:
: 523801 524644 (view as bug list)
Depends On:
Blocks: 527551
TreeView+ depends on / blocked
 
Reported: 2009-09-16 17:54 UTC by Will Benton
Modified: 2018-10-27 16:03 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Grid bug fix C: In cases where no file transfer is required, the condor schedd will attempt to stat, chown, and delete the spool directory when the job completes or is removed. C: In environments using NFS spool directories, these operations run extremely slowly, and may also overload the NFS server. F: The schedd was changed, and no longer creates or deletes job spool directories. R: The bottleneck no longer occurs. In cases where no file transfer is required, the condor schedd will attempt to stat, chown, and delete the spool directory when the job completes or is removed. In environments using NFS spool directories, these operations run extremely slowly, and may also overload the NFS server. The schedd was changed, and no longer creates or deletes job spool directories, and the bottleneck no longer occurs.
Clone Of:
Environment:
Last Closed: 2009-12-03 09:18:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Patch for spurious spool directory creation and deletion (1.04 KB, patch)
2009-09-19 14:17 UTC, Will Benton
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2009:1633 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging and Grid Version 1.2 2009-12-03 09:15:33 UTC

Description Will Benton 2009-09-16 17:54:12 UTC
Description of problem:

In cases that require no file transfer (e.g. in which the spool directory is on a shared filesystem or the local filesystem), the condor schedd will attempt to stat, chown, and delete the spool directory when the job completes or is removed.  This poses a problem in environments using NFS spool directories, since these operations are between one and two orders of magnitude slower on NFS (and may also unacceptably load the NFS server).

Version-Release number of selected component (if applicable):

7.x

How reproducible:

Submit a job with a spool directory on a shared filesystem.
  
Actual results:

The condor schedd will (correctly) not create job spool directories, but will (incorrectly) attempt to delete them in this case.

Expected results:

The schedd should neither create nor delete job spool directories in this case.

Comment 1 Will Benton 2009-09-19 14:17:14 UTC
Created attachment 361766 [details]
Patch for spurious spool directory creation and deletion

This patch prevents the schedd from creating a job sandbox when running jobs without file transfer (as determined by "should_transfer_files = NO" in the job ad).

Comment 2 Matthew Farrellee 2009-09-20 13:16:20 UTC
This patch is built in 7.4.0-0.5.

Tested with submission then removal of 100 jobs. The -0.4 data may have let the jobs run to completion. Data gathered with strace -c -p `pidof condor_schedd`.

$ grep -e mkdir -e fsync -e stat -e open -e chown -e rmdir nfs-* local-*

nfs-0.4: 21.92    0.012985          65       200           mkdir
nfs-0.4: 20.23    0.011987          40       301           fsync
nfs-0.4: 10.01    0.005930           1      4122       300 stat
nfs-0.4:  8.83    0.005230           1      4576      1884 open
nfs-0.4:  8.43    0.004993          17       300           chown
nfs-0.4:  6.74    0.003994          20       200           rmdir
nfs-0.4:  0.09    0.000056           0      2695           fstat
nfs-0.4:  0.07    0.000040           0       800           lstat

nfs-0.5: 39.20    0.001998           6       308           fstat
nfs-0.5: 39.18    0.001997          18       114           fsync
nfs-0.5: 12.14    0.000619           1       457       300 stat
nfs-0.5:  0.00    0.000000           0        30         1 open

local-0.4: 12.21    0.001997           7       301           fsync
local-0.4:  5.66    0.000926           5       200           mkdir
local-0.4:  1.78    0.000291           0      3365      1351 open
local-0.4:  1.74    0.000285           0      3525       300 stat
local-0.4:  1.52    0.000249           1       200           rmdir
local-0.4:  0.42    0.000068           0       800           lstat
local-0.4:  0.31    0.000050           0       300           chown
local-0.4:  0.30    0.000049           0      2158           fstat

local-0.5: 84.42    0.002997          29       102           fsync
local-0.5:  2.08    0.000074           0       328       300 stat
local-0.5:  0.39    0.000014           1        20         1 open
local-0.5:  0.00    0.000000           0       318           fstat

Comment 3 Matthew Farrellee 2009-09-20 13:17:31 UTC
Is there an upgrade issue with this patch? Will existing jobs that have spool directories and are not using file transfer have those directories leaked?

Comment 4 Will Benton 2009-09-21 19:54:31 UTC
In that case, yes, those directories will leak.  For now, keep this in mind when upgrading running pools; I'll find a workaround.

Comment 5 Matthew Farrellee 2009-09-23 03:05:55 UTC
*** Bug 523801 has been marked as a duplicate of this bug. ***

Comment 6 Matthew Farrellee 2009-09-23 03:06:17 UTC
*** Bug 524644 has been marked as a duplicate of this bug. ***

Comment 7 Will Benton 2009-09-25 15:53:37 UTC
The fix is in 7.4.0-0.5, and should be going upstream as well pending code review.

Comment 10 Irina Boverman 2009-10-22 19:55:36 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
In cases that require no file transfer (e.g. in which the spool directory is on a shared filesystem or the local filesystem), the condor schedd will no longer delete the spool directory when the job completes or is removed (523806)

Comment 11 Matthew Farrellee 2009-10-22 20:02:09 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1 @@
-In cases that require no file transfer (e.g. in which the spool directory is on a shared filesystem or the local filesystem), the condor schedd will no longer delete the spool directory when the job completes or is removed (523806)+In cases that require no file transfer (e.g. in which the spool directory is on a shared filesystem or the local filesystem), the condor schedd will no longer maintain a job spool directory (523806)

Comment 13 Luigi Toscano 2009-11-18 17:09:22 UTC
Spool directories are not created anymore when they are not needed. Tested with the strace, both on local filesystem and a with remote (nfs) filesystem, on RHEL 4.8/5.4, i386/x86_64, condor-7.4.1-0.5.

Changing the status to VERIFIED.

Comment 14 Lana Brindley 2009-11-26 21:07:26 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,8 @@
-In cases that require no file transfer (e.g. in which the spool directory is on a shared filesystem or the local filesystem), the condor schedd will no longer maintain a job spool directory (523806)+Grid bug fix
+
+C: In cases where no file transfer is required, the condor schedd will attempt to stat, chown, and delete the spool directory when the job completes or is removed.
+C: In environments using NFS spool directories, these operations run extremely slowly, and may also overload the NFS server.
+F: The schedd was changed, and no longer creates or deletes job spool directories.  
+R: The bottleneck no longer occurs.
+
+In cases where no file transfer is required, the condor schedd will attempt to stat, chown, and delete the spool directory when the job completes or is removed. In environments using NFS spool directories, these operations run extremely slowly, and may also overload the NFS server. The schedd was changed, and no longer creates or deletes job spool directories, and the bottleneck no longer occurs.

Comment 16 errata-xmlrpc 2009-12-03 09:18:32 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html


Note You need to log in before you can comment on or make changes to this bug.