Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 523806

Summary:

No-op syscalls create schedd bottlenecks on shared file systems

Product:

Red Hat Enterprise MRG

Reporter:

Will Benton <willb>

Component:

condor

Assignee:

Will Benton <willb>

Status:

CLOSED ERRATA

QA Contact:

Luigi Toscano <ltoscano>

Severity:

high

Docs Contact:

Priority:

high

Version:

1.0

CC:

iboverma, jthomas, lbrindle, ltoscano, matt, tao, tross

Target Milestone:

1.2

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Grid bug fix C: In cases where no file transfer is required, the condor schedd will attempt to stat, chown, and delete the spool directory when the job completes or is removed. C: In environments using NFS spool directories, these operations run extremely slowly, and may also overload the NFS server. F: The schedd was changed, and no longer creates or deletes job spool directories. R: The bottleneck no longer occurs. In cases where no file transfer is required, the condor schedd will attempt to stat, chown, and delete the spool directory when the job completes or is removed. In environments using NFS spool directories, these operations run extremely slowly, and may also overload the NFS server. The schedd was changed, and no longer creates or deletes job spool directories, and the bottleneck no longer occurs.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-12-03 09:18:32 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

527551

Attachments:

Description	Flags
Patch for spurious spool directory creation and deletion	none

Description Will Benton 2009-09-16 17:54:12 UTC

Description of problem:

In cases that require no file transfer (e.g. in which the spool directory is on a shared filesystem or the local filesystem), the condor schedd will attempt to stat, chown, and delete the spool directory when the job completes or is removed.  This poses a problem in environments using NFS spool directories, since these operations are between one and two orders of magnitude slower on NFS (and may also unacceptably load the NFS server).

Version-Release number of selected component (if applicable):

7.x

How reproducible:

Submit a job with a spool directory on a shared filesystem.
  
Actual results:

The condor schedd will (correctly) not create job spool directories, but will (incorrectly) attempt to delete them in this case.

Expected results:

The schedd should neither create nor delete job spool directories in this case.

Comment 1 Will Benton 2009-09-19 14:17:14 UTC

Created attachment 361766 [details]
Patch for spurious spool directory creation and deletion

This patch prevents the schedd from creating a job sandbox when running jobs without file transfer (as determined by "should_transfer_files = NO" in the job ad).

Comment 2 Matthew Farrellee 2009-09-20 13:16:20 UTC

This patch is built in 7.4.0-0.5.

Tested with submission then removal of 100 jobs. The -0.4 data may have let the jobs run to completion. Data gathered with strace -c -p `pidof condor_schedd`.

$ grep -e mkdir -e fsync -e stat -e open -e chown -e rmdir nfs-* local-*

nfs-0.4: 21.92    0.012985          65       200           mkdir
nfs-0.4: 20.23    0.011987          40       301           fsync
nfs-0.4: 10.01    0.005930           1      4122       300 stat
nfs-0.4:  8.83    0.005230           1      4576      1884 open
nfs-0.4:  8.43    0.004993          17       300           chown
nfs-0.4:  6.74    0.003994          20       200           rmdir
nfs-0.4:  0.09    0.000056           0      2695           fstat
nfs-0.4:  0.07    0.000040           0       800           lstat

nfs-0.5: 39.20    0.001998           6       308           fstat
nfs-0.5: 39.18    0.001997          18       114           fsync
nfs-0.5: 12.14    0.000619           1       457       300 stat
nfs-0.5:  0.00    0.000000           0        30         1 open

local-0.4: 12.21    0.001997           7       301           fsync
local-0.4:  5.66    0.000926           5       200           mkdir
local-0.4:  1.78    0.000291           0      3365      1351 open
local-0.4:  1.74    0.000285           0      3525       300 stat
local-0.4:  1.52    0.000249           1       200           rmdir
local-0.4:  0.42    0.000068           0       800           lstat
local-0.4:  0.31    0.000050           0       300           chown
local-0.4:  0.30    0.000049           0      2158           fstat

local-0.5: 84.42    0.002997          29       102           fsync
local-0.5:  2.08    0.000074           0       328       300 stat
local-0.5:  0.39    0.000014           1        20         1 open
local-0.5:  0.00    0.000000           0       318           fstat

Comment 3 Matthew Farrellee 2009-09-20 13:17:31 UTC

Is there an upgrade issue with this patch? Will existing jobs that have spool directories and are not using file transfer have those directories leaked?

Comment 4 Will Benton 2009-09-21 19:54:31 UTC

In that case, yes, those directories will leak.  For now, keep this in mind when upgrading running pools; I'll find a workaround.

Comment 5 Matthew Farrellee 2009-09-23 03:05:55 UTC

*** Bug 523801 has been marked as a duplicate of this bug. ***

Comment 6 Matthew Farrellee 2009-09-23 03:06:17 UTC

*** Bug 524644 has been marked as a duplicate of this bug. ***

Comment 7 Will Benton 2009-09-25 15:53:37 UTC

The fix is in 7.4.0-0.5, and should be going upstream as well pending code review.

Comment 10 Irina Boverman 2009-10-22 19:55:36 UTC

Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
In cases that require no file transfer (e.g. in which the spool directory is on a shared filesystem or the local filesystem), the condor schedd will no longer delete the spool directory when the job completes or is removed (523806)

Comment 11 Matthew Farrellee 2009-10-22 20:02:09 UTC

Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1 @@
-In cases that require no file transfer (e.g. in which the spool directory is on a shared filesystem or the local filesystem), the condor schedd will no longer delete the spool directory when the job completes or is removed (523806)+In cases that require no file transfer (e.g. in which the spool directory is on a shared filesystem or the local filesystem), the condor schedd will no longer maintain a job spool directory (523806)

Comment 13 Luigi Toscano 2009-11-18 17:09:22 UTC

Spool directories are not created anymore when they are not needed. Tested with the strace, both on local filesystem and a with remote (nfs) filesystem, on RHEL 4.8/5.4, i386/x86_64, condor-7.4.1-0.5.

Changing the status to VERIFIED.

Comment 14 Lana Brindley 2009-11-26 21:07:26 UTC

Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,8 @@
-In cases that require no file transfer (e.g. in which the spool directory is on a shared filesystem or the local filesystem), the condor schedd will no longer maintain a job spool directory (523806)+Grid bug fix
+
+C: In cases where no file transfer is required, the condor schedd will attempt to stat, chown, and delete the spool directory when the job completes or is removed.
+C: In environments using NFS spool directories, these operations run extremely slowly, and may also overload the NFS server.
+F: The schedd was changed, and no longer creates or deletes job spool directories.  
+R: The bottleneck no longer occurs.
+
+In cases where no file transfer is required, the condor schedd will attempt to stat, chown, and delete the spool directory when the job completes or is removed. In environments using NFS spool directories, these operations run extremely slowly, and may also overload the NFS server. The schedd was changed, and no longer creates or deletes job spool directories, and the bottleneck no longer occurs.

Comment 16 errata-xmlrpc 2009-12-03 09:18:32 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html