Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 603663 - Avoid - Negotiator crash because of "OUT OF FILE DESCRIPTORS"
Avoid - Negotiator crash because of "OUT OF FILE DESCRIPTORS"
Status: CLOSED WONTFIX
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
1.0
All Linux
high Severity high
: 2.0
: ---
Assigned To: Erik Erlandson
Tomas Rusnak
:
Depends On:
Blocks: 696824
  Show dependency treegraph
 
Reported: 2010-06-14 05:43 EDT by Martin Kudlej
Modified: 2011-06-05 23:30 EDT (History)
5 users (show)

See Also:
Fixed In Version: condor-7.5.6-0.1
Doc Type: Bug Fix
Doc Text:
Release Note Entry: Previously, the negotiator ran out of file descriptors and crashed when assigned a large number of jobs. Workaround: The user can edit the NEGOTIATOR.MAX_FILE_DESCRIPTORS value to a number that is larger than the expected number of jobs for the negotiation cycle. The recommended value for NEGOTIATOR.MAX_FILE_DESCRIPTORS is double the number of jobs per negotiation cycle.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-05-31 12:27:38 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
log files and condor_config.local (646.88 KB, application/x-gzip)
2010-06-14 05:43 EDT, Martin Kudlej
no flags Details

  None (edit)
Description Martin Kudlej 2010-06-14 05:43:32 EDT
Created attachment 423774 [details]
log files and condor_config.local

Description of problem:
I've tried submit 100,000 jobs and Scheduler and Negotiator have crashed.

Version-Release number of selected component (if applicable):
condor-7.4.3-0.16.el5

How reproducible:
100%

Steps to Reproduce:
1. set up NUM_CPUS=1024
2. service condor restart
3. submit jobs
for i in `seq 200`; do condor_submit job.sub >/dev/null 2>/dev/null; echo "$i";sleep 60;done

$ cat job.sub:
Universe=vanilla
Executable=/bin/sleep
Arguments=10
Queue 100000


Actual results:
Scheduler and Negotiator crash.

Expected results:
Scheduler and Negotiator don't crash.

Additional info:
$ cat dprintf_failure.NEGOTIATOR:
dprintf() had a fatal error in pid 27041
**** PANIC -- OUT OF FILE DESCRIPTORS at line 846 in dprintf.ceuid: 64, ruid: 0

SchedLog:
06/12 06:30:33 (pid:27042) Started shadow for job 1.235 on slot106@ <:55293> for xxx@, (shadow pid = 2349)
Stack dump for process 27042 at timestamp 1276338728 (0 frames)
Comment 1 Matthew Farrellee 2010-06-15 05:29:47 EDT
The OUT OF FILE DESCRIPTORS error from the Negotiator is not unexpected if it was attempting to notify 1024 slots of new claims. Avoiding this requires OS limit increases and Condor configuration changes, see manual.

As for the Schedd crash, there's no stack. Please check the memory usage since you were submitting 20 million jobs.
Comment 2 Matthew Farrellee 2010-06-17 13:20:38 EDT
The Negotiator could avoid using so many FDs by limiting the number of MATCH notifications it does in parallel.
Comment 6 Matthew Farrellee 2011-03-30 16:14:00 EDT
FYI https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1253
Comment 7 Erik Erlandson 2011-04-14 19:47:11 EDT
Addressing this as a documentation bz for 2.0:
bug 696824
Comment 9 Erik Erlandson 2011-04-20 11:03:35 EDT
Neither Matt nor I have been able to repro this.   Can you please attempt another repro on your side?
Comment 10 Tomas Rusnak 2011-05-27 05:32:24 EDT
Retested on current condor:

$CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $

# ulimit -n
1024

05/27/11 11:23:03 (pid:16561) Submission[hostname#247]::update(247.438, LastJobStatus, IDLE)
05/27/11 11:23:03 (pid:16561) proc count for hostname#247 is 99965
05/27/11 11:23:03 (pid:16561) Submission[hostname#247]::update(247.438, JobStatus, RUNNING)
05/27/11 11:23:03 (pid:16561) proc count for hostname#247 is 99965
05/27/11 11:23:16 (pid:16561) Transaction::Commit(): fsync_with_status() took 6 seconds to run
05/27/11 11:23:17 (pid:16561) condor_write(): Socket closed when trying to write 319 bytes to <IP:45002>, fd is 22
05/27/11 11:23:17 (pid:16561) Buf::write(): condor_write() failed
05/27/11 11:23:17 (pid:16561) SECMAN: Error sending response classad to <10.34.37.121:45002>!
Enact = "NO"
Subsystem = "STARTD"
CryptoMethods = "3DES,BLOWFISH"
NewSession = "YES"
ServerPid = 16562
AuthMethods = "FS,KERBEROS"
Encryption = "OPTIONAL"
ServerCommandSock = "<IP:58380>"
OutgoingNegotiation = "PREFERRED"
Integrity = "OPTIONAL"
ParentUniqueID = "host:16553:1306487211"
Command = 441
SessionDuration = "86400"
CurrentTime = time()
SessionLease = 3600
RemoteVersion = "$CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el6 $"
Authentication = "OPTIONAL"
05/27/11 11:23:18 (pid:16561)
05/27/11 11:23:18 (pid:16561) Entered negotiate
05/27/11 11:23:18 (pid:16561) *** SwapSpace = 2147483647
05/27/11 11:23:18 (pid:16561) *** ReservedSwap = 0
05/27/11 11:23:18 (pid:16561) *** Shadow Size Estimate = 125
05/27/11 11:23:18 (pid:16561) *** Start Limit For Swap = 17179892
05/27/11 11:23:18 (pid:16561) Negotiating for owner: test@hostname
05/27/11 11:23:18 (pid:16561) AutoCluster:config(JobUniverse,LastCheckpointPlatform,NumCkpts) invoked
05/27/11 11:23:35 (pid:17152) Reading from /proc/cpuinfo
05/27/11 11:23:35 (pid:17152) Found: Physical-IDs:False; Core-IDs:False
05/27/11 11:23:35 (pid:17152) Using processor count: 4 processors, 4 CPUs, 0 HTs
05/27/11 11:23:35 (pid:17152) Reading condor configuration from '/etc/condor/condor_config'
05/27/11 11:23:35 (pid:17152) Setting maximum accepts per cycle 4.
05/27/11 11:23:35 (pid:17152) ******************************************************
05/27/11 11:23:35 (pid:17152) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
05/27/11 11:23:35 (pid:17152) ** /usr/sbin/condor_schedd
05/27/11 11:23:35 (pid:17152) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
05/27/11 11:23:35 (pid:17152) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
05/27/11 11:23:35 (pid:17152) ** $CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el6 $
05/27/11 11:23:35 (pid:17152) ** $CondorPlatform: X86_64-RedHat_6.1 $
05/27/11 11:23:35 (pid:17152) ** PID = 17152
05/27/11 11:23:35 (pid:17152) ** Log last touched 5/27 11:23:18
05/27/11 11:23:35 (pid:17152) ******************************************************

Reproduced on RHEL6/x86_64
Comment 12 Erik Erlandson 2011-05-31 12:27:38 EDT
I'm closing this bug as WONTFIX -- it will be addressed by scale documentation (bug 696824).  (A possible future negotiator enhancement here: bug 691440)
Comment 13 Misha H. Ali 2011-05-31 19:43:58 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Release Note Entry:

Previously, the negotiator ran out of file descriptors and crashed when assigned a large number of jobs. 

Workaround:

The user can edit the NEGOTIATOR.MAX_FILE_DESCRIPTORS value to a number that is larger than the expected number of jobs for the negotiation cycle. The recommended value for NEGOTIATOR.MAX_FILE_DESCRIPTORS is double the number of jobs per negotiation cycle.
Comment 14 Misha H. Ali 2011-06-05 23:30:07 EDT
Technical note can be viewed in the release notes for 2.0 at the documentation stage here:

http://documentation-stage.bne.redhat.com/docs/en-US/Red_Hat_Enterprise_MRG/2.0/html-single/MRG_Release_Notes/index.html#tabl-MRG_Release_Notes-GRID_Update_Notes-RHM_Known_Issues

Note You need to log in before you can comment on or make changes to this bug.