Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 603663

Summary:

Avoid - Negotiator crash because of "OUT OF FILE DESCRIPTORS"

Product:

Red Hat Enterprise MRG

Reporter:

Martin Kudlej <mkudlej>

Component:

condor

Assignee:

Erik Erlandson <eerlands>

Status:

CLOSED WONTFIX

QA Contact:

Tomas Rusnak <trusnak>

Severity:

high

Docs Contact:

Priority:

high

Version:

1.0

CC:

iboverma, matt, mhusnain, trusnak, tstclair

Target Milestone:

2.0

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

condor-7.5.6-0.1

Doc Type:

Bug Fix

Doc Text:

Release Note Entry: Previously, the negotiator ran out of file descriptors and crashed when assigned a large number of jobs. Workaround: The user can edit the NEGOTIATOR.MAX_FILE_DESCRIPTORS value to a number that is larger than the expected number of jobs for the negotiation cycle. The recommended value for NEGOTIATOR.MAX_FILE_DESCRIPTORS is double the number of jobs per negotiation cycle.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-05-31 16:27:38 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

696824

Attachments:

Description	Flags
log files and condor_config.local	none

Description Martin Kudlej 2010-06-14 09:43:32 UTC

Created attachment 423774 [details]
log files and condor_config.local

Description of problem:
I've tried submit 100,000 jobs and Scheduler and Negotiator have crashed.

Version-Release number of selected component (if applicable):
condor-7.4.3-0.16.el5

How reproducible:
100%

Steps to Reproduce:
1. set up NUM_CPUS=1024
2. service condor restart
3. submit jobs
for i in `seq 200`; do condor_submit job.sub >/dev/null 2>/dev/null; echo "$i";sleep 60;done

$ cat job.sub:
Universe=vanilla
Executable=/bin/sleep
Arguments=10
Queue 100000


Actual results:
Scheduler and Negotiator crash.

Expected results:
Scheduler and Negotiator don't crash.

Additional info:
$ cat dprintf_failure.NEGOTIATOR:
dprintf() had a fatal error in pid 27041
**** PANIC -- OUT OF FILE DESCRIPTORS at line 846 in dprintf.ceuid: 64, ruid: 0

SchedLog:
06/12 06:30:33 (pid:27042) Started shadow for job 1.235 on slot106@ <:55293> for xxx@, (shadow pid = 2349)
Stack dump for process 27042 at timestamp 1276338728 (0 frames)

Comment 1 Matthew Farrellee 2010-06-15 09:29:47 UTC

The OUT OF FILE DESCRIPTORS error from the Negotiator is not unexpected if it was attempting to notify 1024 slots of new claims. Avoiding this requires OS limit increases and Condor configuration changes, see manual.

As for the Schedd crash, there's no stack. Please check the memory usage since you were submitting 20 million jobs.

Comment 2 Matthew Farrellee 2010-06-17 17:20:38 UTC

The Negotiator could avoid using so many FDs by limiting the number of MATCH notifications it does in parallel.

Comment 6 Matthew Farrellee 2011-03-30 20:14:00 UTC

FYI https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1253

Comment 7 Erik Erlandson 2011-04-14 23:47:11 UTC

Addressing this as a documentation bz for 2.0:
bug 696824

Comment 9 Erik Erlandson 2011-04-20 15:03:35 UTC

Neither Matt nor I have been able to repro this.   Can you please attempt another repro on your side?

Comment 10 Tomas Rusnak 2011-05-27 09:32:24 UTC

Retested on current condor:

$CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $

# ulimit -n
1024

05/27/11 11:23:03 (pid:16561) Submission[hostname#247]::update(247.438, LastJobStatus, IDLE)
05/27/11 11:23:03 (pid:16561) proc count for hostname#247 is 99965
05/27/11 11:23:03 (pid:16561) Submission[hostname#247]::update(247.438, JobStatus, RUNNING)
05/27/11 11:23:03 (pid:16561) proc count for hostname#247 is 99965
05/27/11 11:23:16 (pid:16561) Transaction::Commit(): fsync_with_status() took 6 seconds to run
05/27/11 11:23:17 (pid:16561) condor_write(): Socket closed when trying to write 319 bytes to <IP:45002>, fd is 22
05/27/11 11:23:17 (pid:16561) Buf::write(): condor_write() failed
05/27/11 11:23:17 (pid:16561) SECMAN: Error sending response classad to <10.34.37.121:45002>!
Enact = "NO"
Subsystem = "STARTD"
CryptoMethods = "3DES,BLOWFISH"
NewSession = "YES"
ServerPid = 16562
AuthMethods = "FS,KERBEROS"
Encryption = "OPTIONAL"
ServerCommandSock = "<IP:58380>"
OutgoingNegotiation = "PREFERRED"
Integrity = "OPTIONAL"
ParentUniqueID = "host:16553:1306487211"
Command = 441
SessionDuration = "86400"
CurrentTime = time()
SessionLease = 3600
RemoteVersion = "$CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el6 $"
Authentication = "OPTIONAL"
05/27/11 11:23:18 (pid:16561)
05/27/11 11:23:18 (pid:16561) Entered negotiate
05/27/11 11:23:18 (pid:16561) *** SwapSpace = 2147483647
05/27/11 11:23:18 (pid:16561) *** ReservedSwap = 0
05/27/11 11:23:18 (pid:16561) *** Shadow Size Estimate = 125
05/27/11 11:23:18 (pid:16561) *** Start Limit For Swap = 17179892
05/27/11 11:23:18 (pid:16561) Negotiating for owner: test@hostname
05/27/11 11:23:18 (pid:16561) AutoCluster:config(JobUniverse,LastCheckpointPlatform,NumCkpts) invoked
05/27/11 11:23:35 (pid:17152) Reading from /proc/cpuinfo
05/27/11 11:23:35 (pid:17152) Found: Physical-IDs:False; Core-IDs:False
05/27/11 11:23:35 (pid:17152) Using processor count: 4 processors, 4 CPUs, 0 HTs
05/27/11 11:23:35 (pid:17152) Reading condor configuration from '/etc/condor/condor_config'
05/27/11 11:23:35 (pid:17152) Setting maximum accepts per cycle 4.
05/27/11 11:23:35 (pid:17152) ******************************************************
05/27/11 11:23:35 (pid:17152) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
05/27/11 11:23:35 (pid:17152) ** /usr/sbin/condor_schedd
05/27/11 11:23:35 (pid:17152) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
05/27/11 11:23:35 (pid:17152) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
05/27/11 11:23:35 (pid:17152) ** $CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el6 $
05/27/11 11:23:35 (pid:17152) ** $CondorPlatform: X86_64-RedHat_6.1 $
05/27/11 11:23:35 (pid:17152) ** PID = 17152
05/27/11 11:23:35 (pid:17152) ** Log last touched 5/27 11:23:18
05/27/11 11:23:35 (pid:17152) ******************************************************

Reproduced on RHEL6/x86_64

Comment 12 Erik Erlandson 2011-05-31 16:27:38 UTC

I'm closing this bug as WONTFIX -- it will be addressed by scale documentation (bug 696824).  (A possible future negotiator enhancement here: bug 691440)

Comment 13 Misha H. Ali 2011-05-31 23:43:58 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Release Note Entry:

Previously, the negotiator ran out of file descriptors and crashed when assigned a large number of jobs. 

Workaround:

The user can edit the NEGOTIATOR.MAX_FILE_DESCRIPTORS value to a number that is larger than the expected number of jobs for the negotiation cycle. The recommended value for NEGOTIATOR.MAX_FILE_DESCRIPTORS is double the number of jobs per negotiation cycle.

Comment 14 Misha H. Ali 2011-06-06 03:30:07 UTC

Technical note can be viewed in the release notes for 2.0 at the documentation stage here:

http://documentation-stage.bne.redhat.com/docs/en-US/Red_Hat_Enterprise_MRG/2.0/html-single/MRG_Release_Notes/index.html#tabl-MRG_Release_Notes-GRID_Update_Notes-RHM_Known_Issues