Bug 603663
| Summary: | Avoid - Negotiator crash because of "OUT OF FILE DESCRIPTORS" | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Martin Kudlej <mkudlej> | ||||
| Component: | condor | Assignee: | Erik Erlandson <eerlands> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | Tomas Rusnak <trusnak> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 1.0 | CC: | iboverma, matt, mhusnain, trusnak, tstclair | ||||
| Target Milestone: | 2.0 | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | condor-7.5.6-0.1 | Doc Type: | Bug Fix | ||||
| Doc Text: |
Release Note Entry:
Previously, the negotiator ran out of file descriptors and crashed when assigned a large number of jobs.
Workaround:
The user can edit the NEGOTIATOR.MAX_FILE_DESCRIPTORS value to a number that is larger than the expected number of jobs for the negotiation cycle. The recommended value for NEGOTIATOR.MAX_FILE_DESCRIPTORS is double the number of jobs per negotiation cycle.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-05-31 16:27:38 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 696824 | ||||||
| Attachments: |
|
||||||
The OUT OF FILE DESCRIPTORS error from the Negotiator is not unexpected if it was attempting to notify 1024 slots of new claims. Avoiding this requires OS limit increases and Condor configuration changes, see manual. As for the Schedd crash, there's no stack. Please check the memory usage since you were submitting 20 million jobs. The Negotiator could avoid using so many FDs by limiting the number of MATCH notifications it does in parallel. Addressing this as a documentation bz for 2.0: bug 696824 Neither Matt nor I have been able to repro this. Can you please attempt another repro on your side? Retested on current condor: $CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el6 $ $CondorPlatform: X86_64-RedHat_6.1 $ # ulimit -n 1024 05/27/11 11:23:03 (pid:16561) Submission[hostname#247]::update(247.438, LastJobStatus, IDLE) 05/27/11 11:23:03 (pid:16561) proc count for hostname#247 is 99965 05/27/11 11:23:03 (pid:16561) Submission[hostname#247]::update(247.438, JobStatus, RUNNING) 05/27/11 11:23:03 (pid:16561) proc count for hostname#247 is 99965 05/27/11 11:23:16 (pid:16561) Transaction::Commit(): fsync_with_status() took 6 seconds to run 05/27/11 11:23:17 (pid:16561) condor_write(): Socket closed when trying to write 319 bytes to <IP:45002>, fd is 22 05/27/11 11:23:17 (pid:16561) Buf::write(): condor_write() failed 05/27/11 11:23:17 (pid:16561) SECMAN: Error sending response classad to <10.34.37.121:45002>! Enact = "NO" Subsystem = "STARTD" CryptoMethods = "3DES,BLOWFISH" NewSession = "YES" ServerPid = 16562 AuthMethods = "FS,KERBEROS" Encryption = "OPTIONAL" ServerCommandSock = "<IP:58380>" OutgoingNegotiation = "PREFERRED" Integrity = "OPTIONAL" ParentUniqueID = "host:16553:1306487211" Command = 441 SessionDuration = "86400" CurrentTime = time() SessionLease = 3600 RemoteVersion = "$CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el6 $" Authentication = "OPTIONAL" 05/27/11 11:23:18 (pid:16561) 05/27/11 11:23:18 (pid:16561) Entered negotiate 05/27/11 11:23:18 (pid:16561) *** SwapSpace = 2147483647 05/27/11 11:23:18 (pid:16561) *** ReservedSwap = 0 05/27/11 11:23:18 (pid:16561) *** Shadow Size Estimate = 125 05/27/11 11:23:18 (pid:16561) *** Start Limit For Swap = 17179892 05/27/11 11:23:18 (pid:16561) Negotiating for owner: test@hostname 05/27/11 11:23:18 (pid:16561) AutoCluster:config(JobUniverse,LastCheckpointPlatform,NumCkpts) invoked 05/27/11 11:23:35 (pid:17152) Reading from /proc/cpuinfo 05/27/11 11:23:35 (pid:17152) Found: Physical-IDs:False; Core-IDs:False 05/27/11 11:23:35 (pid:17152) Using processor count: 4 processors, 4 CPUs, 0 HTs 05/27/11 11:23:35 (pid:17152) Reading condor configuration from '/etc/condor/condor_config' 05/27/11 11:23:35 (pid:17152) Setting maximum accepts per cycle 4. 05/27/11 11:23:35 (pid:17152) ****************************************************** 05/27/11 11:23:35 (pid:17152) ** condor_schedd (CONDOR_SCHEDD) STARTING UP 05/27/11 11:23:35 (pid:17152) ** /usr/sbin/condor_schedd 05/27/11 11:23:35 (pid:17152) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1) 05/27/11 11:23:35 (pid:17152) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON 05/27/11 11:23:35 (pid:17152) ** $CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el6 $ 05/27/11 11:23:35 (pid:17152) ** $CondorPlatform: X86_64-RedHat_6.1 $ 05/27/11 11:23:35 (pid:17152) ** PID = 17152 05/27/11 11:23:35 (pid:17152) ** Log last touched 5/27 11:23:18 05/27/11 11:23:35 (pid:17152) ****************************************************** Reproduced on RHEL6/x86_64 I'm closing this bug as WONTFIX -- it will be addressed by scale documentation (bug 696824). (A possible future negotiator enhancement here: bug 691440)
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
Release Note Entry:
Previously, the negotiator ran out of file descriptors and crashed when assigned a large number of jobs.
Workaround:
The user can edit the NEGOTIATOR.MAX_FILE_DESCRIPTORS value to a number that is larger than the expected number of jobs for the negotiation cycle. The recommended value for NEGOTIATOR.MAX_FILE_DESCRIPTORS is double the number of jobs per negotiation cycle.
Technical note can be viewed in the release notes for 2.0 at the documentation stage here: http://documentation-stage.bne.redhat.com/docs/en-US/Red_Hat_Enterprise_MRG/2.0/html-single/MRG_Release_Notes/index.html#tabl-MRG_Release_Notes-GRID_Update_Notes-RHM_Known_Issues |
Created attachment 423774 [details] log files and condor_config.local Description of problem: I've tried submit 100,000 jobs and Scheduler and Negotiator have crashed. Version-Release number of selected component (if applicable): condor-7.4.3-0.16.el5 How reproducible: 100% Steps to Reproduce: 1. set up NUM_CPUS=1024 2. service condor restart 3. submit jobs for i in `seq 200`; do condor_submit job.sub >/dev/null 2>/dev/null; echo "$i";sleep 60;done $ cat job.sub: Universe=vanilla Executable=/bin/sleep Arguments=10 Queue 100000 Actual results: Scheduler and Negotiator crash. Expected results: Scheduler and Negotiator don't crash. Additional info: $ cat dprintf_failure.NEGOTIATOR: dprintf() had a fatal error in pid 27041 **** PANIC -- OUT OF FILE DESCRIPTORS at line 846 in dprintf.ceuid: 64, ruid: 0 SchedLog: 06/12 06:30:33 (pid:27042) Started shadow for job 1.235 on slot106@ <:55293> for xxx@, (shadow pid = 2349) Stack dump for process 27042 at timestamp 1276338728 (0 frames)