Bug 627015

Summary: [RFE] NFS userlog locking hangs condor_schedd
Product: Red Hat Enterprise MRG Reporter: Jon Thomas <jthomas>
Component: condorAssignee: Matthew Farrellee <matt>
Status: CLOSED ERRATA QA Contact: Tomas Rusnak <trusnak>
Severity: high Docs Contact:
Priority: high    
Version: 1.2CC: matt, spurrier, trusnak
Target Milestone: 1.3.2Keywords: FutureFeature
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: condor-7.4.5-0.4 Doc Type: Enhancement
Doc Text:
C: Presumed NFS locking issues. C: Schedd hangs. F: Backported feature from 7.5 series to avoid locking files on NFS, instead a local file is locked. R: Hopefully hangs will cease.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-02-15 12:12:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
upstream patch against condor-7_4_4-0_4_el5 none

Comment 3 Matthew Farrellee 2010-08-25 12:20:57 UTC
From Jon Thomas -

condor_schedd hangs where they appear to continue to respond to
DC_DAEMON_ALIVE's to condor_master (and aren't restarted in a timely manner)
but have changed their ownership to a user and are stuck on an fcntl to a
UserLog.  We had three of these occur during the night last night on host
"bur".

08/10 17:11:57 abort_job_myself: 138850.0 action:Hold log_hold:true notify:true
08/10 17:11:57 Writing record to user
logfile=/nfs/log/xyz/jobHistory.log owner=gridmonkey
08/10 17:11:57 FileLock object is updating timestamp on:
/nfs/log/xyz/jobHistory.log
Stack dump for process 18089 at timestamp 1281486087 (18 frames)
condor_schedd(dprintf_dump_stack+0x4e)[0x54846e]
condor_schedd[0x54a242]
/lib64/libpthread.so.0[0x3dd7a0eb10]
/lib64/libpthread.so.0(__fcntl+0x4f)[0x3dd7a0da6f]
condor_schedd(lock_file_plain+0x5b)[0x54b06b]
condor_schedd(lock_file+0x1f)[0x54b0ef]
condor_schedd(_ZN8FileLock6obtainE9LOCK_TYPE+0x6c)[0x554e7c]
condor_schedd(_ZN12WriteUserLog12doWriteEventEP9ULogEventbbP7ClassAd+0x6a)[0x56aaaa]

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1310

Comment 4 Matthew Farrellee 2010-08-25 12:25:01 UTC
Something seems strange here. The Schedd is single threaded, if it is hung on a lock, it should not be able to continue to send DC_CHILDALIVE's to the Master.

Was it hung for longer than an hour, which is the default NOT_RESPONDING_TIMEOUT?

Comment 6 Matthew Farrellee 2010-08-30 00:01:24 UTC
Possibly improve the situation with,

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1310

Comment 11 Matthew Farrellee 2010-12-16 21:34:32 UTC
Candidate code is currently at V7_4-BZ627015-GT1310-local-locking-backport

Comment 13 Matthew Farrellee 2010-12-20 21:27:07 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: Presumed NFS locking issues.
C: Schedd hangs.
F: Backported feature from 7.5 series to avoid locking files on NFS, instead a local file is locked.
R: Hopefully hangs will cease.

Comment 14 Matthew Farrellee 2010-12-20 21:28:23 UTC
Packaging this requires the introduction of /var/lock/condor/local, a /tmp-like directory, in default configuration and in the RPM.

Comment 18 Tomas Rusnak 2011-01-20 14:45:26 UTC
Retested with current condor on all supported platforms x86/x86_64 on RHEL4/RHEL5:

condor-7.4.5-0.6

# mount | grep /mnt
nfs.example.com:/mnt/qa on /mnt/qa type nfs (rw,addr=IP)

echo -e "cmd=/bin/sleep\nargs=10s\nlog=/mnt/qa/nfstest/nfstest.log\nqueue" | sudo -u test condor_submit

Relevant part of ShadowLog:
(18049): directory_util::rec_touch_file: Creating directory /var
(18049): directory_util::rec_touch_file: Creating directory /var/lock
(18049): directory_util::rec_touch_file: Creating directory /var/lock/condor
(18049): directory_util::rec_touch_file: Creating directory /var/lock/condor/local
(18049): directory_util::rec_touch_file: Creating directory /var/lock/condor/local//23
(18049): directory_util::rec_touch_file: Creating directory /var/lock/condor/local//23/20
(18049): FileLock object is updating timestamp on: /var/lock/condor/local//23/20/518989.lockc
(18049): UserLog = /mnt/qa/nfstest/nfstest.log
(18049): in RemoteResource::initStartdInfo()
(18049): Entering DCStartd::activateClaim()
(18049): Initialized the following authorization table:
...
(18049): Request to run on slot1@localhost <IP:53810> was ACCEPTED
(18049): Resource slot1@localhost changing state from PRE to STARTUP
(18049): FileLock object is updating timestamp on: /var/lock/condor/local//23/20/518989.lockc
...
(18049): Resource slot1@localhost changing state from STARTUP to EXECUTING
(18049): FileLock::obtain(1) - @1295533887.941179 lock on /var/lock/condor/local//23/20/518989.lockc now WRITE
(18049): Resource slot1@localhost changing state from EXECUTING to FINISHED
...
(18049): Entering DCStartd::deactivateClaim(forceful)
(18049): DCStartd::deactivateClaim: successfully sent command
(18049): Killed starter (fast) at <IP:53810>
(18049): Updating Job Queue: SetAttribute(NumJobStarts = 1)
(18049): Updating Job Queue: SetAttribute(LastJobLeaseRenewal = 1295533898)
(18049): Updating Job Queue: SetAttribute(RemoteSysCpu = 0.000000)
(18049): Updating Job Queue: SetAttribute(RemoteUserCpu = 0.000000)
(18049): Updating Job Queue: SetAttribute(ResidentSetSize = 0)
(18049): Updating Job Queue: SetAttribute(ExitBySignal = FALSE)
...
(18049): Job 44.0 terminated: exited with status 0
(18049): FileLock::obtain(1) - @1295533898.194989 lock on /var/lock/condor/local//23/20/518989.lockc now WRITE
(18049): FileLock::obtain(2) - @1295533898.460015 lock on /var/lock/condor/local//23/20/518989.lockc now UNLOCKED
(18049): Forking Mailer process...
(18049): Reporting job exit reason 100 and attempting to fetch new job.
(18049): No new job found to run under this shadow.
(18049): **** condor_shadow (condor_SHADOW) pid 18049 EXITING WITH STATUS 100

No regression found. Locks created in /var/lock/condor as expected.

>>> VERIFIED

Comment 22 Tomas Rusnak 2011-01-27 12:42:05 UTC
Retested on system with 8 slots, 50 jobs with random users (5). UserLog was placed on NFS shared by all jobs:

SchedLog:
1/27/11 07:14:45 (46.0) (24778): FileLock::obtain(1) - @1296130485.727582 lock on /var/lock/condor/local//23/20/518989.lockc now WRITE
01/27/11 07:14:45 (46.0) (24778): FileLock::obtain(2) - @1296130485.757217 lock on /var/lock/condor/local//23/20/518989.lockc now UNLOCKED
01/27/11 07:14:45 (46.0) (24778): Forking Mailer process...
01/27/11 07:14:45 (46.0) (24778): Reporting job exit reason 100 and attempting to fetch new job.
01/27/11 07:14:45 (46.0) (24778): No new job found to run under this shadow.
01/27/11 07:14:45 (46.0) (24778): **** condor_shadow (condor_SHADOW) pid 24778 EXITING WITH STATUS 100
01/27/11 07:14:45 (46.0) (24778): FileLock::obtain(1) - @1296130485.762139 lock on /var/lock/condor/local//23/20/518989.lockc now WRITE
01/27/11 07:14:45 (46.0) (24778): directory_util::rec_clean_up: file /var/lock/condor/local//23/20/518989.lockc has been deleted. 
01/27/11 07:14:45 (46.0) (24778): Lock file /var/lock/condor/local//23/20/518989.lockc has been deleted. 
01/27/11 07:14:45 (46.0) (24778): FileLock::obtain(2) - @1296130485.762405 lock on /var/lock/condor/local//23/20/518989.lockc now UNLOCKED

UserLog (repeated for each job):

005 (046.000.000) 01/27 07:14:45 Job terminated.
	(1) Normal termination (return value 0)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	0  -  Run Bytes Sent By Job
	0  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	0  -  Total Bytes Received By Job

No regression or problem found. It works as expected also in current condor-7.4.5-0.7.

Comment 23 errata-xmlrpc 2011-02-15 12:12:31 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0217.html