Bug 627015

Summary:

[RFE] NFS userlog locking hangs condor_schedd

Product:

Red Hat Enterprise MRG

Reporter:

Jon Thomas <jthomas>

Component:

condor

Assignee:

Matthew Farrellee <matt>

Status:

CLOSED ERRATA

QA Contact:

Tomas Rusnak <trusnak>

Severity:

high

Docs Contact:

Priority:

high

Version:

1.2

CC:

matt, spurrier, trusnak

Target Milestone:

1.3.2

Keywords:

FutureFeature

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

condor-7.4.5-0.4

Doc Type:

Enhancement

Doc Text:

C: Presumed NFS locking issues. C: Schedd hangs. F: Backported feature from 7.5 series to avoid locking files on NFS, instead a local file is locked. R: Hopefully hangs will cease.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-02-15 12:12:31 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
upstream patch against condor-7_4_4-0_4_el5	none

Comment 3 Matthew Farrellee 2010-08-25 12:20:57 UTC

From Jon Thomas -

condor_schedd hangs where they appear to continue to respond to
DC_DAEMON_ALIVE's to condor_master (and aren't restarted in a timely manner)
but have changed their ownership to a user and are stuck on an fcntl to a
UserLog.  We had three of these occur during the night last night on host
"bur".

08/10 17:11:57 abort_job_myself: 138850.0 action:Hold log_hold:true notify:true
08/10 17:11:57 Writing record to user
logfile=/nfs/log/xyz/jobHistory.log owner=gridmonkey
08/10 17:11:57 FileLock object is updating timestamp on:
/nfs/log/xyz/jobHistory.log
Stack dump for process 18089 at timestamp 1281486087 (18 frames)
condor_schedd(dprintf_dump_stack+0x4e)[0x54846e]
condor_schedd[0x54a242]
/lib64/libpthread.so.0[0x3dd7a0eb10]
/lib64/libpthread.so.0(__fcntl+0x4f)[0x3dd7a0da6f]
condor_schedd(lock_file_plain+0x5b)[0x54b06b]
condor_schedd(lock_file+0x1f)[0x54b0ef]
condor_schedd(_ZN8FileLock6obtainE9LOCK_TYPE+0x6c)[0x554e7c]
condor_schedd(_ZN12WriteUserLog12doWriteEventEP9ULogEventbbP7ClassAd+0x6a)[0x56aaaa]

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1310

Comment 4 Matthew Farrellee 2010-08-25 12:25:01 UTC

Something seems strange here. The Schedd is single threaded, if it is hung on a lock, it should not be able to continue to send DC_CHILDALIVE's to the Master.

Was it hung for longer than an hour, which is the default NOT_RESPONDING_TIMEOUT?

Comment 6 Matthew Farrellee 2010-08-30 00:01:24 UTC

Possibly improve the situation with,

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1310

Comment 11 Matthew Farrellee 2010-12-16 21:34:32 UTC

Candidate code is currently at V7_4-BZ627015-GT1310-local-locking-backport

Comment 13 Matthew Farrellee 2010-12-20 21:27:07 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: Presumed NFS locking issues.
C: Schedd hangs.
F: Backported feature from 7.5 series to avoid locking files on NFS, instead a local file is locked.
R: Hopefully hangs will cease.

Comment 14 Matthew Farrellee 2010-12-20 21:28:23 UTC

Packaging this requires the introduction of /var/lock/condor/local, a /tmp-like directory, in default configuration and in the RPM.

Comment 18 Tomas Rusnak 2011-01-20 14:45:26 UTC

Retested with current condor on all supported platforms x86/x86_64 on RHEL4/RHEL5:

condor-7.4.5-0.6

# mount | grep /mnt
nfs.example.com:/mnt/qa on /mnt/qa type nfs (rw,addr=IP)

echo -e "cmd=/bin/sleep\nargs=10s\nlog=/mnt/qa/nfstest/nfstest.log\nqueue" | sudo -u test condor_submit

Relevant part of ShadowLog:
(18049): directory_util::rec_touch_file: Creating directory /var
(18049): directory_util::rec_touch_file: Creating directory /var/lock
(18049): directory_util::rec_touch_file: Creating directory /var/lock/condor
(18049): directory_util::rec_touch_file: Creating directory /var/lock/condor/local
(18049): directory_util::rec_touch_file: Creating directory /var/lock/condor/local//23
(18049): directory_util::rec_touch_file: Creating directory /var/lock/condor/local//23/20
(18049): FileLock object is updating timestamp on: /var/lock/condor/local//23/20/518989.lockc
(18049): UserLog = /mnt/qa/nfstest/nfstest.log
(18049): in RemoteResource::initStartdInfo()
(18049): Entering DCStartd::activateClaim()
(18049): Initialized the following authorization table:
...
(18049): Request to run on slot1@localhost <IP:53810> was ACCEPTED
(18049): Resource slot1@localhost changing state from PRE to STARTUP
(18049): FileLock object is updating timestamp on: /var/lock/condor/local//23/20/518989.lockc
...
(18049): Resource slot1@localhost changing state from STARTUP to EXECUTING
(18049): FileLock::obtain(1) - @1295533887.941179 lock on /var/lock/condor/local//23/20/518989.lockc now WRITE
(18049): Resource slot1@localhost changing state from EXECUTING to FINISHED
...
(18049): Entering DCStartd::deactivateClaim(forceful)
(18049): DCStartd::deactivateClaim: successfully sent command
(18049): Killed starter (fast) at <IP:53810>
(18049): Updating Job Queue: SetAttribute(NumJobStarts = 1)
(18049): Updating Job Queue: SetAttribute(LastJobLeaseRenewal = 1295533898)
(18049): Updating Job Queue: SetAttribute(RemoteSysCpu = 0.000000)
(18049): Updating Job Queue: SetAttribute(RemoteUserCpu = 0.000000)
(18049): Updating Job Queue: SetAttribute(ResidentSetSize = 0)
(18049): Updating Job Queue: SetAttribute(ExitBySignal = FALSE)
...
(18049): Job 44.0 terminated: exited with status 0
(18049): FileLock::obtain(1) - @1295533898.194989 lock on /var/lock/condor/local//23/20/518989.lockc now WRITE
(18049): FileLock::obtain(2) - @1295533898.460015 lock on /var/lock/condor/local//23/20/518989.lockc now UNLOCKED
(18049): Forking Mailer process...
(18049): Reporting job exit reason 100 and attempting to fetch new job.
(18049): No new job found to run under this shadow.
(18049): **** condor_shadow (condor_SHADOW) pid 18049 EXITING WITH STATUS 100

No regression found. Locks created in /var/lock/condor as expected.

>>> VERIFIED

Comment 22 Tomas Rusnak 2011-01-27 12:42:05 UTC

Retested on system with 8 slots, 50 jobs with random users (5). UserLog was placed on NFS shared by all jobs:

SchedLog:
1/27/11 07:14:45 (46.0) (24778): FileLock::obtain(1) - @1296130485.727582 lock on /var/lock/condor/local//23/20/518989.lockc now WRITE
01/27/11 07:14:45 (46.0) (24778): FileLock::obtain(2) - @1296130485.757217 lock on /var/lock/condor/local//23/20/518989.lockc now UNLOCKED
01/27/11 07:14:45 (46.0) (24778): Forking Mailer process...
01/27/11 07:14:45 (46.0) (24778): Reporting job exit reason 100 and attempting to fetch new job.
01/27/11 07:14:45 (46.0) (24778): No new job found to run under this shadow.
01/27/11 07:14:45 (46.0) (24778): **** condor_shadow (condor_SHADOW) pid 24778 EXITING WITH STATUS 100
01/27/11 07:14:45 (46.0) (24778): FileLock::obtain(1) - @1296130485.762139 lock on /var/lock/condor/local//23/20/518989.lockc now WRITE
01/27/11 07:14:45 (46.0) (24778): directory_util::rec_clean_up: file /var/lock/condor/local//23/20/518989.lockc has been deleted. 
01/27/11 07:14:45 (46.0) (24778): Lock file /var/lock/condor/local//23/20/518989.lockc has been deleted. 
01/27/11 07:14:45 (46.0) (24778): FileLock::obtain(2) - @1296130485.762405 lock on /var/lock/condor/local//23/20/518989.lockc now UNLOCKED

UserLog (repeated for each job):

005 (046.000.000) 01/27 07:14:45 Job terminated.
	(1) Normal termination (return value 0)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	0  -  Run Bytes Sent By Job
	0  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	0  -  Total Bytes Received By Job

No regression or problem found. It works as expected also in current condor-7.4.5-0.7.

Comment 23 errata-xmlrpc 2011-02-15 12:12:31 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0217.html