Bug 627015
Summary: | [RFE] NFS userlog locking hangs condor_schedd | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Jon Thomas <jthomas> | ||||
Component: | condor | Assignee: | Matthew Farrellee <matt> | ||||
Status: | CLOSED ERRATA | QA Contact: | Tomas Rusnak <trusnak> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 1.2 | CC: | matt, spurrier, trusnak | ||||
Target Milestone: | 1.3.2 | Keywords: | FutureFeature | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | condor-7.4.5-0.4 | Doc Type: | Enhancement | ||||
Doc Text: |
C: Presumed NFS locking issues.
C: Schedd hangs.
F: Backported feature from 7.5 series to avoid locking files on NFS, instead a local file is locked.
R: Hopefully hangs will cease.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2011-02-15 12:12:31 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Comment 3
Matthew Farrellee
2010-08-25 12:20:57 UTC
Something seems strange here. The Schedd is single threaded, if it is hung on a lock, it should not be able to continue to send DC_CHILDALIVE's to the Master. Was it hung for longer than an hour, which is the default NOT_RESPONDING_TIMEOUT? Possibly improve the situation with, https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1310 Candidate code is currently at V7_4-BZ627015-GT1310-local-locking-backport Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: C: Presumed NFS locking issues. C: Schedd hangs. F: Backported feature from 7.5 series to avoid locking files on NFS, instead a local file is locked. R: Hopefully hangs will cease. Packaging this requires the introduction of /var/lock/condor/local, a /tmp-like directory, in default configuration and in the RPM. Retested with current condor on all supported platforms x86/x86_64 on RHEL4/RHEL5:
condor-7.4.5-0.6
# mount | grep /mnt
nfs.example.com:/mnt/qa on /mnt/qa type nfs (rw,addr=IP)
echo -e "cmd=/bin/sleep\nargs=10s\nlog=/mnt/qa/nfstest/nfstest.log\nqueue" | sudo -u test condor_submit
Relevant part of ShadowLog:
(18049): directory_util::rec_touch_file: Creating directory /var
(18049): directory_util::rec_touch_file: Creating directory /var/lock
(18049): directory_util::rec_touch_file: Creating directory /var/lock/condor
(18049): directory_util::rec_touch_file: Creating directory /var/lock/condor/local
(18049): directory_util::rec_touch_file: Creating directory /var/lock/condor/local//23
(18049): directory_util::rec_touch_file: Creating directory /var/lock/condor/local//23/20
(18049): FileLock object is updating timestamp on: /var/lock/condor/local//23/20/518989.lockc
(18049): UserLog = /mnt/qa/nfstest/nfstest.log
(18049): in RemoteResource::initStartdInfo()
(18049): Entering DCStartd::activateClaim()
(18049): Initialized the following authorization table:
...
(18049): Request to run on slot1@localhost <IP:53810> was ACCEPTED
(18049): Resource slot1@localhost changing state from PRE to STARTUP
(18049): FileLock object is updating timestamp on: /var/lock/condor/local//23/20/518989.lockc
...
(18049): Resource slot1@localhost changing state from STARTUP to EXECUTING
(18049): FileLock::obtain(1) - @1295533887.941179 lock on /var/lock/condor/local//23/20/518989.lockc now WRITE
(18049): Resource slot1@localhost changing state from EXECUTING to FINISHED
...
(18049): Entering DCStartd::deactivateClaim(forceful)
(18049): DCStartd::deactivateClaim: successfully sent command
(18049): Killed starter (fast) at <IP:53810>
(18049): Updating Job Queue: SetAttribute(NumJobStarts = 1)
(18049): Updating Job Queue: SetAttribute(LastJobLeaseRenewal = 1295533898)
(18049): Updating Job Queue: SetAttribute(RemoteSysCpu = 0.000000)
(18049): Updating Job Queue: SetAttribute(RemoteUserCpu = 0.000000)
(18049): Updating Job Queue: SetAttribute(ResidentSetSize = 0)
(18049): Updating Job Queue: SetAttribute(ExitBySignal = FALSE)
...
(18049): Job 44.0 terminated: exited with status 0
(18049): FileLock::obtain(1) - @1295533898.194989 lock on /var/lock/condor/local//23/20/518989.lockc now WRITE
(18049): FileLock::obtain(2) - @1295533898.460015 lock on /var/lock/condor/local//23/20/518989.lockc now UNLOCKED
(18049): Forking Mailer process...
(18049): Reporting job exit reason 100 and attempting to fetch new job.
(18049): No new job found to run under this shadow.
(18049): **** condor_shadow (condor_SHADOW) pid 18049 EXITING WITH STATUS 100
No regression found. Locks created in /var/lock/condor as expected.
>>> VERIFIED
Retested on system with 8 slots, 50 jobs with random users (5). UserLog was placed on NFS shared by all jobs: SchedLog: 1/27/11 07:14:45 (46.0) (24778): FileLock::obtain(1) - @1296130485.727582 lock on /var/lock/condor/local//23/20/518989.lockc now WRITE 01/27/11 07:14:45 (46.0) (24778): FileLock::obtain(2) - @1296130485.757217 lock on /var/lock/condor/local//23/20/518989.lockc now UNLOCKED 01/27/11 07:14:45 (46.0) (24778): Forking Mailer process... 01/27/11 07:14:45 (46.0) (24778): Reporting job exit reason 100 and attempting to fetch new job. 01/27/11 07:14:45 (46.0) (24778): No new job found to run under this shadow. 01/27/11 07:14:45 (46.0) (24778): **** condor_shadow (condor_SHADOW) pid 24778 EXITING WITH STATUS 100 01/27/11 07:14:45 (46.0) (24778): FileLock::obtain(1) - @1296130485.762139 lock on /var/lock/condor/local//23/20/518989.lockc now WRITE 01/27/11 07:14:45 (46.0) (24778): directory_util::rec_clean_up: file /var/lock/condor/local//23/20/518989.lockc has been deleted. 01/27/11 07:14:45 (46.0) (24778): Lock file /var/lock/condor/local//23/20/518989.lockc has been deleted. 01/27/11 07:14:45 (46.0) (24778): FileLock::obtain(2) - @1296130485.762405 lock on /var/lock/condor/local//23/20/518989.lockc now UNLOCKED UserLog (repeated for each job): 005 (046.000.000) 01/27 07:14:45 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job No regression or problem found. It works as expected also in current condor-7.4.5-0.7. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0217.html |