Bug 676260
Summary: | condor creates shadow log file with bad permissions | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Martin Kudlej <mkudlej> | ||||||
Component: | condor | Assignee: | Matthew Farrellee <matt> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Tomas Rusnak <trusnak> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 1.3 | CC: | bbockelm, iboverma, jneedle, matt, trusnak | ||||||
Target Milestone: | 2.0 | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | condor-7.5.6-0.1 | Doc Type: | Bug Fix | ||||||
Doc Text: |
N/A
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2011-06-23 15:39:21 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 693778 | ||||||||
Attachments: |
|
Description
Martin Kudlej
2011-02-09 09:13:36 UTC
Created attachment 477772 [details]
condor configuration
Created attachment 477794 [details]
condor log files
ShadowLog contains only... Stack dump for process 8400 at timestamp 1297224136 (5 frames) This indicates the Shadow crashed, likely during log rotation. Theoretically such a crash could explain the inconsistent state of the log ownership. ShadowLog.old ends with... 02/08/11 23:02:16 (fd:6) (pid:8400) (2836.12) (8400): SHADOW_JOB_CLEANUP_RETRY_DELAY is undefined, using default value of 30 02/08/11 23:02:16 (fd:10) (pid:6369) (4803.81) (6369): Entering thread safe stop [send] in condor_rw.cpp:359 unknown() 02/08/11 23:02:16 (fd:11) (pid:6377) (4804.8) (6377): Config 'SEC_DEFAULT_CRYPTO_METHODS': no prefix ==> '3DES, 3DES' 02/08/11 23:02:16 (fd:6) (pid:8400) (2836.12) (8400): SHADOW_LAZY_QUEUE_UPDATE is undefined, using default value of True 02/08/11 23:02:16 (fd:10) (pid:6369) (4803.81) (6369): Leaving thread safe stop [send] in condor_rw.cpp:359 unknown() 02/08/11 23:02:16 (fd:6) (pid:8400) (2836.12) (8400): PRIV_CONDOR --> PRIV_USER at write_user_log.cpp:164 02/08/11 23:02:16 (fd:10) (pid:6369) (4803.81) (6369): selector 0xbfe047b8 resetting 02/08/11 23:02:16 (fd:11) (pid:6377) (4804.8) (6377): SEC_SHADOW_CLIENT_SESSION_DURATION is undefined, using default value of 0 02/08/11 23:02:16 (fd:10) (pid:6369) (4803.81) (6369): condor_read(fd=9 schedd at <:46007>,,size=5,timeout=300,flags=0) 02/08/11 23:02:16 (fd:7) (pid:8400) (2836.12) (8400): CREATE_LOCKS_ON_LOCAL_DISK is undefined, using default value of True 02/08/11 23:02:16 (fd:11) (pid:6377) (4804.8) (6377): SEC_SHADOW_DEFAULT_SESSION_DURATION is undefined, using default value of 0 02/08/11 23:02:16 (fd:10) (pid:6369) (4803.81) (6369): selector 0xbfe047b8 adding fd 9 () 02/08/11 23:02:16 (fd:7) (pid:8400) (2836.12) (8400): Config 'LOCAL_DISK_LOCK_DIR': no prefix ==> '$(LOCK)/local' MaxLog = 1000000, length = 0 Saving log file to "/var/log/condor/ShadowLog.old" This also points to rotation, probably from pid 8400, which is the one that dumped into ShadowLog. The SchedLog* has rotated since the Shadow crash and won't hold useful information. Another theory, the stack dump signal handler was triggered before the new log file was created in rotation, the signal handler runs with root privs and tried to write to the log, creating it. Reducing MAX_SHADOW_LOG may help reproduction by forcing more frequent rotation. Please verify this is still a problem with condor 7.5.6-0.1 Will retest during validation cycle. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: N/A Retested with 10000 jobs for a whole night - unable to reproduce. Last ShadowLog entry: 05/05/11 18:32:54 (14022.0) (18922): Job 14022.0 terminated: exited with status 0 05/05/11 18:32:54 (14022.0) (18922): WriteUserLog: not initialized @ writeEvent() 05/05/11 18:32:54 (14022.0) (18922): Forking Mailer process... 05/05/11 18:32:54 (14022.0) (18922): Reporting job exit reason 100 and attempting to fetch new job. 05/05/11 18:32:54 (14022.0) (18922): No new job found to run under this shadow. 05/05/11 18:32:54 (14022.0) (18922): **** condor_shadow (condor_SHADOW) pid 18922 EXITING WITH STATUS 100 No broken permissions: # ls -la /var/log/condor/Shadow* -rw-r--r-- 1 condor condor 1106 May 5 18:32 /var/log/condor/ShadowLog -rw-r--r-- 1 condor condor 1083 May 5 18:32 /var/log/condor/ShadowLog.old # condor -v $CondorVersion: 7.4.5 Feb 4 2011 BuildID: RH-7.4.5-0.8.el5 PRE-RELEASE $ $CondorPlatform: X86_64-LINUX_RHEL5 $ Do you have any idea how to reproduce this? Retested with current package condor-7.6.1-0.4 on x86,x86_64/RHEL5,RHEL6 with same setup like in previous comment. No such problems found with bad permissions.
Should be reopened when problem raises again.
>>> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2011-0889.html Hi - We run into this locally. I think Matt's conjecture is right: we ran into this twice (once in the ShadowLog, once in the ScheddLog). Each time, the end of the file contained a stack trace. So, it's likely a race condition that just happens when other things are crashing. Brian |