Bug 676260 - condor creates shadow log file with bad permissions
Summary: condor creates shadow log file with bad permissions
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.3
Hardware: Unspecified
OS: Linux
medium
medium
Target Milestone: 2.0
: ---
Assignee: Matthew Farrellee
QA Contact: Tomas Rusnak
URL:
Whiteboard:
Depends On:
Blocks: 693778
TreeView+ depends on / blocked
 
Reported: 2011-02-09 09:13 UTC by Martin Kudlej
Modified: 2012-03-08 22:41 UTC (History)
5 users (show)

Fixed In Version: condor-7.5.6-0.1
Doc Type: Bug Fix
Doc Text:
N/A
Clone Of:
Environment:
Last Closed: 2011-06-23 15:39:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
condor configuration (4.50 KB, application/x-gzip)
2011-02-09 09:17 UTC, Martin Kudlej
no flags Details
condor log files (336.67 KB, application/x-gzip)
2011-02-09 12:23 UTC, Martin Kudlej
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2011:0889 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Grid 2.0 Release 2011-06-23 15:35:53 UTC

Description Martin Kudlej 2011-02-09 09:13:36 UTC
Description of problem:

I submit 100 "sleep 1"-type jobs every 60 seconds. Jobs have ran ok, but during rotating log of shadow, condor creates new log file with bad permissions and shadow cannot write into log.

$ ls -l /var/log/condor
total 9368
-rw-r--r-- 1 condor condor  169347 Feb  9 04:00 CollectorLog
-rw-r--r-- 1 condor condor 1000091 Feb  9 04:00 CollectorLog.old
-rw-r--r-- 1 condor condor     669 Feb  4 08:19 ConfigLog
-rw-r--r-- 1 condor condor     127 Feb  9 04:00 dprintf_failure.SHADOW
-rw-r--r-- 1 condor condor  759806 Feb  9 03:59 MasterLog
-rw-r--r-- 1 condor condor 1000114 Feb  9 03:24 MasterLog.old
-rw-r--r-- 1 condor condor  576229 Feb  9 04:00 MatchLog
-rw-r--r-- 1 condor condor 1000190 Feb  9 03:17 MatchLog.old
-rw-r--r-- 1 condor condor  329317 Feb  9 04:00 NegotiatorLog
-rw-r--r-- 1 condor condor 1000189 Feb  9 03:59 NegotiatorLog.old
-rw-r--r-- 1 condor condor  190694 Feb  9 04:00 SchedLog
-rw-r--r-- 1 condor condor 1000112 Feb  9 04:00 SchedLog.old
-rw-r--r-- 1 root   root        63 Feb  8 23:02 ShadowLog
-rw-r--r-- 1 condor condor 1000152 Feb  8 23:02 ShadowLog.old
-rw-r--r-- 1 condor condor   14479 Feb  4 08:19 StarterLog
-rw-r--r-- 1 condor condor  378880 Feb  9 04:00 StartLog
-rw-r--r-- 1 condor condor 1000200 Feb  9 03:54 StartLog.old

$ cat dprintf_failure.SHADOW 
dprintf() had a fatal error in pid 10935
Can't open "/var/log/condor/ShadowLog"
errno: 13 (Permission denied)
euid: 64, ruid: 0

$ ps aux | grep condor
condor   27027  0.4  0.0  10028  3824 ?        Ss   Feb04  31:08 condor_master -pidfile /var/run/condor/condor_master.pid
condor   27029  7.5  0.0  10844  5412 ?        Ss   Feb04 522:23 condor_collector -f
condor   27030  0.1  0.0   9780  3604 ?        Ss   Feb04  11:39 condor_startd -f
condor   27031 27.8  1.3 117824 111096 ?       Ss   Feb04 1932:09 condor_schedd -f
condor   27032  1.7  0.0  10512  4960 ?        Ds   Feb04 119:13 condor_negotiator -f
condor   27033  0.0  0.2 108796 20096 ?        Ssl  Feb04   2:13 /usr/bin/python /usr/sbin/condor_configd
root     27034  0.0  0.0   8468  3636 ?        S    Feb04   6:00 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -S 60 -C 64


Version-Release number of selected component (if applicable):
python-wallabyclient-3.9-3.el5
python-condorutils-1.4-6.el5
wallaby-utils-0.10.4-2.el5
python-qpid-0.7.946106-15.el5
qpid-cpp-client-0.7.946106-27.el5
condor-7.4.5-0.7.el5
wallaby-0.10.4-2.el5
qpid-cpp-server-0.7.946106-27.el5
condor-wallaby-client-3.9-3.el5
condor-wallaby-tools-3.9-3.el5
condor-kbdd-7.4.5-0.7.el5
condor-wallaby-base-db-1.5-2.el5
ruby-wallaby-0.10.4-2.el5

How reproducible:
after ~10000 jobs

Steps to Reproduce:
1. set up condor pool
2. start long term test which submit 100 jobs every minute
3. watch directory with log files
  
Actual results:
condor creates shadow log file with bad permissions

Expected results:
condor will create log files with right permissions so it will be able to write into them

Comment 1 Martin Kudlej 2011-02-09 09:17:27 UTC
Created attachment 477772 [details]
condor configuration

Comment 2 Martin Kudlej 2011-02-09 12:23:04 UTC
Created attachment 477794 [details]
condor log files

Comment 3 Matthew Farrellee 2011-02-09 12:28:48 UTC
ShadowLog contains only...

Stack dump for process 8400 at timestamp 1297224136 (5 frames)

This indicates the Shadow crashed, likely during log rotation. Theoretically such a crash could explain the inconsistent state of the log ownership.

Comment 4 Matthew Farrellee 2011-02-09 12:30:47 UTC
ShadowLog.old ends with...

02/08/11 23:02:16 (fd:6) (pid:8400) (2836.12) (8400): SHADOW_JOB_CLEANUP_RETRY_DELAY is undefined, using default value of 30
02/08/11 23:02:16 (fd:10) (pid:6369) (4803.81) (6369): Entering thread safe stop [send] in condor_rw.cpp:359 unknown()
02/08/11 23:02:16 (fd:11) (pid:6377) (4804.8) (6377): Config 'SEC_DEFAULT_CRYPTO_METHODS': no prefix ==> '3DES, 3DES'
02/08/11 23:02:16 (fd:6) (pid:8400) (2836.12) (8400): SHADOW_LAZY_QUEUE_UPDATE is undefined, using default value of True
02/08/11 23:02:16 (fd:10) (pid:6369) (4803.81) (6369): Leaving thread safe stop [send] in condor_rw.cpp:359 unknown()
02/08/11 23:02:16 (fd:6) (pid:8400) (2836.12) (8400): PRIV_CONDOR --> PRIV_USER at write_user_log.cpp:164
02/08/11 23:02:16 (fd:10) (pid:6369) (4803.81) (6369): selector 0xbfe047b8 resetting
02/08/11 23:02:16 (fd:11) (pid:6377) (4804.8) (6377): SEC_SHADOW_CLIENT_SESSION_DURATION is undefined, using default value of 0
02/08/11 23:02:16 (fd:10) (pid:6369) (4803.81) (6369): condor_read(fd=9 schedd at <:46007>,,size=5,timeout=300,flags=0)
02/08/11 23:02:16 (fd:7) (pid:8400) (2836.12) (8400): CREATE_LOCKS_ON_LOCAL_DISK is undefined, using default value of True
02/08/11 23:02:16 (fd:11) (pid:6377) (4804.8) (6377): SEC_SHADOW_DEFAULT_SESSION_DURATION is undefined, using default value of 0
02/08/11 23:02:16 (fd:10) (pid:6369) (4803.81) (6369): selector 0xbfe047b8 adding fd 9 ()
02/08/11 23:02:16 (fd:7) (pid:8400) (2836.12) (8400): Config 'LOCAL_DISK_LOCK_DIR': no prefix ==> '$(LOCK)/local'
MaxLog = 1000000, length = 0
Saving log file to "/var/log/condor/ShadowLog.old"

This also points to rotation, probably from pid 8400, which is the one that dumped into ShadowLog.

Comment 5 Matthew Farrellee 2011-02-09 12:31:41 UTC
The SchedLog* has rotated since the Shadow crash and won't hold useful information.

Comment 6 Matthew Farrellee 2011-02-09 12:33:27 UTC
Another theory, the stack dump signal handler was triggered before the new log file was created in rotation, the signal handler runs with root privs and tried to write to the log, creating it.

Comment 7 Matthew Farrellee 2011-02-09 12:36:15 UTC
Reducing MAX_SHADOW_LOG may help reproduction by forcing more frequent rotation.

Comment 8 Matthew Farrellee 2011-02-28 19:43:48 UTC
Please verify this is still a problem with condor 7.5.6-0.1

Comment 9 Martin Kudlej 2011-03-04 14:43:02 UTC
Will retest during validation cycle.

Comment 10 Matthew Farrellee 2011-04-27 20:22:22 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
N/A

Comment 13 Tomas Rusnak 2011-05-06 08:57:08 UTC
Retested with 10000 jobs for a whole night - unable to reproduce. 

Last ShadowLog entry:
05/05/11 18:32:54 (14022.0) (18922): Job 14022.0 terminated: exited with status 0
05/05/11 18:32:54 (14022.0) (18922): WriteUserLog: not initialized @ writeEvent()
05/05/11 18:32:54 (14022.0) (18922): Forking Mailer process...
05/05/11 18:32:54 (14022.0) (18922): Reporting job exit reason 100 and attempting to fetch new job.
05/05/11 18:32:54 (14022.0) (18922): No new job found to run under this shadow.
05/05/11 18:32:54 (14022.0) (18922): **** condor_shadow (condor_SHADOW) pid 18922 EXITING WITH STATUS 100

No broken permissions:

# ls -la  /var/log/condor/Shadow*
-rw-r--r-- 1 condor condor 1106 May  5 18:32 /var/log/condor/ShadowLog
-rw-r--r-- 1 condor condor 1083 May  5 18:32 /var/log/condor/ShadowLog.old

# condor -v
$CondorVersion: 7.4.5 Feb  4 2011 BuildID: RH-7.4.5-0.8.el5 PRE-RELEASE $
$CondorPlatform: X86_64-LINUX_RHEL5 $

Do you have any idea how to reproduce this?

Comment 14 Tomas Rusnak 2011-05-06 10:11:03 UTC
Retested with current package condor-7.6.1-0.4 on x86,x86_64/RHEL5,RHEL6 with same setup like in previous comment. No such problems found with bad permissions.

Should be reopened when problem raises again.

>>> VERIFIED

Comment 15 errata-xmlrpc 2011-06-23 15:39:21 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0889.html

Comment 16 Brian Bockelman 2012-03-08 22:41:59 UTC
Hi - 

We run into this locally.  I think Matt's conjecture is right: we ran into this twice (once in the ShadowLog, once in the ScheddLog).  Each time, the end of the file contained a stack trace.

So, it's likely a race condition that just happens when other things are crashing.

Brian


Note You need to log in before you can comment on or make changes to this bug.