676260 – condor creates shadow log file with bad permissions

Bug 676260 - condor creates shadow log file with bad permissions

Summary: condor creates shadow log file with bad permissions

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	1.3
Hardware:	Unspecified
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	2.0
Target Release:	---
Assignee:	Matthew Farrellee
QA Contact:	Tomas Rusnak
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	693778
TreeView+	depends on / blocked

Reported:	2011-02-09 09:13 UTC by Martin Kudlej
Modified:	2012-03-08 22:41 UTC (History)
CC List:	5 users (show)
Fixed In Version:	condor-7.5.6-0.1
Doc Type:	Bug Fix
Doc Text:	N/A
Clone Of:
Environment:
Last Closed:	2011-06-23 15:39:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
condor configuration (4.50 KB, application/x-gzip) 2011-02-09 09:17 UTC, Martin Kudlej	no flags	Details
condor log files (336.67 KB, application/x-gzip) 2011-02-09 12:23 UTC, Martin Kudlej	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2011:0889	0	normal	SHIPPED_LIVE	Red Hat Enterprise MRG Grid 2.0 Release	2011-06-23 15:35:53 UTC

Description Martin Kudlej 2011-02-09 09:13:36 UTC

Description of problem:

I submit 100 "sleep 1"-type jobs every 60 seconds. Jobs have ran ok, but during rotating log of shadow, condor creates new log file with bad permissions and shadow cannot write into log.

$ ls -l /var/log/condor
total 9368
-rw-r--r-- 1 condor condor  169347 Feb  9 04:00 CollectorLog
-rw-r--r-- 1 condor condor 1000091 Feb  9 04:00 CollectorLog.old
-rw-r--r-- 1 condor condor     669 Feb  4 08:19 ConfigLog
-rw-r--r-- 1 condor condor     127 Feb  9 04:00 dprintf_failure.SHADOW
-rw-r--r-- 1 condor condor  759806 Feb  9 03:59 MasterLog
-rw-r--r-- 1 condor condor 1000114 Feb  9 03:24 MasterLog.old
-rw-r--r-- 1 condor condor  576229 Feb  9 04:00 MatchLog
-rw-r--r-- 1 condor condor 1000190 Feb  9 03:17 MatchLog.old
-rw-r--r-- 1 condor condor  329317 Feb  9 04:00 NegotiatorLog
-rw-r--r-- 1 condor condor 1000189 Feb  9 03:59 NegotiatorLog.old
-rw-r--r-- 1 condor condor  190694 Feb  9 04:00 SchedLog
-rw-r--r-- 1 condor condor 1000112 Feb  9 04:00 SchedLog.old
-rw-r--r-- 1 root   root        63 Feb  8 23:02 ShadowLog
-rw-r--r-- 1 condor condor 1000152 Feb  8 23:02 ShadowLog.old
-rw-r--r-- 1 condor condor   14479 Feb  4 08:19 StarterLog
-rw-r--r-- 1 condor condor  378880 Feb  9 04:00 StartLog
-rw-r--r-- 1 condor condor 1000200 Feb  9 03:54 StartLog.old

$ cat dprintf_failure.SHADOW 
dprintf() had a fatal error in pid 10935
Can't open "/var/log/condor/ShadowLog"
errno: 13 (Permission denied)
euid: 64, ruid: 0

$ ps aux | grep condor
condor   27027  0.4  0.0  10028  3824 ?        Ss   Feb04  31:08 condor_master -pidfile /var/run/condor/condor_master.pid
condor   27029  7.5  0.0  10844  5412 ?        Ss   Feb04 522:23 condor_collector -f
condor   27030  0.1  0.0   9780  3604 ?        Ss   Feb04  11:39 condor_startd -f
condor   27031 27.8  1.3 117824 111096 ?       Ss   Feb04 1932:09 condor_schedd -f
condor   27032  1.7  0.0  10512  4960 ?        Ds   Feb04 119:13 condor_negotiator -f
condor   27033  0.0  0.2 108796 20096 ?        Ssl  Feb04   2:13 /usr/bin/python /usr/sbin/condor_configd
root     27034  0.0  0.0   8468  3636 ?        S    Feb04   6:00 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -S 60 -C 64


Version-Release number of selected component (if applicable):
python-wallabyclient-3.9-3.el5
python-condorutils-1.4-6.el5
wallaby-utils-0.10.4-2.el5
python-qpid-0.7.946106-15.el5
qpid-cpp-client-0.7.946106-27.el5
condor-7.4.5-0.7.el5
wallaby-0.10.4-2.el5
qpid-cpp-server-0.7.946106-27.el5
condor-wallaby-client-3.9-3.el5
condor-wallaby-tools-3.9-3.el5
condor-kbdd-7.4.5-0.7.el5
condor-wallaby-base-db-1.5-2.el5
ruby-wallaby-0.10.4-2.el5

How reproducible:
after ~10000 jobs

Steps to Reproduce:
1. set up condor pool
2. start long term test which submit 100 jobs every minute
3. watch directory with log files
  
Actual results:
condor creates shadow log file with bad permissions

Expected results:
condor will create log files with right permissions so it will be able to write into them

Comment 1 Martin Kudlej 2011-02-09 09:17:27 UTC

Created attachment 477772 [details]
condor configuration

Comment 2 Martin Kudlej 2011-02-09 12:23:04 UTC

Created attachment 477794 [details]
condor log files

Comment 3 Matthew Farrellee 2011-02-09 12:28:48 UTC

ShadowLog contains only...

Stack dump for process 8400 at timestamp 1297224136 (5 frames)

This indicates the Shadow crashed, likely during log rotation. Theoretically such a crash could explain the inconsistent state of the log ownership.

Comment 4 Matthew Farrellee 2011-02-09 12:30:47 UTC

ShadowLog.old ends with...

02/08/11 23:02:16 (fd:6) (pid:8400) (2836.12) (8400): SHADOW_JOB_CLEANUP_RETRY_DELAY is undefined, using default value of 30
02/08/11 23:02:16 (fd:10) (pid:6369) (4803.81) (6369): Entering thread safe stop [send] in condor_rw.cpp:359 unknown()
02/08/11 23:02:16 (fd:11) (pid:6377) (4804.8) (6377): Config 'SEC_DEFAULT_CRYPTO_METHODS': no prefix ==> '3DES, 3DES'
02/08/11 23:02:16 (fd:6) (pid:8400) (2836.12) (8400): SHADOW_LAZY_QUEUE_UPDATE is undefined, using default value of True
02/08/11 23:02:16 (fd:10) (pid:6369) (4803.81) (6369): Leaving thread safe stop [send] in condor_rw.cpp:359 unknown()
02/08/11 23:02:16 (fd:6) (pid:8400) (2836.12) (8400): PRIV_CONDOR --> PRIV_USER at write_user_log.cpp:164
02/08/11 23:02:16 (fd:10) (pid:6369) (4803.81) (6369): selector 0xbfe047b8 resetting
02/08/11 23:02:16 (fd:11) (pid:6377) (4804.8) (6377): SEC_SHADOW_CLIENT_SESSION_DURATION is undefined, using default value of 0
02/08/11 23:02:16 (fd:10) (pid:6369) (4803.81) (6369): condor_read(fd=9 schedd at <:46007>,,size=5,timeout=300,flags=0)
02/08/11 23:02:16 (fd:7) (pid:8400) (2836.12) (8400): CREATE_LOCKS_ON_LOCAL_DISK is undefined, using default value of True
02/08/11 23:02:16 (fd:11) (pid:6377) (4804.8) (6377): SEC_SHADOW_DEFAULT_SESSION_DURATION is undefined, using default value of 0
02/08/11 23:02:16 (fd:10) (pid:6369) (4803.81) (6369): selector 0xbfe047b8 adding fd 9 ()
02/08/11 23:02:16 (fd:7) (pid:8400) (2836.12) (8400): Config 'LOCAL_DISK_LOCK_DIR': no prefix ==> '$(LOCK)/local'
MaxLog = 1000000, length = 0
Saving log file to "/var/log/condor/ShadowLog.old"

This also points to rotation, probably from pid 8400, which is the one that dumped into ShadowLog.

Comment 5 Matthew Farrellee 2011-02-09 12:31:41 UTC

The SchedLog* has rotated since the Shadow crash and won't hold useful information.

Comment 6 Matthew Farrellee 2011-02-09 12:33:27 UTC

Another theory, the stack dump signal handler was triggered before the new log file was created in rotation, the signal handler runs with root privs and tried to write to the log, creating it.

Comment 7 Matthew Farrellee 2011-02-09 12:36:15 UTC

Reducing MAX_SHADOW_LOG may help reproduction by forcing more frequent rotation.

Comment 8 Matthew Farrellee 2011-02-28 19:43:48 UTC

Please verify this is still a problem with condor 7.5.6-0.1

Comment 9 Martin Kudlej 2011-03-04 14:43:02 UTC

Will retest during validation cycle.

Comment 10 Matthew Farrellee 2011-04-27 20:22:22 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
N/A

Comment 13 Tomas Rusnak 2011-05-06 08:57:08 UTC

Retested with 10000 jobs for a whole night - unable to reproduce. 

Last ShadowLog entry:
05/05/11 18:32:54 (14022.0) (18922): Job 14022.0 terminated: exited with status 0
05/05/11 18:32:54 (14022.0) (18922): WriteUserLog: not initialized @ writeEvent()
05/05/11 18:32:54 (14022.0) (18922): Forking Mailer process...
05/05/11 18:32:54 (14022.0) (18922): Reporting job exit reason 100 and attempting to fetch new job.
05/05/11 18:32:54 (14022.0) (18922): No new job found to run under this shadow.
05/05/11 18:32:54 (14022.0) (18922): **** condor_shadow (condor_SHADOW) pid 18922 EXITING WITH STATUS 100

No broken permissions:

# ls -la  /var/log/condor/Shadow*
-rw-r--r-- 1 condor condor 1106 May  5 18:32 /var/log/condor/ShadowLog
-rw-r--r-- 1 condor condor 1083 May  5 18:32 /var/log/condor/ShadowLog.old

# condor -v
$CondorVersion: 7.4.5 Feb  4 2011 BuildID: RH-7.4.5-0.8.el5 PRE-RELEASE $
$CondorPlatform: X86_64-LINUX_RHEL5 $

Do you have any idea how to reproduce this?

Comment 14 Tomas Rusnak 2011-05-06 10:11:03 UTC

Retested with current package condor-7.6.1-0.4 on x86,x86_64/RHEL5,RHEL6 with same setup like in previous comment. No such problems found with bad permissions.

Should be reopened when problem raises again.

>>> VERIFIED

Comment 15 errata-xmlrpc 2011-06-23 15:39:21 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0889.html

Comment 16 Brian Bockelman 2012-03-08 22:41:59 UTC

Hi - 

We run into this locally.  I think Matt's conjecture is right: we ran into this twice (once in the ShadowLog, once in the ScheddLog).  Each time, the end of the file contained a stack trace.

So, it's likely a race condition that just happens when other things are crashing.

Brian

Note You need to log in before you can comment on or make changes to this bug.