Description of problem: I submit 100 "sleep 1"-type jobs every 60 seconds. Jobs have ran ok, but during rotating log of shadow, condor creates new log file with bad permissions and shadow cannot write into log. $ ls -l /var/log/condor total 9368 -rw-r--r-- 1 condor condor 169347 Feb 9 04:00 CollectorLog -rw-r--r-- 1 condor condor 1000091 Feb 9 04:00 CollectorLog.old -rw-r--r-- 1 condor condor 669 Feb 4 08:19 ConfigLog -rw-r--r-- 1 condor condor 127 Feb 9 04:00 dprintf_failure.SHADOW -rw-r--r-- 1 condor condor 759806 Feb 9 03:59 MasterLog -rw-r--r-- 1 condor condor 1000114 Feb 9 03:24 MasterLog.old -rw-r--r-- 1 condor condor 576229 Feb 9 04:00 MatchLog -rw-r--r-- 1 condor condor 1000190 Feb 9 03:17 MatchLog.old -rw-r--r-- 1 condor condor 329317 Feb 9 04:00 NegotiatorLog -rw-r--r-- 1 condor condor 1000189 Feb 9 03:59 NegotiatorLog.old -rw-r--r-- 1 condor condor 190694 Feb 9 04:00 SchedLog -rw-r--r-- 1 condor condor 1000112 Feb 9 04:00 SchedLog.old -rw-r--r-- 1 root root 63 Feb 8 23:02 ShadowLog -rw-r--r-- 1 condor condor 1000152 Feb 8 23:02 ShadowLog.old -rw-r--r-- 1 condor condor 14479 Feb 4 08:19 StarterLog -rw-r--r-- 1 condor condor 378880 Feb 9 04:00 StartLog -rw-r--r-- 1 condor condor 1000200 Feb 9 03:54 StartLog.old $ cat dprintf_failure.SHADOW dprintf() had a fatal error in pid 10935 Can't open "/var/log/condor/ShadowLog" errno: 13 (Permission denied) euid: 64, ruid: 0 $ ps aux | grep condor condor 27027 0.4 0.0 10028 3824 ? Ss Feb04 31:08 condor_master -pidfile /var/run/condor/condor_master.pid condor 27029 7.5 0.0 10844 5412 ? Ss Feb04 522:23 condor_collector -f condor 27030 0.1 0.0 9780 3604 ? Ss Feb04 11:39 condor_startd -f condor 27031 27.8 1.3 117824 111096 ? Ss Feb04 1932:09 condor_schedd -f condor 27032 1.7 0.0 10512 4960 ? Ds Feb04 119:13 condor_negotiator -f condor 27033 0.0 0.2 108796 20096 ? Ssl Feb04 2:13 /usr/bin/python /usr/sbin/condor_configd root 27034 0.0 0.0 8468 3636 ? S Feb04 6:00 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -S 60 -C 64 Version-Release number of selected component (if applicable): python-wallabyclient-3.9-3.el5 python-condorutils-1.4-6.el5 wallaby-utils-0.10.4-2.el5 python-qpid-0.7.946106-15.el5 qpid-cpp-client-0.7.946106-27.el5 condor-7.4.5-0.7.el5 wallaby-0.10.4-2.el5 qpid-cpp-server-0.7.946106-27.el5 condor-wallaby-client-3.9-3.el5 condor-wallaby-tools-3.9-3.el5 condor-kbdd-7.4.5-0.7.el5 condor-wallaby-base-db-1.5-2.el5 ruby-wallaby-0.10.4-2.el5 How reproducible: after ~10000 jobs Steps to Reproduce: 1. set up condor pool 2. start long term test which submit 100 jobs every minute 3. watch directory with log files Actual results: condor creates shadow log file with bad permissions Expected results: condor will create log files with right permissions so it will be able to write into them
Created attachment 477772 [details] condor configuration
Created attachment 477794 [details] condor log files
ShadowLog contains only... Stack dump for process 8400 at timestamp 1297224136 (5 frames) This indicates the Shadow crashed, likely during log rotation. Theoretically such a crash could explain the inconsistent state of the log ownership.
ShadowLog.old ends with... 02/08/11 23:02:16 (fd:6) (pid:8400) (2836.12) (8400): SHADOW_JOB_CLEANUP_RETRY_DELAY is undefined, using default value of 30 02/08/11 23:02:16 (fd:10) (pid:6369) (4803.81) (6369): Entering thread safe stop [send] in condor_rw.cpp:359 unknown() 02/08/11 23:02:16 (fd:11) (pid:6377) (4804.8) (6377): Config 'SEC_DEFAULT_CRYPTO_METHODS': no prefix ==> '3DES, 3DES' 02/08/11 23:02:16 (fd:6) (pid:8400) (2836.12) (8400): SHADOW_LAZY_QUEUE_UPDATE is undefined, using default value of True 02/08/11 23:02:16 (fd:10) (pid:6369) (4803.81) (6369): Leaving thread safe stop [send] in condor_rw.cpp:359 unknown() 02/08/11 23:02:16 (fd:6) (pid:8400) (2836.12) (8400): PRIV_CONDOR --> PRIV_USER at write_user_log.cpp:164 02/08/11 23:02:16 (fd:10) (pid:6369) (4803.81) (6369): selector 0xbfe047b8 resetting 02/08/11 23:02:16 (fd:11) (pid:6377) (4804.8) (6377): SEC_SHADOW_CLIENT_SESSION_DURATION is undefined, using default value of 0 02/08/11 23:02:16 (fd:10) (pid:6369) (4803.81) (6369): condor_read(fd=9 schedd at <:46007>,,size=5,timeout=300,flags=0) 02/08/11 23:02:16 (fd:7) (pid:8400) (2836.12) (8400): CREATE_LOCKS_ON_LOCAL_DISK is undefined, using default value of True 02/08/11 23:02:16 (fd:11) (pid:6377) (4804.8) (6377): SEC_SHADOW_DEFAULT_SESSION_DURATION is undefined, using default value of 0 02/08/11 23:02:16 (fd:10) (pid:6369) (4803.81) (6369): selector 0xbfe047b8 adding fd 9 () 02/08/11 23:02:16 (fd:7) (pid:8400) (2836.12) (8400): Config 'LOCAL_DISK_LOCK_DIR': no prefix ==> '$(LOCK)/local' MaxLog = 1000000, length = 0 Saving log file to "/var/log/condor/ShadowLog.old" This also points to rotation, probably from pid 8400, which is the one that dumped into ShadowLog.
The SchedLog* has rotated since the Shadow crash and won't hold useful information.
Another theory, the stack dump signal handler was triggered before the new log file was created in rotation, the signal handler runs with root privs and tried to write to the log, creating it.
Reducing MAX_SHADOW_LOG may help reproduction by forcing more frequent rotation.
Please verify this is still a problem with condor 7.5.6-0.1
Will retest during validation cycle.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: N/A
Retested with 10000 jobs for a whole night - unable to reproduce. Last ShadowLog entry: 05/05/11 18:32:54 (14022.0) (18922): Job 14022.0 terminated: exited with status 0 05/05/11 18:32:54 (14022.0) (18922): WriteUserLog: not initialized @ writeEvent() 05/05/11 18:32:54 (14022.0) (18922): Forking Mailer process... 05/05/11 18:32:54 (14022.0) (18922): Reporting job exit reason 100 and attempting to fetch new job. 05/05/11 18:32:54 (14022.0) (18922): No new job found to run under this shadow. 05/05/11 18:32:54 (14022.0) (18922): **** condor_shadow (condor_SHADOW) pid 18922 EXITING WITH STATUS 100 No broken permissions: # ls -la /var/log/condor/Shadow* -rw-r--r-- 1 condor condor 1106 May 5 18:32 /var/log/condor/ShadowLog -rw-r--r-- 1 condor condor 1083 May 5 18:32 /var/log/condor/ShadowLog.old # condor -v $CondorVersion: 7.4.5 Feb 4 2011 BuildID: RH-7.4.5-0.8.el5 PRE-RELEASE $ $CondorPlatform: X86_64-LINUX_RHEL5 $ Do you have any idea how to reproduce this?
Retested with current package condor-7.6.1-0.4 on x86,x86_64/RHEL5,RHEL6 with same setup like in previous comment. No such problems found with bad permissions. Should be reopened when problem raises again. >>> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2011-0889.html
Hi - We run into this locally. I think Matt's conjecture is right: we ran into this twice (once in the ShadowLog, once in the ScheddLog). Each time, the end of the file contained a stack trace. So, it's likely a race condition that just happens when other things are crashing. Brian