Description of problem: Condor is non-functioning in fedora 19 Version-Release number of selected component (if applicable): Affects all fedora 19 x84_64 condor rpms on koji How reproducible: For me, completely Steps to Reproduce: 1. Install fedora 19 from live iso 2. Yum update (may need to do in stages to avoid crashes mid-update). 3. Change selinux mode to permissive to avoid selinux problems which are already subject of other bug reports. 4. yum install condor 5. systemctl start condor.service 6. As ordinary user, in a terminal window, type: condor_run ls Actual results: condor daemons start fine in step 5. condor_run never delivers output After step 6, condor_schedd dies, leaving the following at the end of /var/log/condor/SchedLog 08/22/13 18:00:15 (pid:1692) Starting add_shadow_birthdate(1.0) Stack dump for process 1692 at timestamp 1377190815 (4 frames) /lib64/libcondor_utils_8_1_0.so(dprintf_dump_stack+0x72)[0x7f9887b31972] /lib64/libcondor_utils_8_1_0.so(+0x17b5f7)[0x7f9887bcc5f7] /lib64/libpthread.so.0(+0xefa0)[0x7f988361afa0] [0x7fff6edd2730] Expected results: After a gap of seconds to minutes, condor_run ls should list the contents of the user's directory. Additional info: Condor works as installed above on fedora 17. I worked backwards through versions of condor for fedora 19 on the koji website. They all have this problem, including the version that runs on fedora 17 which is 7.9.1-0.1 Then I tried to install the fedora 17 rpms (condor-7.9.1-0.1.fc17.2.x86_64, condor-classads, condor-procd) but that fails due to a missing dependency: libpcre.so.0 So I used rpm with --nodeps to install the rpms and copied libpcre.so.0 from a fedora 17 machine into /usr/lib64. Then condor WORKS. Moreover, the same method functions on our condor cluster in the department of mathematical sciences in durham university. Installing fedora 17 rpms with libpcre.so.0 from a fedora 17 machine enables a fedora 19 machine to function apparently without any problem as part of the cluster. All this is with selinux in permissive mode. It fails due to selinux issues in enforcing mode but selinux issues may have been addressed in fixes for other reported bugs. My instinct is that the problem is that the f19 version is built against libpcre.so.1 whereas the f17 is built against libpcre.so.0. The Changelog in the pcre rpm indicates that there were quite big changes between version 8.21 (f17) and version 8.32 (f19) of pcre. I don't have the time right now to pull the f19 condor source rpms and modify to build against libpcre.so.0 to see what happens but it would be my next step.
Installed Fedora 19 from DVD. # yum update # rpm -q selinux-policy selinux-policy-3.12.1-71.fc19.noarch selinux-policy-targeted-3.12.1-71.fc19.noarch # yum install condor (install many dependent packages) # rpm -qa | fgrep condor condor-8.1.0-0.2.fc19.x86_64 condor-classads-8.1.0-0.2.fc19.x86_64 condor-procd-8.1.0-0.2.fc19.x86_64 # systemctl enable condor # systemctl start condor # ps aux | fgrep condor condor 2868 0.3 0.0 96872 4416 ? Ss 09:28 0:00 /usr/sbin/condor_master -f root 2869 0.3 0.0 23964 3072 ? S 09:28 0:00 condor_procd -A /var/run/condor/procd_pipe -R 10000000 -S 60 -C 990 condor 2872 0.0 0.0 92780 4468 ? Ss 09:28 0:00 condor_collector -f root 3120 0.0 0.0 107964 660 pts/1 S+ 09:28 0:00 fgrep --color=auto condor NO condor_negotiator, NO condor_schedd , NO condor_startd # tail /var/log/messages Aug 23 09:16:08 hopf setroubleshoot: SELinux is preventing /usr/sbin/condor_master from read access on the file hosts. For complete SELinux messages. run sealert -l 17eff763-7c56-49d3-bbb3-d21af42f5861 ... Aug 23 09:16:06 hopf setroubleshoot: SELinux is preventing /usr/sbin/condor_master from read access on the file meminfo. For complete SELinux messages. run sealert -l f820579e-eafd-4a47-b64d-f4f41e048e11 ... Aug 23 09:16:06 hopf setroubleshoot: SELinux is preventing /usr/sbin/condor_master from read access on the file cpuinfo. For complete SELinux messages. run sealert -l f820579e-eafd-4a47-b64d-f4f41e048e11 Aug 23 09:16:06 hopf setroubleshoot: SELinux is preventing /usr/sbin/condor_master from read access on the file resolv.conf. For complete SELinux messages. run sealert -l 17eff763-7c56-49d3-bbb3-d21af42f5861 ... Aug 23 09:16:07 hopf setroubleshoot: SELinux is preventing /usr/sbin/condor_master from write access on the file .master_address.new. For complete SELinux messages. run sealert -l 274a39f3-92d2-47bf-95b5-0cefb5d7ff6a ... Aug 23 09:16:07 hopf setroubleshoot: SELinux is preventing /usr/sbin/condor_master from setattr access on the file MasterLog. For complete SELinux messages. run sealert -l a47020cd-71f5-4972-9529-8550ca6b36ce Aug 23 09:16:07 hopf setroubleshoot: SELinux is preventing /usr/sbin/condor_master from read access on the file stat. For complete SELinux messages. run sealert -l f820579e-eafd-4a47-b64d-f4f41e048e11 ... Aug 23 09:17:07 hopf setroubleshoot: SELinux is preventing /usr/sbin/condor_collector from setattr access on the file CollectorLog. For complete SELinux messages. run sealert -l e6342827-fc60-4b19-a9dd-d2160b1c4774 # yum install policycoreutils-devel # fgrep condor /var/log/audit/audit.log | audit2allow #============= condor_collector_t ============== allow condor_collector_t condor_log_t:file { write setattr }; #============= condor_master_t ============== allow condor_master_t condor_log_t:file { write setattr }; allow condor_master_t net_conf_t:file read; allow condor_master_t proc_t:file read; SO THERE ARE STILL SELINUX ISSUES WHICH CAN PROBABLY BE FIXED BY CREATING SEMODULES FIRST WE MUST CHECK IF CONDOR WORKS WITH SELINUX TRUNED OFF # setenforce 0 # systemctl restart condor # ps aux | fgrep condor condor 2544 0.1 0.0 92808 4620 ? Ss 09:27 0:00 /usr/sbin/condor_master -f root 2545 0.3 0.0 23964 3096 ? S 09:27 0:00 condor_procd -A /var/run/condor/procd_pipe -R 10000000 -S 60 -C 990 condor 2546 0.1 0.0 92804 4640 ? Ss 09:27 0:00 condor_collector -f condor 2551 0.1 0.0 92904 4652 ? Ss 09:27 0:00 condor_negotiator -f condor 2552 0.1 0.0 94052 5204 ? Ss 09:27 0:00 condor_schedd -f condor 2553 0.2 0.0 93212 4944 ? Ss 09:27 0:00 condor_startd -f condor 2765 109 0.0 16848 464 ? R 09:27 0:01 mips root 2767 0.0 0.0 107964 660 pts/1 S+ 09:27 0:00 fgrep --color=auto condor # systemctl status condor condor.service - Condor Distributed High-Throughput-Computing Loaded: loaded (/usr/lib/systemd/system/condor.service; enabled) Active: active (running) since Fri 2013-08-23 09:41:50 BST; 37s ago Process: 5012 ExecStop=/usr/sbin/condor_off -master (code=exited, status=0/SUCCESS) Main PID: 5089 (condor_master) CGroup: name=systemd:/system/condor.service ├─5089 /usr/sbin/condor_master -f ├─5092 condor_procd -A /var/run/condor/procd_pipe -R 10000000 -S 6... ├─5093 condor_collector -f ├─5098 condor_negotiator -f ├─5099 condor_schedd -f └─5100 condor_startd -f Aug 23 09:41:50 hopf.dur.ac.uk systemd[1]: Started Condor Distributed High-T.... bernard% condor_run date (THE COMMAND HANGS WITHOUT OUTPUT) # cat /var/log/condor/SchedLog ... 08/23/13 09:45:08 (pid:5847) ****************************************************** 08/23/13 09:45:08 (pid:5847) ** condor_schedd (CONDOR_SCHEDD) STARTING UP 08/23/13 09:45:08 (pid:5847) ** /usr/sbin/condor_schedd 08/23/13 09:45:08 (pid:5847) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1) 08/23/13 09:45:08 (pid:5847) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON 08/23/13 09:45:08 (pid:5847) ** $CondorVersion: 8.1.0 Jul 15 2013 BuildID: RH-8.1.0-0.2.fc19 PRE-RELEASE-UWCS $ 08/23/13 09:45:08 (pid:5847) ** $CondorPlatform: X86_64-Fedora_19 $ 08/23/13 09:45:08 (pid:5847) ** PID = 5847 08/23/13 09:45:08 (pid:5847) ** Log last touched 8/23 09:44:43 08/23/13 09:45:08 (pid:5847) ****************************************************** 08/23/13 09:45:08 (pid:5847) Using config source: /etc/condor/condor_config 08/23/13 09:45:08 (pid:5847) Using local config sources: 08/23/13 09:45:08 (pid:5847) /etc/condor/config.d/00personal_condor.config 08/23/13 09:45:08 (pid:5847) DaemonCore: command socket at <129.234.21.14:36324> 08/23/13 09:45:08 (pid:5847) DaemonCore: private command socket at <129.234.21.14:36324> 08/23/13 09:45:08 (pid:5847) History file rotation is enabled. 08/23/13 09:45:08 (pid:5847) Maximum history file size is: 20971520 bytes 08/23/13 09:45:08 (pid:5847) Number of rotated history files is: 2 08/23/13 09:45:08 (pid:5847) Failed to execute /usr/sbin/condor_shadow.std, ignoring 08/23/13 09:45:08 (pid:5847) About to rotate ClassAd log /var/lib/condor/spool/job_queue.log 08/23/13 09:45:08 (pid:5847) 1.0: JobLeaseDuration remaining: 1123 08/23/13 09:45:08 (pid:5847) directory_util::rec_touch_file: Directory /var/lock/condor/local cannot be created (Permission denied) 08/23/13 09:45:08 (pid:5847) Starting add_shadow_birthdate(1.0) Stack dump for process 5847 at timestamp 1377247508 (4 frames) /lib64/libcondor_utils_8_1_0.so(dprintf_dump_stack+0x72)[0x3ee52e0972] /lib64/libcondor_utils_8_1_0.so[0x3ee537b5f7] /lib64/libc.so.6[0x3edea35a90] [0x7fff7c604960] SO /var/lock/condor/local (which is /run/lock/condor/local) CAN'T BE CREATED BY condor # ls -l /var/run/ | fgrep condor drwxrwxr-x. 2 condor condor 80 Aug 23 09:47 condor IS IT NOT condor TRYING TO CREATE /var/run/condor/local ? # chmod a+w /var/run/condor # systemctl stop condor # rm /var/lib/condor/spool/job* # systemctl restart condor bernard% condor_run date # cat /var/log/condor/SchedLog ... 08/23/13 10:17:48 (pid:9852) ****************************************************** 08/23/13 10:17:48 (pid:9852) ** condor_schedd (CONDOR_SCHEDD) STARTING UP 08/23/13 10:17:48 (pid:9852) ** /usr/sbin/condor_schedd 08/23/13 10:17:48 (pid:9852) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1) 08/23/13 10:17:48 (pid:9852) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON 08/23/13 10:17:48 (pid:9852) ** $CondorVersion: 8.1.0 Jul 15 2013 BuildID: RH-8.1.0-0.2.fc19 PRE-RELEASE-UWCS $ 08/23/13 10:17:48 (pid:9852) ** $CondorPlatform: X86_64-Fedora_19 $ 08/23/13 10:17:48 (pid:9852) ** PID = 9852 08/23/13 10:17:48 (pid:9852) ** Log last touched 8/23 10:16:15 08/23/13 10:17:48 (pid:9852) ****************************************************** 08/23/13 10:17:48 (pid:9852) Using config source: /etc/condor/condor_config 08/23/13 10:17:48 (pid:9852) Using local config sources: 08/23/13 10:17:48 (pid:9852) /etc/condor/config.d/00personal_condor.config 08/23/13 10:17:48 (pid:9852) DaemonCore: command socket at <129.234.21.14:48753> 08/23/13 10:17:48 (pid:9852) DaemonCore: private command socket at <129.234.21.14:48753> 08/23/13 10:17:48 (pid:9852) History file rotation is enabled. 08/23/13 10:17:48 (pid:9852) Maximum history file size is: 20971520 bytes 08/23/13 10:17:48 (pid:9852) Number of rotated history files is: 2 08/23/13 10:17:48 (pid:9852) Failed to execute /usr/sbin/condor_shadow.std, ignoring 08/23/13 10:17:53 (pid:9852) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s 08/23/13 10:17:53 (pid:9852) TransferQueueManager upload 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load 08/23/13 10:17:53 (pid:9852) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load 08/23/13 10:19:58 (pid:9852) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s 08/23/13 10:19:58 (pid:9852) TransferQueueManager upload 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load 08/23/13 10:19:58 (pid:9852) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load 08/23/13 10:19:58 (pid:9852) Sent ad to central manager for bernard.ac.uk 08/23/13 10:19:58 (pid:9852) Sent ad to 1 collectors for bernard.ac.uk 08/23/13 10:20:08 (pid:9852) Using negotiation protocol: NEGOTIATE 08/23/13 10:20:08 (pid:9852) Negotiating for owner: bernard.ac.uk 08/23/13 10:20:08 (pid:9852) AutoCluster:config() significant attributes changed to JobUniverse,LastCheckpointPlatform,NumCkpts,MachineLastMatchTime,RemoteGroup,SubmitterGroup,SubmitterUserPrio 08/23/13 10:20:08 (pid:9852) Checking consistency running and runnable jobs 08/23/13 10:20:08 (pid:9852) Tables are consistent 08/23/13 10:20:08 (pid:9852) Rebuilt prioritized runnable job list in 0.000s. 08/23/13 10:20:08 (pid:9852) Finished negotiating for bernard in local pool: 1 matched, 0 rejected 08/23/13 10:20:08 (pid:9852) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s 08/23/13 10:20:08 (pid:9852) TransferQueueManager upload 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load 08/23/13 10:20:08 (pid:9852) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load 08/23/13 10:20:08 (pid:9852) Sent ad to central manager for bernard.ac.uk 08/23/13 10:20:08 (pid:9852) Sent ad to 1 collectors for bernard.ac.uk 08/23/13 10:20:08 (pid:9852) Starting add_shadow_birthdate(1.0) Stack dump for process 9852 at timestamp 1377249609 (4 frames) /lib64/libcondor_utils_8_1_0.so(dprintf_dump_stack+0x72)[0x3ee52e0972] /lib64/libcondor_utils_8_1_0.so[0x3ee537b5f7] /lib64/libc.so.6[0x3edea35a90] [0x7fffcb058e60] SO NO MORE PROBLEMS WITH CREATING /var/lock/condor/local (/run/lock/condor/local). BUT # ls -l /var/lock/condor total 0 -rw-------. 1 condor condor 0 Aug 23 10:17 InstanceLock drwxrwxrwx. 2 bernard bernard 40 Aug 23 10:20 local WHY IS IT THE USER CREATING THAT DIRECTORY? THIS IS OBVIOUSLY WHY IT FAILED BEFORE. THIS IS OBVIOUSLY WRONG AS SEVERAL USERS CAN USE CONDOR ON THE SAME COMPUTER. ANYWAY WE ARE NOT OUT OF TROUBLE YET: THEN schedd IS RESTARTED EVERY 30 SECONDS. It LOOKS AS IF IT CRASHES. # systemctl status condor condor.service - Condor Distributed High-Throughput-Computing Loaded: loaded (/usr/lib/systemd/system/condor.service; enabled) Active: active (running) since Fri 2013-08-23 10:10:24 BST; 1min 16s ago Process: 8871 ExecStop=/usr/sbin/condor_off -master (code=exited, status=0/SUCCESS) Main PID: 8931 (condor_master) CGroup: name=systemd:/system/condor.service ├─8931 /usr/sbin/condor_master -f ├─8932 condor_procd -A /var/run/condor/procd_pipe -R 10000000 -S 6... ├─8935 condor_collector -f ├─9017 condor_negotiator -f └─9020 condor_startd -f Aug 23 10:10:24 hopf.dur.ac.uk systemd[1]: Started Condor Distributed High-T.... Aug 23 10:10:47 hopf.dur.ac.uk sendmail[9123]: r7N9Alkx009123: from=condor, ...t Aug 23 10:10:47 hopf.dur.ac.uk sendmail[9123]: r7N9Alkx009123: to=root@hopf....) Aug 23 10:10:58 hopf.dur.ac.uk sendmail[9189]: r7N9AvYw009189: from=condor, ...t Aug 23 10:10:58 hopf.dur.ac.uk sendmail[9189]: r7N9AvYw009189: to=root@hopf....) Aug 23 10:11:09 hopf.dur.ac.uk sendmail[9254]: r7N9B9Ha009254: from=condor, ...t Aug 23 10:11:09 hopf.dur.ac.uk sendmail[9254]: r7N9B9Ha009254: to=root@hopf....) Aug 23 10:11:22 hopf.dur.ac.uk sendmail[9263]: r7N9BMQR009263: from=condor, ...t Aug 23 10:11:22 hopf.dur.ac.uk sendmail[9263]: r7N9BMQR009263: to=root@hopf....) SO THE NEXT THING TO DO IS TO SOLVE THE PROBLEM OF CREATING /var/lock/condor/local CORRECTLY
Oh! This rings a bell! Try adding the following to /etc/tmpfiles.d/condor.conf: d /var/lock/condor/local 0775 condor condor - (there should already be 2 other lines there). Then, run: systemd-tmpfiles --create (this would also get done at boot). Thank you for your hard work at this!
After doing the above: # getenforce Permissive # systemctl start condor # tail -f /var/log/condor/SchedLog 08/30/13 16:22:24 (pid:11324) ****************************************************** 08/30/13 16:22:24 (pid:11324) ** condor_schedd (CONDOR_SCHEDD) STARTING UP 08/30/13 16:22:24 (pid:11324) ** /usr/sbin/condor_schedd 08/30/13 16:22:24 (pid:11324) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1) 08/30/13 16:22:24 (pid:11324) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON 08/30/13 16:22:24 (pid:11324) ** $CondorVersion: 8.1.0 Jul 15 2013 BuildID: RH-8.1.0-0.2.fc19 PRE-RELEASE-UWCS $ 08/30/13 16:22:24 (pid:11324) ** $CondorPlatform: X86_64-Fedora_19 $ 08/30/13 16:22:24 (pid:11324) ** PID = 11324 08/30/13 16:22:24 (pid:11324) ** Log last touched 8/30 16:21:17 08/30/13 16:22:24 (pid:11324) ****************************************************** 08/30/13 16:22:24 (pid:11324) Using config source: /etc/condor/condor_config 08/30/13 16:22:24 (pid:11324) Using local config sources: 08/30/13 16:22:24 (pid:11324) /etc/condor/config.d/00personal_condor.config 08/30/13 16:22:24 (pid:11324) DaemonCore: command socket at <129.234.21.14:35109> 08/30/13 16:22:24 (pid:11324) DaemonCore: private command socket at <129.234.21.14:35109> 08/30/13 16:22:24 (pid:11324) History file rotation is enabled. 08/30/13 16:22:24 (pid:11324) Maximum history file size is: 20971520 bytes 08/30/13 16:22:24 (pid:11324) Number of rotated history files is: 2 08/30/13 16:22:24 (pid:11324) Failed to execute /usr/sbin/condor_shadow.std, ignoring 08/30/13 16:22:30 (pid:11324) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s 08/30/13 16:22:30 (pid:11324) TransferQueueManager upload 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load 08/30/13 16:22:30 (pid:11324) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load and the schedd is running Then bernard% condor_submit condor_job_test.txt Submitting job(s). 1 job(s) submitted to cluster 1. bernard% condor_q -- Failed to fetch ads from: <129.234.21.14:35109> : hopf.dur.ac.uk CEDAR:6001:Failed to connect to <129.234.21.14:35109> The schedd has died and # tail -f /var/log/condor/SchedLog 08/30/13 16:24:01 (pid:11324) Number of Active Workers 1 08/30/13 16:24:01 (pid:11572) Number of Active Workers 0 08/30/13 16:24:10 (pid:11324) directory_util::rec_touch_file: File /var/lock/condor/local//63/70/91326650395759.lockc cannot be created (Permission denied) 08/30/13 16:24:10 (pid:11324) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s 08/30/13 16:24:10 (pid:11324) TransferQueueManager upload 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load 08/30/13 16:24:10 (pid:11324) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load 08/30/13 16:24:10 (pid:11324) Sent ad to central manager for bernard.ac.uk 08/30/13 16:24:10 (pid:11324) Sent ad to 1 collectors for bernard.ac.uk 08/30/13 16:24:10 (pid:11324) Using negotiation protocol: NEGOTIATE 08/30/13 16:24:10 (pid:11324) Negotiating for owner: bernard.ac.uk 08/30/13 16:24:10 (pid:11324) AutoCluster:config() significant attributes changed to JobUniverse,LastCheckpointPlatform,NumCkpts,MachineLastMatchTime,RemoteGroup,SubmitterGroup,SubmitterUserPrio 08/30/13 16:24:10 (pid:11324) Checking consistency running and runnable jobs 08/30/13 16:24:10 (pid:11324) Tables are consistent 08/30/13 16:24:10 (pid:11324) Rebuilt prioritized runnable job list in 0.000s. 08/30/13 16:24:10 (pid:11324) Finished negotiating for bernard in local pool: 1 matched, 0 rejected 08/30/13 16:24:10 (pid:11324) Starting add_shadow_birthdate(1.0) Stack dump for process 11324 at timestamp 1377876250 (4 frames) /lib64/libcondor_utils_8_1_0.so(dprintf_dump_stack+0x72)[0x3ee52e0972] /lib64/libcondor_utils_8_1_0.so[0x3ee537b5f7] /lib64/libc.so.6[0x3edea35a90] [0x7fff2660ac20] The schedd dies and the master daemon tries to restart it every 30 seconds. /var/lock/condor/local is empty # chmod 777 /var/lock/condor/local/ # systemctl stop condor # rm /var/lib/condor/spool/job* # systemctl restart condor bernard% condor_submit condor_job_test.txt Submitting job(s). 1 job(s) submitted to cluster 1. # tail -f /var/log/condor/SchedLog 08/30/13 16:29:22 (pid:11836) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s 08/30/13 16:29:22 (pid:11836) TransferQueueManager upload 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load 08/30/13 16:29:22 (pid:11836) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load 08/30/13 16:29:27 (pid:11836) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s 08/30/13 16:29:27 (pid:11836) TransferQueueManager upload 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load 08/30/13 16:29:27 (pid:11836) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load 08/30/13 16:29:27 (pid:11836) Sent ad to central manager for bernard.ac.uk 08/30/13 16:29:27 (pid:11836) Sent ad to 1 collectors for bernard.ac.uk 08/30/13 16:29:37 (pid:11836) Using negotiation protocol: NEGOTIATE 08/30/13 16:29:37 (pid:11836) Negotiating for owner: bernard.ac.uk 08/30/13 16:29:37 (pid:11836) AutoCluster:config() significant attributes changed to JobUniverse,LastCheckpointPlatform,NumCkpts,MachineLastMatchTime,RemoteGroup,SubmitterGroup,SubmitterUserPrio 08/30/13 16:29:37 (pid:11836) Checking consistency running and runnable jobs 08/30/13 16:29:37 (pid:11836) Tables are consistent 08/30/13 16:29:37 (pid:11836) Rebuilt prioritized runnable job list in 0.000s. 08/30/13 16:29:37 (pid:11836) Finished negotiating for bernard in local pool: 1 matched, 0 rejected 08/30/13 16:29:37 (pid:11836) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s 08/30/13 16:29:37 (pid:11836) TransferQueueManager upload 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load 08/30/13 16:29:37 (pid:11836) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load 08/30/13 16:29:37 (pid:11836) Sent ad to central manager for bernard.ac.uk 08/30/13 16:29:37 (pid:11836) Sent ad to 1 collectors for bernard.ac.uk 08/30/13 16:29:37 (pid:11836) Starting add_shadow_birthdate(1.0) Stack dump for process 11836 at timestamp 1377876578 (4 frames) /lib64/libcondor_utils_8_1_0.so(dprintf_dump_stack+0x72)[0x3ee52e0972] /lib64/libcondor_utils_8_1_0.so[0x3ee537b5f7] /lib64/libc.so.6[0x3edea35a90] [0x7fff91c9f3c0] 08/30/13 16:29:48 (pid:12107) ****************************************************** 08/30/13 16:29:48 (pid:12107) ** condor_schedd (CONDOR_SCHEDD) STARTING UP 08/30/13 16:29:48 (pid:12107) ** /usr/sbin/condor_schedd 08/30/13 16:29:48 (pid:12107) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1) 08/30/13 16:29:48 (pid:12107) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON 08/30/13 16:29:48 (pid:12107) ** $CondorVersion: 8.1.0 Jul 15 2013 BuildID: RH-8.1.0-0.2.fc19 PRE-RELEASE-UWCS $ 08/30/13 16:29:48 (pid:12107) ** $CondorPlatform: X86_64-Fedora_19 $ 08/30/13 16:29:48 (pid:12107) ** PID = 12107 08/30/13 16:29:48 (pid:12107) ** Log last touched 8/30 16:29:38 08/30/13 16:29:48 (pid:12107) ****************************************************** 08/30/13 16:29:48 (pid:12107) Using config source: /etc/condor/condor_config 08/30/13 16:29:48 (pid:12107) Using local config sources: 08/30/13 16:29:48 (pid:12107) /etc/condor/config.d/00personal_condor.config 08/30/13 16:29:48 (pid:12107) DaemonCore: command socket at <129.234.21.14:58513> 08/30/13 16:29:48 (pid:12107) DaemonCore: private command socket at <129.234.21.14:58513> 08/30/13 16:29:48 (pid:12107) History file rotation is enabled. 08/30/13 16:29:48 (pid:12107) Maximum history file size is: 20971520 bytes 08/30/13 16:29:48 (pid:12107) Number of rotated history files is: 2 08/30/13 16:29:48 (pid:12107) Failed to execute /usr/sbin/condor_shadow.std, ignoring 08/30/13 16:29:48 (pid:12107) About to rotate ClassAd log /var/lib/condor/spool/job_queue.log 08/30/13 16:29:48 (pid:12107) 1.0: JobLeaseDuration remaining: 1189 08/30/13 16:29:48 (pid:12107) Starting add_shadow_birthdate(1.0) Stack dump for process 12107 at timestamp 1377876588 (4 frames) /lib64/libcondor_utils_8_1_0.so(dprintf_dump_stack+0x72)[0x3ee52e0972] /lib64/libcondor_utils_8_1_0.so[0x3ee537b5f7] /lib64/libc.so.6[0x3edea35a90] [0x7fff27074ce0] So schedd has died again and it keeps crashing as the deamon is restarted. The error above do not give much clue. # ls -l /var/lock/condor/local total 0 So nothing is actually created in that directory. Others logs: # tail -f /var/log/condor/StartLog 08/30/13 16:40:46 ****************************************************** 08/30/13 16:40:46 ** condor_startd (CONDOR_STARTD) STARTING UP 08/30/13 16:40:46 ** /usr/sbin/condor_startd 08/30/13 16:40:46 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1) 08/30/13 16:40:46 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON 08/30/13 16:40:46 ** $CondorVersion: 8.1.0 Jul 15 2013 BuildID: RH-8.1.0-0.2.fc19 PRE-RELEASE-UWCS $ 08/30/13 16:40:46 ** $CondorPlatform: X86_64-Fedora_19 $ 08/30/13 16:40:46 ** PID = 13386 08/30/13 16:40:46 ** Log last touched 8/30 16:40:18 08/30/13 16:40:46 ****************************************************** 08/30/13 16:40:46 Using config source: /etc/condor/condor_config 08/30/13 16:40:46 Using local config sources: 08/30/13 16:40:46 /etc/condor/config.d/00personal_condor.config 08/30/13 16:40:46 DaemonCore: command socket at <129.234.21.14:49432> 08/30/13 16:40:46 DaemonCore: private command socket at <129.234.21.14:49432> 08/30/13 16:40:46 my_popenv failed 08/30/13 16:40:52 Failed to execute /usr/sbin/condor_starter.std, ignoring 08/30/13 16:40:52 VM-gahp server reported an internal error 08/30/13 16:40:52 VM universe will be tested to check if it is available 08/30/13 16:40:52 History file rotation is enabled. 08/30/13 16:40:52 Maximum history file size is: 20971520 bytes 08/30/13 16:40:52 Number of rotated history files is: 2 slot type 0: Cpus: 1, Memory: auto, Swap: auto, Disk: auto slot type 0: Cpus: 1, Memory: 3980, Swap: 25.00%, Disk: 25.00% slot type 0: Cpus: 1, Memory: auto, Swap: auto, Disk: auto slot type 0: Cpus: 1, Memory: 3980, Swap: 25.00%, Disk: 25.00% slot type 0: Cpus: 1, Memory: auto, Swap: auto, Disk: auto slot type 0: Cpus: 1, Memory: 3980, Swap: 25.00%, Disk: 25.00% slot type 0: Cpus: 1, Memory: auto, Swap: auto, Disk: auto slot type 0: Cpus: 1, Memory: 3980, Swap: 25.00%, Disk: 25.00% 08/30/13 16:40:52 slot1: New machine resource allocated 08/30/13 16:40:52 slot2: New machine resource allocated 08/30/13 16:40:52 slot3: New machine resource allocated 08/30/13 16:40:52 slot4: New machine resource allocated 08/30/13 16:40:52 my_popenv failed 08/30/13 16:40:52 CronJobList: Adding job 'mips' 08/30/13 16:40:52 CronJobList: Adding job 'kflops' 08/30/13 16:40:52 CronJob: Initializing job 'mips' (/usr/libexec/condor/condor_mips) 08/30/13 16:40:52 CronJob: Initializing job 'kflops' (/usr/libexec/condor/condor_kflops) 08/30/13 16:40:52 slot1: State change: IS_OWNER is false 08/30/13 16:40:52 slot1: Changing state: Owner -> Unclaimed 08/30/13 16:40:52 State change: RunBenchmarks is TRUE 08/30/13 16:40:52 slot1: Changing activity: Idle -> Benchmarking 08/30/13 16:40:52 BenchMgr:StartBenchmarks() 08/30/13 16:40:52 slot2: State change: IS_OWNER is false 08/30/13 16:40:52 slot2: Changing state: Owner -> Unclaimed 08/30/13 16:40:52 State change: RunBenchmarks is TRUE 08/30/13 16:40:52 slot2: Changing activity: Idle -> Benchmarking 08/30/13 16:40:52 slot2: Changing activity: Benchmarking -> Idle 08/30/13 16:40:52 slot3: State change: IS_OWNER is false 08/30/13 16:40:52 slot3: Changing state: Owner -> Unclaimed 08/30/13 16:40:52 State change: RunBenchmarks is TRUE 08/30/13 16:40:52 slot3: Changing activity: Idle -> Benchmarking 08/30/13 16:40:52 slot3: Changing activity: Benchmarking -> Idle 08/30/13 16:40:52 slot4: State change: IS_OWNER is false 08/30/13 16:40:52 slot4: Changing state: Owner -> Unclaimed 08/30/13 16:40:52 State change: RunBenchmarks is TRUE 08/30/13 16:40:52 slot4: Changing activity: Idle -> Benchmarking 08/30/13 16:40:52 slot4: Changing activity: Benchmarking -> Idle 08/30/13 16:40:52 Starter pid 13389 exited with status 2 08/30/13 16:40:52 Warning: Starter pid 13389 is not associated with an claim. A slot may fail to transition to Idle. 08/30/13 16:40:52 Starter pid 13570 exited with status 2 08/30/13 16:40:52 Warning: Starter pid 13570 is not associated with an claim. A slot may fail to transition to Idle. 08/30/13 16:40:52 Starter pid 13571 exited with status 2 08/30/13 16:40:52 Warning: Starter pid 13571 is not associated with an claim. A slot may fail to transition to Idle. 08/30/13 16:41:17 State change: benchmarks completed 08/30/13 16:41:17 slot1: Changing activity: Benchmarking -> Idle # tail -f /var/log/condor/CollectorLog 08/30/13 16:43:52 SubmittorAd : Inserting ** "< bernard.ac.ukhopf.dur.ac.uk , 129.234.21.14 >" 08/30/13 16:43:52 stats: Inserting new hashent for 'Submittor':'bernard.ac.uk':'129.234.21.14' 08/30/13 16:43:52 Number of Active Workers 1 08/30/13 16:43:52 Number of Active Workers 0 08/30/13 16:43:52 (Sending 1 ads in response to query) 08/30/13 16:43:52 Query info: matched=1; skipped=0; query_time=0.000567; send_time=0.001892; type=Negotiator; requirements={true}; peer=<129.234.21.14:44446>; projection={} 08/30/13 16:43:52 Number of Active Workers 1 08/30/13 16:43:52 Number of Active Workers 0 08/30/13 16:43:52 (Sending 6 ads in response to query) 08/30/13 16:43:52 Query info: matched=6; skipped=3; query_time=0.000624; send_time=0.002858; type=Any; requirements={( ( ( MyType == "Scheduler" ) || ( MyType == "Submitter" ) ) || ( ( MyType == "Machine" ) ) )}; peer=<129.234.21.14:56642>; projection={} 08/30/13 16:43:52 Got QUERY_STARTD_PVT_ADS 08/30/13 16:43:52 Number of Active Workers 2 08/30/13 16:43:52 Number of Active Workers 1 08/30/13 16:43:52 (Sending 4 ads in response to query) 08/30/13 16:43:52 Query info: matched=4; skipped=0; query_time=0.000617; send_time=0.000270; type=MachinePrivate; requirements={true}; peer=<129.234.21.14:34435>; projection={} So nothing works yet and the logs do not give us much clue as what goes wrong. Where do we take it from here?
I have the same problem (not SElinux related, not the lacking files in tmpfiles.d) The only workaround I found was replacing "condor_sched" with the version from 7.9.1; you also need to add libclassad.so.4 and libcondor_utils_7_9_1.so in /lib64. So you end up with a hybrid 8.1.0 and 7.9.1; ugly, but it works perfectly. As 8.1.1 is out, I hope we get beyond that so we can do some real testing.
condor-8.1.1-0.2.fc19 has been submitted as an update for Fedora 19. https://admin.fedoraproject.org/updates/condor-8.1.1-0.2.fc19
I upgraded to 8.1.1-0.2.fc19, and condor_schedd is still crashing. 09/25/13 17:10:35 (pid:9830) Setting maximum file descriptors to 4096. 09/25/13 17:10:35 (pid:9830) ****************************************************** 09/25/13 17:10:35 (pid:9830) ** condor_schedd (CONDOR_SCHEDD) STARTING UP 09/25/13 17:10:35 (pid:9830) ** /usr/sbin/condor_schedd 09/25/13 17:10:35 (pid:9830) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1) 09/25/13 17:10:35 (pid:9830) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON 09/25/13 17:10:35 (pid:9830) ** $CondorVersion: 8.1.1 Sep 25 2013 BuildID: RH-8.1.1-0.2.fc19 $ 09/25/13 17:10:35 (pid:9830) ** $CondorPlatform: X86_64-Fedora_19 $ 09/25/13 17:10:35 (pid:9830) ** PID = 9830 09/25/13 17:10:35 (pid:9830) ** Log last touched 9/25 17:10:24 09/25/13 17:10:35 (pid:9830) ****************************************************** 09/25/13 17:10:35 (pid:9830) Using config source: /etc/condor/condor_config 09/25/13 17:10:35 (pid:9830) Using local config sources: 09/25/13 17:10:35 (pid:9830) /etc/condor/config.d/00personal_condor.config 09/25/13 17:10:35 (pid:9830) CLASSAD_CACHING is ENABLED 09/25/13 17:10:35 (pid:9830) DaemonCore: command socket at <34.53.49.27:49027> 09/25/13 17:10:35 (pid:9830) DaemonCore: private command socket at <34.53.49.27:49027> 09/25/13 17:10:35 (pid:9830) History file rotation is enabled. 09/25/13 17:10:35 (pid:9830) Maximum history file size is: 20971520 bytes 09/25/13 17:10:35 (pid:9830) Number of rotated history files is: 2 09/25/13 17:10:35 (pid:9830) Failed to execute /usr/sbin/condor_shadow.std, ignoring 09/25/13 17:10:35 (pid:9830) About to rotate ClassAd log /var/lib/condor/spool/job_queue.log 09/25/13 17:10:35 (pid:9830) 1.0: JobLeaseDuration remaining: 1178 09/25/13 17:10:35 (pid:9830) directory_util::rec_touch_file: Directory /var/lock/condor/local//76 cannot be created (Permission denied) 09/25/13 17:10:35 (pid:9830) Starting add_shadow_birthdate(1.0) Stack dump for process 9830 at timestamp 1380147035 (4 frames) /lib64/libcondor_utils_8_1_1.so(dprintf_dump_stack+0x72)[0x7fcc8ccf9362] /lib64/libcondor_utils_8_1_1.so(+0x11ae47)[0x7fcc8ccb6e47] /lib64/libpthread.so.0[0x353800efa0] [0x7fff4fd11d80]
could you please `cat /etc/tmpfiles.d/condor.conf` and post your results. It may be possible it could need a restart.
For another data point, I had tested this build on f18 in permissive mode and it worked fine. Might want to try setting to permissive mode first.
I have the same as in Comment #6: > tail SchedLog 09/26/13 09:40:05 (pid:16928) Setting maximum file descriptors to 4096. 09/26/13 09:40:05 (pid:16928) ****************************************************** 09/26/13 09:40:05 (pid:16928) ** condor_schedd (CONDOR_SCHEDD) STARTING UP 09/26/13 09:40:05 (pid:16928) ** /usr/sbin/condor_schedd 09/26/13 09:40:05 (pid:16928) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1) 09/26/13 09:40:05 (pid:16928) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON 09/26/13 09:40:05 (pid:16928) ** $CondorVersion: 8.1.1 Sep 25 2013 BuildID: RH-8.1.1-0.2.fc19 $ 09/26/13 09:40:05 (pid:16928) ** $CondorPlatform: X86_64-Fedora_19 $ 09/26/13 09:40:05 (pid:16928) ** PID = 16928 09/26/13 09:40:05 (pid:16928) ** Log last touched 9/26 09:38:52 09/26/13 09:40:05 (pid:16928) ****************************************************** 09/26/13 09:40:05 (pid:16928) Using config source: /etc/condor/condor_config 09/26/13 09:40:05 (pid:16928) Using local config sources: 09/26/13 09:40:05 (pid:16928) /etc/condor/config.d/00personal_condor.config 09/26/13 09:40:05 (pid:16928) /etc/condor/config.d/01standard_condor.config 09/26/13 09:40:05 (pid:16928) /etc/condor/config.d/80optout_users.config 09/26/13 09:40:05 (pid:16928) CLASSAD_CACHING is ENABLED 09/26/13 09:40:05 (pid:16928) SharedPortEndpoint: waiting for connections to named socket 16697_ec03_11 09/26/13 09:40:05 (pid:16928) DaemonCore: command socket at <10.33.133.176:9618?sock=16697_ec03_11> 09/26/13 09:40:05 (pid:16928) DaemonCore: private command socket at <10.33.133.176:9618?sock=16697_ec03_11> 09/26/13 09:40:05 (pid:16928) History file rotation is enabled. 09/26/13 09:40:05 (pid:16928) Maximum history file size is: 20971520 bytes 09/26/13 09:40:05 (pid:16928) Number of rotated history files is: 2 09/26/13 09:40:05 (pid:16928) Failed to execute /usr/sbin/condor_shadow.std, ignoring 09/26/13 09:40:05 (pid:16928) About to rotate ClassAd log /var/lib/condor/spool/job_queue.log 09/26/13 09:40:05 (pid:16928) 1.0: JobLeaseDuration remaining: 920 09/26/13 09:40:05 (pid:16928) Starting add_shadow_birthdate(1.0) Stack dump for process 16928 at timestamp 1380181205 (4 frames) /lib64/libcondor_utils_8_1_1.so(dprintf_dump_stack+0x72)[0x7f51af08d362] /lib64/libcondor_utils_8_1_1.so(+0x11ae47)[0x7f51af04ae47] /lib64/libpthread.so.0(+0xefa0)[0x7f51aaaf9fa0] [0x7fff684fa400] * Version: condor-8.1.1-0.2.fc19.x86_64 * Condor was stopped/restarted; * No SELinux around; * tmpfiles stuff: d /var/run/condor 0775 condor condor - d /var/lock/condor 0775 condor condor - d /var/lock/condor/local 0775 condor condor - So: no solution yet...
# cat /etc/tmpfiles.d/condor.conf d /var/run/condor 0775 condor condor - d /var/lock/condor 0775 condor condor - d /var/lock/condor/local 0775 condor condor - # getenforce Disabled # condor_version $CondorVersion: 8.1.1 Sep 25 2013 BuildID: RH-8.1.1-0.2.fc19 $ $CondorPlatform: X86_64-Fedora_19 $
I provisioned a f19 box and repro'd, oddly enough f18 works fine.
Package condor-8.1.1-0.2.fc19: * should fix your issue, * was pushed to the Fedora 19 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing condor-8.1.1-0.2.fc19' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2013-17697/condor-8.1.1-0.2.fc19 then log in and leave karma (feedback).
The following is a workaround: USE_CLONE_TO_CREATE_PROCESSES = False
(In reply to Erik Erlandson from comment #13) > The following is a workaround: > > USE_CLONE_TO_CREATE_PROCESSES = False I can verify that the workaround does, in fact, work.
Spin with the workaround is in flight, and upstream ticket created outlining issue.
condor-8.1.1-0.3.fc19 has been submitted as an update for Fedora 19. https://admin.fedoraproject.org/updates/condor-8.1.1-0.3.fc19
condor-8.1.1-0.3.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/condor-8.1.1-0.3.fc20
Package condor-8.1.1-0.3.fc19: * should fix your issue, * was pushed to the Fedora 19 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing condor-8.1.1-0.3.fc19' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2013-19960/condor-8.1.1-0.3.fc19 then log in and leave karma (feedback).
condor-8.1.1-0.3.fc19 has been pushed to the Fedora 19 stable repository. If problems still persist, please make note of it in this bug report.
condor-8.1.1-0.3.fc20 has been pushed to the Fedora 20 stable repository. If problems still persist, please make note of it in this bug report.