| Summary: | Condor schedd dies on job submission in fedora 19 | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | bugzilla |
| Component: | condor | Assignee: | Erik Erlandson <eerlands> |
| Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 19 | CC: | bbockelm, Bert.Deknuydt, b.m.a.g.piette, eerlands, g2boojum, ltoscano, matt, tomspur, tstclair |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | condor-8.1.1-0.3.fc20 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-11-05 02:49:33 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
bugzilla
2013-08-22 17:19:21 UTC
Installed Fedora 19 from DVD.
# yum update
# rpm -q selinux-policy
selinux-policy-3.12.1-71.fc19.noarch
selinux-policy-targeted-3.12.1-71.fc19.noarch
# yum install condor
(install many dependent packages)
# rpm -qa | fgrep condor
condor-8.1.0-0.2.fc19.x86_64
condor-classads-8.1.0-0.2.fc19.x86_64
condor-procd-8.1.0-0.2.fc19.x86_64
# systemctl enable condor
# systemctl start condor
# ps aux | fgrep condor
condor 2868 0.3 0.0 96872 4416 ? Ss 09:28 0:00 /usr/sbin/condor_master -f
root 2869 0.3 0.0 23964 3072 ? S 09:28 0:00 condor_procd -A /var/run/condor/procd_pipe -R 10000000 -S 60 -C 990
condor 2872 0.0 0.0 92780 4468 ? Ss 09:28 0:00 condor_collector -f
root 3120 0.0 0.0 107964 660 pts/1 S+ 09:28 0:00 fgrep --color=auto condor
NO condor_negotiator, NO condor_schedd , NO condor_startd
# tail /var/log/messages
Aug 23 09:16:08 hopf setroubleshoot: SELinux is preventing /usr/sbin/condor_master from read access on the file hosts. For complete SELinux messages. run sealert -l 17eff763-7c56-49d3-bbb3-d21af42f5861
...
Aug 23 09:16:06 hopf setroubleshoot: SELinux is preventing /usr/sbin/condor_master from read access on the file meminfo. For complete SELinux messages. run sealert -l f820579e-eafd-4a47-b64d-f4f41e048e11
...
Aug 23 09:16:06 hopf setroubleshoot: SELinux is preventing /usr/sbin/condor_master from read access on the file cpuinfo. For complete SELinux messages. run sealert -l f820579e-eafd-4a47-b64d-f4f41e048e11
Aug 23 09:16:06 hopf setroubleshoot: SELinux is preventing /usr/sbin/condor_master from read access on the file resolv.conf. For complete SELinux messages. run sealert -l 17eff763-7c56-49d3-bbb3-d21af42f5861
...
Aug 23 09:16:07 hopf setroubleshoot: SELinux is preventing /usr/sbin/condor_master from write access on the file .master_address.new. For complete SELinux messages. run sealert -l 274a39f3-92d2-47bf-95b5-0cefb5d7ff6a
...
Aug 23 09:16:07 hopf setroubleshoot: SELinux is preventing /usr/sbin/condor_master from setattr access on the file MasterLog. For complete SELinux messages. run sealert -l a47020cd-71f5-4972-9529-8550ca6b36ce
Aug 23 09:16:07 hopf setroubleshoot: SELinux is preventing /usr/sbin/condor_master from read access on the file stat. For complete SELinux messages. run sealert -l f820579e-eafd-4a47-b64d-f4f41e048e11
...
Aug 23 09:17:07 hopf setroubleshoot: SELinux is preventing /usr/sbin/condor_collector from setattr access on the file CollectorLog. For complete SELinux messages. run sealert -l e6342827-fc60-4b19-a9dd-d2160b1c4774
# yum install policycoreutils-devel
# fgrep condor /var/log/audit/audit.log | audit2allow
#============= condor_collector_t ==============
allow condor_collector_t condor_log_t:file { write setattr };
#============= condor_master_t ==============
allow condor_master_t condor_log_t:file { write setattr };
allow condor_master_t net_conf_t:file read;
allow condor_master_t proc_t:file read;
SO THERE ARE STILL SELINUX ISSUES WHICH CAN PROBABLY BE FIXED BY CREATING
SEMODULES
FIRST WE MUST CHECK IF CONDOR WORKS WITH SELINUX TRUNED OFF
# setenforce 0
# systemctl restart condor
# ps aux | fgrep condor
condor 2544 0.1 0.0 92808 4620 ? Ss 09:27 0:00 /usr/sbin/condor_master -f
root 2545 0.3 0.0 23964 3096 ? S 09:27 0:00 condor_procd -A /var/run/condor/procd_pipe -R 10000000 -S 60 -C 990
condor 2546 0.1 0.0 92804 4640 ? Ss 09:27 0:00 condor_collector -f
condor 2551 0.1 0.0 92904 4652 ? Ss 09:27 0:00 condor_negotiator -f
condor 2552 0.1 0.0 94052 5204 ? Ss 09:27 0:00 condor_schedd -f
condor 2553 0.2 0.0 93212 4944 ? Ss 09:27 0:00 condor_startd -f
condor 2765 109 0.0 16848 464 ? R 09:27 0:01 mips
root 2767 0.0 0.0 107964 660 pts/1 S+ 09:27 0:00 fgrep --color=auto condor
# systemctl status condor
condor.service - Condor Distributed High-Throughput-Computing
Loaded: loaded (/usr/lib/systemd/system/condor.service; enabled)
Active: active (running) since Fri 2013-08-23 09:41:50 BST; 37s ago
Process: 5012 ExecStop=/usr/sbin/condor_off -master (code=exited, status=0/SUCCESS)
Main PID: 5089 (condor_master)
CGroup: name=systemd:/system/condor.service
├─5089 /usr/sbin/condor_master -f
├─5092 condor_procd -A /var/run/condor/procd_pipe -R 10000000 -S 6...
├─5093 condor_collector -f
├─5098 condor_negotiator -f
├─5099 condor_schedd -f
└─5100 condor_startd -f
Aug 23 09:41:50 hopf.dur.ac.uk systemd[1]: Started Condor Distributed High-T....
bernard% condor_run date
(THE COMMAND HANGS WITHOUT OUTPUT)
# cat /var/log/condor/SchedLog
...
08/23/13 09:45:08 (pid:5847) ******************************************************
08/23/13 09:45:08 (pid:5847) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
08/23/13 09:45:08 (pid:5847) ** /usr/sbin/condor_schedd
08/23/13 09:45:08 (pid:5847) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
08/23/13 09:45:08 (pid:5847) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
08/23/13 09:45:08 (pid:5847) ** $CondorVersion: 8.1.0 Jul 15 2013 BuildID: RH-8.1.0-0.2.fc19 PRE-RELEASE-UWCS $
08/23/13 09:45:08 (pid:5847) ** $CondorPlatform: X86_64-Fedora_19 $
08/23/13 09:45:08 (pid:5847) ** PID = 5847
08/23/13 09:45:08 (pid:5847) ** Log last touched 8/23 09:44:43
08/23/13 09:45:08 (pid:5847) ******************************************************
08/23/13 09:45:08 (pid:5847) Using config source: /etc/condor/condor_config
08/23/13 09:45:08 (pid:5847) Using local config sources:
08/23/13 09:45:08 (pid:5847) /etc/condor/config.d/00personal_condor.config
08/23/13 09:45:08 (pid:5847) DaemonCore: command socket at <129.234.21.14:36324>
08/23/13 09:45:08 (pid:5847) DaemonCore: private command socket at <129.234.21.14:36324>
08/23/13 09:45:08 (pid:5847) History file rotation is enabled.
08/23/13 09:45:08 (pid:5847) Maximum history file size is: 20971520 bytes
08/23/13 09:45:08 (pid:5847) Number of rotated history files is: 2
08/23/13 09:45:08 (pid:5847) Failed to execute /usr/sbin/condor_shadow.std, ignoring
08/23/13 09:45:08 (pid:5847) About to rotate ClassAd log /var/lib/condor/spool/job_queue.log
08/23/13 09:45:08 (pid:5847) 1.0: JobLeaseDuration remaining: 1123
08/23/13 09:45:08 (pid:5847) directory_util::rec_touch_file: Directory /var/lock/condor/local cannot be created (Permission denied)
08/23/13 09:45:08 (pid:5847) Starting add_shadow_birthdate(1.0)
Stack dump for process 5847 at timestamp 1377247508 (4 frames)
/lib64/libcondor_utils_8_1_0.so(dprintf_dump_stack+0x72)[0x3ee52e0972]
/lib64/libcondor_utils_8_1_0.so[0x3ee537b5f7]
/lib64/libc.so.6[0x3edea35a90]
[0x7fff7c604960]
SO /var/lock/condor/local (which is /run/lock/condor/local) CAN'T BE CREATED
BY condor
# ls -l /var/run/ | fgrep condor
drwxrwxr-x. 2 condor condor 80 Aug 23 09:47 condor
IS IT NOT condor TRYING TO CREATE /var/run/condor/local ?
# chmod a+w /var/run/condor
# systemctl stop condor
# rm /var/lib/condor/spool/job*
# systemctl restart condor
bernard% condor_run date
# cat /var/log/condor/SchedLog
...
08/23/13 10:17:48 (pid:9852) ******************************************************
08/23/13 10:17:48 (pid:9852) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
08/23/13 10:17:48 (pid:9852) ** /usr/sbin/condor_schedd
08/23/13 10:17:48 (pid:9852) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
08/23/13 10:17:48 (pid:9852) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
08/23/13 10:17:48 (pid:9852) ** $CondorVersion: 8.1.0 Jul 15 2013 BuildID: RH-8.1.0-0.2.fc19 PRE-RELEASE-UWCS $
08/23/13 10:17:48 (pid:9852) ** $CondorPlatform: X86_64-Fedora_19 $
08/23/13 10:17:48 (pid:9852) ** PID = 9852
08/23/13 10:17:48 (pid:9852) ** Log last touched 8/23 10:16:15
08/23/13 10:17:48 (pid:9852) ******************************************************
08/23/13 10:17:48 (pid:9852) Using config source: /etc/condor/condor_config
08/23/13 10:17:48 (pid:9852) Using local config sources:
08/23/13 10:17:48 (pid:9852) /etc/condor/config.d/00personal_condor.config
08/23/13 10:17:48 (pid:9852) DaemonCore: command socket at <129.234.21.14:48753>
08/23/13 10:17:48 (pid:9852) DaemonCore: private command socket at <129.234.21.14:48753>
08/23/13 10:17:48 (pid:9852) History file rotation is enabled.
08/23/13 10:17:48 (pid:9852) Maximum history file size is: 20971520 bytes
08/23/13 10:17:48 (pid:9852) Number of rotated history files is: 2
08/23/13 10:17:48 (pid:9852) Failed to execute /usr/sbin/condor_shadow.std, ignoring
08/23/13 10:17:53 (pid:9852) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
08/23/13 10:17:53 (pid:9852) TransferQueueManager upload 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load
08/23/13 10:17:53 (pid:9852) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load
08/23/13 10:19:58 (pid:9852) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
08/23/13 10:19:58 (pid:9852) TransferQueueManager upload 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load
08/23/13 10:19:58 (pid:9852) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load
08/23/13 10:19:58 (pid:9852) Sent ad to central manager for bernard.ac.uk
08/23/13 10:19:58 (pid:9852) Sent ad to 1 collectors for bernard.ac.uk
08/23/13 10:20:08 (pid:9852) Using negotiation protocol: NEGOTIATE
08/23/13 10:20:08 (pid:9852) Negotiating for owner: bernard.ac.uk
08/23/13 10:20:08 (pid:9852) AutoCluster:config() significant attributes changed to JobUniverse,LastCheckpointPlatform,NumCkpts,MachineLastMatchTime,RemoteGroup,SubmitterGroup,SubmitterUserPrio
08/23/13 10:20:08 (pid:9852) Checking consistency running and runnable jobs
08/23/13 10:20:08 (pid:9852) Tables are consistent
08/23/13 10:20:08 (pid:9852) Rebuilt prioritized runnable job list in 0.000s.
08/23/13 10:20:08 (pid:9852) Finished negotiating for bernard in local pool: 1 matched, 0 rejected
08/23/13 10:20:08 (pid:9852) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
08/23/13 10:20:08 (pid:9852) TransferQueueManager upload 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load
08/23/13 10:20:08 (pid:9852) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load
08/23/13 10:20:08 (pid:9852) Sent ad to central manager for bernard.ac.uk
08/23/13 10:20:08 (pid:9852) Sent ad to 1 collectors for bernard.ac.uk
08/23/13 10:20:08 (pid:9852) Starting add_shadow_birthdate(1.0)
Stack dump for process 9852 at timestamp 1377249609 (4 frames)
/lib64/libcondor_utils_8_1_0.so(dprintf_dump_stack+0x72)[0x3ee52e0972]
/lib64/libcondor_utils_8_1_0.so[0x3ee537b5f7]
/lib64/libc.so.6[0x3edea35a90]
[0x7fffcb058e60]
SO NO MORE PROBLEMS WITH CREATING /var/lock/condor/local
(/run/lock/condor/local). BUT
# ls -l /var/lock/condor
total 0
-rw-------. 1 condor condor 0 Aug 23 10:17 InstanceLock
drwxrwxrwx. 2 bernard bernard 40 Aug 23 10:20 local
WHY IS IT THE USER CREATING THAT DIRECTORY? THIS IS OBVIOUSLY WHY IT FAILED
BEFORE. THIS IS OBVIOUSLY WRONG AS SEVERAL USERS CAN USE CONDOR ON THE
SAME COMPUTER.
ANYWAY WE ARE NOT OUT OF TROUBLE YET:
THEN schedd IS RESTARTED EVERY 30 SECONDS. It LOOKS AS IF IT CRASHES.
# systemctl status condor
condor.service - Condor Distributed High-Throughput-Computing
Loaded: loaded (/usr/lib/systemd/system/condor.service; enabled)
Active: active (running) since Fri 2013-08-23 10:10:24 BST; 1min 16s ago
Process: 8871 ExecStop=/usr/sbin/condor_off -master (code=exited, status=0/SUCCESS)
Main PID: 8931 (condor_master)
CGroup: name=systemd:/system/condor.service
├─8931 /usr/sbin/condor_master -f
├─8932 condor_procd -A /var/run/condor/procd_pipe -R 10000000 -S 6...
├─8935 condor_collector -f
├─9017 condor_negotiator -f
└─9020 condor_startd -f
Aug 23 10:10:24 hopf.dur.ac.uk systemd[1]: Started Condor Distributed High-T....
Aug 23 10:10:47 hopf.dur.ac.uk sendmail[9123]: r7N9Alkx009123: from=condor, ...t
Aug 23 10:10:47 hopf.dur.ac.uk sendmail[9123]: r7N9Alkx009123: to=root@hopf....)
Aug 23 10:10:58 hopf.dur.ac.uk sendmail[9189]: r7N9AvYw009189: from=condor, ...t
Aug 23 10:10:58 hopf.dur.ac.uk sendmail[9189]: r7N9AvYw009189: to=root@hopf....)
Aug 23 10:11:09 hopf.dur.ac.uk sendmail[9254]: r7N9B9Ha009254: from=condor, ...t
Aug 23 10:11:09 hopf.dur.ac.uk sendmail[9254]: r7N9B9Ha009254: to=root@hopf....)
Aug 23 10:11:22 hopf.dur.ac.uk sendmail[9263]: r7N9BMQR009263: from=condor, ...t
Aug 23 10:11:22 hopf.dur.ac.uk sendmail[9263]: r7N9BMQR009263: to=root@hopf....)
SO THE NEXT THING TO DO IS TO SOLVE THE PROBLEM OF CREATING
/var/lock/condor/local CORRECTLY
Oh! This rings a bell! Try adding the following to /etc/tmpfiles.d/condor.conf: d /var/lock/condor/local 0775 condor condor - (there should already be 2 other lines there). Then, run: systemd-tmpfiles --create (this would also get done at boot). Thank you for your hard work at this! After doing the above:
# getenforce
Permissive
# systemctl start condor
# tail -f /var/log/condor/SchedLog
08/30/13 16:22:24 (pid:11324) ******************************************************
08/30/13 16:22:24 (pid:11324) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
08/30/13 16:22:24 (pid:11324) ** /usr/sbin/condor_schedd
08/30/13 16:22:24 (pid:11324) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
08/30/13 16:22:24 (pid:11324) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
08/30/13 16:22:24 (pid:11324) ** $CondorVersion: 8.1.0 Jul 15 2013 BuildID: RH-8.1.0-0.2.fc19 PRE-RELEASE-UWCS $
08/30/13 16:22:24 (pid:11324) ** $CondorPlatform: X86_64-Fedora_19 $
08/30/13 16:22:24 (pid:11324) ** PID = 11324
08/30/13 16:22:24 (pid:11324) ** Log last touched 8/30 16:21:17
08/30/13 16:22:24 (pid:11324) ******************************************************
08/30/13 16:22:24 (pid:11324) Using config source: /etc/condor/condor_config
08/30/13 16:22:24 (pid:11324) Using local config sources:
08/30/13 16:22:24 (pid:11324) /etc/condor/config.d/00personal_condor.config
08/30/13 16:22:24 (pid:11324) DaemonCore: command socket at <129.234.21.14:35109>
08/30/13 16:22:24 (pid:11324) DaemonCore: private command socket at <129.234.21.14:35109>
08/30/13 16:22:24 (pid:11324) History file rotation is enabled.
08/30/13 16:22:24 (pid:11324) Maximum history file size is: 20971520 bytes
08/30/13 16:22:24 (pid:11324) Number of rotated history files is: 2
08/30/13 16:22:24 (pid:11324) Failed to execute /usr/sbin/condor_shadow.std, ignoring
08/30/13 16:22:30 (pid:11324) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
08/30/13 16:22:30 (pid:11324) TransferQueueManager upload 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load
08/30/13 16:22:30 (pid:11324) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load
and the schedd is running
Then
bernard% condor_submit condor_job_test.txt
Submitting job(s).
1 job(s) submitted to cluster 1.
bernard% condor_q
-- Failed to fetch ads from: <129.234.21.14:35109> : hopf.dur.ac.uk
CEDAR:6001:Failed to connect to <129.234.21.14:35109>
The schedd has died and
# tail -f /var/log/condor/SchedLog
08/30/13 16:24:01 (pid:11324) Number of Active Workers 1
08/30/13 16:24:01 (pid:11572) Number of Active Workers 0
08/30/13 16:24:10 (pid:11324) directory_util::rec_touch_file: File /var/lock/condor/local//63/70/91326650395759.lockc cannot be created (Permission denied)
08/30/13 16:24:10 (pid:11324) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
08/30/13 16:24:10 (pid:11324) TransferQueueManager upload 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load
08/30/13 16:24:10 (pid:11324) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load
08/30/13 16:24:10 (pid:11324) Sent ad to central manager for bernard.ac.uk
08/30/13 16:24:10 (pid:11324) Sent ad to 1 collectors for bernard.ac.uk
08/30/13 16:24:10 (pid:11324) Using negotiation protocol: NEGOTIATE
08/30/13 16:24:10 (pid:11324) Negotiating for owner: bernard.ac.uk
08/30/13 16:24:10 (pid:11324) AutoCluster:config() significant attributes changed to JobUniverse,LastCheckpointPlatform,NumCkpts,MachineLastMatchTime,RemoteGroup,SubmitterGroup,SubmitterUserPrio
08/30/13 16:24:10 (pid:11324) Checking consistency running and runnable jobs
08/30/13 16:24:10 (pid:11324) Tables are consistent
08/30/13 16:24:10 (pid:11324) Rebuilt prioritized runnable job list in 0.000s.
08/30/13 16:24:10 (pid:11324) Finished negotiating for bernard in local pool: 1 matched, 0 rejected
08/30/13 16:24:10 (pid:11324) Starting add_shadow_birthdate(1.0)
Stack dump for process 11324 at timestamp 1377876250 (4 frames)
/lib64/libcondor_utils_8_1_0.so(dprintf_dump_stack+0x72)[0x3ee52e0972]
/lib64/libcondor_utils_8_1_0.so[0x3ee537b5f7]
/lib64/libc.so.6[0x3edea35a90]
[0x7fff2660ac20]
The schedd dies and the master daemon tries to restart it every 30 seconds.
/var/lock/condor/local is empty
# chmod 777 /var/lock/condor/local/
# systemctl stop condor
# rm /var/lib/condor/spool/job*
# systemctl restart condor
bernard% condor_submit condor_job_test.txt
Submitting job(s).
1 job(s) submitted to cluster 1.
# tail -f /var/log/condor/SchedLog
08/30/13 16:29:22 (pid:11836) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
08/30/13 16:29:22 (pid:11836) TransferQueueManager upload 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load
08/30/13 16:29:22 (pid:11836) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load
08/30/13 16:29:27 (pid:11836) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
08/30/13 16:29:27 (pid:11836) TransferQueueManager upload 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load
08/30/13 16:29:27 (pid:11836) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load
08/30/13 16:29:27 (pid:11836) Sent ad to central manager for bernard.ac.uk
08/30/13 16:29:27 (pid:11836) Sent ad to 1 collectors for bernard.ac.uk
08/30/13 16:29:37 (pid:11836) Using negotiation protocol: NEGOTIATE
08/30/13 16:29:37 (pid:11836) Negotiating for owner: bernard.ac.uk
08/30/13 16:29:37 (pid:11836) AutoCluster:config() significant attributes changed to JobUniverse,LastCheckpointPlatform,NumCkpts,MachineLastMatchTime,RemoteGroup,SubmitterGroup,SubmitterUserPrio
08/30/13 16:29:37 (pid:11836) Checking consistency running and runnable jobs
08/30/13 16:29:37 (pid:11836) Tables are consistent
08/30/13 16:29:37 (pid:11836) Rebuilt prioritized runnable job list in 0.000s.
08/30/13 16:29:37 (pid:11836) Finished negotiating for bernard in local pool: 1 matched, 0 rejected
08/30/13 16:29:37 (pid:11836) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
08/30/13 16:29:37 (pid:11836) TransferQueueManager upload 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load
08/30/13 16:29:37 (pid:11836) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load
08/30/13 16:29:37 (pid:11836) Sent ad to central manager for bernard.ac.uk
08/30/13 16:29:37 (pid:11836) Sent ad to 1 collectors for bernard.ac.uk
08/30/13 16:29:37 (pid:11836) Starting add_shadow_birthdate(1.0)
Stack dump for process 11836 at timestamp 1377876578 (4 frames)
/lib64/libcondor_utils_8_1_0.so(dprintf_dump_stack+0x72)[0x3ee52e0972]
/lib64/libcondor_utils_8_1_0.so[0x3ee537b5f7]
/lib64/libc.so.6[0x3edea35a90]
[0x7fff91c9f3c0]
08/30/13 16:29:48 (pid:12107) ******************************************************
08/30/13 16:29:48 (pid:12107) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
08/30/13 16:29:48 (pid:12107) ** /usr/sbin/condor_schedd
08/30/13 16:29:48 (pid:12107) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
08/30/13 16:29:48 (pid:12107) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
08/30/13 16:29:48 (pid:12107) ** $CondorVersion: 8.1.0 Jul 15 2013 BuildID: RH-8.1.0-0.2.fc19 PRE-RELEASE-UWCS $
08/30/13 16:29:48 (pid:12107) ** $CondorPlatform: X86_64-Fedora_19 $
08/30/13 16:29:48 (pid:12107) ** PID = 12107
08/30/13 16:29:48 (pid:12107) ** Log last touched 8/30 16:29:38
08/30/13 16:29:48 (pid:12107) ******************************************************
08/30/13 16:29:48 (pid:12107) Using config source: /etc/condor/condor_config
08/30/13 16:29:48 (pid:12107) Using local config sources:
08/30/13 16:29:48 (pid:12107) /etc/condor/config.d/00personal_condor.config
08/30/13 16:29:48 (pid:12107) DaemonCore: command socket at <129.234.21.14:58513>
08/30/13 16:29:48 (pid:12107) DaemonCore: private command socket at <129.234.21.14:58513>
08/30/13 16:29:48 (pid:12107) History file rotation is enabled.
08/30/13 16:29:48 (pid:12107) Maximum history file size is: 20971520 bytes
08/30/13 16:29:48 (pid:12107) Number of rotated history files is: 2
08/30/13 16:29:48 (pid:12107) Failed to execute /usr/sbin/condor_shadow.std, ignoring
08/30/13 16:29:48 (pid:12107) About to rotate ClassAd log /var/lib/condor/spool/job_queue.log
08/30/13 16:29:48 (pid:12107) 1.0: JobLeaseDuration remaining: 1189
08/30/13 16:29:48 (pid:12107) Starting add_shadow_birthdate(1.0)
Stack dump for process 12107 at timestamp 1377876588 (4 frames)
/lib64/libcondor_utils_8_1_0.so(dprintf_dump_stack+0x72)[0x3ee52e0972]
/lib64/libcondor_utils_8_1_0.so[0x3ee537b5f7]
/lib64/libc.so.6[0x3edea35a90]
[0x7fff27074ce0]
So schedd has died again and it keeps crashing as the deamon is restarted.
The error above do not give much clue.
# ls -l /var/lock/condor/local
total 0
So nothing is actually created in that directory.
Others logs:
# tail -f /var/log/condor/StartLog
08/30/13 16:40:46 ******************************************************
08/30/13 16:40:46 ** condor_startd (CONDOR_STARTD) STARTING UP
08/30/13 16:40:46 ** /usr/sbin/condor_startd
08/30/13 16:40:46 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1)
08/30/13 16:40:46 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON
08/30/13 16:40:46 ** $CondorVersion: 8.1.0 Jul 15 2013 BuildID: RH-8.1.0-0.2.fc19 PRE-RELEASE-UWCS $
08/30/13 16:40:46 ** $CondorPlatform: X86_64-Fedora_19 $
08/30/13 16:40:46 ** PID = 13386
08/30/13 16:40:46 ** Log last touched 8/30 16:40:18
08/30/13 16:40:46 ******************************************************
08/30/13 16:40:46 Using config source: /etc/condor/condor_config
08/30/13 16:40:46 Using local config sources:
08/30/13 16:40:46 /etc/condor/config.d/00personal_condor.config
08/30/13 16:40:46 DaemonCore: command socket at <129.234.21.14:49432>
08/30/13 16:40:46 DaemonCore: private command socket at <129.234.21.14:49432>
08/30/13 16:40:46 my_popenv failed
08/30/13 16:40:52 Failed to execute /usr/sbin/condor_starter.std, ignoring
08/30/13 16:40:52 VM-gahp server reported an internal error
08/30/13 16:40:52 VM universe will be tested to check if it is available
08/30/13 16:40:52 History file rotation is enabled.
08/30/13 16:40:52 Maximum history file size is: 20971520 bytes
08/30/13 16:40:52 Number of rotated history files is: 2
slot type 0: Cpus: 1, Memory: auto, Swap: auto, Disk: auto
slot type 0: Cpus: 1, Memory: 3980, Swap: 25.00%, Disk: 25.00%
slot type 0: Cpus: 1, Memory: auto, Swap: auto, Disk: auto
slot type 0: Cpus: 1, Memory: 3980, Swap: 25.00%, Disk: 25.00%
slot type 0: Cpus: 1, Memory: auto, Swap: auto, Disk: auto
slot type 0: Cpus: 1, Memory: 3980, Swap: 25.00%, Disk: 25.00%
slot type 0: Cpus: 1, Memory: auto, Swap: auto, Disk: auto
slot type 0: Cpus: 1, Memory: 3980, Swap: 25.00%, Disk: 25.00%
08/30/13 16:40:52 slot1: New machine resource allocated
08/30/13 16:40:52 slot2: New machine resource allocated
08/30/13 16:40:52 slot3: New machine resource allocated
08/30/13 16:40:52 slot4: New machine resource allocated
08/30/13 16:40:52 my_popenv failed
08/30/13 16:40:52 CronJobList: Adding job 'mips'
08/30/13 16:40:52 CronJobList: Adding job 'kflops'
08/30/13 16:40:52 CronJob: Initializing job 'mips' (/usr/libexec/condor/condor_mips)
08/30/13 16:40:52 CronJob: Initializing job 'kflops' (/usr/libexec/condor/condor_kflops)
08/30/13 16:40:52 slot1: State change: IS_OWNER is false
08/30/13 16:40:52 slot1: Changing state: Owner -> Unclaimed
08/30/13 16:40:52 State change: RunBenchmarks is TRUE
08/30/13 16:40:52 slot1: Changing activity: Idle -> Benchmarking
08/30/13 16:40:52 BenchMgr:StartBenchmarks()
08/30/13 16:40:52 slot2: State change: IS_OWNER is false
08/30/13 16:40:52 slot2: Changing state: Owner -> Unclaimed
08/30/13 16:40:52 State change: RunBenchmarks is TRUE
08/30/13 16:40:52 slot2: Changing activity: Idle -> Benchmarking
08/30/13 16:40:52 slot2: Changing activity: Benchmarking -> Idle
08/30/13 16:40:52 slot3: State change: IS_OWNER is false
08/30/13 16:40:52 slot3: Changing state: Owner -> Unclaimed
08/30/13 16:40:52 State change: RunBenchmarks is TRUE
08/30/13 16:40:52 slot3: Changing activity: Idle -> Benchmarking
08/30/13 16:40:52 slot3: Changing activity: Benchmarking -> Idle
08/30/13 16:40:52 slot4: State change: IS_OWNER is false
08/30/13 16:40:52 slot4: Changing state: Owner -> Unclaimed
08/30/13 16:40:52 State change: RunBenchmarks is TRUE
08/30/13 16:40:52 slot4: Changing activity: Idle -> Benchmarking
08/30/13 16:40:52 slot4: Changing activity: Benchmarking -> Idle
08/30/13 16:40:52 Starter pid 13389 exited with status 2
08/30/13 16:40:52 Warning: Starter pid 13389 is not associated with an claim. A slot may fail to transition to Idle.
08/30/13 16:40:52 Starter pid 13570 exited with status 2
08/30/13 16:40:52 Warning: Starter pid 13570 is not associated with an claim. A slot may fail to transition to Idle.
08/30/13 16:40:52 Starter pid 13571 exited with status 2
08/30/13 16:40:52 Warning: Starter pid 13571 is not associated with an claim. A slot may fail to transition to Idle.
08/30/13 16:41:17 State change: benchmarks completed
08/30/13 16:41:17 slot1: Changing activity: Benchmarking -> Idle
# tail -f /var/log/condor/CollectorLog
08/30/13 16:43:52 SubmittorAd : Inserting ** "< bernard.ac.ukhopf.dur.ac.uk , 129.234.21.14 >"
08/30/13 16:43:52 stats: Inserting new hashent for 'Submittor':'bernard.ac.uk':'129.234.21.14'
08/30/13 16:43:52 Number of Active Workers 1
08/30/13 16:43:52 Number of Active Workers 0
08/30/13 16:43:52 (Sending 1 ads in response to query)
08/30/13 16:43:52 Query info: matched=1; skipped=0; query_time=0.000567; send_time=0.001892; type=Negotiator; requirements={true}; peer=<129.234.21.14:44446>; projection={}
08/30/13 16:43:52 Number of Active Workers 1
08/30/13 16:43:52 Number of Active Workers 0
08/30/13 16:43:52 (Sending 6 ads in response to query)
08/30/13 16:43:52 Query info: matched=6; skipped=3; query_time=0.000624; send_time=0.002858; type=Any; requirements={( ( ( MyType == "Scheduler" ) || ( MyType == "Submitter" ) ) || ( ( MyType == "Machine" ) ) )}; peer=<129.234.21.14:56642>; projection={}
08/30/13 16:43:52 Got QUERY_STARTD_PVT_ADS
08/30/13 16:43:52 Number of Active Workers 2
08/30/13 16:43:52 Number of Active Workers 1
08/30/13 16:43:52 (Sending 4 ads in response to query)
08/30/13 16:43:52 Query info: matched=4; skipped=0; query_time=0.000617; send_time=0.000270; type=MachinePrivate; requirements={true}; peer=<129.234.21.14:34435>; projection={}
So nothing works yet and the logs do not give us much clue as what goes wrong.
Where do we take it from here?
I have the same problem (not SElinux related, not the lacking files in tmpfiles.d) The only workaround I found was replacing "condor_sched" with the version from 7.9.1; you also need to add libclassad.so.4 and libcondor_utils_7_9_1.so in /lib64. So you end up with a hybrid 8.1.0 and 7.9.1; ugly, but it works perfectly. As 8.1.1 is out, I hope we get beyond that so we can do some real testing. condor-8.1.1-0.2.fc19 has been submitted as an update for Fedora 19. https://admin.fedoraproject.org/updates/condor-8.1.1-0.2.fc19 I upgraded to 8.1.1-0.2.fc19, and condor_schedd is still crashing. 09/25/13 17:10:35 (pid:9830) Setting maximum file descriptors to 4096. 09/25/13 17:10:35 (pid:9830) ****************************************************** 09/25/13 17:10:35 (pid:9830) ** condor_schedd (CONDOR_SCHEDD) STARTING UP 09/25/13 17:10:35 (pid:9830) ** /usr/sbin/condor_schedd 09/25/13 17:10:35 (pid:9830) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1) 09/25/13 17:10:35 (pid:9830) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON 09/25/13 17:10:35 (pid:9830) ** $CondorVersion: 8.1.1 Sep 25 2013 BuildID: RH-8.1.1-0.2.fc19 $ 09/25/13 17:10:35 (pid:9830) ** $CondorPlatform: X86_64-Fedora_19 $ 09/25/13 17:10:35 (pid:9830) ** PID = 9830 09/25/13 17:10:35 (pid:9830) ** Log last touched 9/25 17:10:24 09/25/13 17:10:35 (pid:9830) ****************************************************** 09/25/13 17:10:35 (pid:9830) Using config source: /etc/condor/condor_config 09/25/13 17:10:35 (pid:9830) Using local config sources: 09/25/13 17:10:35 (pid:9830) /etc/condor/config.d/00personal_condor.config 09/25/13 17:10:35 (pid:9830) CLASSAD_CACHING is ENABLED 09/25/13 17:10:35 (pid:9830) DaemonCore: command socket at <34.53.49.27:49027> 09/25/13 17:10:35 (pid:9830) DaemonCore: private command socket at <34.53.49.27:49027> 09/25/13 17:10:35 (pid:9830) History file rotation is enabled. 09/25/13 17:10:35 (pid:9830) Maximum history file size is: 20971520 bytes 09/25/13 17:10:35 (pid:9830) Number of rotated history files is: 2 09/25/13 17:10:35 (pid:9830) Failed to execute /usr/sbin/condor_shadow.std, ignoring 09/25/13 17:10:35 (pid:9830) About to rotate ClassAd log /var/lib/condor/spool/job_queue.log 09/25/13 17:10:35 (pid:9830) 1.0: JobLeaseDuration remaining: 1178 09/25/13 17:10:35 (pid:9830) directory_util::rec_touch_file: Directory /var/lock/condor/local//76 cannot be created (Permission denied) 09/25/13 17:10:35 (pid:9830) Starting add_shadow_birthdate(1.0) Stack dump for process 9830 at timestamp 1380147035 (4 frames) /lib64/libcondor_utils_8_1_1.so(dprintf_dump_stack+0x72)[0x7fcc8ccf9362] /lib64/libcondor_utils_8_1_1.so(+0x11ae47)[0x7fcc8ccb6e47] /lib64/libpthread.so.0[0x353800efa0] [0x7fff4fd11d80] could you please `cat /etc/tmpfiles.d/condor.conf` and post your results. It may be possible it could need a restart. For another data point, I had tested this build on f18 in permissive mode and it worked fine. Might want to try setting to permissive mode first. I have the same as in Comment #6: > tail SchedLog 09/26/13 09:40:05 (pid:16928) Setting maximum file descriptors to 4096. 09/26/13 09:40:05 (pid:16928) ****************************************************** 09/26/13 09:40:05 (pid:16928) ** condor_schedd (CONDOR_SCHEDD) STARTING UP 09/26/13 09:40:05 (pid:16928) ** /usr/sbin/condor_schedd 09/26/13 09:40:05 (pid:16928) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1) 09/26/13 09:40:05 (pid:16928) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON 09/26/13 09:40:05 (pid:16928) ** $CondorVersion: 8.1.1 Sep 25 2013 BuildID: RH-8.1.1-0.2.fc19 $ 09/26/13 09:40:05 (pid:16928) ** $CondorPlatform: X86_64-Fedora_19 $ 09/26/13 09:40:05 (pid:16928) ** PID = 16928 09/26/13 09:40:05 (pid:16928) ** Log last touched 9/26 09:38:52 09/26/13 09:40:05 (pid:16928) ****************************************************** 09/26/13 09:40:05 (pid:16928) Using config source: /etc/condor/condor_config 09/26/13 09:40:05 (pid:16928) Using local config sources: 09/26/13 09:40:05 (pid:16928) /etc/condor/config.d/00personal_condor.config 09/26/13 09:40:05 (pid:16928) /etc/condor/config.d/01standard_condor.config 09/26/13 09:40:05 (pid:16928) /etc/condor/config.d/80optout_users.config 09/26/13 09:40:05 (pid:16928) CLASSAD_CACHING is ENABLED 09/26/13 09:40:05 (pid:16928) SharedPortEndpoint: waiting for connections to named socket 16697_ec03_11 09/26/13 09:40:05 (pid:16928) DaemonCore: command socket at <10.33.133.176:9618?sock=16697_ec03_11> 09/26/13 09:40:05 (pid:16928) DaemonCore: private command socket at <10.33.133.176:9618?sock=16697_ec03_11> 09/26/13 09:40:05 (pid:16928) History file rotation is enabled. 09/26/13 09:40:05 (pid:16928) Maximum history file size is: 20971520 bytes 09/26/13 09:40:05 (pid:16928) Number of rotated history files is: 2 09/26/13 09:40:05 (pid:16928) Failed to execute /usr/sbin/condor_shadow.std, ignoring 09/26/13 09:40:05 (pid:16928) About to rotate ClassAd log /var/lib/condor/spool/job_queue.log 09/26/13 09:40:05 (pid:16928) 1.0: JobLeaseDuration remaining: 920 09/26/13 09:40:05 (pid:16928) Starting add_shadow_birthdate(1.0) Stack dump for process 16928 at timestamp 1380181205 (4 frames) /lib64/libcondor_utils_8_1_1.so(dprintf_dump_stack+0x72)[0x7f51af08d362] /lib64/libcondor_utils_8_1_1.so(+0x11ae47)[0x7f51af04ae47] /lib64/libpthread.so.0(+0xefa0)[0x7f51aaaf9fa0] [0x7fff684fa400] * Version: condor-8.1.1-0.2.fc19.x86_64 * Condor was stopped/restarted; * No SELinux around; * tmpfiles stuff: d /var/run/condor 0775 condor condor - d /var/lock/condor 0775 condor condor - d /var/lock/condor/local 0775 condor condor - So: no solution yet... # cat /etc/tmpfiles.d/condor.conf d /var/run/condor 0775 condor condor - d /var/lock/condor 0775 condor condor - d /var/lock/condor/local 0775 condor condor - # getenforce Disabled # condor_version $CondorVersion: 8.1.1 Sep 25 2013 BuildID: RH-8.1.1-0.2.fc19 $ $CondorPlatform: X86_64-Fedora_19 $ I provisioned a f19 box and repro'd, oddly enough f18 works fine. Package condor-8.1.1-0.2.fc19: * should fix your issue, * was pushed to the Fedora 19 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing condor-8.1.1-0.2.fc19' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2013-17697/condor-8.1.1-0.2.fc19 then log in and leave karma (feedback). The following is a workaround: USE_CLONE_TO_CREATE_PROCESSES = False (In reply to Erik Erlandson from comment #13) > The following is a workaround: > > USE_CLONE_TO_CREATE_PROCESSES = False I can verify that the workaround does, in fact, work. Spin with the workaround is in flight, and upstream ticket created outlining issue. condor-8.1.1-0.3.fc19 has been submitted as an update for Fedora 19. https://admin.fedoraproject.org/updates/condor-8.1.1-0.3.fc19 condor-8.1.1-0.3.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/condor-8.1.1-0.3.fc20 Package condor-8.1.1-0.3.fc19: * should fix your issue, * was pushed to the Fedora 19 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing condor-8.1.1-0.3.fc19' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2013-19960/condor-8.1.1-0.3.fc19 then log in and leave karma (feedback). condor-8.1.1-0.3.fc19 has been pushed to the Fedora 19 stable repository. If problems still persist, please make note of it in this bug report. condor-8.1.1-0.3.fc20 has been pushed to the Fedora 20 stable repository. If problems still persist, please make note of it in this bug report. |