Hide Forgot
Per mattf... Running 7.6.3-0.2 I can submit a job and condor_restart -schedd, the result is the JS dropping core. Also the submissions appear to have no jobs, something to do with the UNKNOWN status? $ _CONDOR_JOB_SERVER_DEBUG=D_FULLDEBUG condor_job_server -t -f 08/19/11 15:34:32 Reading from /proc/cpuinfo 08/19/11 15:34:32 Found: Physical-IDs:True; Core-IDs:True 08/19/11 15:34:32 Analyzing 4 processors using IDs... 08/19/11 15:34:32 Looking at processor #0 (PID:0, CID:0): 08/19/11 15:34:32 Comparing P#0 and P#1 : pid:0==0 and cid:0==0 (match=2) 08/19/11 15:34:32 Comparing P#0 and P#2 : pid:0!=0 or cid:0!=2 (match=No) 08/19/11 15:34:32 Comparing P#0 and P#3 : pid:0!=0 or cid:0!=2 (match=No) 08/19/11 15:34:32 ncpus = 1 08/19/11 15:34:32 P0: match->2 08/19/11 15:34:32 P1: match->2 08/19/11 15:34:32 Looking at processor #1 (PID:0, CID:0): 08/19/11 15:34:32 Looking at processor #2 (PID:0, CID:2): 08/19/11 15:34:32 Comparing P#2 and P#3 : pid:0==0 and cid:2==2 (match=2) 08/19/11 15:34:32 ncpus = 2 08/19/11 15:34:32 P2: match->2 08/19/11 15:34:32 P3: match->2 08/19/11 15:34:32 Looking at processor #3 (PID:0, CID:2): 08/19/11 15:34:32 Using IDs: 4 processors, 2 CPUs, 2 HTs 08/19/11 15:34:32 Setting maximum accepts per cycle 4. 08/19/11 15:34:32 ****************************************************** 08/19/11 15:34:32 ** condor_job_server (CONDOR_JOB_SERVER) STARTING UP 08/19/11 15:34:32 ** /home/matt/Documents/Repositories/Condor/release_dir/sbin/condor_job_server 08/19/11 15:34:32 ** SubsystemInfo: name=JOB_SERVER type=DAEMON(12) class=DAEMON(1) 08/19/11 15:34:32 ** Configuration: subsystem:JOB_SERVER local:<NONE> class:DAEMON 08/19/11 15:34:32 ** $CondorVersion: 7.6.3 Aug 19 2011 $ 08/19/11 15:34:32 ** $CondorPlatform: X86_64-Fedora_13 $ 08/19/11 15:34:32 ** PID = 20036 08/19/11 15:34:32 ** Log last touched time unavailable (Success) 08/19/11 15:34:32 ****************************************************** 08/19/11 15:34:32 Using config source: /home/matt/Documents/CondorInstallation/condor_config 08/19/11 15:34:32 Using local config sources: 08/19/11 15:34:32 /home/matt/Documents/CondorInstallation/config.d/00personal_condor.config 08/19/11 15:34:32 /home/matt/Documents/CondorInstallation/config.d/50mef-base.config 08/19/11 15:34:32 /home/matt/Documents/CondorInstallation/config.d/50mef.config 08/19/11 15:34:32 Not using shared port because USE_SHARED_PORT=false 08/19/11 15:34:32 DaemonCore: command socket at <192.168.0.187:33250> 08/19/11 15:34:32 DaemonCore: private command socket at <192.168.0.187:33250> 08/19/11 15:34:32 Setting maximum accepts per cycle 4. 08/19/11 15:34:32 Will use UDP to update collector eeyore.local <127.0.0.1:9618> 08/19/11 15:34:32 Not using shared port because USE_SHARED_PORT=false 08/19/11 15:34:32 main_init() called 08/19/11 15:34:32 Warning: Could not find DaemonStartTime 08/19/11 15:34:32 Warning: Could not find MonitorSelfAge 08/19/11 15:34:32 Warning: Could not find MonitorSelfCPUUsage 08/19/11 15:34:32 Warning: Could not find MonitorSelfImageSize 08/19/11 15:34:32 Warning: Could not find MonitorSelfRegisteredSocketCount 08/19/11 15:34:32 Warning: Could not find MonitorSelfResidentSetSize 08/19/11 15:34:32 Warning: Could not find MonitorSelfTime 08/19/11 15:34:32 Warning: Could not find PublicNetworkIpAddr 08/19/11 15:34:32 HISTORY path is /home/matt/Documents/CondorInstallation/busted-job_server-spool/ 08/19/11 15:34:32 TimerHandler_JobLogPolling() called 08/19/11 15:34:32 === Current Probing Information === 08/19/11 15:34:32 fsize: 162 mtime: 1313793220 08/19/11 15:34:32 first log entry: 39 CreationTimestamp 1311374495 08/19/11 15:34:32 JobServerJobLogConsumer::Reset() - deleting jobs and submissions 08/19/11 15:34:32 HistoryFile::init:1:Failed to stat /home/matt/Documents/CondorInstallation/busted-job_server-spool//history: 2 (No such file or directory) 08/19/11 15:34:32 JobServerJobLogConsumer::NewClassAd processing _key='0.0' 08/19/11 15:34:32 ProcessHistoryTimer() called 08/19/11 15:34:32 HistoryFile::init:1:Failed to stat /home/matt/Documents/CondorInstallation/busted-job_server-spool//history: 2 (No such file or directory) 08/19/11 15:34:42 TimerHandler_JobLogPolling() called 08/19/11 15:34:42 === Current Probing Information === 08/19/11 15:34:42 fsize: 162 mtime: 1313793220 08/19/11 15:34:42 first log entry: 39 CreationTimestamp 1311374495 08/19/11 15:34:52 TimerHandler_JobLogPolling() called 08/19/11 15:34:52 === Current Probing Information === 08/19/11 15:34:52 fsize: 3207 mtime: 1313793285 08/19/11 15:34:52 first log entry: 39 CreationTimestamp 1311374495 08/19/11 15:34:52 JobServerJobLogConsumer::NewClassAd processing _key='047821.-1' 08/19/11 15:34:52 Job::Job of '047821.-1' 08/19/11 15:34:52 LiveJobImpl created for '047821.-1' 08/19/11 15:34:52 warning: failed to lookup attribute JobStatus in job '047821.-1' 08/19/11 15:34:52 JobServerJobLogConsumer::NewClassAd processing _key='47821.0' 08/19/11 15:34:52 Job::Job of '47821.0' 08/19/11 15:34:52 LiveJobImpl created for '47821.0' 08/19/11 15:34:52 warning: failed to lookup attribute JobStatus in job '47821.0' 08/19/11 15:34:52 Created new SubmissionObject 'A' for '(null)' 08/19/11 15:34:52 SubmissionObject::Increment 'IDLE' on '47821.0' 08/19/11 15:34:52 warning: failed to lookup attribute JobStatus in job '47821.0' 08/19/11 15:34:52 SubmissionObject::Decrement 'IDLE' on '47821.0' 08/19/11 15:34:52 SubmissionObject::Increment 'UNKNOWN' on '47821.0' 08/19/11 15:34:52 error: Unknown JobStatus of 0 on 47821.0 08/19/11 15:35:02 TimerHandler_JobLogPolling() called 08/19/11 15:35:02 === Current Probing Information === 08/19/11 15:35:02 fsize: 3257 mtime: 1313793297 08/19/11 15:35:02 first log entry: 40 CreationTimestamp 1311374495 08/19/11 15:35:02 JobServerJobLogConsumer::Reset() - deleting jobs and submissions 08/19/11 15:35:02 Job::~Job of '47821.0' 08/19/11 15:35:02 SubmissionObject::Decrement 'UNKNOWN' on '47821.0' 08/19/11 15:35:02 error: Unknown JobStatus of 0 on 47821.0 08/19/11 15:35:02 LiveJobImpl destroyed: key '47821.0' Stack dump for process 20036 at timestamp 1313793302 (13 frames) condor_job_server(dprintf_dump_stack+0x63)[0x4ff653] condor_job_server[0x509d02] /lib64/libpthread.so.0[0x376300f4a0] /usr/lib64/libstdc++.so.6(_ZSt18_Rb_tree_incrementPSt18_Rb_tree_node_base+0x13)[0x376d069e43] condor_job_server(_ZN23JobServerJobLogConsumer5ResetEv+0x7c)[0x465c4c] condor_job_server(_ZN16ClassAdLogReader8BulkLoadEv+0x1a)[0x4e4d2a] condor_job_server(_ZN16ClassAdLogReader4PollEv+0x108)[0x4e4e48] condor_job_server(_ZN12JobLogMirror26TimerHandler_JobLogPollingEv+0x1e)[0x4ece3e] condor_job_server(_ZN12TimerManager7TimeoutEv+0x129)[0x490309] condor_job_server(_ZN10DaemonCore6DriverEv+0x277)[0x481577] condor_job_server(main+0x10f3)[0x48e7c3] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3762c1ec5d] condor_job_server[0x459af9] zsh: segmentation fault (core dumped) _CONDOR_JOB_SERVER_DEBUG=D_FULLDEBUG condor_job_server -t -f
Fixed upstream in contrib. A utility global map for submissions was not being properly cleaned up in Reset.
I tried to reproduce it on condor-qmf-7.6.3-0.3, but without success. I don't know how to achieve UNKNOWN state. Please could you specify more the reproduction scenario? My scenario: First terminal: # _CONDOR_JOB_SERVER_DEBUG=D_FULLDEBUG condor_job_server -t -f ... 10/13/11 15:42:00 first log entry: 6 CreationTimestamp 1318510973 10/13/11 15:42:00 JobServerJobLogConsumer::Reset() - deleting jobs and submissions 10/13/11 15:42:00 JobServerJobLogConsumer::NewClassAd processing _key='5.0' 10/13/11 15:42:00 Job::Job of '05.-1' 10/13/11 15:42:00 LiveJobImpl created for '05.-1' 10/13/11 15:42:00 warning: failed to lookup attribute JobStatus in job '05.-1' 10/13/11 15:42:00 Job::Job of '5.0' 10/13/11 15:42:00 LiveJobImpl created for '5.0' 10/13/11 15:42:00 warning: failed to lookup attribute JobStatus in job '5.0' 10/13/11 15:42:00 Created new SubmissionObject 'host#5' for '(null)' 10/13/11 15:42:00 SubmissionObject::Increment 'IDLE' on '5.0' 10/13/11 15:42:00 warning: failed to lookup attribute JobStatus in job '5.0' 10/13/11 15:42:00 SubmissionObject::Decrement 'IDLE' on '5.0' 10/13/11 15:42:00 SubmissionObject::Increment 'IDLE' on '5.0' 10/13/11 15:42:00 JobServerJobLogConsumer::NewClassAd processing _key='0.0' 10/13/11 15:42:00 JobServerJobLogConsumer::NewClassAd processing _key='05.-1' 10/13/11 15:42:10 TimerHandler_JobLogPolling() called 10/13/11 15:42:10 === Current Probing Information === 10/13/11 15:42:10 fsize: 3053 mtime: 1318513318 10/13/11 15:42:10 first log entry: 6 CreationTimestamp 1318510973 10/13/11 15:42:20 TimerHandler_JobLogPolling() called 10/13/11 15:42:20 === Current Probing Information === 10/13/11 15:42:20 fsize: 3053 mtime: 1318513318 10/13/11 15:42:20 first log entry: 6 CreationTimestamp 1318510973 10/13/11 15:42:30 TimerHandler_JobLogPolling() called 10/13/11 15:42:30 === Current Probing Information === 10/13/11 15:42:30 fsize: 3893 mtime: 1318513347 10/13/11 15:42:30 first log entry: 6 CreationTimestamp 1318510973 10/13/11 15:42:30 SubmissionObject::Decrement 'IDLE' on '5.0' 10/13/11 15:42:30 SubmissionObject::Increment 'RUNNING' on '5.0' 10/13/11 15:42:40 TimerHandler_JobLogPolling() called 10/13/11 15:42:40 === Current Probing Information === 10/13/11 15:42:40 fsize: 3893 mtime: 1318513347 10/13/11 15:42:40 first log entry: 6 CreationTimestamp 1318510973 ... Second terminal: # echo -e "cmd=/bin/sleep\nargs=1d\nqueue" | runuser condor -s /bin/bash -c condor_submit; condor_restart -schedd Submitting job(s). 1 job(s) submitted to cluster 5. Sent "Restart" command to local schedd
Suggestions: 1) restart schedd in a loop 2) queue multiple jobs per submit
Reproduced on: $CondorVersion: 7.6.3 Jul 27 2011 BuildID: RH-7.6.3-0.3.el5 $ $CondorPlatform: X86_64-RedHat_5.6 $ # for I in `seq 40`; do echo -e "cmd=/bin/sleep\nargs=20\nqueue 100" | runuser condor -s /bin/bash -c condor_submit; done Submitting job(s).... 100 job(s) submitted to cluster 1. Submitting job(s).... 100 job(s) submitted to cluster 2. Submitting job(s).... # while true; condor_restart -schedd; sleep 20; done 10/14/11 16:12:22 JobServerJobLogConsumer::NewClassAd processing _key='36.89' 10/14/11 16:12:22 Job::Job of '36.89' 10/14/11 16:12:22 LiveJobImpl created for '36.89' 10/14/11 16:12:22 warning: failed to lookup attribute JobStatus in job '36.89' 10/14/11 16:12:22 SubmissionObject::Increment 'IDLE' on '36.89' 10/14/11 16:12:22 warning: failed to lookup attribute JobStatus in job '36.89' 10/14/11 16:12:22 SubmissionObject::Decrement 'IDLE' on '36.89' 10/14/11 16:12:22 SubmissionObject::Increment 'IDLE' on '36.89' 10/14/11 16:12:22 JobServerJobLogConsumer::NewClassAd processing _key='010.-1' 10/14/11 16:12:22 Job::Job of '010.-1' 10/14/11 16:12:22 LiveJobImpl created for '010.-1' 10/14/11 16:12:22 warning: failed to lookup attribute JobStatus in job '010.-1' Stack dump for process 20821 at timestamp 1318601542 (18 frames) condor_job_server(dprintf_dump_stack+0x56)[0x520576] condor_job_server[0x5191a2] /lib64/libpthread.so.0[0x354c60eb10] /usr/lib64/libqpidcommon.so.5(_ZN4qpid10management5Mutex4lockEv+0x1b)[0x3933bf326b] condor_job_server(_ZN16SubmissionObject8SetOwnerEPKc+0x63)[0x460fd3] condor_job_server(_ZN3Job16UpdateSubmissionEiPKc+0x3b)[0x465c6b] condor_job_server(_ZN11LiveJobImpl3SetEPKcS1_+0x187)[0x467d37] condor_job_server(_ZN23JobServerJobLogConsumer12SetAttributeEPKcS1_S1_+0x87)[0x45fc27] condor_job_server(_ZN16ClassAdLogReader15ProcessLogEntryEP15ClassAdLogEntryP16ClassAdLogParser+0x8d)[0x52611d] condor_job_server(_ZN16ClassAdLogReader15IncrementalLoadEv+0x36)[0x526176] condor_job_server(_ZN16ClassAdLogReader8BulkLoadEv+0x22)[0x526242] condor_job_server(_ZN16ClassAdLogReader4PollEv+0xcb)[0x52631b] condor_job_server(_ZN12JobLogMirror26TimerHandler_JobLogPollingEv+0x21)[0x524531] condor_job_server(_ZN12TimerManager7TimeoutEv+0x155)[0x49a245] condor_job_server(_ZN10DaemonCore6DriverEv+0x248)[0x483d58] condor_job_server(main+0xe60)[0x498a30] /lib64/libc.so.6(__libc_start_main+0xf4)[0x354ba1d994] condor_job_server[0x45e209] Segmentation fault
Tested on: $CondorVersion: 7.6.4 Oct 07 2011 BuildID: RH-7.6.4-0.7.el5 $ $CondorPlatform: I686-RedHat_5.7 $ $CondorVersion: 7.6.4 Oct 07 2011 BuildID: RH-7.6.4-0.7.el5 $ $CondorPlatform: X86_64-RedHat_5.7 $ $CondorVersion: 7.6.4 Oct 07 2011 BuildID: RH-7.6.4-0.7.el6 $ $CondorPlatform: I686-RedHat_6.1 $ $CondorVersion: 7.6.4 Oct 07 2011 BuildID: RH-7.6.4-0.7.el6 $ $CondorPlatform: X86_64-RedHat_6.1 $ No crash of job server. >>> VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: Numerous submissions in advance of a schedd restart. Consequence: QMF job server or Aviary query server can crash. Fix: An internal submission list was not being properly cleared in the reset code. Result: QMF job server or Aviary query server do not crash.