Bug 732452

Summary: Job server crashes on submit plus schedd restart scenario
Product: Red Hat Enterprise MRG Reporter: Pete MacKinnon <pmackinn>
Component: condor-qmfAssignee: Pete MacKinnon <pmackinn>
Status: CLOSED ERRATA QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: high Docs Contact:
Priority: high    
Version: DevelopmentCC: jneedle, ltoscano, ltrilety, matt, mkudlej, tstclair
Target Milestone: 2.1   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: condor-7.6.4-0.7 Doc Type: Bug Fix
Doc Text:
Cause: Numerous submissions in advance of a schedd restart. Consequence: QMF job server or Aviary query server can crash. Fix: An internal submission list was not being properly cleared in the reset code. Result: QMF job server or Aviary query server do not crash.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-01-27 19:17:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pete MacKinnon 2011-08-22 14:08:40 UTC
Per mattf...

Running 7.6.3-0.2 I can submit a job and condor_restart -schedd, the 
result is the JS dropping core. Also the submissions appear to have no 
jobs, something to do with the UNKNOWN status?

$ _CONDOR_JOB_SERVER_DEBUG=D_FULLDEBUG condor_job_server -t -f
08/19/11 15:34:32 Reading from /proc/cpuinfo
08/19/11 15:34:32 Found: Physical-IDs:True; Core-IDs:True
08/19/11 15:34:32 Analyzing 4 processors using IDs...
08/19/11 15:34:32 Looking at processor #0 (PID:0, CID:0):
08/19/11 15:34:32 Comparing P#0   and P#1  : pid:0==0 and cid:0==0 (match=2)
08/19/11 15:34:32 Comparing P#0   and P#2  : pid:0!=0 or  cid:0!=2 
(match=No)
08/19/11 15:34:32 Comparing P#0   and P#3  : pid:0!=0 or  cid:0!=2 
(match=No)
08/19/11 15:34:32 ncpus = 1
08/19/11 15:34:32 P0: match->2
08/19/11 15:34:32 P1: match->2
08/19/11 15:34:32 Looking at processor #1 (PID:0, CID:0):
08/19/11 15:34:32 Looking at processor #2 (PID:0, CID:2):
08/19/11 15:34:32 Comparing P#2   and P#3  : pid:0==0 and cid:2==2 (match=2)
08/19/11 15:34:32 ncpus = 2
08/19/11 15:34:32 P2: match->2
08/19/11 15:34:32 P3: match->2
08/19/11 15:34:32 Looking at processor #3 (PID:0, CID:2):
08/19/11 15:34:32 Using IDs: 4 processors, 2 CPUs, 2 HTs
08/19/11 15:34:32 Setting maximum accepts per cycle 4.
08/19/11 15:34:32 ******************************************************
08/19/11 15:34:32 ** condor_job_server (CONDOR_JOB_SERVER) STARTING UP
08/19/11 15:34:32 ** 
/home/matt/Documents/Repositories/Condor/release_dir/sbin/condor_job_server
08/19/11 15:34:32 ** SubsystemInfo: name=JOB_SERVER type=DAEMON(12) 
class=DAEMON(1)
08/19/11 15:34:32 ** Configuration: subsystem:JOB_SERVER local:<NONE> 
class:DAEMON
08/19/11 15:34:32 ** $CondorVersion: 7.6.3 Aug 19 2011 $
08/19/11 15:34:32 ** $CondorPlatform: X86_64-Fedora_13 $
08/19/11 15:34:32 ** PID = 20036
08/19/11 15:34:32 ** Log last touched time unavailable (Success)
08/19/11 15:34:32 ******************************************************
08/19/11 15:34:32 Using config source: 
/home/matt/Documents/CondorInstallation/condor_config
08/19/11 15:34:32 Using local config sources:
08/19/11 15:34:32 
/home/matt/Documents/CondorInstallation/config.d/00personal_condor.config
08/19/11 15:34:32 
/home/matt/Documents/CondorInstallation/config.d/50mef-base.config
08/19/11 15:34:32 
/home/matt/Documents/CondorInstallation/config.d/50mef.config
08/19/11 15:34:32 Not using shared port because USE_SHARED_PORT=false
08/19/11 15:34:32 DaemonCore: command socket at <192.168.0.187:33250>
08/19/11 15:34:32 DaemonCore: private command socket at 
<192.168.0.187:33250>
08/19/11 15:34:32 Setting maximum accepts per cycle 4.
08/19/11 15:34:32 Will use UDP to update collector eeyore.local 
<127.0.0.1:9618>
08/19/11 15:34:32 Not using shared port because USE_SHARED_PORT=false
08/19/11 15:34:32 main_init() called
08/19/11 15:34:32 Warning: Could not find DaemonStartTime
08/19/11 15:34:32 Warning: Could not find MonitorSelfAge
08/19/11 15:34:32 Warning: Could not find MonitorSelfCPUUsage
08/19/11 15:34:32 Warning: Could not find MonitorSelfImageSize
08/19/11 15:34:32 Warning: Could not find MonitorSelfRegisteredSocketCount
08/19/11 15:34:32 Warning: Could not find MonitorSelfResidentSetSize
08/19/11 15:34:32 Warning: Could not find MonitorSelfTime
08/19/11 15:34:32 Warning: Could not find PublicNetworkIpAddr
08/19/11 15:34:32 HISTORY path is 
/home/matt/Documents/CondorInstallation/busted-job_server-spool/
08/19/11 15:34:32 TimerHandler_JobLogPolling() called
08/19/11 15:34:32 === Current Probing Information ===
08/19/11 15:34:32 fsize: 162            mtime: 1313793220
08/19/11 15:34:32 first log entry: 39 CreationTimestamp 1311374495
08/19/11 15:34:32 JobServerJobLogConsumer::Reset() - deleting jobs and 
submissions
08/19/11 15:34:32 HistoryFile::init:1:Failed to stat 
/home/matt/Documents/CondorInstallation/busted-job_server-spool//history: 2 
(No such file or directory)

08/19/11 15:34:32 JobServerJobLogConsumer::NewClassAd processing _key='0.0'
08/19/11 15:34:32 ProcessHistoryTimer() called
08/19/11 15:34:32 HistoryFile::init:1:Failed to stat 
/home/matt/Documents/CondorInstallation/busted-job_server-spool//history: 2 
(No such file or directory)

08/19/11 15:34:42 TimerHandler_JobLogPolling() called
08/19/11 15:34:42 === Current Probing Information ===
08/19/11 15:34:42 fsize: 162            mtime: 1313793220
08/19/11 15:34:42 first log entry: 39 CreationTimestamp 1311374495
08/19/11 15:34:52 TimerHandler_JobLogPolling() called
08/19/11 15:34:52 === Current Probing Information ===
08/19/11 15:34:52 fsize: 3207           mtime: 1313793285
08/19/11 15:34:52 first log entry: 39 CreationTimestamp 1311374495
08/19/11 15:34:52 JobServerJobLogConsumer::NewClassAd processing 
_key='047821.-1'
08/19/11 15:34:52 Job::Job of '047821.-1'
08/19/11 15:34:52 LiveJobImpl created for '047821.-1'
08/19/11 15:34:52 warning: failed to lookup attribute JobStatus in job 
'047821.-1'
08/19/11 15:34:52 JobServerJobLogConsumer::NewClassAd processing 
_key='47821.0'
08/19/11 15:34:52 Job::Job of '47821.0'
08/19/11 15:34:52 LiveJobImpl created for '47821.0'
08/19/11 15:34:52 warning: failed to lookup attribute JobStatus in job 
'47821.0'
08/19/11 15:34:52 Created new SubmissionObject 'A' for '(null)'
08/19/11 15:34:52 SubmissionObject::Increment 'IDLE' on '47821.0'
08/19/11 15:34:52 warning: failed to lookup attribute JobStatus in job 
'47821.0'
08/19/11 15:34:52 SubmissionObject::Decrement 'IDLE' on '47821.0'
08/19/11 15:34:52 SubmissionObject::Increment 'UNKNOWN' on '47821.0'
08/19/11 15:34:52 error: Unknown JobStatus of 0 on 47821.0
08/19/11 15:35:02 TimerHandler_JobLogPolling() called
08/19/11 15:35:02 === Current Probing Information ===
08/19/11 15:35:02 fsize: 3257           mtime: 1313793297
08/19/11 15:35:02 first log entry: 40 CreationTimestamp 1311374495
08/19/11 15:35:02 JobServerJobLogConsumer::Reset() - deleting jobs and 
submissions
08/19/11 15:35:02 Job::~Job of '47821.0'
08/19/11 15:35:02 SubmissionObject::Decrement 'UNKNOWN' on '47821.0'
08/19/11 15:35:02 error: Unknown JobStatus of 0 on 47821.0
08/19/11 15:35:02 LiveJobImpl destroyed: key '47821.0'
Stack dump for process 20036 at timestamp 1313793302 (13 frames)
condor_job_server(dprintf_dump_stack+0x63)[0x4ff653]
condor_job_server[0x509d02]
/lib64/libpthread.so.0[0x376300f4a0]
/usr/lib64/libstdc++.so.6(_ZSt18_Rb_tree_incrementPSt18_Rb_tree_node_base+0x13)[0x376d069e43]
condor_job_server(_ZN23JobServerJobLogConsumer5ResetEv+0x7c)[0x465c4c]
condor_job_server(_ZN16ClassAdLogReader8BulkLoadEv+0x1a)[0x4e4d2a]
condor_job_server(_ZN16ClassAdLogReader4PollEv+0x108)[0x4e4e48]
condor_job_server(_ZN12JobLogMirror26TimerHandler_JobLogPollingEv+0x1e)[0x4ece3e]
condor_job_server(_ZN12TimerManager7TimeoutEv+0x129)[0x490309]
condor_job_server(_ZN10DaemonCore6DriverEv+0x277)[0x481577]
condor_job_server(main+0x10f3)[0x48e7c3]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3762c1ec5d]
condor_job_server[0x459af9]
zsh: segmentation fault (core dumped) 
_CONDOR_JOB_SERVER_DEBUG=D_FULLDEBUG condor_job_server -t -f

Comment 1 Pete MacKinnon 2011-10-06 14:44:56 UTC
Fixed upstream in contrib. A utility global map for submissions was not being properly cleaned up in Reset.

Comment 2 Lubos Trilety 2011-10-13 13:45:05 UTC
I tried to reproduce it on condor-qmf-7.6.3-0.3, but without success. I don't know how to achieve UNKNOWN state. Please could you specify more the reproduction scenario?

My scenario:
First terminal:
# _CONDOR_JOB_SERVER_DEBUG=D_FULLDEBUG condor_job_server -t -f
...
10/13/11 15:42:00 first log entry: 6 CreationTimestamp 1318510973
10/13/11 15:42:00 JobServerJobLogConsumer::Reset() - deleting jobs and submissions
10/13/11 15:42:00 JobServerJobLogConsumer::NewClassAd processing _key='5.0'
10/13/11 15:42:00 Job::Job of '05.-1'
10/13/11 15:42:00 LiveJobImpl created for '05.-1'
10/13/11 15:42:00 warning: failed to lookup attribute JobStatus in job '05.-1'
10/13/11 15:42:00 Job::Job of '5.0'
10/13/11 15:42:00 LiveJobImpl created for '5.0'
10/13/11 15:42:00 warning: failed to lookup attribute JobStatus in job '5.0'
10/13/11 15:42:00 Created new SubmissionObject 'host#5' for '(null)'
10/13/11 15:42:00 SubmissionObject::Increment 'IDLE' on '5.0'
10/13/11 15:42:00 warning: failed to lookup attribute JobStatus in job '5.0'
10/13/11 15:42:00 SubmissionObject::Decrement 'IDLE' on '5.0'
10/13/11 15:42:00 SubmissionObject::Increment 'IDLE' on '5.0'
10/13/11 15:42:00 JobServerJobLogConsumer::NewClassAd processing _key='0.0'
10/13/11 15:42:00 JobServerJobLogConsumer::NewClassAd processing _key='05.-1'
10/13/11 15:42:10 TimerHandler_JobLogPolling() called
10/13/11 15:42:10 === Current Probing Information ===
10/13/11 15:42:10 fsize: 3053		mtime: 1318513318
10/13/11 15:42:10 first log entry: 6 CreationTimestamp 1318510973
10/13/11 15:42:20 TimerHandler_JobLogPolling() called
10/13/11 15:42:20 === Current Probing Information ===
10/13/11 15:42:20 fsize: 3053		mtime: 1318513318
10/13/11 15:42:20 first log entry: 6 CreationTimestamp 1318510973
10/13/11 15:42:30 TimerHandler_JobLogPolling() called
10/13/11 15:42:30 === Current Probing Information ===
10/13/11 15:42:30 fsize: 3893		mtime: 1318513347
10/13/11 15:42:30 first log entry: 6 CreationTimestamp 1318510973
10/13/11 15:42:30 SubmissionObject::Decrement 'IDLE' on '5.0'
10/13/11 15:42:30 SubmissionObject::Increment 'RUNNING' on '5.0'
10/13/11 15:42:40 TimerHandler_JobLogPolling() called
10/13/11 15:42:40 === Current Probing Information ===
10/13/11 15:42:40 fsize: 3893		mtime: 1318513347
10/13/11 15:42:40 first log entry: 6 CreationTimestamp 1318510973
...

Second terminal:
# echo -e "cmd=/bin/sleep\nargs=1d\nqueue" | runuser condor -s /bin/bash -c condor_submit; condor_restart -schedd
Submitting job(s).
1 job(s) submitted to cluster 5.
Sent "Restart" command to local schedd

Comment 3 Pete MacKinnon 2011-10-13 14:06:05 UTC
Suggestions:

1) restart schedd in a loop
2) queue multiple jobs per submit

Comment 4 Lubos Trilety 2011-10-14 14:53:41 UTC
Reproduced on:
$CondorVersion: 7.6.3 Jul 27 2011 BuildID: RH-7.6.3-0.3.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

# for I in `seq 40`; do echo -e "cmd=/bin/sleep\nargs=20\nqueue 100" | runuser condor -s /bin/bash -c condor_submit; done
Submitting job(s)....
100 job(s) submitted to cluster 1.
Submitting job(s)....
100 job(s) submitted to cluster 2.
Submitting job(s)....

# while true; condor_restart -schedd; sleep 20; done

10/14/11 16:12:22 JobServerJobLogConsumer::NewClassAd processing _key='36.89'
10/14/11 16:12:22 Job::Job of '36.89'
10/14/11 16:12:22 LiveJobImpl created for '36.89'
10/14/11 16:12:22 warning: failed to lookup attribute JobStatus in job '36.89'
10/14/11 16:12:22 SubmissionObject::Increment 'IDLE' on '36.89'
10/14/11 16:12:22 warning: failed to lookup attribute JobStatus in job '36.89'
10/14/11 16:12:22 SubmissionObject::Decrement 'IDLE' on '36.89'
10/14/11 16:12:22 SubmissionObject::Increment 'IDLE' on '36.89'
10/14/11 16:12:22 JobServerJobLogConsumer::NewClassAd processing _key='010.-1'
10/14/11 16:12:22 Job::Job of '010.-1'
10/14/11 16:12:22 LiveJobImpl created for '010.-1'
10/14/11 16:12:22 warning: failed to lookup attribute JobStatus in job '010.-1'
Stack dump for process 20821 at timestamp 1318601542 (18 frames)
condor_job_server(dprintf_dump_stack+0x56)[0x520576]
condor_job_server[0x5191a2]
/lib64/libpthread.so.0[0x354c60eb10]
/usr/lib64/libqpidcommon.so.5(_ZN4qpid10management5Mutex4lockEv+0x1b)[0x3933bf326b]
condor_job_server(_ZN16SubmissionObject8SetOwnerEPKc+0x63)[0x460fd3]
condor_job_server(_ZN3Job16UpdateSubmissionEiPKc+0x3b)[0x465c6b]
condor_job_server(_ZN11LiveJobImpl3SetEPKcS1_+0x187)[0x467d37]
condor_job_server(_ZN23JobServerJobLogConsumer12SetAttributeEPKcS1_S1_+0x87)[0x45fc27]
condor_job_server(_ZN16ClassAdLogReader15ProcessLogEntryEP15ClassAdLogEntryP16ClassAdLogParser+0x8d)[0x52611d]
condor_job_server(_ZN16ClassAdLogReader15IncrementalLoadEv+0x36)[0x526176]
condor_job_server(_ZN16ClassAdLogReader8BulkLoadEv+0x22)[0x526242]
condor_job_server(_ZN16ClassAdLogReader4PollEv+0xcb)[0x52631b]
condor_job_server(_ZN12JobLogMirror26TimerHandler_JobLogPollingEv+0x21)[0x524531]
condor_job_server(_ZN12TimerManager7TimeoutEv+0x155)[0x49a245]
condor_job_server(_ZN10DaemonCore6DriverEv+0x248)[0x483d58]
condor_job_server(main+0xe60)[0x498a30]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x354ba1d994]
condor_job_server[0x45e209]
Segmentation fault

Comment 5 Lubos Trilety 2011-10-14 14:55:26 UTC
Tested on:
$CondorVersion: 7.6.4 Oct 07 2011 BuildID: RH-7.6.4-0.7.el5 $
$CondorPlatform: I686-RedHat_5.7 $

$CondorVersion: 7.6.4 Oct 07 2011 BuildID: RH-7.6.4-0.7.el5 $
$CondorPlatform: X86_64-RedHat_5.7 $

$CondorVersion: 7.6.4 Oct 07 2011 BuildID: RH-7.6.4-0.7.el6 $
$CondorPlatform: I686-RedHat_6.1 $

$CondorVersion: 7.6.4 Oct 07 2011 BuildID: RH-7.6.4-0.7.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $

No crash of job server.

>>> VERIFIED

Comment 6 Pete MacKinnon 2011-10-24 13:51:04 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Numerous submissions in advance of a schedd restart.
Consequence: QMF job server or Aviary query server can crash.
Fix: An internal submission list was not being properly cleared in the reset code.
Result: QMF job server or Aviary query server do not crash.