Bug 732452
Summary: | Job server crashes on submit plus schedd restart scenario | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Pete MacKinnon <pmackinn> |
Component: | condor-qmf | Assignee: | Pete MacKinnon <pmackinn> |
Status: | CLOSED ERRATA | QA Contact: | MRG Quality Engineering <mrgqe-bugs> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | Development | CC: | jneedle, ltoscano, ltrilety, matt, mkudlej, tstclair |
Target Milestone: | 2.1 | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | condor-7.6.4-0.7 | Doc Type: | Bug Fix |
Doc Text: |
Cause: Numerous submissions in advance of a schedd restart.
Consequence: QMF job server or Aviary query server can crash.
Fix: An internal submission list was not being properly cleared in the reset code.
Result: QMF job server or Aviary query server do not crash.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2012-01-27 19:17:34 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Pete MacKinnon
2011-08-22 14:08:40 UTC
Fixed upstream in contrib. A utility global map for submissions was not being properly cleaned up in Reset. I tried to reproduce it on condor-qmf-7.6.3-0.3, but without success. I don't know how to achieve UNKNOWN state. Please could you specify more the reproduction scenario? My scenario: First terminal: # _CONDOR_JOB_SERVER_DEBUG=D_FULLDEBUG condor_job_server -t -f ... 10/13/11 15:42:00 first log entry: 6 CreationTimestamp 1318510973 10/13/11 15:42:00 JobServerJobLogConsumer::Reset() - deleting jobs and submissions 10/13/11 15:42:00 JobServerJobLogConsumer::NewClassAd processing _key='5.0' 10/13/11 15:42:00 Job::Job of '05.-1' 10/13/11 15:42:00 LiveJobImpl created for '05.-1' 10/13/11 15:42:00 warning: failed to lookup attribute JobStatus in job '05.-1' 10/13/11 15:42:00 Job::Job of '5.0' 10/13/11 15:42:00 LiveJobImpl created for '5.0' 10/13/11 15:42:00 warning: failed to lookup attribute JobStatus in job '5.0' 10/13/11 15:42:00 Created new SubmissionObject 'host#5' for '(null)' 10/13/11 15:42:00 SubmissionObject::Increment 'IDLE' on '5.0' 10/13/11 15:42:00 warning: failed to lookup attribute JobStatus in job '5.0' 10/13/11 15:42:00 SubmissionObject::Decrement 'IDLE' on '5.0' 10/13/11 15:42:00 SubmissionObject::Increment 'IDLE' on '5.0' 10/13/11 15:42:00 JobServerJobLogConsumer::NewClassAd processing _key='0.0' 10/13/11 15:42:00 JobServerJobLogConsumer::NewClassAd processing _key='05.-1' 10/13/11 15:42:10 TimerHandler_JobLogPolling() called 10/13/11 15:42:10 === Current Probing Information === 10/13/11 15:42:10 fsize: 3053 mtime: 1318513318 10/13/11 15:42:10 first log entry: 6 CreationTimestamp 1318510973 10/13/11 15:42:20 TimerHandler_JobLogPolling() called 10/13/11 15:42:20 === Current Probing Information === 10/13/11 15:42:20 fsize: 3053 mtime: 1318513318 10/13/11 15:42:20 first log entry: 6 CreationTimestamp 1318510973 10/13/11 15:42:30 TimerHandler_JobLogPolling() called 10/13/11 15:42:30 === Current Probing Information === 10/13/11 15:42:30 fsize: 3893 mtime: 1318513347 10/13/11 15:42:30 first log entry: 6 CreationTimestamp 1318510973 10/13/11 15:42:30 SubmissionObject::Decrement 'IDLE' on '5.0' 10/13/11 15:42:30 SubmissionObject::Increment 'RUNNING' on '5.0' 10/13/11 15:42:40 TimerHandler_JobLogPolling() called 10/13/11 15:42:40 === Current Probing Information === 10/13/11 15:42:40 fsize: 3893 mtime: 1318513347 10/13/11 15:42:40 first log entry: 6 CreationTimestamp 1318510973 ... Second terminal: # echo -e "cmd=/bin/sleep\nargs=1d\nqueue" | runuser condor -s /bin/bash -c condor_submit; condor_restart -schedd Submitting job(s). 1 job(s) submitted to cluster 5. Sent "Restart" command to local schedd Suggestions: 1) restart schedd in a loop 2) queue multiple jobs per submit Reproduced on: $CondorVersion: 7.6.3 Jul 27 2011 BuildID: RH-7.6.3-0.3.el5 $ $CondorPlatform: X86_64-RedHat_5.6 $ # for I in `seq 40`; do echo -e "cmd=/bin/sleep\nargs=20\nqueue 100" | runuser condor -s /bin/bash -c condor_submit; done Submitting job(s).... 100 job(s) submitted to cluster 1. Submitting job(s).... 100 job(s) submitted to cluster 2. Submitting job(s).... # while true; condor_restart -schedd; sleep 20; done 10/14/11 16:12:22 JobServerJobLogConsumer::NewClassAd processing _key='36.89' 10/14/11 16:12:22 Job::Job of '36.89' 10/14/11 16:12:22 LiveJobImpl created for '36.89' 10/14/11 16:12:22 warning: failed to lookup attribute JobStatus in job '36.89' 10/14/11 16:12:22 SubmissionObject::Increment 'IDLE' on '36.89' 10/14/11 16:12:22 warning: failed to lookup attribute JobStatus in job '36.89' 10/14/11 16:12:22 SubmissionObject::Decrement 'IDLE' on '36.89' 10/14/11 16:12:22 SubmissionObject::Increment 'IDLE' on '36.89' 10/14/11 16:12:22 JobServerJobLogConsumer::NewClassAd processing _key='010.-1' 10/14/11 16:12:22 Job::Job of '010.-1' 10/14/11 16:12:22 LiveJobImpl created for '010.-1' 10/14/11 16:12:22 warning: failed to lookup attribute JobStatus in job '010.-1' Stack dump for process 20821 at timestamp 1318601542 (18 frames) condor_job_server(dprintf_dump_stack+0x56)[0x520576] condor_job_server[0x5191a2] /lib64/libpthread.so.0[0x354c60eb10] /usr/lib64/libqpidcommon.so.5(_ZN4qpid10management5Mutex4lockEv+0x1b)[0x3933bf326b] condor_job_server(_ZN16SubmissionObject8SetOwnerEPKc+0x63)[0x460fd3] condor_job_server(_ZN3Job16UpdateSubmissionEiPKc+0x3b)[0x465c6b] condor_job_server(_ZN11LiveJobImpl3SetEPKcS1_+0x187)[0x467d37] condor_job_server(_ZN23JobServerJobLogConsumer12SetAttributeEPKcS1_S1_+0x87)[0x45fc27] condor_job_server(_ZN16ClassAdLogReader15ProcessLogEntryEP15ClassAdLogEntryP16ClassAdLogParser+0x8d)[0x52611d] condor_job_server(_ZN16ClassAdLogReader15IncrementalLoadEv+0x36)[0x526176] condor_job_server(_ZN16ClassAdLogReader8BulkLoadEv+0x22)[0x526242] condor_job_server(_ZN16ClassAdLogReader4PollEv+0xcb)[0x52631b] condor_job_server(_ZN12JobLogMirror26TimerHandler_JobLogPollingEv+0x21)[0x524531] condor_job_server(_ZN12TimerManager7TimeoutEv+0x155)[0x49a245] condor_job_server(_ZN10DaemonCore6DriverEv+0x248)[0x483d58] condor_job_server(main+0xe60)[0x498a30] /lib64/libc.so.6(__libc_start_main+0xf4)[0x354ba1d994] condor_job_server[0x45e209] Segmentation fault Tested on:
$CondorVersion: 7.6.4 Oct 07 2011 BuildID: RH-7.6.4-0.7.el5 $
$CondorPlatform: I686-RedHat_5.7 $
$CondorVersion: 7.6.4 Oct 07 2011 BuildID: RH-7.6.4-0.7.el5 $
$CondorPlatform: X86_64-RedHat_5.7 $
$CondorVersion: 7.6.4 Oct 07 2011 BuildID: RH-7.6.4-0.7.el6 $
$CondorPlatform: I686-RedHat_6.1 $
$CondorVersion: 7.6.4 Oct 07 2011 BuildID: RH-7.6.4-0.7.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $
No crash of job server.
>>> VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: Numerous submissions in advance of a schedd restart. Consequence: QMF job server or Aviary query server can crash. Fix: An internal submission list was not being properly cleared in the reset code. Result: QMF job server or Aviary query server do not crash. |