Hide Forgot
Description of problem: Previously, when condor was restarted while executing a VM job, the shadow used to wait for the lease expiration, then the job (==the VM) was restarted. With condor 7.6.0-0.4, after the restart, when the lease expire the schedd dies, the shadow tries again and again and the job is never properly restarted (even if the status is still running. Tested on KVM/x86_64, both RHEL5.6 and RHEL6.1beta, still waiting for the results on RHEL5/Xen. Version-Release number of selected component (if applicable): condor-7.6.0-0.4 condor-vm-gahp-7.6.0-0.4 Steps to Reproduce: 1. submit a vm job, wait for it to run 2. restart condor (service condor restart) Actual results: After the lease expiration, schedd crashes and the shadow tries again, over and over Expected results: After the lease expiration, the job should be properly restarted. Additional information: This is the backtrace found in the ScheddLog: ------------------------------------------------ 04/01/11 09:05:00 (pid:17887) Failed to send ALIVE to startd at <x.y.z.t:48974>: SECMAN:2004:Failed to create security session to <x.y.z.t:48974> with TCP. |SECMAN:2003:TCP connection to <x.y.z.t:48974> failed. 04/01/11 09:05:45 (pid:17887) Shadow pid 17891 for job 2.0 exited with status 107 Stack dump for process 17887 at timestamp 1301663145 (18 frames) condor_schedd(dprintf_dump_stack+0x56)[0x5ba506] condor_schedd[0x5cb252] /lib64/libpthread.so.0[0x3414a0eb10] /lib64/libc.so.6(abort+0x28f)[0x3413e31e8f] /usr/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x114)[0x341d2bed14] /usr/lib64/libstdc++.so.6[0x341d2bce16] /usr/lib64/libstdc++.so.6[0x341d2bce43] /usr/lib64/libstdc++.so.6[0x341d2bcf2a] condor_schedd[0x4c91db] condor_schedd(_ZN9Scheduler11jobExitCodeE7PROC_IDi+0x231)[0x4aea31] condor_schedd(_ZN9Scheduler10child_exitEii+0x340)[0x4b65d0] condor_schedd(_ZN10DaemonCore10CallReaperEiPKcii+0x11c)[0x4f65cc] condor_schedd(_ZN10DaemonCore17HandleProcessExitEii+0x1a7)[0x509177] condor_schedd(_ZN10DaemonCore24HandleDC_SERVICEWAITPIDSEi+0x2e)[0x5092de] condor_schedd(_ZN10DaemonCore6DriverEv+0x1e7)[0x4fd257] condor_schedd(main+0xe57)[0x510d57] /lib64/libc.so.6(__libc_start_main+0xf4)[0x3413e1d994] condor_schedd(__gxx_personality_v0+0x459)[0x490879] ------------------------------------------------ The working SheddLog (condor 7.4.5-0.8) contains instead: 03/31/11 13:51:32 (pid:11587) Shadow pid 11615 for job 48.0 exited with status 107 03/31/11 13:51:32 (pid:11587) Match record (<x.y.z.t:43735> for condor, 48.0) deleted 03/31/11 13:51:32 (pid:11587) attempt to connect to <x.y.z.t:43735> failed: Connection refused (connect errno = 111). 03/31/11 13:51:32 (pid:11587) Failed to send RELEASE_CLAIM to startd at <x.y.z.t:43735>: SECMAN:2004:Failed to create security session to <x.y.z.t:43735> with TCP. |SECMAN:2003:TCP connection to <x.y.z.t:43735> failed. ....
Note - Luigi reports that the Schedd & Startd are on the same machine, thus the service condor restart impacts them both. In such a case, it would be expected that the VM is shutdown when the Startd goes down. The SEGV should not exist.
This is actually a schedd segfault on new code additions: core file backtrace: #1 0x0000003ddb8bea7d in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:93 #2 0x0000003ddb8bcc06 in __cxxabiv1::__terminate (handler=<value optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:38 #3 0x0000003ddb8bcc33 in std::terminate () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48 #4 0x0000003ddb8bcd2e in __cxxabiv1::__cxa_throw (obj=0x2a706b0, tinfo=<value optimized out>, dest=<value optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:83 #5 0x00000000004b10a4 in push_front (tq=..., n=1, t=1302184509) at /usr/src/debug/condor-7.5.6/src/condor_utils/timed_queue.h:62 #6 update<int, int> (tq=..., n=1, t=1302184509) at /usr/src/debug/condor-7.5.6/src/condor_utils/timed_queue.h:108 #7 0x0000000000498889 in Scheduler::jobExitCode (this=0x8f9600, job_id=..., exit_code=107) at /usr/src/debug/condor-7.5.6/src/condor_schedd.V6/schedd.cpp:9225 #8 0x000000000049a95f in Scheduler::child_exit (this=0x8f9600, pid=26008, status=<value optimized out>) at /usr/src/debug/condor-7.5.6/src/condor_schedd.V6/schedd.cpp:9102 #9 0x00000000004ea24d in DaemonCore::CallReaper (this=0x29cfd10, reaper_id=<value optimized out>, whatexited=<value optimized out>, pid=26008, exit_status=27392) at /usr/src/debug/condor-7.5.6/src/condor_daemon_core.V6/daemon_core.cpp:9521 #10 0x00000000004f0047 in DaemonCore::HandleProcessExit (this=0x29cfd10, pid=26008, exit_status=27392) at /usr/src/debug/condor-7.5.6/src/condor_daemon_core.V6/daemon_core.cpp:9621 #11 0x00000000004f01fb in DaemonCore::HandleDC_SERVICEWAITPIDS (this=<value optimized out>) at /usr/src/debug/condor-7.5.6/src/condor_daemon_core.V6/daemon_core.cpp:9183 #12 0x00000000004f3fb5 in DaemonCore::Driver (this=0x29cfd10) at /usr/src/debug/condor-7.5.6/src/condor_daemon_core.V6/daemon_core.cpp:3055 #13 0x00000000004e597b in main (argc=1, argv=0x7fffda848560) at /usr/src/debug/condor-7.5.6/src/condor_daemon_core.V6/daemon_core_main.cpp:2377 git blame schedd.cpp: 8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9212) 8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9213) // update exit code statistics 8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9214) int start_date = 0; 8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9215) GetAttributeInt(job_id.cluster, j 8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9216) time_t updateTime = time(NULL); 8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9217) JobsExitedCum += 1; 8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9218) update(JobsExitedTQ, 1, updateTim 8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9219) map<int,int>::iterator f(ExitCode 8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9220) if (f != ExitCodesCum.end()) { 8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9221) f->second += 1; 8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9222) } 8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9223) map<int, timed_queue<int> >::iter 8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9224) if (ff != ExitCodesTQ.end()) { 8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9225) update(ff->second, 1, updateT 8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9226) } 8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9227) // check up on submissions as lon 8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9228) int jobsQueued = GetJobQueuedCoun 8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9229) update(JobsSubmittedTQ, jobsQueue 8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9230) LastJobsQueued = jobsQueued; 8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9231)
Possibly related, seeing crashes in - #0 0x0000003762c328f5 in raise () from /lib64/libc.so.6 #1 0x0000003762c340d5 in abort () from /lib64/libc.so.6 #2 0x000000376d0bea2d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6 #3 0x000000376d0bcbb6 in ?? () from /usr/lib64/libstdc++.so.6 #4 0x000000376d0bcbe3 in std::terminate() () from /usr/lib64/libstdc++.so.6 #5 0x000000376d0bccde in __cxa_throw () from /usr/lib64/libstdc++.so.6 #6 0x00000000004b2664 in push_front (tq=..., n=1, t=1302175180) at /home/matt/Documents/Repositories/Condor/src/condor_utils/timed_queue.h:62 #7 update<int, int> (tq=..., n=1, t=1302175180) at /home/matt/Documents/Repositories/Condor/src/condor_utils/timed_queue.h:108 #8 0x00000000004a920c in Scheduler::count_jobs (this=0x8f6620) at /home/matt/Documents/Repositories/Condor/src/condor_schedd.V6/schedd.cpp:944 #9 0x00000000004acdb2 in Scheduler::timeout (this=0x8f6620) at /home/matt/Documents/Repositories/Condor/src/condor_schedd.V6/schedd.cpp:677 ... And potential fix - diff --git a/src/condor_utils/timed_queue.h b/src/condor_utils/timed_queue.h index 350b10f..af61d53 100644 --- a/src/condor_utils/timed_queue.h +++ b/src/condor_utils/timed_queue.h @@ -59,7 +59,7 @@ struct timed_queue : public std::deque<std::pair<time_t, Data> > { } void push_front(const Data& d, time_t t) { - if (t < base_type::front().first) throw "timed_queue::push_front, timestamp out of order"; + if (!base_type::empty() && t < base_type::front().first) throw "timed_queue::push_front, timestamp out of order"; base_type::push_front(value_type(t, d)); if (max_len() > 0) trim_len(max_len()); if (max_time() > 0) trim_time(base_type::front().first - max_time());
Another way to trigger this bug is to submit a VM job whose VM_DISK parameter specifies a non-existent file.
pushed fix to V7_6-BZ692870-timed-queue-segv This bug should have crashed on my previous testing. I assume I got "lucky" on stale memory values. Just in case, I retested using: $ export MALLOC_PERTURB_=$(($RANDOM % 255 + 1)) $ git diff HEAD~1 diff --git a/src/condor_utils/timed_queue.h b/src/condor_utils/timed_queue.h index 350b10f..da2794d 100644 --- a/src/condor_utils/timed_queue.h +++ b/src/condor_utils/timed_queue.h @@ -22,6 +22,7 @@ #include "time.h" #include <deque> +#include "condor_debug.h" // A deque<> subclass that makes it convenient to time-stamp queue entries // and maintain the queue with a configurable time-window and maximum length. @@ -59,7 +60,9 @@ struct timed_queue : public std::deque<std::pair<time_t, Data> > { } void push_front(const Data& d, time_t t) { - if (t < base_type::front().first) throw "timed_queue::push_front, timestamp out of order"; + if (!base_type::empty() && (t < base_type::front().first)) { + EXCEPT("timed_queue::push_front, timestamp %lu out of order", (unsigned long)(t)); + } base_type::push_front(value_type(t, d)); if (max_len() > 0) trim_len(max_len()); if (max_time() > 0) trim_time(base_type::front().first - max_time());
Also cherry-picked to V7_6-BZ678025-publish-schedd-stats
The error is gone, the restart works as before. Verified on RHEL5.6 Xen i386/x86_64, RHEL5.6 KVM x86_64, RHEL6.1 Beta KVM x86_64. condor-classads-7.6.1-0.1 condor-7.6.1-0.1 condor-vm-gahp-7.6.1-0.1