| Summary: | schedd SEGV - VM jobs don't restart after condor restart | ||
|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Luigi Toscano <ltoscano> |
| Component: | condor-vm-gahp | Assignee: | Erik Erlandson <eerlands> |
| Status: | CLOSED ERRATA | QA Contact: | Luigi Toscano <ltoscano> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | Development | CC: | iboverma, jneedle, matt, tstclair |
| Target Milestone: | 2.0 | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | condor-7.6.0-0.6 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-06-27 14:10:32 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Luigi Toscano
2011-04-01 13:14:28 UTC
Note - Luigi reports that the Schedd & Startd are on the same machine, thus the service condor restart impacts them both. In such a case, it would be expected that the VM is shutdown when the Startd goes down. The SEGV should not exist. This is actually a schedd segfault on new code additions:
core file backtrace:
#1 0x0000003ddb8bea7d in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:93
#2 0x0000003ddb8bcc06 in __cxxabiv1::__terminate (handler=<value optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:38
#3 0x0000003ddb8bcc33 in std::terminate () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#4 0x0000003ddb8bcd2e in __cxxabiv1::__cxa_throw (obj=0x2a706b0, tinfo=<value optimized out>, dest=<value optimized out>)
at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:83
#5 0x00000000004b10a4 in push_front (tq=..., n=1, t=1302184509) at /usr/src/debug/condor-7.5.6/src/condor_utils/timed_queue.h:62
#6 update<int, int> (tq=..., n=1, t=1302184509) at /usr/src/debug/condor-7.5.6/src/condor_utils/timed_queue.h:108
#7 0x0000000000498889 in Scheduler::jobExitCode (this=0x8f9600, job_id=..., exit_code=107)
at /usr/src/debug/condor-7.5.6/src/condor_schedd.V6/schedd.cpp:9225
#8 0x000000000049a95f in Scheduler::child_exit (this=0x8f9600, pid=26008, status=<value optimized out>)
at /usr/src/debug/condor-7.5.6/src/condor_schedd.V6/schedd.cpp:9102
#9 0x00000000004ea24d in DaemonCore::CallReaper (this=0x29cfd10, reaper_id=<value optimized out>, whatexited=<value optimized out>,
pid=26008, exit_status=27392) at /usr/src/debug/condor-7.5.6/src/condor_daemon_core.V6/daemon_core.cpp:9521
#10 0x00000000004f0047 in DaemonCore::HandleProcessExit (this=0x29cfd10, pid=26008, exit_status=27392)
at /usr/src/debug/condor-7.5.6/src/condor_daemon_core.V6/daemon_core.cpp:9621
#11 0x00000000004f01fb in DaemonCore::HandleDC_SERVICEWAITPIDS (this=<value optimized out>)
at /usr/src/debug/condor-7.5.6/src/condor_daemon_core.V6/daemon_core.cpp:9183
#12 0x00000000004f3fb5 in DaemonCore::Driver (this=0x29cfd10) at /usr/src/debug/condor-7.5.6/src/condor_daemon_core.V6/daemon_core.cpp:3055
#13 0x00000000004e597b in main (argc=1, argv=0x7fffda848560)
at /usr/src/debug/condor-7.5.6/src/condor_daemon_core.V6/daemon_core_main.cpp:2377
git blame schedd.cpp:
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9212)
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9213) // update exit code statistics
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9214) int start_date = 0;
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9215) GetAttributeInt(job_id.cluster, j
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9216) time_t updateTime = time(NULL);
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9217) JobsExitedCum += 1;
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9218) update(JobsExitedTQ, 1, updateTim
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9219) map<int,int>::iterator f(ExitCode
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9220) if (f != ExitCodesCum.end()) {
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9221) f->second += 1;
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9222) }
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9223) map<int, timed_queue<int> >::iter
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9224) if (ff != ExitCodesTQ.end()) {
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9225) update(ff->second, 1, updateT
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9226) }
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9227) // check up on submissions as lon
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9228) int jobsQueued = GetJobQueuedCoun
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9229) update(JobsSubmittedTQ, jobsQueue
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9230) LastJobsQueued = jobsQueued;
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson 2011-03-24 12:46:28 -0700 9231)
Possibly related, seeing crashes in -
#0 0x0000003762c328f5 in raise () from /lib64/libc.so.6
#1 0x0000003762c340d5 in abort () from /lib64/libc.so.6
#2 0x000000376d0bea2d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#3 0x000000376d0bcbb6 in ?? () from /usr/lib64/libstdc++.so.6
#4 0x000000376d0bcbe3 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5 0x000000376d0bccde in __cxa_throw () from /usr/lib64/libstdc++.so.6
#6 0x00000000004b2664 in push_front (tq=..., n=1, t=1302175180)
at /home/matt/Documents/Repositories/Condor/src/condor_utils/timed_queue.h:62
#7 update<int, int> (tq=..., n=1, t=1302175180)
at /home/matt/Documents/Repositories/Condor/src/condor_utils/timed_queue.h:108
#8 0x00000000004a920c in Scheduler::count_jobs (this=0x8f6620)
at /home/matt/Documents/Repositories/Condor/src/condor_schedd.V6/schedd.cpp:944
#9 0x00000000004acdb2 in Scheduler::timeout (this=0x8f6620)
at /home/matt/Documents/Repositories/Condor/src/condor_schedd.V6/schedd.cpp:677
...
And potential fix -
diff --git a/src/condor_utils/timed_queue.h b/src/condor_utils/timed_queue.h
index 350b10f..af61d53 100644
--- a/src/condor_utils/timed_queue.h
+++ b/src/condor_utils/timed_queue.h
@@ -59,7 +59,7 @@ struct timed_queue : public std::deque<std::pair<time_t, Data> > {
}
void push_front(const Data& d, time_t t) {
- if (t < base_type::front().first) throw "timed_queue::push_front, timestamp out of order";
+ if (!base_type::empty() && t < base_type::front().first) throw "timed_queue::push_front, timestamp out of order";
base_type::push_front(value_type(t, d));
if (max_len() > 0) trim_len(max_len());
if (max_time() > 0) trim_time(base_type::front().first - max_time());
Another way to trigger this bug is to submit a VM job whose VM_DISK parameter specifies a non-existent file. pushed fix to V7_6-BZ692870-timed-queue-segv This bug should have crashed on my previous testing. I assume I got "lucky" on stale memory values. Just in case, I retested using: $ export MALLOC_PERTURB_=$(($RANDOM % 255 + 1)) $ git diff HEAD~1 diff --git a/src/condor_utils/timed_queue.h b/src/condor_utils/timed_queue.h index 350b10f..da2794d 100644 --- a/src/condor_utils/timed_queue.h +++ b/src/condor_utils/timed_queue.h @@ -22,6 +22,7 @@ #include "time.h" #include <deque> +#include "condor_debug.h" // A deque<> subclass that makes it convenient to time-stamp queue entries // and maintain the queue with a configurable time-window and maximum length. @@ -59,7 +60,9 @@ struct timed_queue : public std::deque<std::pair<time_t, Data> > { } void push_front(const Data& d, time_t t) { - if (t < base_type::front().first) throw "timed_queue::push_front, timestamp out of order"; + if (!base_type::empty() && (t < base_type::front().first)) { + EXCEPT("timed_queue::push_front, timestamp %lu out of order", (unsigned long)(t)); + } base_type::push_front(value_type(t, d)); if (max_len() > 0) trim_len(max_len()); if (max_time() > 0) trim_time(base_type::front().first - max_time()); Also cherry-picked to V7_6-BZ678025-publish-schedd-stats The error is gone, the restart works as before. Verified on RHEL5.6 Xen i386/x86_64, RHEL5.6 KVM x86_64, RHEL6.1 Beta KVM x86_64. condor-classads-7.6.1-0.1 condor-7.6.1-0.1 condor-vm-gahp-7.6.1-0.1 |