Bug 692870

Summary: schedd SEGV - VM jobs don't restart after condor restart
Product: Red Hat Enterprise MRG Reporter: Luigi Toscano <ltoscano>
Component: condor-vm-gahpAssignee: Erik Erlandson <eerlands>
Status: CLOSED ERRATA QA Contact: Luigi Toscano <ltoscano>
Severity: high Docs Contact:
Priority: high    
Version: DevelopmentCC: iboverma, jneedle, matt, tstclair
Target Milestone: 2.0   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: condor-7.6.0-0.6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-06-27 14:10:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Luigi Toscano 2011-04-01 13:14:28 UTC
Description of problem:
Previously, when condor was restarted while executing a VM job, the shadow used to wait for the lease expiration, then the job (==the VM) was restarted.

With condor 7.6.0-0.4, after the restart, when the lease expire the schedd dies, the shadow tries again and again and the job is never properly restarted (even if the status is still running.

Tested on KVM/x86_64, both RHEL5.6 and RHEL6.1beta, still waiting for the results on RHEL5/Xen.

Version-Release number of selected component (if applicable):
condor-7.6.0-0.4
condor-vm-gahp-7.6.0-0.4

Steps to Reproduce:
1. submit a vm job, wait for it to run
2. restart condor (service condor restart)
  
Actual results:
After the lease expiration, schedd crashes and the shadow tries again, over and over

Expected results:
After the lease expiration, the job should be properly restarted.

Additional information:
This is the backtrace found in the ScheddLog:

------------------------------------------------
04/01/11 09:05:00 (pid:17887) Failed to send ALIVE to startd at <x.y.z.t:48974>: SECMAN:2004:Failed to create security session to <x.y.z.t:48974> with TCP.
|SECMAN:2003:TCP connection to <x.y.z.t:48974> failed.
04/01/11 09:05:45 (pid:17887) Shadow pid 17891 for job 2.0 exited with status 107
Stack dump for process 17887 at timestamp 1301663145 (18 frames)
condor_schedd(dprintf_dump_stack+0x56)[0x5ba506]
condor_schedd[0x5cb252]
/lib64/libpthread.so.0[0x3414a0eb10]
/lib64/libc.so.6(abort+0x28f)[0x3413e31e8f]
/usr/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x114)[0x341d2bed14]
/usr/lib64/libstdc++.so.6[0x341d2bce16]
/usr/lib64/libstdc++.so.6[0x341d2bce43]
/usr/lib64/libstdc++.so.6[0x341d2bcf2a]
condor_schedd[0x4c91db]
condor_schedd(_ZN9Scheduler11jobExitCodeE7PROC_IDi+0x231)[0x4aea31]
condor_schedd(_ZN9Scheduler10child_exitEii+0x340)[0x4b65d0]
condor_schedd(_ZN10DaemonCore10CallReaperEiPKcii+0x11c)[0x4f65cc]
condor_schedd(_ZN10DaemonCore17HandleProcessExitEii+0x1a7)[0x509177]
condor_schedd(_ZN10DaemonCore24HandleDC_SERVICEWAITPIDSEi+0x2e)[0x5092de]
condor_schedd(_ZN10DaemonCore6DriverEv+0x1e7)[0x4fd257]
condor_schedd(main+0xe57)[0x510d57]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3413e1d994]
condor_schedd(__gxx_personality_v0+0x459)[0x490879]
------------------------------------------------
The working SheddLog (condor 7.4.5-0.8) contains instead:
03/31/11 13:51:32 (pid:11587) Shadow pid 11615 for job 48.0 exited with status 107
03/31/11 13:51:32 (pid:11587) Match record (<x.y.z.t:43735> for condor, 48.0) deleted
03/31/11 13:51:32 (pid:11587) attempt to connect to <x.y.z.t:43735> failed: Connection refused (connect errno = 111).
03/31/11 13:51:32 (pid:11587) Failed to send RELEASE_CLAIM to startd at <x.y.z.t:43735>: SECMAN:2004:Failed to create security session to <x.y.z.t:43735> with TCP.
|SECMAN:2003:TCP connection to <x.y.z.t:43735> failed.
....

Comment 1 Matthew Farrellee 2011-04-01 14:06:15 UTC
Note - Luigi reports that the Schedd & Startd are on the same machine, thus the service condor restart impacts them both. In such a case, it would be expected that the VM is shutdown when the Startd goes down.

The SEGV should not exist.

Comment 4 Timothy St. Clair 2011-04-07 15:59:02 UTC
This is actually a schedd segfault on new code additions: 

core file backtrace: 
#1  0x0000003ddb8bea7d in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:93
#2  0x0000003ddb8bcc06 in __cxxabiv1::__terminate (handler=<value optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:38
#3  0x0000003ddb8bcc33 in std::terminate () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#4  0x0000003ddb8bcd2e in __cxxabiv1::__cxa_throw (obj=0x2a706b0, tinfo=<value optimized out>, dest=<value optimized out>)
    at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:83
#5  0x00000000004b10a4 in push_front (tq=..., n=1, t=1302184509) at /usr/src/debug/condor-7.5.6/src/condor_utils/timed_queue.h:62
#6  update<int, int> (tq=..., n=1, t=1302184509) at /usr/src/debug/condor-7.5.6/src/condor_utils/timed_queue.h:108
#7  0x0000000000498889 in Scheduler::jobExitCode (this=0x8f9600, job_id=..., exit_code=107)
    at /usr/src/debug/condor-7.5.6/src/condor_schedd.V6/schedd.cpp:9225
#8  0x000000000049a95f in Scheduler::child_exit (this=0x8f9600, pid=26008, status=<value optimized out>)
    at /usr/src/debug/condor-7.5.6/src/condor_schedd.V6/schedd.cpp:9102
#9  0x00000000004ea24d in DaemonCore::CallReaper (this=0x29cfd10, reaper_id=<value optimized out>, whatexited=<value optimized out>, 
    pid=26008, exit_status=27392) at /usr/src/debug/condor-7.5.6/src/condor_daemon_core.V6/daemon_core.cpp:9521
#10 0x00000000004f0047 in DaemonCore::HandleProcessExit (this=0x29cfd10, pid=26008, exit_status=27392)
    at /usr/src/debug/condor-7.5.6/src/condor_daemon_core.V6/daemon_core.cpp:9621
#11 0x00000000004f01fb in DaemonCore::HandleDC_SERVICEWAITPIDS (this=<value optimized out>)
    at /usr/src/debug/condor-7.5.6/src/condor_daemon_core.V6/daemon_core.cpp:9183
#12 0x00000000004f3fb5 in DaemonCore::Driver (this=0x29cfd10) at /usr/src/debug/condor-7.5.6/src/condor_daemon_core.V6/daemon_core.cpp:3055
#13 0x00000000004e597b in main (argc=1, argv=0x7fffda848560)
    at /usr/src/debug/condor-7.5.6/src/condor_daemon_core.V6/daemon_core_main.cpp:2377


git blame schedd.cpp: 

8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson    2011-03-24 12:46:28 -0700  9212) 
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson    2011-03-24 12:46:28 -0700  9213)     // update exit code statistics
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson    2011-03-24 12:46:28 -0700  9214)     int start_date = 0;
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson    2011-03-24 12:46:28 -0700  9215)     GetAttributeInt(job_id.cluster, j
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson    2011-03-24 12:46:28 -0700  9216)     time_t updateTime = time(NULL);
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson    2011-03-24 12:46:28 -0700  9217)     JobsExitedCum += 1;
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson    2011-03-24 12:46:28 -0700  9218)     update(JobsExitedTQ, 1, updateTim
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson    2011-03-24 12:46:28 -0700  9219)     map<int,int>::iterator f(ExitCode
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson    2011-03-24 12:46:28 -0700  9220)     if (f != ExitCodesCum.end()) {
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson    2011-03-24 12:46:28 -0700  9221)         f->second += 1;
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson    2011-03-24 12:46:28 -0700  9222)     }
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson    2011-03-24 12:46:28 -0700  9223)     map<int, timed_queue<int> >::iter
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson    2011-03-24 12:46:28 -0700  9224)     if (ff != ExitCodesTQ.end()) {
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson    2011-03-24 12:46:28 -0700  9225)         update(ff->second, 1, updateT
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson    2011-03-24 12:46:28 -0700  9226)     }
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson    2011-03-24 12:46:28 -0700  9227)     // check up on submissions as lon
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson    2011-03-24 12:46:28 -0700  9228)     int jobsQueued = GetJobQueuedCoun
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson    2011-03-24 12:46:28 -0700  9229)     update(JobsSubmittedTQ, jobsQueue
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson    2011-03-24 12:46:28 -0700  9230)     LastJobsQueued = jobsQueued;    
8af13d30 src/condor_schedd.V6/schedd.cpp (Erik Erlandson    2011-03-24 12:46:28 -0700  9231)

Comment 6 Matthew Farrellee 2011-04-07 16:04:03 UTC
Possibly related, seeing crashes in -

#0  0x0000003762c328f5 in raise () from /lib64/libc.so.6
#1  0x0000003762c340d5 in abort () from /lib64/libc.so.6
#2  0x000000376d0bea2d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#3  0x000000376d0bcbb6 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x000000376d0bcbe3 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5  0x000000376d0bccde in __cxa_throw () from /usr/lib64/libstdc++.so.6
#6  0x00000000004b2664 in push_front (tq=..., n=1, t=1302175180)
    at /home/matt/Documents/Repositories/Condor/src/condor_utils/timed_queue.h:62
#7  update<int, int> (tq=..., n=1, t=1302175180)
    at /home/matt/Documents/Repositories/Condor/src/condor_utils/timed_queue.h:108
#8  0x00000000004a920c in Scheduler::count_jobs (this=0x8f6620)
    at /home/matt/Documents/Repositories/Condor/src/condor_schedd.V6/schedd.cpp:944
#9  0x00000000004acdb2 in Scheduler::timeout (this=0x8f6620)
    at /home/matt/Documents/Repositories/Condor/src/condor_schedd.V6/schedd.cpp:677
...

And potential fix -

diff --git a/src/condor_utils/timed_queue.h b/src/condor_utils/timed_queue.h
index 350b10f..af61d53 100644
--- a/src/condor_utils/timed_queue.h
+++ b/src/condor_utils/timed_queue.h
@@ -59,7 +59,7 @@ struct timed_queue : public std::deque<std::pair<time_t, Data> > {
     }
 
     void push_front(const Data& d, time_t t) {
-        if (t < base_type::front().first) throw "timed_queue::push_front, timestamp out of order";
+        if (!base_type::empty() && t < base_type::front().first) throw "timed_queue::push_front, timestamp out of order";
         base_type::push_front(value_type(t, d));
         if (max_len() > 0) trim_len(max_len());
         if (max_time() > 0) trim_time(base_type::front().first - max_time());

Comment 7 Luigi Toscano 2011-04-07 16:32:08 UTC
Another way to trigger this bug is to submit a VM job whose VM_DISK parameter specifies a non-existent file.

Comment 8 Erik Erlandson 2011-04-07 18:16:26 UTC
pushed fix to V7_6-BZ692870-timed-queue-segv

This bug should have crashed on my previous testing.  I assume I got "lucky" on stale memory values.  Just in case, I retested using:
$ export MALLOC_PERTURB_=$(($RANDOM % 255 + 1))


$ git diff HEAD~1
diff --git a/src/condor_utils/timed_queue.h b/src/condor_utils/timed_queue.h
index 350b10f..da2794d 100644
--- a/src/condor_utils/timed_queue.h
+++ b/src/condor_utils/timed_queue.h
@@ -22,6 +22,7 @@
 
 #include "time.h"
 #include <deque>
+#include "condor_debug.h"
 
 // A deque<> subclass that makes it convenient to time-stamp queue entries
 // and maintain the queue with a configurable time-window and maximum length.
@@ -59,7 +60,9 @@ struct timed_queue : public std::deque<std::pair<time_t, Data> > {
     }
 
     void push_front(const Data& d, time_t t) {
-        if (t < base_type::front().first) throw "timed_queue::push_front, timestamp out of order";
+        if (!base_type::empty() && (t < base_type::front().first)) {
+            EXCEPT("timed_queue::push_front, timestamp %lu out of order", (unsigned long)(t));
+        }
         base_type::push_front(value_type(t, d));
         if (max_len() > 0) trim_len(max_len());
         if (max_time() > 0) trim_time(base_type::front().first - max_time());

Comment 9 Erik Erlandson 2011-04-07 18:26:19 UTC
Also cherry-picked to V7_6-BZ678025-publish-schedd-stats

Comment 10 Luigi Toscano 2011-04-22 14:27:05 UTC
The error is gone, the restart works as before.

Verified on RHEL5.6 Xen i386/x86_64, RHEL5.6 KVM x86_64, RHEL6.1 Beta KVM x86_64.
condor-classads-7.6.1-0.1
condor-7.6.1-0.1
condor-vm-gahp-7.6.1-0.1