Bug 540545
Summary: | WANT_SUSPEND evaluating to UNDEFIEND causes condor_startd exception | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Martin Kudlej <mkudlej> |
Component: | condor | Assignee: | Matthew Farrellee <matt> |
Status: | CLOSED ERRATA | QA Contact: | Martin Kudlej <mkudlej> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 1.0 | CC: | fnadge, matt |
Target Milestone: | 1.3 | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
The default 'WANT_SUSPEND' policy included the JobStart attribute, which in some cases was 'UNDEFINED'. A 'WANT_SUSPEND' with the value 'UNDEFINED' was considered an error and startd would exit. With this update, the JobStart attribute was removed from 'WANT_SUSPEND'. Startd now treats a 'WANT_SUSPEND' that evaluates to 'UNDEFINED' as if it evaluated to 'FALSE'.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2010-10-14 16:12:48 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Martin Kudlej
2009-11-23 16:16:45 UTC
What's the frequency of this bug? For instance, $ grep -e "STARTING UP" -e "signal 11" MasterLog; tail -n1 MasterLog 11/25 13:49:56 ** condor_master (CONDOR_MASTER) STARTING UP 11/25 14:06:09 The STARTD (pid 15559) died due to signal 11 (Segmentation fault) 11/25 14:21:20 exit Daemons::UpdateCollector That shows one crash over a period of about 30 minutes. However, we need more than 1 data point to have any confidence in the rate. The condor_shadow failures are a response to a condor_starter failure. The condor_starter failure, only indirectly shown in the logs above, is because "ProcD has failed". The condor_procd is a process shared between the condor_startd and condor_starter processes to track process families. $ pstree -p 12992 condor_startd(12992)─┬─condor_procd(13254) ├─condor_starter(25486)───condor_exec.exe(25567) ├─condor_starter(25519)───condor_exec.exe(25568) ├─condor_starter(25542)───condor_exec.exe(25569) ├─condor_starter(25588) ├─condor_starter(25620) ├─condor_starter(25647) └─condor_starter(31030)───condor_exec.exe(31054) The procd, which is terminated as part of ~DaemonCore and Proc_Family_Cleanup, may be in a race with the condor_starter. This should be a separate BZ. What is the occurrence of this bug? Basically, how many jobs are processed by a single startd before it crashes. Very roughly, $ echo "Jobs per crash: $(($(grep "Idle -> Busy" StartLog | wc -l) / $(grep -e "STARTING UP" StartLog | wc -l)))" Jobs per crash: 1129 The stack is easier to read after being piped through c++filt, $ grep -A24 Stack StartLog | c++filt Stack dump for process 15559 at timestamp 1259175969 (24 frames) condor_startd(dprintf_dump_stack+0xb7)[0x5020d1] condor_startd[0x50233e] /lib64/libpthread.so.0[0x34c380e7c0] /lib64/libc.so.6(abort+0x28f)[0x34c3031e8f] condor_startd(_EXCEPT_+0x1a5)[0x500663] condor_startd(Resource::wants_suspend()+0x113)[0x4a5065] condor_startd(ResState::eval()+0xe5)[0x4a40d9] condor_startd(ResState::change(State, Activity)+0x2af)[0x4a335b] condor_startd(ResState::change(State)+0x7e)[0x4a33e4] condor_startd(Resource::change_state(State)+0x1f)[0x49819d] condor_startd(accept_request_claim(Resource*)+0x66c)[0x4b1eae] condor_startd(request_claim(Resource*, Claim*, char*, Stream*)+0x144a)[0x4b343c] condor_startd(command_request_claim(Service*, int, Stream*)+0x2a6)[0x4b494e] condor_startd(DaemonCore::CallCommandHandler(int, Stream*, bool)+0x29f)[0x4d7675] condor_startd(DaemonCore::HandleReq(Stream*, Stream*)+0x3393)[0x4eb671] condor_startd(DaemonCore::HandleReqSocketHandler(Stream*)+0x99)[0x4ec287] condor_startd(DaemonCore::CallSocketHandler_worker(int, bool, Stream*)+0x279)[0x4ebffb] condor_startd(DaemonCore::CallSocketHandler_worker_demarshall(void*)+0x39)[0x4ec1e3] condor_startd(CondorThreads::pool_add(void (*)(void*), void*, int*, char const*)+0x3f)[0x579acd] condor_startd(DaemonCore::CallSocketHandler(int&, bool)+0x1cd)[0x4de445] condor_startd(DaemonCore::Driver()+0x1876)[0x4dfd3c] condor_startd(main+0x1917)[0x4f7785] /lib64/libc.so.6(__libc_start_main+0xf4)[0x34c301d994] condor_startd[0x490169] NOTE (thanks GregT for the reminder): The check for WANT_SUSPEND is an EvalBool, which means WANT_SUSPEND evaluating to UNDEFINED results in this assertion. Debug with: WANT_SUSPEND = debug($(WANT_SUSPEND)) The assertion may be misguided. It attempts to assert that WANT_SUSPEND is set, and cannot distinguish between a WANT_SUSPEND that evaluates to UNDEFINED vs was never set. $ condor_config_val -v WANT_SUSPEND WANT_SUSPEND: debug(( (TARGET.ImageSize < (15 * 1024)) || ((KeyboardIdle < 60) == False) || (TARGET.JobUniverse == 5) ) && ( ( (KeyboardIdle < 60) || ( (CpuBusyTime > 2 * 60) && (CurrentTime - JobStart) > 90 ) ) )) I would suspect JobStart, aka ATTR_JOB_START Indeed JobStart is evaluating to UNDEFINED StartLog: 11/25 15:49:27 Classad debug: ImageSize --> ERROR 11/25 15:49:27 Classad debug: TARGET.ImageSize --> 22 11/25 15:49:27 Classad debug: KeyboardIdle --> ERROR 11/25 15:49:27 Classad debug: KeyboardIdle --> 66 11/25 15:49:27 Classad debug: CpuBusyTime --> ERROR 11/25 15:49:27 Classad debug: CpuBusyTime --> 318 11/25 15:49:27 Classad debug: CurrentTime --> 1259182167 11/25 15:49:27 Classad debug: JobStart --> UNDEFINED 11/25 15:49:27 Classad debug: debug(((TARGET.ImageSize < (15 * 1024)) || ((KeyboardIdle < 60) == FALSE) || (TARGET.JobUniverse == 5)) && (((KeyboardIdle < 60) || ((CpuBusyTime > 2 * 60) && (CurrentTime - JobStart) > 90)))) --> UNDEFINED 11/25 15:49:27 ERROR "Can't find WANT_SUSPEND in internal ClassAd" at line 1261 in file Resource.cpp With debugging, $ grep -e "STARTING UP" -e "signal 11" MasterLog; tail -n1 MasterLog 11/25 13:49:56 ** condor_master (CONDOR_MASTER) STARTING UP 11/25 14:06:09 The STARTD (pid 15559) died due to signal 11 (Segmentation fault) 11/25 14:27:46 The STARTD (pid 12992) died due to signal 11 (Segmentation fault) 11/25 14:49:01 The STARTD (pid 28825) died due to signal 11 (Segmentation fault) 11/25 15:10:41 The STARTD (pid 11562) died due to signal 11 (Segmentation fault) 11/25 15:38:20 The STARTD (pid 27427) died due to signal 11 (Segmentation fault) 11/25 15:49:27 The STARTD (pid 29229) died due to signal 11 (Segmentation fault) 11/25 16:06:40 ProcAPI::getProcInfo() pid 16745 does not exist. $ grep "JobStart --> UNDEFINED" StartLog 11/25 15:49:27 Classad debug: JobStart --> UNDEFINED Proposed workaround: $ grep ActivationTimer ~condor/condor_config.overrides ActivationTimer = ifThenElse(JobStart =!= UNDEFINED, (CurrentTime - JobStart), 0) This crash was only happening when the evaluation of WANT_SUSPEND made it to evaluation of JobStart, which was protected by KeyboardIdle and CpuBusyTime. The crash can be easily reproduced with: WANT_SUSPEND = (CurrentTime - JobStart) < 90 JobStart is defined by a claim with an active job in the startd. It is not always present, making it a dangerous choice for default policy. Two actions on this: 1) Update the assertion to differentiate between not present and evaluating to UNDEFINED 2) Update the default configuration to protect access to JobStart This bug has likely been around for a very long time: Activation timer showed up in 1998 (git show 3cef52db). The conditional setting of ATTR_JOB_START has been around since at least 2003 (see V6_5-branch). Upstream ticket: http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1001 commit 745a047243d94d23ac286fd604c57542f3f3cd15 Author: Matthew Farrellee <matt@> Date: Fri Dec 18 15:50:16 2009 -0500 Protect ActivationTimer from going to UNDEFINED (#1001) diff --git a/src/condor_examples/condor_config.generic b/src/condor_examples/condor_config.generic index 4a49eae..3d93b5a 100644 --- a/src/condor_examples/condor_config.generic +++ b/src/condor_examples/condor_config.generic @@ -622,7 +622,7 @@ MINUTE = 60 HOUR = (60 * $(MINUTE)) StateTimer = (CurrentTime - EnteredCurrentState) ActivityTimer = (CurrentTime - EnteredCurrentActivity) -ActivationTimer = (CurrentTime - JobStart) +ActivationTimer = ifThenElse(JobStart =!= UNDEFINED, (CurrentTime - JobStart), 0) LastCkpt = (CurrentTime - LastPeriodicCheckpoint) ## The JobUniverse attribute is just an int. These macros can be commit 6d4b6e5119261851e17996e5121c42b7c3a045fe Author: Matthew Farrellee <matt@> Date: Fri Dec 18 15:55:12 2009 -0500 WANT_SUSPEND that is UNDEFINED, either by being absent or evaluating to UNDEFINED, now means FALSE. It used to mean EXCEPT! (#1001) diff --git a/src/condor_startd.V6/Resource.cpp b/src/condor_startd.V6/Resource.cpp index 6a23889..76da635 100644 --- a/src/condor_startd.V6/Resource.cpp +++ b/src/condor_startd.V6/Resource.cpp @@ -1255,10 +1255,8 @@ Resource::wants_suspend( void ) if( r_classad->EvalBool( "WANT_SUSPEND", r_cur->ad(), want_suspend ) == 0) { - // This should never happen, since we already check - // when we're constructing the internal config classad - // if we've got this defined. -Derek Wright 4/12/00 - EXCEPT( "Can't find WANT_SUSPEND in internal ClassAd" ); + // UNDEFINED means FALSE for WANT_SUSPEND + want_suspend = false; } } return want_suspend; Fixed in 7.4.2-0.1 Tested on Rhel 5.5/4.8 x x86_64/i386 with condor-7.4.3-0.5. I submitted 100000 simple job from comment #1 and another 100000 simple jobs to condor with "WANT_SUSPEND = (CurrentTime - JobStart) < 90" in configuration. I didn't see any error or coredump from comment #1 -->VERIFIED Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, if the user submitted around 10000 simple jobs "sleep 3", condor_startd and condor_shadow failed. With this update, condor_startd and condor_shadow work as expected, even with large numbers of jobs. Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -Previously, if the user submitted around 10000 simple jobs "sleep 3", condor_startd and condor_shadow failed. With this update, condor_startd and condor_shadow work as expected, even with large numbers of jobs.+Previously, condor_startd and condor_shadow failed because default the value of WANT_SUSPEND policy becomes to UNDEFINED. With this update, condor_startd and condor_shadow work as expected, even if WANT_SUSPEND evaluates to UNDEFINED. Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1,4 @@ -Previously, condor_startd and condor_shadow failed because default the value of WANT_SUSPEND policy becomes to UNDEFINED. With this update, condor_startd and condor_shadow work as expected, even if WANT_SUSPEND evaluates to UNDEFINED.+C: The default WANT_SUSPEND policy included JobStart, which in some cases may be UNDEFINED. +C: The WANT_SUSPEND policy expression would evaluated to UNDEFINED when JobStart was also not defined. A WANT_SUSPEND of UNDEFINED was considered an error, and the Startd would exit. +F: Workaround by removing JobStart from WANT_SUSPEND. Fixed by making WANT_SUSPEND evaluating to UNDEFINED equivalent to evaluating to FALSE. +R: Startd now treats a WANT_SUSPEND that evaluates to UNDEFINED as if it evaluated to FALSE. Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,4 +1 @@ -C: The default WANT_SUSPEND policy included JobStart, which in some cases may be UNDEFINED. +The default 'WANT_SUSPEND' policy included the JobStart attribute, which in some cases was 'UNDEFINED'. A 'WANT_SUSPEND' with the value 'UNDEFINED' was considered an error and startd would exit. With this update, the JobStart attribute was removed from 'WANT_SUSPEND'. Startd now treats a 'WANT_SUSPEND' that evaluates to 'UNDEFINED' as if it evaluated to 'FALSE'.-C: The WANT_SUSPEND policy expression would evaluated to UNDEFINED when JobStart was also not defined. A WANT_SUSPEND of UNDEFINED was considered an error, and the Startd would exit. -F: Workaround by removing JobStart from WANT_SUSPEND. Fixed by making WANT_SUSPEND evaluating to UNDEFINED equivalent to evaluating to FALSE. -R: Startd now treats a WANT_SUSPEND that evaluates to UNDEFINED as if it evaluated to FALSE. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html |