Bug 709873

Summary: Plugins report JobStatus of held job as "Running" in result of GetJobSummaries method
Product: Red Hat Enterprise MRG Reporter: Trevor McKay <tmckay>
Component: condor-qmfAssignee: Pete MacKinnon <pmackinn>
Status: CLOSED ERRATA QA Contact: Martin Kudlej <mkudlej>
Severity: medium Docs Contact:
Priority: unspecified    
Version: DevelopmentCC: jneedle, matt, mkudlej
Target Milestone: 2.0   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: condor-7.6.1-0.10 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-06-27 14:13:26 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
qmf patch
none
aviary patch none

Description Trevor McKay 2011-06-01 20:16:44 UTC
Description of problem:

A job that was in the held state was shown to be held by condor_q and shown as held in the job classad but is shown with JobStatus 2 (running) in the data returned from the GetJobSummaries method run on its enclosing Submission.

Held/Running/Idle count for the Submission itself was correct.

Version-Release number of selected component (if applicable):

How reproducible:

unknown

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Pete MacKinnon 2011-06-02 15:07:09 UTC
Upstream at V7_6-branch

~/repos/uw/condor/CONDOR_SRC  (V7_6-branch)$ git show d69c77708a7e97b5c7a47b21fd05fe66bf034ea1 1f7623883552c6acf9b87967419327545cc42aed
commit d69c77708a7e97b5c7a47b21fd05fe66bf034ea1
Author: Peter MacKinnon <pmackinn>
Date:   Wed Jun 1 16:43:46 2011 -0400

    Fix to ensure job status is up-to-date even if summary has been cached
    in qmf contrib job server

diff --git a/src/condor_contrib/mgmt/qmf/daemons/Job.cpp b/src/condor_contrib/mgmt/qmf/daemons/Job.cpp
index 6983e0e..eb219f7 100644
--- a/src/condor_contrib/mgmt/qmf/daemons/Job.cpp
+++ b/src/condor_contrib/mgmt/qmf/daemons/Job.cpp
@@ -337,6 +337,16 @@ const ClassAd* LiveJobImpl::GetSummary ()
                }
        }
 
+    // make sure we're up-to-date with status even if we've cached the summary
+       m_summary_ad->Assign(ATTR_JOB_STATUS,this->GetStatus());
+    int i;
+    if ( m_full_ad->LookupInteger ( ATTR_ENTERED_CURRENT_STATUS, i ) ) {
+        m_summary_ad->Assign(ATTR_ENTERED_CURRENT_STATUS,i);
+    }
+    else {
+        dprintf(D_ALWAYS,"Unable to get ATTR_ENTERED_CURRENT_STATUS\n");
+    }
+
        return m_summary_ad;
 }
 

commit 1f7623883552c6acf9b87967419327545cc42aed
Author: Peter MacKinnon <pmackinn>
Date:   Wed Jun 1 17:14:52 2011 -0400

    Ensure live job status is accurate in job summaries
    from aviary contrib query server

diff --git a/src/condor_contrib/aviary/src/Job.cpp b/src/condor_contrib/aviary/src/Job.cpp
index bb4112e..1b2c294 100644
--- a/src/condor_contrib/aviary/src/Job.cpp
+++ b/src/condor_contrib/aviary/src/Job.cpp
@@ -307,11 +307,21 @@ const ClassAd* LiveJobImpl::getSummary ()
                                                m_summary_ad->Assign(ATTRS[i], attr->getValue());
                                }
                        }
-                       delete attr;
+               delete attr;
                i++;
         }
        }
 
+    // make sure we're up-to-date with status even if we've cached the summary
+    m_summary_ad->Assign(ATTR_JOB_STATUS,this->getStatus());
+    int i;
+    if ( m_full_ad->LookupInteger ( ATTR_ENTERED_CURRENT_STATUS, i ) ) {
+        m_summary_ad->Assign(ATTR_ENTERED_CURRENT_STATUS,i);
+    }
+    else {
+        dprintf(D_ALWAYS,"Unable to get ATTR_ENTERED_CURRENT_STATUS\n");
+    }
+
        return m_summary_ad;
 }

Comment 3 Pete MacKinnon 2011-06-02 18:15:14 UTC
Created attachment 502595 [details]
qmf patch

Comment 4 Pete MacKinnon 2011-06-02 18:15:52 UTC
Created attachment 502596 [details]
aviary patch

Comment 5 Pete MacKinnon 2011-06-03 13:34:39 UTC
QMF condor_job_server test procedure:

1) submit new job (either via qmf or cmd line)
2) use qpid-tool to get the submission summary while the job is still active
(i.e., not COMPLETED or REMOVED) -> "call XXX GetJobSummaries"
3) make note of the JobStatus (IDLE or RUNNING)
4) put job on hold (qmf or cmd line)
5) get summary again and note the new JobStatus (HELD)
6) release job (qmf or cmd line)
7) get summary again and note the new JobStatus (IDLE or RUNNING)

This test needs to allow for the combined latency of condor and QMF updates (10-30 seconds?).

Comment 6 Martin Kudlej 2011-06-09 14:15:22 UTC
Tested on RHEL 5.6/6.1 x i386/x86_64 and
with condor-7.6.1-0.9 it doesn't work
and with condor-7.6.1-0.10 it works. -->VERIFIED