Bug 698269

Summary: RFE: allow "remove failed" feature to kill condor jobs in any state, not just "error"
Product: [Retired] CloudForms Cloud Engine Reporter: wes hayutin <whayutin>
Component: aeolus-conductorAssignee: Mike Orazi <morazi>
Status: CLOSED CURRENTRELEASE QA Contact: wes hayutin <whayutin>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 1.0.0CC: dajohnso, deltacloud-maint, msolberg, ssachdev, tsedovic
Target Milestone: rcKeywords: FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
URL: http://dell-pe1955-01.rhts.eng.bos.redhat.com/conductor/resources/instances
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-09-29 16:04:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 709348    

Description wes hayutin 2011-04-20 14:36:26 UTC
Description of problem:

It is possible when starting/stopping instances to get condor in a state where things no longer work.  


[root@dell-pe1955-01 ~]# date
Wed Apr 20 10:34:08 EDT 2011
[root@dell-pe1955-01 ~]# 

-- Submitter: dell-pe1955-01.rhts.eng.bos.redhat.com : <10.16.65.121:52830> : dell-pe1955-01.rhts.eng.bos.redhat.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  20.0   aeolus          4/19 17:25   0+00:00:00 I  0   0.0  job_instance95g25_
  21.0   aeolus          4/19 22:55   0+00:00:00 I  0   0.0  job_instancepy97c_
  22.0   aeolus          4/19 23:10   0+00:00:00 I  0   0.0  job_asdf_34      


These instances show up as "new" in the cloud engine webui.

Please allow the "remove failed" function to condor_rm the jobs associated w/ these instances .. in any state. 


[root@dell-pe1955-01 ~]# rpm -qa | grep aeolus
aeolus-conductor-0.0.3-6.el6.x86_64
aeolus-configure-2.0.0-8.el6.noarch
aeolus-conductor-doc-0.0.3-6.el6.x86_64
aeolus-conductor-daemons-0.0.3-6.el6.x86_64
[root@dell-pe1955-01 ~]# rpm -qa | grep condor
condor-7.6.0-2dcloud.el6.x86_64
[root@dell-pe1955-01 ~]#

Comment 1 Shveta 2011-05-02 10:10:36 UTC
    * Name west_insta
    * Status new
    * Public Addresses
    * Private Addresses
    * Operating system Fedora 13
    * Provider
    * Base Template new_temp_1
    * Architecture x86_64
    * Memory 1
    * Storage 1
    * Instantiation Time 02-May-2011 09:41:23
    * Uptime Error, could not calculate state time: state is not monitored
    * Current Alerts 0
    * Console Connection via SSH
    * Owner aeolus user
    * Shared to N/A
===========================================================
Such instances continue to show "NEW" state even when they have failed .
Once error occured , the state should change to "Error " or "Failed"
which can then be removed by "Remove Failed".

Comment 2 Chris Lalancette 2011-07-05 18:48:54 UTC
This may actually work now; there is a method called condormatic_instance_reset_error that should do the right thing.  If it still doesn't work, then it is probably a case of making sure we hook up that method with the right UI action.

Chris Lalancette

Comment 3 wes hayutin 2011-07-07 20:57:02 UTC
funny thing about this bug.. is that from what I can tell the "remove failed" button has been removed.

I need it back :)  I cant remove instances in an "error" state now :)

I think athomas should bring it back w/ just a "remove" if the function works as Chris mentioned.

Comment 4 Tomas Sedovic 2011-07-08 16:31:06 UTC
How about this: we modify the Stop button to stop the instance no matter what.

If it's running, we stop it gracefully (just like we do now) and if it's in a different state, we stop it using the `condormatic_instance_reset_error` method that Chris mentioned.

That way you can always be able to delete the deployment that owns the instance.

Here's a quick patch:

https://fedorahosted.org/pipermail/aeolus-devel/2011-July/002952.html

Possibly, we can use the same mechanism for deleting deployables: i.e. if the deployables has instances that aren't stopped/cannot be stopped cleanly, we stop them and delete the deployable.

Comment 5 wes hayutin 2011-07-08 18:28:48 UTC
*** Bug 719962 has been marked as a duplicate of this bug. ***

Comment 6 wes hayutin 2011-08-01 19:33:20 UTC
not sure if this got in the build...  assigning to morazi to double check. I didnt see a git hash.

Comment 7 wes hayutin 2011-09-28 16:40:09 UTC
making sure all the bugs are at the right version for future queries

Comment 9 wes hayutin 2011-09-29 16:04:22 UTC
condor is gone.. moving to closed


[root@unused bin]# rpm -qa | grep aeolus
aeolus-conductor-doc-0.4.0-0.20110929145941git7594098.fc15.noarch
aeolus-conductor-0.4.0-0.20110929145941git7594098.fc15.noarch
aeolus-conductor-daemons-0.4.0-0.20110929145941git7594098.fc15.noarch
aeolus-conductor-devel-0.4.0-0.20110929145941git7594098.fc15.noarch
aeolus-all-0.4.0-0.20110929145941git7594098.fc15.noarch
rubygem-aeolus-image-0.1.0-3.20110919115936gitd1d24b4.fc15.noarch
aeolus-configure-2.0.2-4.20110926142838git5044e56.fc15.noarch
[root@unused bin]# rpm -qa | grep condor
[root@unused bin]#