Bug 698269 - RFE: allow "remove failed" feature to kill condor jobs in any state, not just "error"
Summary: RFE: allow "remove failed" feature to kill condor jobs in any state, not just...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: CloudForms Cloud Engine
Classification: Retired
Component: aeolus-conductor
Version: 1.0.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
Assignee: Mike Orazi
QA Contact: wes hayutin
URL: http://dell-pe1955-01.rhts.eng.bos.re...
Whiteboard:
Depends On:
Blocks: ce-p2-beta
TreeView+ depends on / blocked
 
Reported: 2011-04-20 14:36 UTC by wes hayutin
Modified: 2012-01-26 12:23 UTC (History)
5 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2011-09-29 16:04:22 UTC
Embargoed:


Attachments (Terms of Use)

Description wes hayutin 2011-04-20 14:36:26 UTC
Description of problem:

It is possible when starting/stopping instances to get condor in a state where things no longer work.  


[root@dell-pe1955-01 ~]# date
Wed Apr 20 10:34:08 EDT 2011
[root@dell-pe1955-01 ~]# 

-- Submitter: dell-pe1955-01.rhts.eng.bos.redhat.com : <10.16.65.121:52830> : dell-pe1955-01.rhts.eng.bos.redhat.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  20.0   aeolus          4/19 17:25   0+00:00:00 I  0   0.0  job_instance95g25_
  21.0   aeolus          4/19 22:55   0+00:00:00 I  0   0.0  job_instancepy97c_
  22.0   aeolus          4/19 23:10   0+00:00:00 I  0   0.0  job_asdf_34      


These instances show up as "new" in the cloud engine webui.

Please allow the "remove failed" function to condor_rm the jobs associated w/ these instances .. in any state. 


[root@dell-pe1955-01 ~]# rpm -qa | grep aeolus
aeolus-conductor-0.0.3-6.el6.x86_64
aeolus-configure-2.0.0-8.el6.noarch
aeolus-conductor-doc-0.0.3-6.el6.x86_64
aeolus-conductor-daemons-0.0.3-6.el6.x86_64
[root@dell-pe1955-01 ~]# rpm -qa | grep condor
condor-7.6.0-2dcloud.el6.x86_64
[root@dell-pe1955-01 ~]#

Comment 1 Shveta 2011-05-02 10:10:36 UTC
    * Name west_insta
    * Status new
    * Public Addresses
    * Private Addresses
    * Operating system Fedora 13
    * Provider
    * Base Template new_temp_1
    * Architecture x86_64
    * Memory 1
    * Storage 1
    * Instantiation Time 02-May-2011 09:41:23
    * Uptime Error, could not calculate state time: state is not monitored
    * Current Alerts 0
    * Console Connection via SSH
    * Owner aeolus user
    * Shared to N/A
===========================================================
Such instances continue to show "NEW" state even when they have failed .
Once error occured , the state should change to "Error " or "Failed"
which can then be removed by "Remove Failed".

Comment 2 Chris Lalancette 2011-07-05 18:48:54 UTC
This may actually work now; there is a method called condormatic_instance_reset_error that should do the right thing.  If it still doesn't work, then it is probably a case of making sure we hook up that method with the right UI action.

Chris Lalancette

Comment 3 wes hayutin 2011-07-07 20:57:02 UTC
funny thing about this bug.. is that from what I can tell the "remove failed" button has been removed.

I need it back :)  I cant remove instances in an "error" state now :)

I think athomas should bring it back w/ just a "remove" if the function works as Chris mentioned.

Comment 4 Tomas Sedovic 2011-07-08 16:31:06 UTC
How about this: we modify the Stop button to stop the instance no matter what.

If it's running, we stop it gracefully (just like we do now) and if it's in a different state, we stop it using the `condormatic_instance_reset_error` method that Chris mentioned.

That way you can always be able to delete the deployment that owns the instance.

Here's a quick patch:

https://fedorahosted.org/pipermail/aeolus-devel/2011-July/002952.html

Possibly, we can use the same mechanism for deleting deployables: i.e. if the deployables has instances that aren't stopped/cannot be stopped cleanly, we stop them and delete the deployable.

Comment 5 wes hayutin 2011-07-08 18:28:48 UTC
*** Bug 719962 has been marked as a duplicate of this bug. ***

Comment 6 wes hayutin 2011-08-01 19:33:20 UTC
not sure if this got in the build...  assigning to morazi to double check. I didnt see a git hash.

Comment 7 wes hayutin 2011-09-28 16:40:09 UTC
making sure all the bugs are at the right version for future queries

Comment 9 wes hayutin 2011-09-29 16:04:22 UTC
condor is gone.. moving to closed


[root@unused bin]# rpm -qa | grep aeolus
aeolus-conductor-doc-0.4.0-0.20110929145941git7594098.fc15.noarch
aeolus-conductor-0.4.0-0.20110929145941git7594098.fc15.noarch
aeolus-conductor-daemons-0.4.0-0.20110929145941git7594098.fc15.noarch
aeolus-conductor-devel-0.4.0-0.20110929145941git7594098.fc15.noarch
aeolus-all-0.4.0-0.20110929145941git7594098.fc15.noarch
rubygem-aeolus-image-0.1.0-3.20110919115936gitd1d24b4.fc15.noarch
aeolus-configure-2.0.2-4.20110926142838git5044e56.fc15.noarch
[root@unused bin]# rpm -qa | grep condor
[root@unused bin]#


Note You need to log in before you can comment on or make changes to this bug.