Description of problem: It is possible when starting/stopping instances to get condor in a state where things no longer work. [root@dell-pe1955-01 ~]# date Wed Apr 20 10:34:08 EDT 2011 [root@dell-pe1955-01 ~]# -- Submitter: dell-pe1955-01.rhts.eng.bos.redhat.com : <10.16.65.121:52830> : dell-pe1955-01.rhts.eng.bos.redhat.com ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 20.0 aeolus 4/19 17:25 0+00:00:00 I 0 0.0 job_instance95g25_ 21.0 aeolus 4/19 22:55 0+00:00:00 I 0 0.0 job_instancepy97c_ 22.0 aeolus 4/19 23:10 0+00:00:00 I 0 0.0 job_asdf_34 These instances show up as "new" in the cloud engine webui. Please allow the "remove failed" function to condor_rm the jobs associated w/ these instances .. in any state. [root@dell-pe1955-01 ~]# rpm -qa | grep aeolus aeolus-conductor-0.0.3-6.el6.x86_64 aeolus-configure-2.0.0-8.el6.noarch aeolus-conductor-doc-0.0.3-6.el6.x86_64 aeolus-conductor-daemons-0.0.3-6.el6.x86_64 [root@dell-pe1955-01 ~]# rpm -qa | grep condor condor-7.6.0-2dcloud.el6.x86_64 [root@dell-pe1955-01 ~]#
* Name west_insta * Status new * Public Addresses * Private Addresses * Operating system Fedora 13 * Provider * Base Template new_temp_1 * Architecture x86_64 * Memory 1 * Storage 1 * Instantiation Time 02-May-2011 09:41:23 * Uptime Error, could not calculate state time: state is not monitored * Current Alerts 0 * Console Connection via SSH * Owner aeolus user * Shared to N/A =========================================================== Such instances continue to show "NEW" state even when they have failed . Once error occured , the state should change to "Error " or "Failed" which can then be removed by "Remove Failed".
This may actually work now; there is a method called condormatic_instance_reset_error that should do the right thing. If it still doesn't work, then it is probably a case of making sure we hook up that method with the right UI action. Chris Lalancette
funny thing about this bug.. is that from what I can tell the "remove failed" button has been removed. I need it back :) I cant remove instances in an "error" state now :) I think athomas should bring it back w/ just a "remove" if the function works as Chris mentioned.
How about this: we modify the Stop button to stop the instance no matter what. If it's running, we stop it gracefully (just like we do now) and if it's in a different state, we stop it using the `condormatic_instance_reset_error` method that Chris mentioned. That way you can always be able to delete the deployment that owns the instance. Here's a quick patch: https://fedorahosted.org/pipermail/aeolus-devel/2011-July/002952.html Possibly, we can use the same mechanism for deleting deployables: i.e. if the deployables has instances that aren't stopped/cannot be stopped cleanly, we stop them and delete the deployable.
*** Bug 719962 has been marked as a duplicate of this bug. ***
not sure if this got in the build... assigning to morazi to double check. I didnt see a git hash.
making sure all the bugs are at the right version for future queries
condor is gone.. moving to closed [root@unused bin]# rpm -qa | grep aeolus aeolus-conductor-doc-0.4.0-0.20110929145941git7594098.fc15.noarch aeolus-conductor-0.4.0-0.20110929145941git7594098.fc15.noarch aeolus-conductor-daemons-0.4.0-0.20110929145941git7594098.fc15.noarch aeolus-conductor-devel-0.4.0-0.20110929145941git7594098.fc15.noarch aeolus-all-0.4.0-0.20110929145941git7594098.fc15.noarch rubygem-aeolus-image-0.1.0-3.20110919115936gitd1d24b4.fc15.noarch aeolus-configure-2.0.2-4.20110926142838git5044e56.fc15.noarch [root@unused bin]# rpm -qa | grep condor [root@unused bin]#