Bug 796528

Summary: [RFE] Pending (forever) Instances should be allowed to delete
Product: [Retired] CloudForms Cloud Engine Reporter: Shveta <ssachdev>
Component: aeolus-conductorAssignee: Jan Provaznik <jprovazn>
Status: CLOSED CURRENTRELEASE QA Contact: Dave Johnson <dajohnso>
Severity: high Docs Contact:
Priority: high    
Version: 1.0.0CC: akarol, athomas, cpelland, deltacloud-maint, dgao, gfidente, hbrock, matt.wagner, morazi, psharma, ssachdev, whayutin
Target Milestone: rcKeywords: FutureFeature, Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
: 826130 (view as bug list) Environment:
Last Closed: 2012-12-13 19:48:26 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 826130    
Attachments:
Description Flags
delete_pending_instance
none
problem snapshot
none
configserver_down none

Description Shveta 2012-02-23 06:08:19 UTC
Created attachment 565194 [details]
delete_pending_instance

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Launched few instances , some vsphere and rhev instances failed to launch and 
showed pending forever
2. When i tried to delete them , i wasn't allowed.
3. We should be allowed to delete such instances
  
Actual results:


Expected results:


Additional info:

rpm -qa|grep aeolus
aeolus-conductor-daemons-0.8.0-33.el6.noarch
aeolus-configure-2.5.0-14.el6.noarch
rubygem-aeolus-image-0.3.0-8.el6.noarch
aeolus-conductor-0.8.0-33.el6.noarch
rubygem-aeolus-cli-0.3.0-9.el6.noarch
aeolus-conductor-doc-0.8.0-33.el6.noarch
aeolus-all-0.8.0-33.el6.noarch

Comment 1 Hugh Brock 2012-02-27 16:34:15 UTC
Says Wes: This also happens to instances that get caught in NEW.

Comment 2 wes hayutin 2012-02-27 19:10:50 UTC
*** Bug 797172 has been marked as a duplicate of this bug. ***

Comment 3 Jan Provaznik 2012-03-05 14:00:49 UTC
a patch sent: https://fedorahosted.org/pipermail/aeolus-devel/2012-March/009383.html

Comment 4 Jan Provaznik 2012-03-08 10:00:56 UTC
With current code, it should not be possible to have an instance stuck in 'pending' state forever (of course if dbomatic is running).
But it's possible to have an instance stuck in 'new' state:
1. if no suitable provider account (where a deployment could be launched) was found
2. if an instance of a deployment uses launch time params + config server, and uploading to config server fails for some reason (e.g. config server is down).

There was a bug in previous version of my patch for this, so a fix is not pushed yet. I have 2 options now:
1. create 'few lines' hotfix for this bug (if there is a chance this BZ will be changed to blocker for 1.0)
2. create bigger patch which does at least major refactoring of launch process which makes code readable and more sane and which also fixes this bug.

Comment 5 Jan Provaznik 2012-03-09 07:40:04 UTC
sent updated patchset: https://fedorahosted.org/pipermail/aeolus-devel/2012-March/009488.html

Comment 6 Jan Provaznik 2012-03-12 13:55:06 UTC
pushed to master in 4 commits:
86987cd9194c344272c0cfff313edcdf66df80c0
3dd5f304b8458528d15ddf462e0db7622b56dd09
56016671e651cf17bb0bc5c29b49c5aa55e94536
7a8502b846a819c27fa141621220ca0bbaeac23c

Comment 7 Jan Provaznik 2012-03-14 10:09:37 UTC
This patchset should be pushed to branch for 1.0.1 release (once this branch is created).

Comment 8 Jan Provaznik 2012-03-27 17:59:04 UTC
Note for 1.0.0:
It's much more easier to get an instance stuck in 'new' state because it's possible to select a realm on launch overview page. If a user selects a realm where a deployment can't be launched, instances are left in 'new' state. 

And there is no way how to get rid of 'new' instances. Probably only stop dc-api, then instances are marked as vanished and can be stopped :/.

This is fixed by the patch in comment 6 which is already pushed to master, but it's not part of 1.0.0.

Comment 9 pushpesh sharma 2012-04-05 06:34:01 UTC
*** Bug 804909 has been marked as a duplicate of this bug. ***

Comment 11 pushpesh sharma 2012-04-20 07:33:41 UTC
Marking as assigned,not able to delete pending instance.

[root@dhcp201-169 ~]# rpm -qa|grep aeolus
aeolus-conductor-daemons-0.8.7-1.el6.noarch
rubygem-aeolus-image-0.3.0-12.el6.noarch
aeolus-conductor-0.8.7-1.el6.noarch
rubygem-aeolus-cli-0.3.1-1.el6.noarch
aeolus-conductor-doc-0.8.7-1.el6.noarch
aeolus-all-0.8.7-1.el6.noarch
aeolus-configure-2.5.2-1.el6.noarch

Comment 12 pushpesh sharma 2012-04-20 07:34:14 UTC
Created attachment 578905 [details]
problem snapshot

Comment 13 Jan Provaznik 2012-04-20 07:51:28 UTC
It probably wasn't clarified in my comments properly. Let me reexplain. As Shveta wrote in his description in first step, the problem is that in some situations instances can be in new/pending state forever:

1. Launched few instances , some vsphere and rhev instances failed to launch and showed pending forever

But solution is _not_ to allow delete instances in pending state. Instead we make sure that instances never stuck in new/pending state if something goes wrong when launching a deployment -> 'create_failed' state is set for instances when an error occurs.

Comment 14 pushpesh sharma 2012-04-20 09:05:34 UTC
From Comment#13 can i conclude that whenever I am able to create a situation in which Deployment shows new/pending state,is a bug.

Comment 15 Jan Provaznik 2012-04-20 09:26:39 UTC
Mostly yes.
- you should never see an instance in 'new' state
- you should never see an instance in 'pending' state forever (it's ok if it's in pending state but it's state is changed to something else later)

Comment 16 wes hayutin 2012-04-26 20:03:16 UTC
If app blueprints are misconfigured or the config server is unreachable then instances still can get into a state where they remain in new/pending.  To be clear aeolus will show instances in "new" and apps in "pending" 

This bug fails qa..

The recreate is to use.. the wordpress app blueprint xml file.. found 
http://cf-srv.lab.bos.redhat.com/pub/cffiles/services/jlabocki-wordpress.xml
I did not have a config server w/ the ssh hack, so I left it blank, just to see what would happen.

IMHO fixing the bug at this point is too risky, I believe the fix was already reverted once.

Comment 17 wes hayutin 2012-04-26 20:03:37 UTC
tested w/ 

[root@qeblade30 ~]# rpm -qa | grep aeolus
aeolus-configure-2.5.3-1.el6.noarch
aeolus-conductor-0.8.13-1.el6_2.noarch
aeolus-all-0.8.13-1.el6_2.noarch
rubygem-aeolus-cli-0.3.1-1.el6.noarch
rubygem-aeolus-image-0.3.0-12.el6.noarch
aeolus-conductor-daemons-0.8.13-1.el6_2.noarch
aeolus-conductor-doc-0.8.13-1.el6_2.noarch

Comment 18 Hugh Brock 2012-04-26 21:12:21 UTC
We've already evaluated this one and determined it's not a blocker. Clearing the blocker flag. We'll revisit for 1.1.

Comment 19 Jan Provaznik 2012-05-02 13:27:06 UTC
(In reply to comment #16)
> If app blueprints are misconfigured or the config server is unreachable then
> instances still can get into a state where they remain in new/pending.  To be
> clear aeolus will show instances in "new" and apps in "pending" 
> 
> This bug fails qa..
> 

I wasn't able to reproduce this BZ on master branch where the patchset for this BZ is pushed -> switching this BZ to 'modified'.

Comment 20 Matt Wagner 2012-05-23 19:13:45 UTC
This patch also fixes #809621 -- when this patch is fixed and this is moved to ON_QA, please update https://bugzilla.redhat.com/show_bug.cgi?id=809621 as well.

Comment 21 Dave Johnson 2012-05-25 15:52:09 UTC
hmm, having a hard time reproducing this, we are working on the config server scenario now.  This is what I tried...

launch deployable...stop dbomatic...instance stuck in pending...cannot delete as expected

launch deployable with long name...when dc appends bluprint name, longer that 50 chars...launch fails with create_failed... deployment in stopped status though so it deletes on both v1.0 and patched (comment 6) v1.0

any other suggestions besides config server scenario?

Comment 22 dgao 2012-05-25 16:02:01 UTC
Created attachment 586897 [details]
configserver_down

With configserver down, the instance would have a create_failed state instead of new/pending state. Subsequently, you can delete the application. The patch attached worked, but could be better if the error is more clear about why the instance failed to launch.

Comment 23 Jan Provaznik 2012-05-29 07:51:28 UTC
So according to QA the patch seems to work fine except the fact that error message might be more explicit if launch fails because config server is down.
I need info if this patch will be pushed to 1.0.1 or not because it fixes 809621
which should be part of 1.0.1 and format of patch for 817114 depends on decision if this patchset is pushed or not.

Comment 24 Jan Provaznik 2012-05-29 12:57:17 UTC
re: the error message "503 service unavailable" - this is not related to this BZ - it happens w/o this patch too. I'll create separate bug for this.

After discussing this with Angus, this patchset can be pushed to 1.0.1 - it fixes 809621 too.

Comment 26 dgao 2012-06-12 17:50:14 UTC
The patch introduced in this state creates an error for instances launched with userdata. The userdata that is suppose to be mounted and subsequently consumed by audrey agent becomes un-mountable.

The following error has been observed on the instance itself:

[root@10-16-120-104 ~]# mount /dev/fd0 /media
mount: you must specify the filesystem type

which leads to:

[root@10-16-120-104 ~]# cat /var/log/audrey.log 
2012-04-19 14:38:47,312 - ERROR   : audrey:93 Failed accessing RHEVm user data.

Comment 29 Mike Orazi 2012-09-11 16:00:54 UTC
This is getting a little crufty.  Can we try to reproduce?

Comment 30 dgao 2012-09-20 14:35:43 UTC
according to aeolus-audrey-agent-0.4.10-1.el6cf.noarch

- if configserver is stopped and the instance is launched, the agent on the instance simply fails with a 503 error code. The instance itself is still in "running" state. 

However

- if dbomatic is stopped when the instance is launching, instance stuck in pending, cannot delete.

aeolus-conductor-0.13.7-1.el6cf.noarch

Comment 31 Jan Provaznik 2012-09-21 06:39:57 UTC
This bug was originally about having instances stuck in pending state forever even with running dbomatic. This should be fixed, if so I suggest to close this bug.

The fact that dbomatic is not running and states are not being updated is *not* the reason to allow deletion of pending instances. To be more clear - pending instances should never be allowed to delete.

If RFE is now about handling situation when dbomatic is not running I would suggest to create separate BZ for this because this thread becomes unclear.
The new BZ might be somethig like "Instances state are not updated if dbomatic is not runnning" (which is understandable), but only thing we can do in such situation is display some warning in UI that some essential part of Aeolus is not running. Though this is more general problem - similar situation will happend if for example delayed_job service is not running.

Comment 32 Giulio Fidente 2012-09-28 16:02:52 UTC
I'm facing an instance in state 'pending', which can't be removed, using:
aeolus-conductor-0.13.14-1.el6cf

The deployable in use is:
https://github.com/aeolusproject/audrey/blob/master/examples/deployables/deployable-sample-single-instance.xml

Instance-1 fails to start as I did not provide any value for the parameter 'service_1_param_2'. Instance-2 remains in 'pending' state.

Is this a valid use case? If so, what is the procedure needed to cancel the deployment job?

Comment 33 Giulio Fidente 2012-09-28 16:06:20 UTC
The deployable in use is this:
https://github.com/aeolusproject/audrey/blob/master/examples/deployables/deployable-sample-multi-instance.xml

_NOT_ the single instance I linked in commend 32

Comment 34 Giulio Fidente 2012-10-09 14:05:10 UTC
I've an instance stuck in state "NEW" , the template image was deleted from the provider before launching. This also doesn't seem to be migrating into any state which allows for deletion.

Comment 35 Jan Provaznik 2012-10-09 14:26:21 UTC
(In reply to comment #34)
> I've an instance stuck in state "NEW" , the template image was deleted from
> the provider before launching. This also doesn't seem to be migrating into
> any state which allows for deletion.

this case is duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=863383

Comment 36 Jan Provaznik 2012-10-10 10:41:08 UTC
(In reply to comment #32)
> I'm facing an instance in state 'pending', which can't be removed, using:
> aeolus-conductor-0.13.14-1.el6cf
> 
> The deployable in use is:
> https://github.com/aeolusproject/audrey/blob/master/examples/deployables/
> deployable-sample-single-instance.xml
> 
> Instance-1 fails to start as I did not provide any value for the parameter
> 'service_1_param_2'. Instance-2 remains in 'pending' state.
> 
> Is this a valid use case? If so, what is the procedure needed to cancel the
> deployment job?

Even if launch time params are blank, instance goes to running state (IOW missing params is not a reason for stuck pending state). Only reason for pending state I can think of is that dbomatic was not running.

Can you still reproduce this? (For me it worked as expected)

Comment 37 Jan Provaznik 2012-10-10 15:15:55 UTC
Giulio Fidente  was able to reproduce stucked pending state, it turned out it's actually same issue as: https://bugzilla.redhat.com/show_bug.cgi?id=863383 and should be fixed by same patch.

Comment 38 Giulio Fidente 2012-10-10 16:06:46 UTC
I'm closing the RFE as this has been implemented, just not working for a particular use case (which also should be fixed by the bugzilla in comment #37)

Comment 39 Angus Thomas 2012-12-10 15:46:48 UTC
Hi Giulio,

Could you please re-test this against 1.1?

According to Comment 37, and the bug which is linked from there, this ought to be resolved in CE 1.1.


Thanks,

Angus