Hide Forgot
Created attachment 565194 [details] delete_pending_instance Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Launched few instances , some vsphere and rhev instances failed to launch and showed pending forever 2. When i tried to delete them , i wasn't allowed. 3. We should be allowed to delete such instances Actual results: Expected results: Additional info: rpm -qa|grep aeolus aeolus-conductor-daemons-0.8.0-33.el6.noarch aeolus-configure-2.5.0-14.el6.noarch rubygem-aeolus-image-0.3.0-8.el6.noarch aeolus-conductor-0.8.0-33.el6.noarch rubygem-aeolus-cli-0.3.0-9.el6.noarch aeolus-conductor-doc-0.8.0-33.el6.noarch aeolus-all-0.8.0-33.el6.noarch
Says Wes: This also happens to instances that get caught in NEW.
*** Bug 797172 has been marked as a duplicate of this bug. ***
a patch sent: https://fedorahosted.org/pipermail/aeolus-devel/2012-March/009383.html
With current code, it should not be possible to have an instance stuck in 'pending' state forever (of course if dbomatic is running). But it's possible to have an instance stuck in 'new' state: 1. if no suitable provider account (where a deployment could be launched) was found 2. if an instance of a deployment uses launch time params + config server, and uploading to config server fails for some reason (e.g. config server is down). There was a bug in previous version of my patch for this, so a fix is not pushed yet. I have 2 options now: 1. create 'few lines' hotfix for this bug (if there is a chance this BZ will be changed to blocker for 1.0) 2. create bigger patch which does at least major refactoring of launch process which makes code readable and more sane and which also fixes this bug.
sent updated patchset: https://fedorahosted.org/pipermail/aeolus-devel/2012-March/009488.html
pushed to master in 4 commits: 86987cd9194c344272c0cfff313edcdf66df80c0 3dd5f304b8458528d15ddf462e0db7622b56dd09 56016671e651cf17bb0bc5c29b49c5aa55e94536 7a8502b846a819c27fa141621220ca0bbaeac23c
This patchset should be pushed to branch for 1.0.1 release (once this branch is created).
Note for 1.0.0: It's much more easier to get an instance stuck in 'new' state because it's possible to select a realm on launch overview page. If a user selects a realm where a deployment can't be launched, instances are left in 'new' state. And there is no way how to get rid of 'new' instances. Probably only stop dc-api, then instances are marked as vanished and can be stopped :/. This is fixed by the patch in comment 6 which is already pushed to master, but it's not part of 1.0.0.
*** Bug 804909 has been marked as a duplicate of this bug. ***
Marking as assigned,not able to delete pending instance. [root@dhcp201-169 ~]# rpm -qa|grep aeolus aeolus-conductor-daemons-0.8.7-1.el6.noarch rubygem-aeolus-image-0.3.0-12.el6.noarch aeolus-conductor-0.8.7-1.el6.noarch rubygem-aeolus-cli-0.3.1-1.el6.noarch aeolus-conductor-doc-0.8.7-1.el6.noarch aeolus-all-0.8.7-1.el6.noarch aeolus-configure-2.5.2-1.el6.noarch
Created attachment 578905 [details] problem snapshot
It probably wasn't clarified in my comments properly. Let me reexplain. As Shveta wrote in his description in first step, the problem is that in some situations instances can be in new/pending state forever: 1. Launched few instances , some vsphere and rhev instances failed to launch and showed pending forever But solution is _not_ to allow delete instances in pending state. Instead we make sure that instances never stuck in new/pending state if something goes wrong when launching a deployment -> 'create_failed' state is set for instances when an error occurs.
From Comment#13 can i conclude that whenever I am able to create a situation in which Deployment shows new/pending state,is a bug.
Mostly yes. - you should never see an instance in 'new' state - you should never see an instance in 'pending' state forever (it's ok if it's in pending state but it's state is changed to something else later)
If app blueprints are misconfigured or the config server is unreachable then instances still can get into a state where they remain in new/pending. To be clear aeolus will show instances in "new" and apps in "pending" This bug fails qa.. The recreate is to use.. the wordpress app blueprint xml file.. found http://cf-srv.lab.bos.redhat.com/pub/cffiles/services/jlabocki-wordpress.xml I did not have a config server w/ the ssh hack, so I left it blank, just to see what would happen. IMHO fixing the bug at this point is too risky, I believe the fix was already reverted once.
tested w/ [root@qeblade30 ~]# rpm -qa | grep aeolus aeolus-configure-2.5.3-1.el6.noarch aeolus-conductor-0.8.13-1.el6_2.noarch aeolus-all-0.8.13-1.el6_2.noarch rubygem-aeolus-cli-0.3.1-1.el6.noarch rubygem-aeolus-image-0.3.0-12.el6.noarch aeolus-conductor-daemons-0.8.13-1.el6_2.noarch aeolus-conductor-doc-0.8.13-1.el6_2.noarch
We've already evaluated this one and determined it's not a blocker. Clearing the blocker flag. We'll revisit for 1.1.
(In reply to comment #16) > If app blueprints are misconfigured or the config server is unreachable then > instances still can get into a state where they remain in new/pending. To be > clear aeolus will show instances in "new" and apps in "pending" > > This bug fails qa.. > I wasn't able to reproduce this BZ on master branch where the patchset for this BZ is pushed -> switching this BZ to 'modified'.
This patch also fixes #809621 -- when this patch is fixed and this is moved to ON_QA, please update https://bugzilla.redhat.com/show_bug.cgi?id=809621 as well.
hmm, having a hard time reproducing this, we are working on the config server scenario now. This is what I tried... launch deployable...stop dbomatic...instance stuck in pending...cannot delete as expected launch deployable with long name...when dc appends bluprint name, longer that 50 chars...launch fails with create_failed... deployment in stopped status though so it deletes on both v1.0 and patched (comment 6) v1.0 any other suggestions besides config server scenario?
Created attachment 586897 [details] configserver_down With configserver down, the instance would have a create_failed state instead of new/pending state. Subsequently, you can delete the application. The patch attached worked, but could be better if the error is more clear about why the instance failed to launch.
So according to QA the patch seems to work fine except the fact that error message might be more explicit if launch fails because config server is down. I need info if this patch will be pushed to 1.0.1 or not because it fixes 809621 which should be part of 1.0.1 and format of patch for 817114 depends on decision if this patchset is pushed or not.
re: the error message "503 service unavailable" - this is not related to this BZ - it happens w/o this patch too. I'll create separate bug for this. After discussing this with Angus, this patchset can be pushed to 1.0.1 - it fixes 809621 too.
The patch introduced in this state creates an error for instances launched with userdata. The userdata that is suppose to be mounted and subsequently consumed by audrey agent becomes un-mountable. The following error has been observed on the instance itself: [root@10-16-120-104 ~]# mount /dev/fd0 /media mount: you must specify the filesystem type which leads to: [root@10-16-120-104 ~]# cat /var/log/audrey.log 2012-04-19 14:38:47,312 - ERROR : audrey:93 Failed accessing RHEVm user data.
This is getting a little crufty. Can we try to reproduce?
according to aeolus-audrey-agent-0.4.10-1.el6cf.noarch - if configserver is stopped and the instance is launched, the agent on the instance simply fails with a 503 error code. The instance itself is still in "running" state. However - if dbomatic is stopped when the instance is launching, instance stuck in pending, cannot delete. aeolus-conductor-0.13.7-1.el6cf.noarch
This bug was originally about having instances stuck in pending state forever even with running dbomatic. This should be fixed, if so I suggest to close this bug. The fact that dbomatic is not running and states are not being updated is *not* the reason to allow deletion of pending instances. To be more clear - pending instances should never be allowed to delete. If RFE is now about handling situation when dbomatic is not running I would suggest to create separate BZ for this because this thread becomes unclear. The new BZ might be somethig like "Instances state are not updated if dbomatic is not runnning" (which is understandable), but only thing we can do in such situation is display some warning in UI that some essential part of Aeolus is not running. Though this is more general problem - similar situation will happend if for example delayed_job service is not running.
I'm facing an instance in state 'pending', which can't be removed, using: aeolus-conductor-0.13.14-1.el6cf The deployable in use is: https://github.com/aeolusproject/audrey/blob/master/examples/deployables/deployable-sample-single-instance.xml Instance-1 fails to start as I did not provide any value for the parameter 'service_1_param_2'. Instance-2 remains in 'pending' state. Is this a valid use case? If so, what is the procedure needed to cancel the deployment job?
The deployable in use is this: https://github.com/aeolusproject/audrey/blob/master/examples/deployables/deployable-sample-multi-instance.xml _NOT_ the single instance I linked in commend 32
I've an instance stuck in state "NEW" , the template image was deleted from the provider before launching. This also doesn't seem to be migrating into any state which allows for deletion.
(In reply to comment #34) > I've an instance stuck in state "NEW" , the template image was deleted from > the provider before launching. This also doesn't seem to be migrating into > any state which allows for deletion. this case is duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=863383
(In reply to comment #32) > I'm facing an instance in state 'pending', which can't be removed, using: > aeolus-conductor-0.13.14-1.el6cf > > The deployable in use is: > https://github.com/aeolusproject/audrey/blob/master/examples/deployables/ > deployable-sample-single-instance.xml > > Instance-1 fails to start as I did not provide any value for the parameter > 'service_1_param_2'. Instance-2 remains in 'pending' state. > > Is this a valid use case? If so, what is the procedure needed to cancel the > deployment job? Even if launch time params are blank, instance goes to running state (IOW missing params is not a reason for stuck pending state). Only reason for pending state I can think of is that dbomatic was not running. Can you still reproduce this? (For me it worked as expected)
Giulio Fidente was able to reproduce stucked pending state, it turned out it's actually same issue as: https://bugzilla.redhat.com/show_bug.cgi?id=863383 and should be fixed by same patch.
I'm closing the RFE as this has been implemented, just not working for a particular use case (which also should be fixed by the bugzilla in comment #37)
Hi Giulio, Could you please re-test this against 1.1? According to Comment 37, and the bug which is linked from there, this ought to be resolved in CE 1.1. Thanks, Angus