Bug 1230163

Summary: DELETE_FAILED when trying to delete a stack that has some nodes in error
Product: Red Hat OpenStack Reporter: Udi Kalifon <ukalifon>
Component: openstack-ironicAssignee: Lucas Alvares Gomes <lmartins>
Status: CLOSED ERRATA QA Contact: Amit Ugol <augol>
Severity: high Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: adan, calfonso, ddomingo, jdonohue, jschluet, jslagle, lmartins, mbooth, mburns, oblaut, rhel-osp-director-maint, rlandy, sbaker, shardy, ukalifon, yeylon
Target Milestone: gaKeywords: Triaged
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-ironic-2015.1.0-9.el7ost openstack-nova-2015.1.0-15.el7ost Doc Type: Bug Fix
Doc Text:
The Compute service expects to be able to delete an instance at any time; however, a Bare Metal instance can only be stopped at a specific stage -- namely, when it is in the 'DEPLOYWAIT' state. As a result, whenever the Compute service attempts to delete a Bare Metal instance that is not in the DEPLOYWAIT state, Compute's attempt will fail. In doing so, the instance may get stuck in a particular state, thereby requiring a database change to resolve. With this release, Bare Metal instances no longer get stuck mid-deployment when Compute attempts to delete them. The Bare Metal service still won't abort an instance unless it is in the DEPLOYWAIT state.
Story Points: ---
Clone Of:
: 1256564 (view as bug list) Environment:
Last Closed: 2015-08-05 13:25:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1191185, 1243520, 1256564    
Attachments:
Description Flags
Log showing error from Nova none

Description Udi Kalifon 2015-06-10 11:34:18 UTC
Description of problem:
When I get CREATE_FAILED in deployments and try to delete the stack - it always ends up in DELETE_FAILED state and I have no work around. I need to reprovision the machine. 


2015-06-10 13:49:58.350 18683 INFO heat.engine.resource [-] DELETE: ResourceGroup "Compute" [3bc08ff8-3d07-43ee-92f7-d9619342f219] Stack "overcloud" [a1269759-9696-4d58-8af4-47f564d89178]
2015-06-10 13:49:58.350 18683 TRACE heat.engine.resource Traceback (most recent call last):
2015-06-10 13:49:58.350 18683 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 500, in _action_recorder
2015-06-10 13:49:58.350 18683 TRACE heat.engine.resource     yield
2015-06-10 13:49:58.350 18683 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 930, in delete
2015-06-10 13:49:58.350 18683 TRACE heat.engine.resource     yield self.action_handler_task(action, *action_args)
2015-06-10 13:49:58.350 18683 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 313, in wrapper
2015-06-10 13:49:58.350 18683 TRACE heat.engine.resource     step = next(subtask)
2015-06-10 13:49:58.350 18683 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 544, in action_handler_task
2015-06-10 13:49:58.350 18683 TRACE heat.engine.resource     while not check(handler_data):
2015-06-10 13:49:58.350 18683 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 419, in check_delete_complete
2015-06-10 13:49:58.350 18683 TRACE heat.engine.resource     show_deleted=True)
2015-06-10 13:49:58.350 18683 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 332, in _check_status_complete
2015-06-10 13:49:58.350 18683 TRACE heat.engine.resource     status_reason=nested.status_reason)
2015-06-10 13:49:58.350 18683 TRACE heat.engine.resource ResourceUnknownStatus: Resource failed - Unknown status FAILED due to "Resource DELETE failed: ResourceUnknownStatus: Resource failed - Unknown status FAILED due to "Resource DELET
E failed: Error: Server ov-nqpz5krd6w-1-ddg2t6bxs6we-NovaCompute-6wzsdsvrkewg delete failed: (500) Error destroying the instance on node 953a6fdb-4a53-4476-a796-e9d0bfcff54d. Provision state still 'deleting'.""


Version-Release number of selected component (if applicable):
openstack-heat-api-2015.1.1-dev11.el7.centos.noarch
openstack-heat-engine-2015.1.1-dev11.el7.centos.noarch


Steps to Reproduce:
1. After a failure in deployment, try to delete the stack: "heat stack-delete overcloud"


Additional info:
Unfortunately I don't have a reproducible scenario to get a creation failure. This only happens to me when deploying bare metal nodes (using tripleo and tuskar).

Comment 3 Udi Kalifon 2015-06-10 11:35:55 UTC
Created attachment 1037220 [details]
Log showing error from Nova

It appears that Nova is returning an id of a non-existing resource. See attached log segment.

Comment 4 Amit Ugol 2015-06-10 11:53:37 UTC
I'm not sure how heat 'gave' the wrong ID in the first place. This is the latest director puddle.

Comment 6 Steven Hardy 2015-06-10 13:02:23 UTC
So I think I've seen this too, but with the latest poodle build - I created a stack, then the stack create failed, and I got this same error when trying to delete the stack.  I've not yet dug into the root cause, but I think in this case Heat is the messenger, and the problem is Nova or Ironic can't delete the servers.

Comment 7 Mike Burns 2015-06-10 13:16:29 UTC
(In reply to Amit Ugol from comment #4)
> I'm not sure how heat 'gave' the wrong ID in the first place. This is the
> latest director puddle.

Was this a puddle or upstream?  You say latest puddle in comment 4, but the description says el7.centos builds.  There was a tuskar patch that was backported for a similar issue.  

https://github.com/rdo-management/tuskar/commit/77868ba9da62b03df3c99c98bad3ef7d5dae0847

This patch exists in current poodles but not a puddle yet and should exist upstream as well.

Comment 8 chris alfonso 2015-06-12 16:43:10 UTC
Can you retest this again off the poodle to make sure, because if you were using the puddle it's probably fixed with the latest build.

Comment 9 Ronelle Landy 2015-06-12 19:10:31 UTC
Deleting the overcloud errors out even though nova list returns empty:

[stack@instack ~]$ nova list
+----+------+--------+------------+-------------+----------+
| ID | Name | Status | Task State | Power State | Networks |
+----+------+--------+------------+-------------+----------+
+----+------+--------+------------+-------------+----------+

[stack@instack ~]$ heat stack-list
+--------------------------------------+------------+---------------+----------------------+
| id                                   | stack_name | stack_status  | creation_time        |
+--------------------------------------+------------+---------------+----------------------+
| 6401cd95-919c-437a-b035-9c71a76be172 | overcloud  | DELETE_FAILED | 2015-06-12T17:23:53Z |
+--------------------------------------+------------+---------------+----------------------+

The overcloud was deployed using the CLI and was not in error.

The following rpms are installed on the undercloud:

[stack@instack log]$ rpm -qa | grep openstack
openstack-nova-console-2015.1.0-10.el7ost.noarch
openstack-neutron-2015.1.0-7.el7ost.noarch
openstack-ironic-conductor-2015.1.0-4.el7ost.noarch
openstack-ceilometer-alarm-2015.1.0-2.el7ost.noarch
openstack-swift-account-2.3.0-1.el7ost.noarch
python-django-openstack-auth-1.2.0-2.el7ost.noarch
openstack-tuskar-ui-0.3.0-2.el7ost.noarch
openstack-heat-api-cloudwatch-2015.1.0-3.el7ost.noarch
openstack-ceilometer-notification-2015.1.0-2.el7ost.noarch
openstack-neutron-openvswitch-2015.1.0-7.el7ost.noarch
openstack-nova-api-2015.1.0-10.el7ost.noarch
openstack-tripleo-image-elements-0.9.6-1.el7ost.noarch
python-openstackclient-1.0.3-2.el7ost.noarch
openstack-ironic-discoverd-1.1.0-3.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-6.el7ost.noarch
openstack-swift-object-2.3.0-1.el7ost.noarch
openstack-tripleo-0.0.6-0.1.git812abe0.el7ost.noarch
openstack-utils-2014.2-1.el7ost.noarch
openstack-nova-common-2015.1.0-10.el7ost.noarch
openstack-heat-common-2015.1.0-3.el7ost.noarch
openstack-tuskar-0.4.18-2.el7ost.noarch
openstack-tripleo-puppet-elements-0.0.1-2.el7ost.noarch
openstack-dashboard-theme-2015.1.0-10.el7ost.noarch
openstack-tuskar-ui-extras-0.0.3-3.el7ost.noarch
openstack-tempest-kilo-20150507.2.el7ost.noarch
openstack-swift-2.3.0-1.el7ost.noarch
openstack-neutron-ml2-2015.1.0-7.el7ost.noarch
openstack-nova-novncproxy-2015.1.0-10.el7ost.noarch
openstack-keystone-2015.1.0-1.el7ost.noarch
openstack-swift-plugin-swift3-1.7-3.el7ost.noarch
openstack-tripleo-common-0.0.1.dev6-0.git49b57eb.el7ost.noarch
openstack-neutron-common-2015.1.0-7.el7ost.noarch
openstack-heat-engine-2015.1.0-3.el7ost.noarch
openstack-ceilometer-common-2015.1.0-2.el7ost.noarch
openstack-heat-api-cfn-2015.1.0-3.el7ost.noarch
openstack-ceilometer-api-2015.1.0-2.el7ost.noarch
openstack-ironic-api-2015.1.0-4.el7ost.noarch
openstack-swift-proxy-2.3.0-1.el7ost.noarch
openstack-heat-templates-0-0.6.20150605git.el7ost.noarch
openstack-ceilometer-collector-2015.1.0-2.el7ost.noarch
openstack-ironic-common-2015.1.0-4.el7ost.noarch
openstack-selinux-0.6.31-2.el7ost.noarch
openstack-nova-compute-2015.1.0-10.el7ost.noarch
openstack-nova-conductor-2015.1.0-10.el7ost.noarch
openstack-swift-container-2.3.0-1.el7ost.noarch
redhat-access-plugin-openstack-7.0.0-0.el7ost.noarch
openstack-glance-2015.1.0-6.el7ost.noarch
openstack-heat-api-2015.1.0-3.el7ost.noarch
openstack-ceilometer-central-2015.1.0-2.el7ost.noarch
openstack-puppet-modules-2015.1.4-1.el7ost.noarch
openstack-nova-scheduler-2015.1.0-10.el7ost.noarch
openstack-nova-cert-2015.1.0-10.el7ost.noarch
openstack-dashboard-2015.1.0-10.el7ost.noarch


heat event-show reveals errors in:

| RedisVirtualIP                    | cfe0db10-c774-4d5f-a90d-aa9b63ad772f | state changed                                                                                                                                                                                                                                                   | DELETE_IN_PROGRESS | 2015-06-12T17:47:31Z |
 | RedisVirtualIP                    | 81566a2f-b64d-4284-8a43-70b1325f7ca3 | Unauthorized: {"error": {"message": "Expecting to find username or userId in passwordCredentials - the server could not comply with the request since it is either malformed or otherwise incorrect. The client is assumed to be in error.", "code": 400, "titl | DELETE_FAILED      | 2015-06-12T17:47:32Z |
 | overcloud                         | 938af622-8312-45b6-b0b6-84d374293cd1 | 

| overcloud                         | 003d5977-a269-4e04-85ac-502c056ec3f6 | Resource DELETE failed: ConnectionFailed: Connection to neutron failed: ('Connection aborted.', error(113, 'EHOSTUNREACH'))                                                                                                                                     | DELETE_FAILED      | 2015-06-12T18:14:07Z |


Possible issues with: patch to the RedisVirtualIP

Awaiting patch - thanks dsneddon

Comment 11 Udi Kalifon 2015-06-22 08:13:13 UTC
This is still a main big problem, also with the latest puddle 2015-06-17.2. This problem recreates all the time and makes it very difficult to delete stacks and redeploy if you got an error:

$ nova list
+--------------+------------------------+--------+-----+-------------+---------------------+
| ID           | Name                   | Status | ... | Power State | Networks            |
+--------------+------------------------+--------+-----+-------------+---------------------+
| 4f1d6f76-... | overcloud-compute-0    | ERROR  | -   | NOSTATE     | ctlplane=192.0.2.11 |
| 53efb90d-... | overcloud-controller-0 | ERROR  | -   | NOSTATE     | ctlplane=192.0.2.12 |
+--------------+------------------------+--------+-----+-------------+---------------------+
$ heat stack-delete overcloud
+--------------+------------+--------------------+----------------------+
| id           | stack_name | stack_status       | creation_time        |
+--------------+------------+--------------------+----------------------+
| acf7f9af-... | overcloud  | DELETE_IN_PROGRESS | 2015-06-21T11:05:48Z |
+--------------+------------+--------------------+----------------------+
$ heat stack-list
+--------------+------------+---------------+----------------------+
| id           | stack_name | stack_status  | creation_time        |
+--------------+------------+---------------+----------------------+
| acf7f9af-... | overcloud  | DELETE_FAILED | 2015-06-21T11:05:48Z |
+--------------+------------+---------------+----------------------+

Comment 12 Lucas Alvares Gomes 2015-06-23 14:57:18 UTC
Do we still have some information about the states in Ironic?

If so could someone please attach the output of:

1) ironic node-list

2) ironic node-show (for each node)

Yes, Nova and Ironic still have problems with locks. Depending on the state things failed on the deployment it can lead it to get stuck somewhere. Some patches that I have put up that might help mitigate this problem:

* For Nova: https://review.openstack.org/#/c/182992/ (already merged upstream in Nova). This allow nova to delete the instance if the deployment is on DEPLOYWAIT state in Ironic, aborting it.

* For Ironic: https://review.openstack.org/#/c/194132/ (not merged upstream in Ironic yet). This mitigates the problem of having a node stuck in the DEPLOYING state. There's more stuff to do but at least with this we can unstuck a node if the conductor died mid deployment and had to be restarted.

Comment 13 chris alfonso 2015-06-24 18:11:07 UTC
*** Bug 1235390 has been marked as a duplicate of this bug. ***

Comment 14 Lucas Alvares Gomes 2015-06-30 17:07:41 UTC
I've added another patch to Ironic that will periodically check the status of a node being deployed and the conductor that is deploying it to avoid the node to get stuck in case a conductor die mid-deployment due some OOM Killer or energy outage: https://review.openstack.org/#/c/197141/

Comment 15 Lucas Alvares Gomes 2015-07-01 13:31:17 UTC
Here's another patch that might help mitigate some problems with nodes in ERROR state: https://review.openstack.org/#/c/197504/

Comment 17 Lucas Alvares Gomes 2015-07-02 13:42:00 UTC
Hi @Mike,

So all these patches help to mitigate this problem by avoiding the nodes getting stuck in some states which would cause the heat stack-delete to fail.

The problem with this bug is that there's an interface incompatibility between Ironic and Nova. Basically Nova allows an instance being spawned (in "spawning" state) to be deleted. But Ironic doesn't support aborting the deployment, so if some instance is being provisioned by Ironic and the user issue a heat stack-delete this will have to wait until all instances gets deployed or error out in Ironic. And, by having nova calling destroy() mid operation we can get into some odd states. Which is what those patches tries to mitigate, making Ironic smart to automatically clean up the nodes.

So the real fix for this problem would be to have a mechanism to allow interrupting the deployment, we are discussing how to do this upstream but it won't be a small change.

Comment 19 Lucas Alvares Gomes 2015-07-15 13:37:44 UTC
I have started introducing the idea of aborting a deployment in Ironic upstream. This is not something that may be backported for this bug because it requires API changes, but it's the proper fix for this type of problem for future releases.

The patches with the initial work are:

* https://review.openstack.org/#/c/200152/
* https://review.openstack.org/#/c/201552/

Comment 20 Matthew Booth 2015-07-15 14:53:00 UTC
The Nova part of this is built in openstack-nova-2015.1.0-15.el7ost

Comment 21 Jon Schlueter 2015-07-15 14:56:37 UTC
Now that all the parts are in place this should be good to go

Comment 23 Udi Kalifon 2015-07-16 10:27:58 UTC
I see no improvement. Every stack deletion, without exception, is always a fight with heat, ironic and nova. The last time it took about 15 re-calls to "nova delete" in order to delete the very last server (bare metal) that didn't want to go.

I have these packages:
openstack-ironic-api-2015.1.0-9.el7ost.noarch
python-ironicclient-0.5.1-9.el7ost.noarch
openstack-ironic-common-2015.1.0-9.el7ost.noarch
openstack-ironic-conductor-2015.1.0-9.el7ost.noarch

Comment 24 Jon Schlueter 2015-07-16 12:09:31 UTC
This required the following build which just made it into poodles yesterday:

openstack-nova-2015.1.0-15.el7ost

Please retest once the new puddle is released.

Comment 25 Lucas Alvares Gomes 2015-07-20 09:08:58 UTC
(In reply to Udi from comment #23)
> I see no improvement. Every stack deletion, without exception, is always a
> fight with heat, ironic and nova. The last time it took about 15 re-calls to
> "nova delete" in order to delete the very last server (bare metal) that
> didn't want to go.
> 

Yes, this patches will just mitigate the problem of having the node stuck in states like DEPLOYING or DEPLOYWAIT, so the stack will eventually get deleted.

As I have pointed out in the comments before the right fix for this problem is Ironic to introduce a way to abort the deployment (because Nova supports it for the instance, a call to destroy() should stop a VM spawming) but in Ironic we currently can not destroy() an instance mid-deployment. I bought this discussion upstream about and started working on some patches [1][2]. But there's some refactoring needed before, as you can see [1] is making cleaning to behave like deploying. [2] introduces abort for DEPLOYWAIT and CLEANWAIT, which is the state when the clean or deploy operation is running in-band (the deploy agent is working on the disk). The next patches I'm working on is to be able to abort on DEPLOYING and CLEANING, which is when the conductor is doing the work. 

But anyway, I'm afraid that this work will not be backported because it needs API changes so I believe for the current osp-d release this problem won't 100% fixed.

[1] https://review.openstack.org/#/c/200152/
[2] https://review.openstack.org/#/c/201552/
[3] https://review.openstack.org/#/c/203157/

Comment 26 Steve Baker 2015-07-22 04:33:03 UTC
I realise there is a ironic component to this, however I believe this heat fix will fix the issue for many of the reported failures in this bug:

https://review.openstack.org/#/c/204301/

Comment 27 Amit Ugol 2015-07-22 04:40:17 UTC
(In reply to Steve Baker from comment #26)
> I realise there is a ironic component to this, however I believe this heat
> fix will fix the issue for many of the reported failures in this bug:
> 
> https://review.openstack.org/#/c/204301/

I'd like to wait for that fix as well. Delete still fails in some cases.

Comment 28 Mike Burns 2015-07-22 11:16:56 UTC
This isn't a valid reason to fail this bug.  There is a fix for the specific issue mentioned in this bug which should be tested.  If we want to get the heat fix in, let's either file a generic heat bug or clone this bug and attach the fix to that bug and follow the process for that.

Comment 29 Lucas Alvares Gomes 2015-07-22 14:49:15 UTC
@Steve thanks for that

As an update, I'm trying to work on a version for this problem that we potentially could backport. The proposed change would use the verb "deleted" for the API which is what Nova already calls in Ironic to delete the instance to be able to abort it mid-deployment as well. For more specifics take a look at the spec https://review.openstack.org/#/c/204162/

Comment 30 Steve Baker 2015-07-22 21:01:08 UTC
I don't think a new bz is needed for the heat fix, we already have bug 1242796

Comment 31 Amit Ugol 2015-07-28 05:32:21 UTC
There have been many patches to rectify this issue and from testing it for a while on the latest puddle there has been much improvement in this area.
I will mark this _general_ issue as verified yet I believe that we _WILL SEE_ cases in which delete will fail but those cases will be specific issues.
For those specific cases a new bug should be created.

Comment 33 errata-xmlrpc 2015-08-05 13:25:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2015:1548

Comment 34 Alexandru Dan 2016-05-04 11:40:50 UTC
I am on OSP8 stable using RHEL7 and I have the same issue. I have a basic overcloud with a failed deployment (troubleshooting that also) and I am trying to redeploy everything but heat stack-delete does not work. Nova is empty and ironic node-list shows the two hosts in power off state and available. 

I feel this is still broken.

Comment 35 Steve Baker 2016-05-04 22:07:43 UTC
Can you please provide more details than "heat stack-delete does not work"? Does it go to DELETE_FAILED?

Does it work when you attempt a delete the second time?

Can you attach the output for the following?

  heat resource-list --show-nested 3 overcloud |grep -iv complete

Comment 36 Alexandru Dan 2016-05-05 08:07:51 UTC
Hello, 

Yes it went to status UPDATE_FAILED. By troubleshooting I realized there are way too many connections to rabbitmq (192.0.2.1:5672) and I wanted to clear some of them so I restarted rabbitmq via systemctl. Doing a heat-stack delete overcloud again resulted in UPDATE_FAILED status si i did it repeatedly and it worked at some point. The stack was deleted.

I am now experiencing kinda the same issue by deploying a stack. The deployment fails at different times and by issuing openstack overcloud deploy .... again and again it manages to move a step further in the deployment. 

Something, or some component is failing to communicate properly, I will investigate further as I am trying to deploy my first OSP8 stack.

Anyway, the idem-potency of it is not really showing off now.

Here is a conn cont while in deployment:
[stack@director ~]$ sudo netstat -atpn | grep 5672 | grep ESTA  | wc -l
176


It was the same when it wasn't doing anything.

Comment 37 Alexandru Dan 2016-05-05 08:49:02 UTC
OK! Turns out after several runs of the same command 

("openstack overcloud deploy --templates -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /home/stack/templates/network-environment.yaml -e /home/stack/templates/storage-environment.yaml -e ~/templates/cloudname.yaml --control-flavor control --compute-flavor compute --ntp-server pool.ntp.org --neutron-network-type vxlan --neutron-tunnel-types vxlan")

it managed to deploy.
This is because I know the configuration was sound and because I realized it's failing at different steps in the process. Some of the fails weren't even fails. heat stack-list --show-nested said that controller failed to deploy from a nested task. That task had no error but still the stack had UPDATE_FAILED as status so I rerun the command above and voila, stack deployed.