Bug 1230163
Summary: | DELETE_FAILED when trying to delete a stack that has some nodes in error | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Udi Kalifon <ukalifon> | ||||
Component: | openstack-ironic | Assignee: | Lucas Alvares Gomes <lmartins> | ||||
Status: | CLOSED ERRATA | QA Contact: | Amit Ugol <augol> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.0 (Kilo) | CC: | adan, calfonso, ddomingo, jdonohue, jschluet, jslagle, lmartins, mbooth, mburns, oblaut, rhel-osp-director-maint, rlandy, sbaker, shardy, ukalifon, yeylon | ||||
Target Milestone: | ga | Keywords: | Triaged | ||||
Target Release: | 7.0 (Kilo) | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | openstack-ironic-2015.1.0-9.el7ost openstack-nova-2015.1.0-15.el7ost | Doc Type: | Bug Fix | ||||
Doc Text: |
The Compute service expects to be able to delete an instance at any time; however, a Bare Metal instance can only be stopped at a specific stage -- namely, when it is in the 'DEPLOYWAIT' state. As a result, whenever the Compute service attempts to delete a Bare Metal instance that is not in the DEPLOYWAIT state, Compute's attempt will fail. In doing so, the instance may get stuck in a particular state, thereby requiring a database change to resolve.
With this release, Bare Metal instances no longer get stuck mid-deployment when Compute attempts to delete them. The Bare Metal service still won't abort an instance unless it is in the DEPLOYWAIT state.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1256564 (view as bug list) | Environment: | |||||
Last Closed: | 2015-08-05 13:25:31 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1191185, 1243520, 1256564 | ||||||
Attachments: |
|
Description
Udi Kalifon
2015-06-10 11:34:18 UTC
Created attachment 1037220 [details]
Log showing error from Nova
It appears that Nova is returning an id of a non-existing resource. See attached log segment.
I'm not sure how heat 'gave' the wrong ID in the first place. This is the latest director puddle. So I think I've seen this too, but with the latest poodle build - I created a stack, then the stack create failed, and I got this same error when trying to delete the stack. I've not yet dug into the root cause, but I think in this case Heat is the messenger, and the problem is Nova or Ironic can't delete the servers. (In reply to Amit Ugol from comment #4) > I'm not sure how heat 'gave' the wrong ID in the first place. This is the > latest director puddle. Was this a puddle or upstream? You say latest puddle in comment 4, but the description says el7.centos builds. There was a tuskar patch that was backported for a similar issue. https://github.com/rdo-management/tuskar/commit/77868ba9da62b03df3c99c98bad3ef7d5dae0847 This patch exists in current poodles but not a puddle yet and should exist upstream as well. Can you retest this again off the poodle to make sure, because if you were using the puddle it's probably fixed with the latest build. Deleting the overcloud errors out even though nova list returns empty: [stack@instack ~]$ nova list +----+------+--------+------------+-------------+----------+ | ID | Name | Status | Task State | Power State | Networks | +----+------+--------+------------+-------------+----------+ +----+------+--------+------------+-------------+----------+ [stack@instack ~]$ heat stack-list +--------------------------------------+------------+---------------+----------------------+ | id | stack_name | stack_status | creation_time | +--------------------------------------+------------+---------------+----------------------+ | 6401cd95-919c-437a-b035-9c71a76be172 | overcloud | DELETE_FAILED | 2015-06-12T17:23:53Z | +--------------------------------------+------------+---------------+----------------------+ The overcloud was deployed using the CLI and was not in error. The following rpms are installed on the undercloud: [stack@instack log]$ rpm -qa | grep openstack openstack-nova-console-2015.1.0-10.el7ost.noarch openstack-neutron-2015.1.0-7.el7ost.noarch openstack-ironic-conductor-2015.1.0-4.el7ost.noarch openstack-ceilometer-alarm-2015.1.0-2.el7ost.noarch openstack-swift-account-2.3.0-1.el7ost.noarch python-django-openstack-auth-1.2.0-2.el7ost.noarch openstack-tuskar-ui-0.3.0-2.el7ost.noarch openstack-heat-api-cloudwatch-2015.1.0-3.el7ost.noarch openstack-ceilometer-notification-2015.1.0-2.el7ost.noarch openstack-neutron-openvswitch-2015.1.0-7.el7ost.noarch openstack-nova-api-2015.1.0-10.el7ost.noarch openstack-tripleo-image-elements-0.9.6-1.el7ost.noarch python-openstackclient-1.0.3-2.el7ost.noarch openstack-ironic-discoverd-1.1.0-3.el7ost.noarch openstack-tripleo-heat-templates-0.8.6-6.el7ost.noarch openstack-swift-object-2.3.0-1.el7ost.noarch openstack-tripleo-0.0.6-0.1.git812abe0.el7ost.noarch openstack-utils-2014.2-1.el7ost.noarch openstack-nova-common-2015.1.0-10.el7ost.noarch openstack-heat-common-2015.1.0-3.el7ost.noarch openstack-tuskar-0.4.18-2.el7ost.noarch openstack-tripleo-puppet-elements-0.0.1-2.el7ost.noarch openstack-dashboard-theme-2015.1.0-10.el7ost.noarch openstack-tuskar-ui-extras-0.0.3-3.el7ost.noarch openstack-tempest-kilo-20150507.2.el7ost.noarch openstack-swift-2.3.0-1.el7ost.noarch openstack-neutron-ml2-2015.1.0-7.el7ost.noarch openstack-nova-novncproxy-2015.1.0-10.el7ost.noarch openstack-keystone-2015.1.0-1.el7ost.noarch openstack-swift-plugin-swift3-1.7-3.el7ost.noarch openstack-tripleo-common-0.0.1.dev6-0.git49b57eb.el7ost.noarch openstack-neutron-common-2015.1.0-7.el7ost.noarch openstack-heat-engine-2015.1.0-3.el7ost.noarch openstack-ceilometer-common-2015.1.0-2.el7ost.noarch openstack-heat-api-cfn-2015.1.0-3.el7ost.noarch openstack-ceilometer-api-2015.1.0-2.el7ost.noarch openstack-ironic-api-2015.1.0-4.el7ost.noarch openstack-swift-proxy-2.3.0-1.el7ost.noarch openstack-heat-templates-0-0.6.20150605git.el7ost.noarch openstack-ceilometer-collector-2015.1.0-2.el7ost.noarch openstack-ironic-common-2015.1.0-4.el7ost.noarch openstack-selinux-0.6.31-2.el7ost.noarch openstack-nova-compute-2015.1.0-10.el7ost.noarch openstack-nova-conductor-2015.1.0-10.el7ost.noarch openstack-swift-container-2.3.0-1.el7ost.noarch redhat-access-plugin-openstack-7.0.0-0.el7ost.noarch openstack-glance-2015.1.0-6.el7ost.noarch openstack-heat-api-2015.1.0-3.el7ost.noarch openstack-ceilometer-central-2015.1.0-2.el7ost.noarch openstack-puppet-modules-2015.1.4-1.el7ost.noarch openstack-nova-scheduler-2015.1.0-10.el7ost.noarch openstack-nova-cert-2015.1.0-10.el7ost.noarch openstack-dashboard-2015.1.0-10.el7ost.noarch heat event-show reveals errors in: | RedisVirtualIP | cfe0db10-c774-4d5f-a90d-aa9b63ad772f | state changed | DELETE_IN_PROGRESS | 2015-06-12T17:47:31Z | | RedisVirtualIP | 81566a2f-b64d-4284-8a43-70b1325f7ca3 | Unauthorized: {"error": {"message": "Expecting to find username or userId in passwordCredentials - the server could not comply with the request since it is either malformed or otherwise incorrect. The client is assumed to be in error.", "code": 400, "titl | DELETE_FAILED | 2015-06-12T17:47:32Z | | overcloud | 938af622-8312-45b6-b0b6-84d374293cd1 | | overcloud | 003d5977-a269-4e04-85ac-502c056ec3f6 | Resource DELETE failed: ConnectionFailed: Connection to neutron failed: ('Connection aborted.', error(113, 'EHOSTUNREACH')) | DELETE_FAILED | 2015-06-12T18:14:07Z | Possible issues with: patch to the RedisVirtualIP Awaiting patch - thanks dsneddon This is still a main big problem, also with the latest puddle 2015-06-17.2. This problem recreates all the time and makes it very difficult to delete stacks and redeploy if you got an error: $ nova list +--------------+------------------------+--------+-----+-------------+---------------------+ | ID | Name | Status | ... | Power State | Networks | +--------------+------------------------+--------+-----+-------------+---------------------+ | 4f1d6f76-... | overcloud-compute-0 | ERROR | - | NOSTATE | ctlplane=192.0.2.11 | | 53efb90d-... | overcloud-controller-0 | ERROR | - | NOSTATE | ctlplane=192.0.2.12 | +--------------+------------------------+--------+-----+-------------+---------------------+ $ heat stack-delete overcloud +--------------+------------+--------------------+----------------------+ | id | stack_name | stack_status | creation_time | +--------------+------------+--------------------+----------------------+ | acf7f9af-... | overcloud | DELETE_IN_PROGRESS | 2015-06-21T11:05:48Z | +--------------+------------+--------------------+----------------------+ $ heat stack-list +--------------+------------+---------------+----------------------+ | id | stack_name | stack_status | creation_time | +--------------+------------+---------------+----------------------+ | acf7f9af-... | overcloud | DELETE_FAILED | 2015-06-21T11:05:48Z | +--------------+------------+---------------+----------------------+ Do we still have some information about the states in Ironic? If so could someone please attach the output of: 1) ironic node-list 2) ironic node-show (for each node) Yes, Nova and Ironic still have problems with locks. Depending on the state things failed on the deployment it can lead it to get stuck somewhere. Some patches that I have put up that might help mitigate this problem: * For Nova: https://review.openstack.org/#/c/182992/ (already merged upstream in Nova). This allow nova to delete the instance if the deployment is on DEPLOYWAIT state in Ironic, aborting it. * For Ironic: https://review.openstack.org/#/c/194132/ (not merged upstream in Ironic yet). This mitigates the problem of having a node stuck in the DEPLOYING state. There's more stuff to do but at least with this we can unstuck a node if the conductor died mid deployment and had to be restarted. *** Bug 1235390 has been marked as a duplicate of this bug. *** I've added another patch to Ironic that will periodically check the status of a node being deployed and the conductor that is deploying it to avoid the node to get stuck in case a conductor die mid-deployment due some OOM Killer or energy outage: https://review.openstack.org/#/c/197141/ Here's another patch that might help mitigate some problems with nodes in ERROR state: https://review.openstack.org/#/c/197504/ Hi @Mike, So all these patches help to mitigate this problem by avoiding the nodes getting stuck in some states which would cause the heat stack-delete to fail. The problem with this bug is that there's an interface incompatibility between Ironic and Nova. Basically Nova allows an instance being spawned (in "spawning" state) to be deleted. But Ironic doesn't support aborting the deployment, so if some instance is being provisioned by Ironic and the user issue a heat stack-delete this will have to wait until all instances gets deployed or error out in Ironic. And, by having nova calling destroy() mid operation we can get into some odd states. Which is what those patches tries to mitigate, making Ironic smart to automatically clean up the nodes. So the real fix for this problem would be to have a mechanism to allow interrupting the deployment, we are discussing how to do this upstream but it won't be a small change. I have started introducing the idea of aborting a deployment in Ironic upstream. This is not something that may be backported for this bug because it requires API changes, but it's the proper fix for this type of problem for future releases. The patches with the initial work are: * https://review.openstack.org/#/c/200152/ * https://review.openstack.org/#/c/201552/ The Nova part of this is built in openstack-nova-2015.1.0-15.el7ost Now that all the parts are in place this should be good to go I see no improvement. Every stack deletion, without exception, is always a fight with heat, ironic and nova. The last time it took about 15 re-calls to "nova delete" in order to delete the very last server (bare metal) that didn't want to go. I have these packages: openstack-ironic-api-2015.1.0-9.el7ost.noarch python-ironicclient-0.5.1-9.el7ost.noarch openstack-ironic-common-2015.1.0-9.el7ost.noarch openstack-ironic-conductor-2015.1.0-9.el7ost.noarch This required the following build which just made it into poodles yesterday: openstack-nova-2015.1.0-15.el7ost Please retest once the new puddle is released. (In reply to Udi from comment #23) > I see no improvement. Every stack deletion, without exception, is always a > fight with heat, ironic and nova. The last time it took about 15 re-calls to > "nova delete" in order to delete the very last server (bare metal) that > didn't want to go. > Yes, this patches will just mitigate the problem of having the node stuck in states like DEPLOYING or DEPLOYWAIT, so the stack will eventually get deleted. As I have pointed out in the comments before the right fix for this problem is Ironic to introduce a way to abort the deployment (because Nova supports it for the instance, a call to destroy() should stop a VM spawming) but in Ironic we currently can not destroy() an instance mid-deployment. I bought this discussion upstream about and started working on some patches [1][2]. But there's some refactoring needed before, as you can see [1] is making cleaning to behave like deploying. [2] introduces abort for DEPLOYWAIT and CLEANWAIT, which is the state when the clean or deploy operation is running in-band (the deploy agent is working on the disk). The next patches I'm working on is to be able to abort on DEPLOYING and CLEANING, which is when the conductor is doing the work. But anyway, I'm afraid that this work will not be backported because it needs API changes so I believe for the current osp-d release this problem won't 100% fixed. [1] https://review.openstack.org/#/c/200152/ [2] https://review.openstack.org/#/c/201552/ [3] https://review.openstack.org/#/c/203157/ I realise there is a ironic component to this, however I believe this heat fix will fix the issue for many of the reported failures in this bug: https://review.openstack.org/#/c/204301/ (In reply to Steve Baker from comment #26) > I realise there is a ironic component to this, however I believe this heat > fix will fix the issue for many of the reported failures in this bug: > > https://review.openstack.org/#/c/204301/ I'd like to wait for that fix as well. Delete still fails in some cases. This isn't a valid reason to fail this bug. There is a fix for the specific issue mentioned in this bug which should be tested. If we want to get the heat fix in, let's either file a generic heat bug or clone this bug and attach the fix to that bug and follow the process for that. @Steve thanks for that As an update, I'm trying to work on a version for this problem that we potentially could backport. The proposed change would use the verb "deleted" for the API which is what Nova already calls in Ironic to delete the instance to be able to abort it mid-deployment as well. For more specifics take a look at the spec https://review.openstack.org/#/c/204162/ I don't think a new bz is needed for the heat fix, we already have bug 1242796 There have been many patches to rectify this issue and from testing it for a while on the latest puddle there has been much improvement in this area. I will mark this _general_ issue as verified yet I believe that we _WILL SEE_ cases in which delete will fail but those cases will be specific issues. For those specific cases a new bug should be created. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2015:1548 I am on OSP8 stable using RHEL7 and I have the same issue. I have a basic overcloud with a failed deployment (troubleshooting that also) and I am trying to redeploy everything but heat stack-delete does not work. Nova is empty and ironic node-list shows the two hosts in power off state and available. I feel this is still broken. Can you please provide more details than "heat stack-delete does not work"? Does it go to DELETE_FAILED? Does it work when you attempt a delete the second time? Can you attach the output for the following? heat resource-list --show-nested 3 overcloud |grep -iv complete Hello, Yes it went to status UPDATE_FAILED. By troubleshooting I realized there are way too many connections to rabbitmq (192.0.2.1:5672) and I wanted to clear some of them so I restarted rabbitmq via systemctl. Doing a heat-stack delete overcloud again resulted in UPDATE_FAILED status si i did it repeatedly and it worked at some point. The stack was deleted. I am now experiencing kinda the same issue by deploying a stack. The deployment fails at different times and by issuing openstack overcloud deploy .... again and again it manages to move a step further in the deployment. Something, or some component is failing to communicate properly, I will investigate further as I am trying to deploy my first OSP8 stack. Anyway, the idem-potency of it is not really showing off now. Here is a conn cont while in deployment: [stack@director ~]$ sudo netstat -atpn | grep 5672 | grep ESTA | wc -l 176 It was the same when it wasn't doing anything. OK! Turns out after several runs of the same command ("openstack overcloud deploy --templates -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /home/stack/templates/network-environment.yaml -e /home/stack/templates/storage-environment.yaml -e ~/templates/cloudname.yaml --control-flavor control --compute-flavor compute --ntp-server pool.ntp.org --neutron-network-type vxlan --neutron-tunnel-types vxlan") it managed to deploy. This is because I know the configuration was sound and because I realized it's failing at different steps in the process. Some of the fails weren't even fails. heat stack-list --show-nested said that controller failed to deploy from a nested task. That task had no error but still the stack had UPDATE_FAILED as status so I rerun the command above and voila, stack deployed. |