Bug 1308517
Summary: | heat stack-delete fails with already has an action (CREATE) in progress | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Amit Ugol <augol> |
Component: | openstack-heat | Assignee: | Zane Bitter <zbitter> |
Status: | CLOSED WONTFIX | QA Contact: | Amit Ugol <augol> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.0 (Kilo) | CC: | mburns, rhel-osp-director-maint, sbaker, shardy, yeylon |
Target Milestone: | --- | Keywords: | ZStream |
Target Release: | 7.0 (Kilo) | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-03-10 22:55:57 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Amit Ugol
2016-02-15 12:52:40 UTC
The stack was stuck at CREATE_IN_PROGRESS even after I nova deleted and ironic node-deleted all the nodes. only a reboot to the undercloud machine freed it. Here's what the log shows: 2016-02-15 13:21:54.843 3538 INFO heat.engine.service [req-64025314-8e6d-4f53-bb 06-f7adb9a967ab 793c5d1bf48e4a20bb9172c5c8ebc765 f49fdcf77d814ed89d2c75b5f5c20bf 0] Deleting stack overcloud ... 2016-02-15 13:21:54.889 3538 DEBUG heat.engine.service [req-64025314-8e6d-4f53-bb06-f7adb9a967ab 793c5d1bf48e4a20bb9172c5c8ebc765 f49fdcf77d814ed89d2c75b5f5c20bf0] Successfully stopped remote task on engine e388c109-d2d7-4850-be82-ad54606134f0 delete_stack /usr/lib/python2.7/site-packages/heat/engine/service.py:952 ... 2016-02-15 13:21:54.937 3538 DEBUG heat.engine.stack_lock [req-64025314-8e6d-4f53-bb06-f7adb9a967ab 793c5d1bf48e4a20bb9172c5c8ebc765 f49fdcf77d814ed89d2c75b5f5c20bf0] Lock on stack d9e1b7bb-a749-4085-a547-6a36556e9d29 is owned by engine e388c109-d2d7-4850-be82-ad54606134f0 acquire /usr/lib/python2.7/site-packages/heat/engine/stack_lock.py:87 2016-02-15 13:21:54.938 3538 DEBUG oslo_messaging.rpc.dispatcher [req-64025314-8e6d-4f53-bb06-f7adb9a967ab 793c5d1bf48e4a20bb9172c5c8ebc765 f49fdcf77d814ed89d2c75b5f5c20bf0] Expected exception during message handling (Stack overcloud already has an action (CREATE) in progress.) _dispatch_and_reply /usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py:145 So it claims to have stopped the remote task, which should result in the lock being removed, but the lock still exists when we go to acquire it again in order to do the delete. I have seen this error message upstream on at least one gate test: http://logs.openstack.org/31/278831/1/check/gate-heat-dsvm-functional-orig-mysql/908bbfd/console.html.gz The test was for a stable/liberty backport of code that only touched the update patch, not create or delete, so it wasn't caused by that particular patch (https://review.openstack.org/#/c/278831). That one looks like it may have been caused by a DB transaction rollback though, and there's nothing similar here. OK, after reading the description and the log more carefully I think I see what happened here. At some point during the stack creation, Heat lost access to the database, causing a bunch of errors, some of the unexpected variety. Heat is now actually pretty good at cleaning up after an unexpected error - where cleaning up means releasing the lock (in the database) and setting the stack state back to FAILED (in the database). There's pretty much nothing it can do about not being able to access the database though. Anyway, once that has bombed out (and the database has started up again) we're left in the state where the resource is stuck IN_PROGRESS but there's no operation running on it, the lock is still present, and the engine that owns the lock is still alive. It's fairly easy to recover from this situation by just rebooting heat-engine - the engine owning the lock going down will make other engines willing to steal it again, and a new engine starting up will locate any IN_PROGRESS stacks with stealable locks and set them back to FAILED. I filed https://bugs.launchpad.net/heat/+bug/1555840 upstream, but I think it's safe to say that this is unlikely to get fixed in RHOS 7. The HA architecture should largely prevent this from happening on the overcloud, and it's also fairly unlikely to begin with. Development Management has reviewed and declined this request. You may appeal this decision by reopening this request. |