Bug 1263816
| Summary: | Stack-delete of overcloud does not remove instance UUID from nodes | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Joe Talerico <jtaleric> | ||||
| Component: | openstack-nova | Assignee: | Lucas Alvares Gomes <lmartins> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Raviv Bar-Tal <rbartal> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 7.0 (Kilo) | CC: | akrzos, berrange, dasmith, dcain, dnavale, eglynn, gdrapeau, jcoufal, jjoyce, jslagle, jtrowbri, kchamart, mburns, mcornea, nlevinki, racedoro, rcernin, rhel-osp-director-maint, sbauza, sferdjao, sgordon, srevivo, vromanso | ||||
| Target Milestone: | rc | Keywords: | Triaged | ||||
| Target Release: | 10.0 (Newton) | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | openstack-nova-14.0.1-1.el7ost | Doc Type: | Bug Fix | ||||
| Doc Text: |
Previously, the nova ironic virt driver wrote an instance UUID in the Bare Metal Provisioning (ironic) node before starting a deployment. If something failed between writing the UUID and starting the deployment, Compute did not remove the instance after it failed to spawn the instance. As a result, the Bare Metal Provisioning (ironic) node would have an instance UUID set and would not be picked for another deployment.
With this update, if spawning an instance fails at any stage of the deployment, the ironic virt driver ensures that the instance UUID is cleaned up. As a result, nodes will not have an instance UUID set and will be picked up for a new deployment.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2016-12-14 15:15:53 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Joe Talerico
2015-09-16 19:12:11 UTC
From discussing this on IRC, a simple reproducer for this would be: 1. Launch an baremetal instance 2. kill the ironic-conductor service 3. delete the instance from nova 4. restart the ironic-conductor service This will leave the instance with an instance_uuid that can not be deleted without direct editing of the db. These steps just show the issue in a really simple way. The actual issue is because yum update is causing the conductor service to crash, `yum update; heat stack-delete overcloud` leads to the same behavior. Just hit this in OSP8. (In reply to John Trowbridge from comment #5) > From discussing this on IRC, a simple reproducer for this would be: > > 1. Launch an baremetal instance > 2. kill the ironic-conductor service > 3. delete the instance from nova > 4. restart the ironic-conductor service > > This will leave the instance with an instance_uuid that can not be deleted > without direct editing of the db. > Hi John, if you do that the instances will continue to be marked as active in Ironic right? That would require people to manually delete them from Ironic by mimic'ing what the nova driver in Ironic does: $ ironic node-set-provision-state <node uuid> deleted And to remove the instance_uuid $ ironic node-update <node uuid> remove instance_uuid Does that works for you? This bug did not make the OSP 8.0 release. It is being deferred to OSP 10. I also encountered this with OSPd9 after deleting an overcloud via openstack stack delete overcloud [stack@gprfc007 ~]$ ironic node-list +--------------------------------------+------+--------------------------------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------+--------------------------------------+-------------+--------------------+-------------+ | 3c9d0f77-3a8a-4621-be8a-59662c58396f | None | 35920bdc-254b-4bc0-a31c-c7863441613e | power off | available | False | | 700f7ebd-29f4-419e-80df-68da58f13d3b | None | None | power off | available | False | | 6a95b0af-4320-4b6e-9924-da8c343a5174 | None | None | power off | available | False | | e618b0d5-ba09-46ef-a074-1d543fb9a892 | None | None | power off | available | False | | c97ce129-ee54-4108-8481-4ede8ead7f70 | None | None | power off | available | False | +--------------------------------------+------+--------------------------------------+-------------+--------------------+-------------+ I got around this by running a workaround provided by Joe: [stack@gprfc007 ~]$ ironic node-update 3c9d0f77-3a8a-4621-be8a-59662c58396f remove instance_uuid I would definitely chalk this up as inconsistent to reproduce as I had deleted and redeployed several times this past week without any issue until today. Created attachment 1173547 [details]
OSPD9 Ironic logs
The 'ironic node-update remove instance_uuid' workaround doesn't work for me (on OSPd9):
[stack@undercloud ~]$ ironic node-list
+--------------------------------------+------+--------------------------------------+-------------+--------------------+-------------+
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+------+--------------------------------------+-------------+--------------------+-------------+
| 2dd880f3-63e2-419a-a604-0b667625bb0e | None | None | power off | available | False |
| 682dd8c0-5710-4d5e-b95d-e158ee051ab2 | None | None | power off | available | False |
| 6a00b395-15e8-4621-b994-25c1af4ec8ee | None | None | power off | available | False |
| 13c2862c-83f9-47a8-b4f9-9be78df7fae1 | None | 2e6bc741-e711-4c3d-a067-7857bdb7beee | power off | available | False |
| 0a8c9295-89fb-49f4-9ff5-7cc14c44a542 | None | None | power off | available | False |
| 6d1a6dbb-5748-4311-845d-86ddb6fc26f0 | None | 0cbbeb4d-16f8-41f3-8f7c-c6dbed59c954 | power off | available | False |
| bece1995-2edd-4fa6-bb79-f2730df4a461 | None | None | power off | available | False |
| 774da9a9-cbff-4fe9-a1e2-2e3287d125f7 | None | None | power off | available | True |
+--------------------------------------+------+--------------------------------------+-------------+--------------------+-------------+
[stack@undercloud ~]$ ironic node-update 13c2862c-83f9-47a8-b4f9-9be78df7fae1 remove 2e6bc741-e711-4c3d-a067-7857bdb7beee
Couldn't apply patch '[{'path': '/2e6bc741-e711-4c3d-a067-7857bdb7beee', 'op': 'remove'}]'. Reason: u'2e6bc741-e711-4c3d-a067-7857bdb7beee' (HTTP 400)
[stack@undercloud ~]$ ironic node-update 6d1a6dbb-5748-4311-845d-86ddb6fc26f0 remove 0cbbeb4d-16f8-41f3-8f7c-c6dbed59c954
Couldn't apply patch '[{'path': '/0cbbeb4d-16f8-41f3-8f7c-c6dbed59c954', 'op': 'remove'}]'. Reason: u'0cbbeb4d-16f8-41f3-8f7c-c6dbed59c954' (HTTP 400)
[stack@undercloud ~]$
(In reply to Karthik Prabhakar from comment #12) > The 'ironic node-update remove instance_uuid' workaround doesn't work for me > (on OSPd9): > > [stack@undercloud ~]$ ironic node-list > +--------------------------------------+------+------------------------------ > --------+-------------+--------------------+-------------+ > | UUID | Name | Instance UUID > | Power State | Provisioning State | Maintenance | > +--------------------------------------+------+------------------------------ > --------+-------------+--------------------+-------------+ > | 2dd880f3-63e2-419a-a604-0b667625bb0e | None | None > | power off | available | False | > | 682dd8c0-5710-4d5e-b95d-e158ee051ab2 | None | None > | power off | available | False | > | 6a00b395-15e8-4621-b994-25c1af4ec8ee | None | None > | power off | available | False | > | 13c2862c-83f9-47a8-b4f9-9be78df7fae1 | None | > 2e6bc741-e711-4c3d-a067-7857bdb7beee | power off | available | > False | > | 0a8c9295-89fb-49f4-9ff5-7cc14c44a542 | None | None > | power off | available | False | > | 6d1a6dbb-5748-4311-845d-86ddb6fc26f0 | None | > 0cbbeb4d-16f8-41f3-8f7c-c6dbed59c954 | power off | available | > False | > | bece1995-2edd-4fa6-bb79-f2730df4a461 | None | None > | power off | available | False | > | 774da9a9-cbff-4fe9-a1e2-2e3287d125f7 | None | None > | power off | available | True | > +--------------------------------------+------+------------------------------ > --------+-------------+--------------------+-------------+ > > [stack@undercloud ~]$ ironic node-update > 13c2862c-83f9-47a8-b4f9-9be78df7fae1 remove > 2e6bc741-e711-4c3d-a067-7857bdb7beee > Couldn't apply patch '[{'path': '/2e6bc741-e711-4c3d-a067-7857bdb7beee', > 'op': 'remove'}]'. Reason: u'2e6bc741-e711-4c3d-a067-7857bdb7beee' (HTTP 400) > > [stack@undercloud ~]$ ironic node-update > 6d1a6dbb-5748-4311-845d-86ddb6fc26f0 remove > 0cbbeb4d-16f8-41f3-8f7c-c6dbed59c954 > Couldn't apply patch '[{'path': '/0cbbeb4d-16f8-41f3-8f7c-c6dbed59c954', > 'op': 'remove'}]'. Reason: u'0cbbeb4d-16f8-41f3-8f7c-c6dbed59c954' (HTTP 400) > > [stack@undercloud ~]$ The command is incorrect, the correct way to clean out the instance_uuid field is: $ ironic node-update <node uuid> remove instance_uuid instance_uuid is the name of the field, it shouldn't be replaced with the actual UUID of the instance. There's current a patch for review upstream in Nova that seems to address this problem: https://review.openstack.org/#/c/341253/7 The patch is in Nova rather than Ironic because the ironic driver in nova is the one responsible for setting (and now cleaning up) the instance_uuid in case the deployment fails before it hits Ironic. Ran into this problem in OSP9, Joe's suggestion worked for me. Quite an irritating problem for a customer to have, will this be fixed by OSP10? [stack@refarch-ospd ~]$ nova list +----+------+--------+------------+-------------+----------+ | ID | Name | Status | Task State | Power State | Networks | +----+------+--------+------------+-------------+----------+ +----+------+--------+------------+-------------+----------+ [stack@refarch-ospd ~]$ ironic node-list i+--------------------------------------+---------+--------------------------------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+---------+--------------------------------------+-------------+--------------------+-------------+ | 57e44040-5feb-42fd-8cbd-5b927802af46 | r630-02 | 4f2c4b38-71f9-4c89-98b1-95410efa2cbd | power off | available | False | +--------------------------------------+---------+--------------------------------------+-------------+--------------------+-------------+ [stack@refarch-ospd ~]$ ironic node-delete r630-02 Failed to delete node r630-02: Node 57e44040-5feb-42fd-8cbd-5b927802af46 is associated with instance 4f2c4b38-71f9-4c89-98b1-95410efa2cbd. (HTTP 409) [stack@refarch-ospd ~]$ ironic node-update r630-02 remove instance_uuid [stack@refarch-ospd ~]$ ironic node-list +--------------------------------------+---------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+---------+---------------+-------------+--------------------+-------------+ | 57e44040-5feb-42fd-8cbd-5b927802af46 | r630-02 | None | power off | available | False | +--------------------------------------+---------+---------------+-------------+--------------------+-------------+ [stack@refarch-ospd ~]$ ironic node-delete r630-02 Deleted node r630-02 Lucas, the patch seems to be merged. Can you please update the bz status? (In reply to Jaromir Coufal from comment #16) > Lucas, the patch seems to be merged. Can you please update the bz status? Hi Jaromir, cool! I've checked and the patch is already present in the "rhos-10.0-patches" branch for nova. When trying the reproduce steps I found out the behaviour changed and now when trying to delete stack or delete nova instance when ironic-conductor is down, the stack/instance change status to DELETE_FAIL in stack list and ERROR in nova list. Once ironic-conductor is started and the delete command run again the stack/nodes are delete and instance uuid is removed from ironic node. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html |