Bug 1235098
Summary: | Isolated networks not deleted during heat stack-delete overcloud | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Dan Sneddon <dsneddon> |
Component: | rhosp-director | Assignee: | Jay Dobies <jason.dobies> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Shai Revivo <srevivo> |
Severity: | unspecified | Docs Contact: | |
Priority: | medium | ||
Version: | 7.0 (Kilo) | CC: | dmacpher, dsneddon, hbrock, jcoufal, jdonohue, mandreou, mburns, rhel-osp-director-maint, sasha, sbaker, shardy, zbitter |
Target Milestone: | --- | Keywords: | Reopened, Triaged |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Known Issue | |
Doc Text: |
Sometimes the stack-delete operation fails the first time. When the user runs the command again, some of the isolated networks might not have deleted. This causes the next deployment to fail, but after deleting that stack the following deployment will succeed. As a workaround, check that only the "ctlplane" network exists on the Undercloud before the first deployment. Run "neutron net-list". If there are any networks other than "ctlplane", run "neutron net-delete <UUID>" for the other networks. This ensures only "ctlplane" exists before initial deployment, which helps the deployment to succeed.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2016-10-09 23:17:40 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1191185, 1243520 |
Description
Dan Sneddon
2015-06-24 01:48:09 UTC
This error seems to be somewhat erratic. The first time I deleted the stack I got DELETE_FAILED, and then had to run heat stack-delete again. I was left with extra networks. When I tried to deploy again, it failed immediately. But after running heat stack-delete overcloud again, all the extra networks were deleted except for "internal_api". [stack@host01 ~]$ heat stack-list +--------------------------------------+------------+---------------+----------------------+ | id | stack_name | stack_status | creation_time | +--------------------------------------+------------+---------------+----------------------+ | 8af0f35f-aaa3-4322-808f-f2590434a100 | overcloud | CREATE_FAILED | 2015-06-24T01:41:57Z | +--------------------------------------+------------+---------------+----------------------+ [stack@host01 ~]$ heat stack-delete overcloud +--------------------------------------+------------+--------------------+----------------------+ | id | stack_name | stack_status | creation_time | +--------------------------------------+------------+--------------------+----------------------+ | 8af0f35f-aaa3-4322-808f-f2590434a100 | overcloud | DELETE_IN_PROGRESS | 2015-06-24T01:41:57Z | +--------------------------------------+------------+--------------------+----------------------+ [stack@host01 ~]$ heat stack-list +----+------------+--------------+---------------+ | id | stack_name | stack_status | creation_time | +----+------------+--------------+---------------+ +----+------------+--------------+---------------+ [stack@host01 ~]$ neutron net-list +--------------------------------------+--------------+----------------------------------------------------+ | id | name | subnets | +--------------------------------------+--------------+----------------------------------------------------+ | ac67e69d-b94f-421f-a99f-94b18f6298b1 | internal_api | 49267c29-c5ef-454d-822a-b958068170de 172.17.0.0/24 | | 92b45bea-acdc-497a-b8af-82d4c3791547 | ctlplane | ad89853f-6428-4853-9bd7-e63be4f14815 10.8.146.0/24 | +--------------------------------------+--------------+----------------------------------------------------+ Could you please attach the heat-engine log, plus the output to the following command after the first DELETE_FAILED: heat event-list -n3 overcloud If the stack is already deleted, you may still be able to get the event list using the uuid: heat event-list -n3 8af0f35f-aaa3-4322-808f-f2590434a100 (In reply to Steve Baker from comment #4) Sorry, I've blown away the environment. After I got the first CREATE_FAILED, I deleted the stack and the next time deploy worked. In fact, I've been deleting and redeploying the stack and haven't run into this issue again. I think the networks only stick around if the stack delete fails in a particular way, but they do get cleaned up when stack delete works. OK, that makes more sense. The networks won't be deleted until the servers are, due to the dependencies. This bug is pretty hard to trigger. You have to have the stack-delete fail due to one of the systems still being powered on. The behavior then is that the networks don't get created and the next stack create fails. But if you delete that stack it successfully deletes all networks and the next deployment succeeds. I don't think we can solve for every particular edge case that might hang up our tooling. The important thing is that we get the cleanup right. In all of my testing, I created several dozen overclouds, many of which failed to deploy, but I only ran into this bug once and it was easy to get past. I'm going to close this as not a bug for now since heat stack-delete did properly clean up everything on the second try. I'm reopening this bug. Others have run into it. I hit this while working with a customer on a beta 2 deployment. Several testers have run into it in the last few days. The behavior is that sometimes when deleting the stack the stack will go into DELETE_FAILED. When you run stack delete again, it doesn't necessarily clean up all the networks. Then, when you redeploy, you run into an error like this: "Resource CREATE failed: Conflict: Unable to create the flat network. Physical network external is in use" We need to make sure that we are deleting all the networks except for ctlplane when we delete the stack, even when completing a previously failed delete operation. So I've seen similar failures, and it seems to be a dependency issue or maybe a race where neutron says something is deleted before it actually is. For example, I see (after a stack-delete ends up DELETE_FAILED): Conflict: Unable to complete operation on subnet 6e54bc78-846a-400d-b7c5-619b15e97cd2. One or more ports have an IP allocation from this subnet. $ neutron subnet-list +--------------------------------------+----------------+---------------+------------------------------------------------+ | id | name | cidr | allocation_pools | +--------------------------------------+----------------+---------------+------------------------------------------------+ | c986939c-dae9-4d68-98ef-0c1116d03e96 | | 192.0.2.0/24 | {"start": "192.0.2.5", "end": "192.0.2.24"} | | 6e54bc78-846a-400d-b7c5-619b15e97cd2 | storage_subnet | 172.16.1.0/24 | {"start": "172.16.1.4", "end": "172.16.1.250"} | +--------------------------------------+----------------+---------------+------------------------------------------------+ So we can see in this case, either we didn't attempt to delete the ports before the subnet, or we claimed them fully deleted when really they were not. Looking at the ports seems to indicate the latter, e.g a race of some sort, because I don't see any port assigned to storage_subnet: $ neutron port-list +--------------------------------------+------+-------------------+----------------------------------------------------------------------------------+ | id | name | mac_address | fixed_ips | +--------------------------------------+------+-------------------+----------------------------------------------------------------------------------+ | a529f0bb-60c6-4d36-bdd0-4a71c34c27b8 | | fa:16:3e:e3:b2:f8 | {"subnet_id": "c986939c-dae9-4d68-98ef-0c1116d03e96", "ip_address": "192.0.2.5"} | +--------------------------------------+------+-------------------+----------------------------------------------------------------------------------+ This is still happening with the latest poodle on 2015-07-16. We need to either address this bug with a fix or make sure there is a doc fix, because this is happening often. It sounds like applying the upstream patch that Steve mentioned for bug 1242796 would likely avoid getting the user into the scenario (i.e. needing to issue multiple deletes) that triggers the bug. Also, I wonder if this is related to bug 1228324. Some heat-engine logs would be helpful if you have them. Got affected by it several times. The issues sometimes manifest themselves in unexpected (for admin/user) places: running "neutron net-create public --provider:network_type flat --provider:physical_network physnet-external --router:external" , resulted in "Invalid input for operation: physical_network 'physnet-external' unknown for flat provider network" update as i came across this today chasing https://bugzilla.redhat.com/show_bug.cgi?id=1250546 - seems (vm env) the 1 compute/1control deploy case makes this bug more likely to occur: openstack overcloud deploy --plan overcloud -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml 4 out of 5 runs of this, the heat stack-delete failed on the Networks resource (and nets left behind). will try and revisit later Anybody still hitting the issue? Seems like not reproduced anymore. Please, re-open if still happening. |