Bug 1235098

Summary:	Isolated networks not deleted during heat stack-delete overcloud
Product:	Red Hat OpenStack	Reporter:	Dan Sneddon <dsneddon>
Component:	rhosp-director	Assignee:	Jay Dobies <jason.dobies>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Shai Revivo <srevivo>
Severity:	unspecified	Docs Contact:
Priority:	medium
Version:	7.0 (Kilo)	CC:	dmacpher, dsneddon, hbrock, jcoufal, jdonohue, mandreou, mburns, rhel-osp-director-maint, sasha, sbaker, shardy, zbitter
Target Milestone:	---	Keywords:	Reopened, Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	Sometimes the stack-delete operation fails the first time. When the user runs the command again, some of the isolated networks might not have deleted. This causes the next deployment to fail, but after deleting that stack the following deployment will succeed. As a workaround, check that only the "ctlplane" network exists on the Undercloud before the first deployment. Run "neutron net-list". If there are any networks other than "ctlplane", run "neutron net-delete <UUID>" for the other networks. This ensures only "ctlplane" exists before initial deployment, which helps the deployment to succeed.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-10-09 23:17:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1191185, 1243520

Description Dan Sneddon 2015-06-24 01:48:09 UTC

Description of problem:
When deploying with isolated networks, the isolated networks are not deleted when the stack is deleted.

Version-Release number of selected component (if applicable):
poodle-2015-06-23.14

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP 7 with network isolation
2. heat stack-delete overcloud
3. Try to redeploy

Actual results:
The stack actually went to CREATE_COMPLETE today, but I deleted it to try a different configuration. The networks were still there after the stack was deleted, resulting in an immediate CREATE_FAILED when the networks were already present when the network creation was attempted.

Expected results:
The networks should be deleted along with the stack.

Additional info:
These networks were still visible after the stack was deleted:

[stack@host01 ~]$ neutron net-list
+--------------------------------------+--------------+----------------------------------------------------+
| id                                   | name         | subnets                                            |
+--------------------------------------+--------------+----------------------------------------------------+
| 52bff423-0a41-4197-8cd1-df4e82f2f305 | tenant       | 6947e921-eda1-4884-8e53-88be3b76b5d8 172.16.0.0/24 |
| ac67e69d-b94f-421f-a99f-94b18f6298b1 | internal_api | 49267c29-c5ef-454d-822a-b958068170de 172.17.0.0/24 |
| abca49db-5327-443e-bf5f-1c80c1e5d8d7 | storage      | 707e6c58-d7df-42fb-85db-78a97495860d 172.18.0.0/24 |
| 92b45bea-acdc-497a-b8af-82d4c3791547 | ctlplane     | ad89853f-6428-4853-9bd7-e63be4f14815 10.8.146.0/24 |
| 9b5f88d0-c165-4860-a377-0ebcd1a5c815 | external     | 35eed269-9284-415d-ab91-b3c2eaf5a2d5 10.8.148.0/24 |
| 3c4418b3-0a0d-4060-a448-2f32236372eb | storage_mgmt | 347d00af-55af-4a9b-93ce-6a05e161b0c8 172.19.0.0/24 |
+--------------------------------------+--------------+----------------------------------------------------+


[stack@host01 ~]$ heat resource-show overcloud Networks
+------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Property               | Value                                                                                                                                                                                                                                                                                 |
+------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| attributes             | {}                                                                                                                                                                                                                                                                                    |
| description            |                                                                                                                                                                                                                                                                                       |
| links                  | http://10.8.146.1:8004/v1/facde9cd5e124a6c8cf657a6f068b8d1/stacks/overcloud/8af0f35f-aaa3-4322-808f-f2590434a100/resources/Networks (self)                                                                                                                                            |
|                        | http://10.8.146.1:8004/v1/facde9cd5e124a6c8cf657a6f068b8d1/stacks/overcloud/8af0f35f-aaa3-4322-808f-f2590434a100 (stack)                                                                                                                                                              |
|                        | http://10.8.146.1:8004/v1/facde9cd5e124a6c8cf657a6f068b8d1/stacks/overcloud-Networks-nl3h7vjdbe4q/0b7ce57a-4a3c-4238-a84c-83ec4c853a00 (nested)                                                                                                                                       |
| logical_resource_id    | Networks                                                                                                                                                                                                                                                                              |
| physical_resource_id   | 0b7ce57a-4a3c-4238-a84c-83ec4c853a00                                                                                                                                                                                                                                                  |
| required_by            | ControlVirtualIP                                                                                                                                                                                                                                                                      |
|                        | PublicVirtualIP                                                                                                                                                                                                                                                                       |
|                        | StorageVirtualIP                                                                                                                                                                                                                                                                      |
|                        | InternalApiVirtualIP                                                                                                                                                                                                                                                                  |
|                        | RedisVirtualIP                                                                                                                                                                                                                                                                        |
|                        | StorageMgmtVirtualIP                                                                                                                                                                                                                                                                  |
| resource_name          | Networks                                                                                                                                                                                                                                                                              |
| resource_status        | CREATE_FAILED                                                                                                                                                                                                                                                                         |
| resource_status_reason | ResourceUnknownStatus: Resource failed - Unknown status FAILED due to "Resource CREATE failed: ResourceUnknownStatus: Resource failed - Unknown status FAILED due to "Resource CREATE failed: Conflict: Unable to create the flat network. Physical network internal_api is in use."" |
| resource_type          | OS::TripleO::Network                                                                                                                                                                                                                                                                  |
| updated_time           | 2015-06-24T01:41:57Z                                                                                                                                                                                                                                                                  |
+------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Comment 3 Dan Sneddon 2015-06-24 07:44:47 UTC

This error seems to be somewhat erratic. The first time I deleted the stack I got DELETE_FAILED, and then had to run heat stack-delete again. I was left with extra networks.

When I tried to deploy again, it failed immediately. But after running heat stack-delete overcloud again, all the extra networks were deleted except for "internal_api".

[stack@host01 ~]$ heat stack-list
+--------------------------------------+------------+---------------+----------------------+
| id                                   | stack_name | stack_status  | creation_time        |
+--------------------------------------+------------+---------------+----------------------+
| 8af0f35f-aaa3-4322-808f-f2590434a100 | overcloud  | CREATE_FAILED | 2015-06-24T01:41:57Z |
+--------------------------------------+------------+---------------+----------------------+
[stack@host01 ~]$ heat stack-delete overcloud
+--------------------------------------+------------+--------------------+----------------------+
| id                                   | stack_name | stack_status       | creation_time        |
+--------------------------------------+------------+--------------------+----------------------+
| 8af0f35f-aaa3-4322-808f-f2590434a100 | overcloud  | DELETE_IN_PROGRESS | 2015-06-24T01:41:57Z |
+--------------------------------------+------------+--------------------+----------------------+
[stack@host01 ~]$ heat stack-list
+----+------------+--------------+---------------+
| id | stack_name | stack_status | creation_time |
+----+------------+--------------+---------------+
+----+------------+--------------+---------------+
[stack@host01 ~]$ neutron net-list
+--------------------------------------+--------------+----------------------------------------------------+
| id                                   | name         | subnets                                            |
+--------------------------------------+--------------+----------------------------------------------------+
| ac67e69d-b94f-421f-a99f-94b18f6298b1 | internal_api | 49267c29-c5ef-454d-822a-b958068170de 172.17.0.0/24 |
| 92b45bea-acdc-497a-b8af-82d4c3791547 | ctlplane     | ad89853f-6428-4853-9bd7-e63be4f14815 10.8.146.0/24 |
+--------------------------------------+--------------+----------------------------------------------------+

Comment 4 Steve Baker 2015-06-24 21:13:42 UTC

Could you please attach the heat-engine log, plus the output to the following command after the first DELETE_FAILED:

  heat event-list -n3 overcloud

If the stack is already deleted, you may still be able to get the event list using the uuid:

  heat event-list -n3 8af0f35f-aaa3-4322-808f-f2590434a100

Comment 5 Dan Sneddon 2015-06-24 22:56:33 UTC

(In reply to Steve Baker from comment #4)

Sorry, I've blown away the environment.

After I got the first CREATE_FAILED, I deleted the stack and the next time deploy worked. In fact, I've been deleting and redeploying the stack and haven't run into this issue again. I think the networks only stick around if the stack delete fails in a particular way, but they do get cleaned up when stack delete works.

Comment 6 Steve Baker 2015-06-25 01:38:03 UTC

OK, that makes more sense. The networks won't be deleted until the servers are, due to the dependencies.

Comment 8 Dan Sneddon 2015-06-26 22:37:41 UTC

This bug is pretty hard to trigger. You have to have the stack-delete fail due to one of the systems still being powered on. The behavior then is that the networks don't get created and the next stack create fails. But if you delete that stack it successfully deletes all networks and the next deployment succeeds.

I don't think we can solve for every particular edge case that might hang up our tooling. The important thing is that we get the cleanup right. In all of my testing, I created several dozen overclouds, many of which failed to deploy, but I only ran into this bug once and it was easy to get past. I'm going to close this as not a bug for now since heat stack-delete did properly clean up everything on the second try.

Comment 9 Dan Sneddon 2015-07-08 15:41:06 UTC

I'm reopening this bug. Others have run into it. I hit this while working with a customer on a beta 2 deployment. Several testers have run into it in the last few days.

The behavior is that sometimes when deleting the stack the stack will go into DELETE_FAILED. When you run stack delete again, it doesn't necessarily clean up all the networks. Then, when you redeploy, you run into an error like this:

"Resource CREATE failed: Conflict: Unable to create the flat network. Physical network external is in use"

We need to make sure that we are deleting all the networks except for ctlplane when we delete the stack, even when completing a previously failed delete operation.

Comment 10 Steven Hardy 2015-07-15 12:32:07 UTC

So I've seen similar failures, and it seems to be a dependency issue or maybe a race where neutron says something is deleted before it actually is.

For example, I see (after a stack-delete ends up DELETE_FAILED):

Conflict: Unable to complete operation on subnet 6e54bc78-846a-400d-b7c5-619b15e97cd2. One or more ports have an IP allocation from this subnet.

$ neutron subnet-list
+--------------------------------------+----------------+---------------+------------------------------------------------+
| id                                   | name           | cidr          | allocation_pools                               |
+--------------------------------------+----------------+---------------+------------------------------------------------+
| c986939c-dae9-4d68-98ef-0c1116d03e96 |                | 192.0.2.0/24  | {"start": "192.0.2.5", "end": "192.0.2.24"}    |
| 6e54bc78-846a-400d-b7c5-619b15e97cd2 | storage_subnet | 172.16.1.0/24 | {"start": "172.16.1.4", "end": "172.16.1.250"} |
+--------------------------------------+----------------+---------------+------------------------------------------------+


So we can see in this case, either we didn't attempt to delete the ports before the subnet, or we claimed them fully deleted when really they were not.

Looking at the ports seems to indicate the latter, e.g a race of some sort, because I don't see any port assigned to storage_subnet:

$ neutron port-list
+--------------------------------------+------+-------------------+----------------------------------------------------------------------------------+
| id                                   | name | mac_address       | fixed_ips                                                                        |
+--------------------------------------+------+-------------------+----------------------------------------------------------------------------------+
| a529f0bb-60c6-4d36-bdd0-4a71c34c27b8 |      | fa:16:3e:e3:b2:f8 | {"subnet_id": "c986939c-dae9-4d68-98ef-0c1116d03e96", "ip_address": "192.0.2.5"} |
+--------------------------------------+------+-------------------+----------------------------------------------------------------------------------+

Comment 11 Dan Sneddon 2015-07-17 01:38:21 UTC

This is still happening with the latest poodle on 2015-07-16.

We need to either address this bug with a fix or make sure there is a doc fix, because this is happening often.

Comment 14 Zane Bitter 2015-07-17 15:16:53 UTC

It sounds like applying the upstream patch that Steve mentioned for bug 1242796 would likely avoid getting the user into the scenario (i.e. needing to issue multiple deletes) that triggers the bug.

Also, I wonder if this is related to bug 1228324. Some heat-engine logs would be helpful if you have them.

Comment 15 Alexander Chuzhoy 2015-07-20 19:34:30 UTC

Got affected by it several times.
The issues sometimes manifest themselves in unexpected (for admin/user) places:
running "neutron net-create public --provider:network_type flat --provider:physical_network physnet-external --router:external" , resulted in
"Invalid input for operation: physical_network 'physnet-external' unknown for flat provider network"

Comment 16 Marios Andreou 2015-08-06 11:31:07 UTC

update as i came across this today chasing https://bugzilla.redhat.com/show_bug.cgi?id=1250546 - seems (vm env) the 1 compute/1control deploy case makes this bug more likely to occur:

openstack overcloud deploy --plan overcloud -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml

4 out of 5 runs of this, the heat stack-delete failed on the Networks resource (and nets left behind). will try and revisit later

Comment 17 Jaromir Coufal 2016-01-06 18:58:12 UTC

Anybody still hitting the issue?

Comment 19 Jaromir Coufal 2016-10-09 23:17:40 UTC

Seems like not reproduced anymore.

Comment 20 Jaromir Coufal 2016-10-09 23:18:15 UTC

Please, re-open if still happening.