Bug 1372105 - Heat stack delete fails when Distributed Virtual Routing(DVR) is used as it fails to delete centralized_snat neutron port
Summary: Heat stack delete fails when Distributed Virtual Routing(DVR) is used as it f...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-heat
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: 10.0 (Newton)
Assignee: Zane Bitter
QA Contact: Ronnie Rasouli
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-31 23:40 UTC by Sai Sindhur Malleni
Modified: 2017-12-11 16:42 UTC (History)
14 users (show)

Fixed In Version: openstack-heat-7.0.0-6.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-11 16:41:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1626619 0 None None None 2016-09-22 15:42:27 UTC
OpenStack gerrit 394814 0 'None' 'MERGED' 'Make FloatingIP depend on all RouterInterfaces again' 2019-11-27 23:12:10 UTC
Red Hat Product Errata RHEA-2016:2948 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 10 enhancement update 2016-12-14 19:55:27 UTC

Description Sai Sindhur Malleni 2016-08-31 23:40:03 UTC
Description of problem:
When a heat stack is created with two networks plugged into a router and DVR is being used in neutron(L3 agent on each compute), the stack is created successfully but delete fails with the message that the subnet couldn't be deleted as one or more ports couldn't be deleted. On inspecting the ports on the subnet, the port with owner network:dhcp and network:router_centralized_snat  were left behind. One manually deleting the port with device_owner as network:router_centralized_snat only the heat stack could be deleted. 

Version-Release number of selected component (if applicable):
RHOP 10
Puddle: 2016-08-24.1

How reproducible:
100%

Steps to Reproduce:
1.Deploy heat stack with two networks plugged into a router
2. Delete heat stack
3.

Actual results:
heat stack-delete fails

Expected results:
stack should be deleted

Additional info:
traceback in heat-engine logs: https://gist.github.com/smalleni/57f3127ae103a667c6809c510f3175e9
heat stack-show of a similar stack(not the stack in the logs, but exactly the same: https://gist.github.com/smalleni/0f18da5432ec9257276bdbffbff76f9b
neutron port-show of the port that needed to be deleted: https://gist.github.com/smalleni/2472fe3c345afc84d894af7e396c5194

Comment 2 Steve Baker 2016-09-01 01:47:07 UTC
Can you please attach the template to reproduce this?

Can you also attempt to delete the stack more than once and report if later delete attempts are successful?

Comment 3 Sai Sindhur Malleni 2016-09-01 17:16:56 UTC
The template is: https://github.com/openstack/shaker/blob/master/shaker/scenarios/openstack/l3_east_west.hot

The parameters in the template are configured by the tool.

heat stack-delete doesn't work even if loop over the command and try to delete the sack.

Also, please note that happens only in an environment were the router is distributed(DVR) enabled, and the port that is failing to deleted is sentralized_snat which you find only in a DVR setup.

Comment 4 Zane Bitter 2016-09-01 18:11:12 UTC
First off, the quality of the template is impeccable and it explicitly references subnets everywhere so that rules out all of the usual missing dependency candidates for a problem like this.

From the logs, the resource that's failing to be deleted is west_private_subnet.

It's not clear which port is the one it's blocking on, but it appears to have the IP address 10.1.0.12, which according to the template places it in east_private_subnet. TBH this is crazy enough even by Neutron standards that it legitimately can be considered a Neutron bug IMHO.

My best guess is that the port causing the problem is router_interface. That sounds consistent with the device owner being network:router_centralized_snat. I would try adding:

  depends_on: west_private_subnet

to router_interface and vice-versa for router_interface_2.

We'd need to see more detailed output from the stack, showing physical resource IDs for all the ports and matching them up to the offending one in Neutron, to be able to say more definitively.

Comment 5 Zane Bitter 2016-09-07 21:31:19 UTC
Any luck with the suggestion in Comment #4?

Comment 6 Sai Sindhur Malleni 2016-09-08 16:14:55 UTC
Zane, I haven't tried the suggestion yet, will do. However, I also wanted to bring to your notice that even with the same template, the stack delete failing isn't consistent. It does clean up in some cases. So not entirely sure what's going on here.

Comment 7 Zane Bitter 2016-09-09 12:52:38 UTC
That's not entirely surprising. Resources without a dependency relationship tend to be worked on in a random order, so if a dependency is missing it generally results in random failures to delete.

Comment 8 Sai Sindhur Malleni 2016-09-14 15:42:38 UTC
After looking at this for sometime, this is not a bug IMHO. I guess the resource creation time for when a router is legacy vs distributed is different and that might be causing resources to be missing dependencies when being created. Throwing depends_on at a few resources solves the issue. That being said this makes a good case to look at control plane performance in Neutron when using DVR routers vs legacy.

Comment 9 Zane Bitter 2016-09-14 19:24:50 UTC
I'd be interested in knowing what depends_on relations you had to add to solve the issue. We have added implicit dependencies in the past to work around Neutron's unfortunate object model, so there is at least the potential we could do something similar here. Failing that, I'm sure you won't be the last person to run into the problem so it'd be great to have a canned answer ready :)

Comment 10 Sai Sindhur Malleni 2016-09-14 19:56:29 UTC
Zane,
All the depends_on in the following gist were added by me.
https://gist.github.com/smalleni/32b638147b334dd4ad26abdd14190254

One interesting observation here, when only one(either) of the routers is set to distributed: True, the stack was being created but when both the routers were being set to distributed, the stack create was failing 100%. Also to reiterate, all these failures occured and workarounds were needed only when distributed: True was set.

Comment 11 Zane Bitter 2016-09-14 20:34:02 UTC
Thanks!

So working through that list, these can't have made a difference:

- router_interface depends_on north_private_subnet: redundant because it already does { get_resource: north_private_subnet } (which adds a dependency)
- router_interface_2 depends_on south_router: redundant because it already does { get_resource: south_router } (which adds a dependency)

This could have:

- south_router depends_on south_private_subnet: interesting, because if this is the one that solves it, it means that the router has to be deleted before the subnet to which it was routing, which would be bizarre given that the router *interface* connecting them has already been deleted. If this were the case I'd consider it a bug in Neutron.

But it's most likely this:

- {{ agent.id }}_floating_ip depends_on router_interface_2: the thing is, we're supposed to add this one for you:

http://git.openstack.org/cgit/openstack/heat/tree/heat/engine/resources/openstack/neutron/floatingip.py#n194

There are some known issue with that though. The comments on https://review.openstack.org/#/c/289371/ are the best source of details; unfortunately work on the fix appears to have stalled. In particular (quoting myself):

> The same patch also attempts to fix the issue with adding too many dependencies during create, unfortunately by causing the opposite problem: not adding _any_ dependencies during create. This will *certainly* be fatal under convergence for the reasons mentioned above - deletes will use the dependencies stored prior to the create and therefore will start failing because stuff is out of order.

The offending patch (https://review.openstack.org/#/c/167782/3/heat/engine/resources/openstack/neutron/floatingip.py) means that this dependency will be missing when the dependencies are calculated at create time, but not at delete time. However, starting with Newton/OSP10, the convergence_engine option is enabled by default. With this option on, the dependencies calculated at create time are the ones that we use to delete.

So I think this could actually be a longstanding bug that has now been uncovered in Newton.

Comment 12 Zane Bitter 2016-09-22 15:42:27 UTC
I raised a Launchpad bug to keep track of this.

Comment 13 Sai Sindhur Malleni 2016-09-22 16:00:24 UTC
Thanks Zane. Looks like only adding  {{ agent.id }}_floating_ip depends_on router_interface_2 also solves the issue. So yes, the other changes might be unrelated.

Comment 16 Zane Bitter 2016-11-08 13:17:06 UTC
Patch merged to upstream stable/newton.

Comment 19 Amit Ugol 2016-11-22 07:37:58 UTC
I have yet to encounter this issue while testing the most recent version (from the 19th).

Comment 21 errata-xmlrpc 2016-12-14 15:55:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html


Note You need to log in before you can comment on or make changes to this bug.