Description of problem: When a heat stack is created with two networks plugged into a router and DVR is being used in neutron(L3 agent on each compute), the stack is created successfully but delete fails with the message that the subnet couldn't be deleted as one or more ports couldn't be deleted. On inspecting the ports on the subnet, the port with owner network:dhcp and network:router_centralized_snat were left behind. One manually deleting the port with device_owner as network:router_centralized_snat only the heat stack could be deleted. Version-Release number of selected component (if applicable): RHOP 10 Puddle: 2016-08-24.1 How reproducible: 100% Steps to Reproduce: 1.Deploy heat stack with two networks plugged into a router 2. Delete heat stack 3. Actual results: heat stack-delete fails Expected results: stack should be deleted Additional info: traceback in heat-engine logs: https://gist.github.com/smalleni/57f3127ae103a667c6809c510f3175e9 heat stack-show of a similar stack(not the stack in the logs, but exactly the same: https://gist.github.com/smalleni/0f18da5432ec9257276bdbffbff76f9b neutron port-show of the port that needed to be deleted: https://gist.github.com/smalleni/2472fe3c345afc84d894af7e396c5194
Can you please attach the template to reproduce this? Can you also attempt to delete the stack more than once and report if later delete attempts are successful?
The template is: https://github.com/openstack/shaker/blob/master/shaker/scenarios/openstack/l3_east_west.hot The parameters in the template are configured by the tool. heat stack-delete doesn't work even if loop over the command and try to delete the sack. Also, please note that happens only in an environment were the router is distributed(DVR) enabled, and the port that is failing to deleted is sentralized_snat which you find only in a DVR setup.
First off, the quality of the template is impeccable and it explicitly references subnets everywhere so that rules out all of the usual missing dependency candidates for a problem like this. From the logs, the resource that's failing to be deleted is west_private_subnet. It's not clear which port is the one it's blocking on, but it appears to have the IP address 10.1.0.12, which according to the template places it in east_private_subnet. TBH this is crazy enough even by Neutron standards that it legitimately can be considered a Neutron bug IMHO. My best guess is that the port causing the problem is router_interface. That sounds consistent with the device owner being network:router_centralized_snat. I would try adding: depends_on: west_private_subnet to router_interface and vice-versa for router_interface_2. We'd need to see more detailed output from the stack, showing physical resource IDs for all the ports and matching them up to the offending one in Neutron, to be able to say more definitively.
Any luck with the suggestion in Comment #4?
Zane, I haven't tried the suggestion yet, will do. However, I also wanted to bring to your notice that even with the same template, the stack delete failing isn't consistent. It does clean up in some cases. So not entirely sure what's going on here.
That's not entirely surprising. Resources without a dependency relationship tend to be worked on in a random order, so if a dependency is missing it generally results in random failures to delete.
After looking at this for sometime, this is not a bug IMHO. I guess the resource creation time for when a router is legacy vs distributed is different and that might be causing resources to be missing dependencies when being created. Throwing depends_on at a few resources solves the issue. That being said this makes a good case to look at control plane performance in Neutron when using DVR routers vs legacy.
I'd be interested in knowing what depends_on relations you had to add to solve the issue. We have added implicit dependencies in the past to work around Neutron's unfortunate object model, so there is at least the potential we could do something similar here. Failing that, I'm sure you won't be the last person to run into the problem so it'd be great to have a canned answer ready :)
Zane, All the depends_on in the following gist were added by me. https://gist.github.com/smalleni/32b638147b334dd4ad26abdd14190254 One interesting observation here, when only one(either) of the routers is set to distributed: True, the stack was being created but when both the routers were being set to distributed, the stack create was failing 100%. Also to reiterate, all these failures occured and workarounds were needed only when distributed: True was set.
Thanks! So working through that list, these can't have made a difference: - router_interface depends_on north_private_subnet: redundant because it already does { get_resource: north_private_subnet } (which adds a dependency) - router_interface_2 depends_on south_router: redundant because it already does { get_resource: south_router } (which adds a dependency) This could have: - south_router depends_on south_private_subnet: interesting, because if this is the one that solves it, it means that the router has to be deleted before the subnet to which it was routing, which would be bizarre given that the router *interface* connecting them has already been deleted. If this were the case I'd consider it a bug in Neutron. But it's most likely this: - {{ agent.id }}_floating_ip depends_on router_interface_2: the thing is, we're supposed to add this one for you: http://git.openstack.org/cgit/openstack/heat/tree/heat/engine/resources/openstack/neutron/floatingip.py#n194 There are some known issue with that though. The comments on https://review.openstack.org/#/c/289371/ are the best source of details; unfortunately work on the fix appears to have stalled. In particular (quoting myself): > The same patch also attempts to fix the issue with adding too many dependencies during create, unfortunately by causing the opposite problem: not adding _any_ dependencies during create. This will *certainly* be fatal under convergence for the reasons mentioned above - deletes will use the dependencies stored prior to the create and therefore will start failing because stuff is out of order. The offending patch (https://review.openstack.org/#/c/167782/3/heat/engine/resources/openstack/neutron/floatingip.py) means that this dependency will be missing when the dependencies are calculated at create time, but not at delete time. However, starting with Newton/OSP10, the convergence_engine option is enabled by default. With this option on, the dependencies calculated at create time are the ones that we use to delete. So I think this could actually be a longstanding bug that has now been uncovered in Newton.
I raised a Launchpad bug to keep track of this.
Thanks Zane. Looks like only adding {{ agent.id }}_floating_ip depends_on router_interface_2 also solves the issue. So yes, the other changes might be unrelated.
Patch merged to upstream stable/newton.
I have yet to encounter this issue while testing the most recent version (from the 19th).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html