1161524 – Duplicated qrouter namespaces on neutron nodes

Bug 1161524 - Duplicated qrouter namespaces on neutron nodes

Summary: Duplicated qrouter namespaces on neutron nodes

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-foreman-installer
Sub Component:
Version:	5.0 (RHEL 6)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	Installer
Assignee:	Jason Guiditta
QA Contact:	Ofer Blaut
Docs Contact:
URL:
Whiteboard:
Depends On:	1162108
Blocks:
TreeView+	depends on / blocked

Reported:	2014-11-07 09:51 UTC by Pablo Caruana
Modified:	2019-04-16 14:23 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-04-04 15:36:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Pablo Caruana 2014-11-07 09:51:50 UTC

Description of problem:

Customer is running a netron-ha with PaceMaker that we suspect current scripts configuration does not include neutron-netns-cleanup scripts:

Customer knows how manually remove the duplicate one but we are looking for a better solution for their production enviromnet

some ideas coming from them,:
-reconfigured PaceMaker to invoke neutron-netns-cleanup when failing over, or enabling router_delete_namespaces = True in our L3 agents, or combined.

Usually we expect this kind of top should happen in three conditions:
1) When the node is set off the cluster
2) When the neutron-agent resources are took off the node.
3) If the neutron-netns-cleanup script is installed as a service it will clean up all netns namespaces during reboot/poweroff/halt or leaving the programmed runlevels.

we have below doubts,

- Will it work if l3_agent is failed over to a different node or only when the router is deleted? When the l3_agent is failed over, there is no l3-agent process on the failed node, so I am not sure will it clean the node or not.

Packeges involved:

openstack-neutron-2013.2.3-16.el6ost.noarch
openstack-neutron-openvswitch-2013.2.3-16.el6ost.noarch
iproute-2.6.32-130.el6ost.netns.3.x86_64
kernel-2.6.32-504.el6.x86_64

How reproducible:

after pacemaker neutron node failover the l3-agents on the active server shutdown at that time and the l3-agents on the passive server started. Then it switched back 20' later.

Causing a failover after deleting the namespaces on the passive node in order to test whether this is the root cause.

Additional info: we need to produce feedback to customer to test in the preproduction environment, before requestiong a change maintance windown for applying in production.

Comment 3 Leonid Natapov 2014-11-09 11:11:10 UTC

I was able to reproduce this issue on my environment. A2.

Comment 7 Leonid Natapov 2014-12-01 14:35:48 UTC

[root@mac848f69fbc4c3 bin(openstack_admin)]# rpm -qa | grep neutron
python-neutron-2014.1.3-11.el7ost.noarch
python-neutronclient-2.3.4-3.el7ost.noarch
openstack-neutron-openvswitch-2014.1.3-11.el7ost.noarch
openstack-neutron-ml2-2014.1.3-11.el7ost.noarch
openstack-neutron-2014.1.3-11.el7ost.noarch



Currently the problem is in the puppet modules.

1.Reboot will clean the namespaces.
2.Failover or moving neutron resources from one cluster node to another will leave namespaces undeleted,so we'll have several nodes with duplicated namespaces.

it's a problem with the deployment.
it's cloning the neutron-*-cleanup across nodes
to make it go faster but then you need this kind of manual intervention
we need to ask the deployers not to clone neutron-*-cleanup resources across all nodes.
--------------
Clone: neutron-ovs-cleanup-clone Resource: neutron-ovs-cleanup (class=ocf provider=neutron type=OVSCleanup)
 Operations: start interval=0s timeout=40 (neutron-ovs-cleanup-start-timeout-40)
 stop interval=0s timeout=300 (neutron-ovs-cleanup-stop-timeout-300)
               monitor interval=30s (neutron-ovs-cleanup-monitor-interval-30s)
  Clone: neutron-netns-cleanup-clone
   Resource: neutron-netns-cleanup (class=ocf provider=neutron type=Net)
--------------
this is wrong configuration.

Comment 11 Jason Guiditta 2014-12-01 18:51:10 UTC

Moving this to ofi, so I can fix it.

Comment 17 Ihar Hrachyshka 2014-12-03 10:45:15 UTC

Miguel is not on PTO today, so removing myself from needinfo list.

Comment 21 Jason Guiditta 2016-04-04 15:36:50 UTC

I don't think this is going to be fixed in quickstack, please reopen if needed.

Note You need to log in before you can comment on or make changes to this bug.