+++ This bug was initially created as a clone of Bug #1051028 +++ Description of problem: neutron-l3-agent can start several child processes per router: * neutron-ns-metadata-proxy and it does create a qrouter-xxxx namespace for every tenant network it handles. Inside this namespace, it has two interfaces, one connected to the external network, one connected to the integration bridge (tenant network tagged) It also setups a few iptable rules to handle traffic routing, forwarding, nat, etc.. When shutting down the service: # service neutron-l3-agent stop The namespaces are left behind: [root@fab3 ~]# ip netns | grep qrouter qrouter-12cba127-3f1f-4cef-a41b-6de04c452008 The processes are left behind: [root@fab3 ~]# ps fax | grep ns-metadata 14862 pts/0 S+ 0:00 \_ grep ns-metadata 3453 ? S 0:00 /usr/bin/python /usr/bin/neutron-ns-metadata-proxy --pid_file=/var/lib/neutron/external/pids/12cba127-3f1f-4cef-a41b-6de04c452008.pid --metadata_proxy_socket=/var/lib/neutron/metadata_proxy --router_id=12cba127-3f1f-4cef-a41b-6de04c452008 --state_path=/var/lib/neutron --metadata_port=9697 --debug --verbose --log-file=neutron-ns-metadata-proxy-12cba127-3f1f-4cef-a41b-6de04c452008.log --log-dir=/var/log/neutron Also the ovs ports are left: [root@fab3 ~]# ovs-vsctl show c337cef7-7b66-4d0b-90d1-1452e5b99687 Bridge br-ex Port br-ex Interface br-ex type: internal Port "eth0" Interface "eth0" Port "qg-b4eaf6de-d8" <<<<<<<<<<<<<<<<<<< Interface "qg-b4eaf6de-d8" type: internal Bridge br-tun Port "gre-4" Interface "gre-4" type: gre options: {in_key=flow, local_ip="10.16.139.53", out_key=flow, remote_ip="10.16.139.54"} Port br-tun Interface br-tun type: internal Port patch-int Interface patch-int type: patch options: {peer=patch-tun} Port "gre-1" Interface "gre-1" type: gre options: {in_key=flow, local_ip="10.16.139.53", out_key=flow, remote_ip="10.16.139.52"} Port "gre-2" Interface "gre-2" type: gre options: {in_key=flow, local_ip="10.16.139.53", out_key=flow, remote_ip="10.16.139.55"} Bridge br-int Port br-int Interface br-int type: internal Port "tap3af9ca10-d3" tag: 1 Interface "tap3af9ca10-d3" type: internal Port patch-tun Interface patch-tun type: patch options: {peer=patch-int} Port "qr-8888c558-25" tag: 1 Interface "qr-8888c558-25" <<<<<<<<<<<<<<< type: internal ovs_version: "1.11.0" And also the network config inside the namespace: [root@fab3 ~]# ip netns exec qrouter-12cba127-3f1f-4cef-a41b-6de04c452008 ip addr 20: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 26: qr-8888c558-25: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN link/ether fa:16:3e:ff:0b:cb brd ff:ff:ff:ff:ff:ff inet 192.168.100.1/24 brd 192.168.100.255 scope global qr-8888c558-25 inet6 fe80::f816:3eff:feff:bcb/64 scope link valid_lft forever preferred_lft forever 27: qg-b4eaf6de-d8: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN link/ether fa:16:3e:e7:6b:4c brd ff:ff:ff:ff:ff:ff inet 10.16.139.235/21 brd 10.16.143.255 scope global qg-b4eaf6de-d8 inet6 fe80::f816:3eff:fee7:6b4c/64 scope link tentative dadfailed valid_lft forever preferred_lft forever And the iptables rules: [root@fab3 ~]# ip netns exec qrouter-12cba127-3f1f-4cef-a41b-6de04c452008 iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination neutron-l3-agent-INPUT all -- anywhere anywhere Chain FORWARD (policy ACCEPT) target prot opt source destination neutron-filter-top all -- anywhere anywhere neutron-l3-agent-FORWARD all -- anywhere anywhere Chain OUTPUT (policy ACCEPT) target prot opt source destination neutron-filter-top all -- anywhere anywhere neutron-l3-agent-OUTPUT all -- anywhere anywhere Chain neutron-filter-top (2 references) target prot opt source destination neutron-l3-agent-local all -- anywhere anywhere Chain neutron-l3-agent-FORWARD (1 references) target prot opt source destination Chain neutron-l3-agent-INPUT (1 references) target prot opt source destination ACCEPT tcp -- anywhere localhost tcp dpt:9697 Chain neutron-l3-agent-OUTPUT (1 references) target prot opt source destination Chain neutron-l3-agent-local (1 references) target prot opt source destination Version-Release number of selected component (if applicable): openstack-neutron-2013.2-16.el6ost.noarch or openstack-neutron-2014.1-0.1.b1.el6.noarch How reproducible: Always. Steps to Reproduce: 1. start the service 2. setup some tenant networks, with external network, and at least a router, with a VM connected to them so the services and namespaces are actually created. 3. stop the service. Actual results: neutron-ns-metadata-proxy child processes are left. qrouter-* network namespaces are left. ovs ports (internal+external) for the router are left. ip addresses are left configured into the ovs ports Expected results: all child process are terminated. qrouter-* network namespaces are cleaned up. the ip addresses are removed the ovs ports are cleaned up Additional info: In a simple instalation this wouldn't be a problem, but for HA setups, we need to stop services in one node, and start them in a different one, without interfering to each other. In this situation, an unmanaged neutron-ns-metadata-proxy is left connected to the network. Also the router IPs (internal + external) will be duplicated in several nodes, having the same MAC address.
Launchpad bug#1273095 prevents from properly selecting which kind of namespace we want to cleanup (dhcp or l3-agent).
Launchpad bug #1115999 prevents from properly cleaning the metadata-proxies in namespaces (qdhcp or qrouter), that needs to be fixed to have a workaround here.
using /etc/init.d/neutron-netns-forced-cleanup start cleans up the network namespaces and all internal iptable rules + interfaces, the fix up is provided in this repo: http://file.rdu.redhat.com/~majopela/neutron-ha-fixes-bz-1051028-and-36-cleanup/ neutron needs to be patched (netns_cleanup script).
Created attachment 865582 [details] Patch for netns_cleanup
Created attachment 865583 [details] neutron-netns-cleanup script
Created attachment 865584 [details] neutron-netns-forced-cleanup script
Upstream review on the fixes: https://review.openstack.org/#/c/80261/
Created attachment 879527 [details] netns cleanup script Thought to be used in pacemaker. It does a normal cleanup at start It does a forced cleanup at stop
Please check https://bugzilla.redhat.com/show_bug.cgi?id=1051028#c11 for details on how the neutron-netns-cleanup script will behave.
I have tested that when service neutron-netns-cleanup stop used netns are cleaned The stop conditions are HA related and not script one. openstack-neutron-2013.2.3-4.el6ost.noarch [root@puma05 ~]# ip netns qdhcp-a76e98a5-7ae3-4f91-b721-4f81cebcfa6f qdhcp-6dcaa203-e61a-4003-a1fe-95d60853516f qrouter-15ef1247-b52a-43fc-bfa2-27478dbfe1f3 [root@puma05 ~]# service neutron-netns-cleanup stop [root@puma05 ~]# ip netns [root@puma05 ~]# [root@puma05 ~]# [root@puma05 ~]# service neutron-netns-cleanup start [root@puma05 ~]# ip netns [root@puma05 ~]# openstack-status == neutron services == neutron-server: inactive (disabled on boot) neutron-dhcp-agent: active neutron-l3-agent: active neutron-metadata-agent: active neutron-lbaas-agent: inactive (disabled on boot) neutron-openvswitch-agent: active == Support services == openvswitch: active messagebus: active [root@puma05 ~]# service neutron-dhcp-agent restart Stopping neutron-dhcp-agent: [ OK ] Starting neutron-dhcp-agent: [ OK ] [root@puma05 ~]# service neutron-l3-agent restart Stopping neutron-l3-agent: [ OK ] Starting neutron-l3-agent: [ OK ] [root@puma05 ~]# ip netns qdhcp-a76e98a5-7ae3-4f91-b721-4f81cebcfa6f qdhcp-6dcaa203-e61a-4003-a1fe-95d60853516f [root@puma05 ~]# ip netns qdhcp-a76e98a5-7ae3-4f91-b721-4f81cebcfa6f qdhcp-6dcaa203-e61a-4003-a1fe-95d60853516f qrouter-15ef1247-b52a-43fc-bfa2-27478dbfe1f3
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-0516.html