Bug 1051036 - neutron-l3-agent doesn't clean after itself when service is shut down
Summary: neutron-l3-agent doesn't clean after itself when service is shut down
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 4.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z4
: 4.0
Assignee: Miguel Angel Ajo
QA Contact: Ofer Blaut
URL:
Whiteboard:
Depends On: 1051028 1062685 1173435
Blocks: 1066642 RHEL-OSP_Neutron_HA
TreeView+ depends on / blocked
 
Reported: 2014-01-09 15:30 UTC by Miguel Angel Ajo
Modified: 2022-07-09 06:16 UTC (History)
8 users (show)

Fixed In Version: openstack-neutron-2013.2.2-9.el6ost
Doc Type: Bug Fix
Doc Text:
Cause: Neutron l3 agent is known to not to clean up resources (netns, iptables, processes, etc.) when the service is stopped. This is a feature, intended to allow upgrades to the agent without service disruption. Consequence: When trying remove a node from the cluster, and stop the services, the l3 services/resources will remain active, but will get updated as soon as there are changes to the served tenant networks. Fix: Added the neutron-netns-cleanup init script to allow cleanup of the l3 service resources as needed. Result: The resources can be cleaned up now by running the script.
Clone Of: 1051028
Environment:
Last Closed: 2014-05-29 20:18:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Patch for netns_cleanup (4.39 KB, patch)
2014-02-20 15:38 UTC, Miguel Angel Ajo
no flags Details | Diff
neutron-netns-cleanup script (1.01 KB, application/x-shellscript)
2014-02-20 15:39 UTC, Miguel Angel Ajo
no flags Details
neutron-netns-forced-cleanup script (524 bytes, application/x-shellscript)
2014-02-20 15:40 UTC, Miguel Angel Ajo
no flags Details
netns cleanup script (1.39 KB, application/x-shellscript)
2014-03-27 15:31 UTC, Miguel Angel Ajo
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1115999 0 None None None Never
Launchpad 1273095 0 None None None Never
Launchpad 1291915 0 None None None Never
OpenStack gerrit 80261 0 None MERGED fixes broken neutron-netns-cleanup 2020-04-27 09:59:14 UTC
Red Hat Product Errata RHSA-2014:0516 0 normal SHIPPED_LIVE Moderate: openstack-neutron security, bug fix, and enhancement update 2014-05-30 00:15:59 UTC

Description Miguel Angel Ajo 2014-01-09 15:30:16 UTC
+++ This bug was initially created as a clone of Bug #1051028 +++

Description of problem:

  neutron-l3-agent can start several child processes per router: 
   * neutron-ns-metadata-proxy 

  and it does create a qrouter-xxxx namespace for every tenant network
  it handles.

  Inside this namespace, it has two interfaces, one connected to the 
external network, one connected to the integration bridge (tenant network tagged)

  It also setups a few iptable rules to handle traffic routing, forwarding, nat, etc..

  When shutting down the service:

# service neutron-l3-agent stop

  The namespaces are left behind:

[root@fab3 ~]# ip netns | grep qrouter
qrouter-12cba127-3f1f-4cef-a41b-6de04c452008

  The processes are left behind:

[root@fab3 ~]# ps fax | grep ns-metadata
14862 pts/0    S+     0:00          \_ grep ns-metadata
 3453 ?        S      0:00 /usr/bin/python /usr/bin/neutron-ns-metadata-proxy --pid_file=/var/lib/neutron/external/pids/12cba127-3f1f-4cef-a41b-6de04c452008.pid --metadata_proxy_socket=/var/lib/neutron/metadata_proxy --router_id=12cba127-3f1f-4cef-a41b-6de04c452008 --state_path=/var/lib/neutron --metadata_port=9697 --debug --verbose --log-file=neutron-ns-metadata-proxy-12cba127-3f1f-4cef-a41b-6de04c452008.log --log-dir=/var/log/neutron

Also the ovs ports are left:
[root@fab3 ~]# ovs-vsctl show
c337cef7-7b66-4d0b-90d1-1452e5b99687
    Bridge br-ex
        Port br-ex
            Interface br-ex
                type: internal
        Port "eth0"
            Interface "eth0" 
        Port "qg-b4eaf6de-d8"          <<<<<<<<<<<<<<<<<<<
            Interface "qg-b4eaf6de-d8"
                type: internal
    Bridge br-tun
        Port "gre-4"
            Interface "gre-4"
                type: gre
                options: {in_key=flow, local_ip="10.16.139.53", out_key=flow, remote_ip="10.16.139.54"}
        Port br-tun
            Interface br-tun
                type: internal
        Port patch-int
            Interface patch-int
                type: patch
                options: {peer=patch-tun}
        Port "gre-1"
            Interface "gre-1"
                type: gre
                options: {in_key=flow, local_ip="10.16.139.53", out_key=flow, remote_ip="10.16.139.52"}
        Port "gre-2"
            Interface "gre-2"
                type: gre
                options: {in_key=flow, local_ip="10.16.139.53", out_key=flow, remote_ip="10.16.139.55"}
    Bridge br-int
        Port br-int
            Interface br-int
                type: internal
        Port "tap3af9ca10-d3"
            tag: 1
            Interface "tap3af9ca10-d3"
                type: internal
        Port patch-tun
            Interface patch-tun
                type: patch
                options: {peer=patch-int}
        Port "qr-8888c558-25"
            tag: 1
            Interface "qr-8888c558-25"  <<<<<<<<<<<<<<<
                type: internal
    ovs_version: "1.11.0"

And also the network config inside the namespace:

[root@fab3 ~]# ip netns exec qrouter-12cba127-3f1f-4cef-a41b-6de04c452008 ip addr
20: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
26: qr-8888c558-25: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:ff:0b:cb brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.1/24 brd 192.168.100.255 scope global qr-8888c558-25
    inet6 fe80::f816:3eff:feff:bcb/64 scope link 
       valid_lft forever preferred_lft forever
27: qg-b4eaf6de-d8: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:e7:6b:4c brd ff:ff:ff:ff:ff:ff
    inet 10.16.139.235/21 brd 10.16.143.255 scope global qg-b4eaf6de-d8
    inet6 fe80::f816:3eff:fee7:6b4c/64 scope link tentative dadfailed 
       valid_lft forever preferred_lft forever


And the iptables rules:
[root@fab3 ~]# ip netns exec qrouter-12cba127-3f1f-4cef-a41b-6de04c452008 iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
neutron-l3-agent-INPUT  all  --  anywhere             anywhere            

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         
neutron-filter-top  all  --  anywhere             anywhere            
neutron-l3-agent-FORWARD  all  --  anywhere             anywhere            

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
neutron-filter-top  all  --  anywhere             anywhere            
neutron-l3-agent-OUTPUT  all  --  anywhere             anywhere            

Chain neutron-filter-top (2 references)
target     prot opt source               destination         
neutron-l3-agent-local  all  --  anywhere             anywhere            

Chain neutron-l3-agent-FORWARD (1 references)
target     prot opt source               destination         

Chain neutron-l3-agent-INPUT (1 references)
target     prot opt source               destination         
ACCEPT     tcp  --  anywhere             localhost           tcp dpt:9697 

Chain neutron-l3-agent-OUTPUT (1 references)
target     prot opt source               destination         

Chain neutron-l3-agent-local (1 references)
target     prot opt source               destination       



Version-Release number of selected component (if applicable):
 openstack-neutron-2013.2-16.el6ost.noarch
or
 openstack-neutron-2014.1-0.1.b1.el6.noarch

How reproducible:
  
  Always.

Steps to Reproduce:
1. start the service
2. setup some tenant networks, with external network, and at least a router, with a VM connected to them so the services and namespaces are actually created.

3. stop the service.

Actual results:

 neutron-ns-metadata-proxy child processes are left.
 qrouter-* network namespaces are left.
 ovs ports (internal+external) for the router are left.
 ip addresses are left configured into the ovs ports

Expected results:

 all child process are terminated.
 qrouter-* network namespaces are cleaned up.
 the ip addresses are removed
 the ovs ports are cleaned up

  
Additional info:
  
  In a simple instalation this wouldn't be a problem, but for HA setups, we need to stop services in one node, and start them in a different one, without interfering to each other.

  In this situation, an unmanaged neutron-ns-metadata-proxy is left connected to the network. Also the router IPs (internal + external) will be duplicated in several nodes, having the same MAC address.

Comment 2 Miguel Angel Ajo 2014-01-27 08:13:41 UTC
Launchpad bug#1273095 prevents from properly selecting which kind of namespace we want to cleanup (dhcp or l3-agent).

Comment 3 Miguel Angel Ajo 2014-01-27 08:14:08 UTC
Launchpad bug #1115999 prevents from properly cleaning the metadata-proxies in namespaces (qdhcp or qrouter), that needs to be fixed to have a workaround here.

Comment 4 Miguel Angel Ajo 2014-02-20 15:34:35 UTC
using /etc/init.d/neutron-netns-forced-cleanup start 
cleans up the network namespaces and all internal iptable rules + interfaces,

the fix up is provided in this repo:
http://file.rdu.redhat.com/~majopela/neutron-ha-fixes-bz-1051028-and-36-cleanup/

neutron needs to be patched (netns_cleanup script).

Comment 5 Miguel Angel Ajo 2014-02-20 15:38:47 UTC
Created attachment 865582 [details]
Patch for netns_cleanup

Comment 6 Miguel Angel Ajo 2014-02-20 15:39:26 UTC
Created attachment 865583 [details]
neutron-netns-cleanup script

Comment 7 Miguel Angel Ajo 2014-02-20 15:40:14 UTC
Created attachment 865584 [details]
neutron-netns-forced-cleanup script

Comment 8 Miguel Angel Ajo 2014-03-13 12:22:54 UTC
Upstream review on the fixes:
https://review.openstack.org/#/c/80261/

Comment 9 Miguel Angel Ajo 2014-03-27 15:31:38 UTC
Created attachment 879527 [details]
netns cleanup script

Thought to be used in pacemaker.

It does a normal cleanup at start

It does a forced cleanup at stop

Comment 12 Miguel Angel Ajo 2014-04-10 09:21:34 UTC
Please check https://bugzilla.redhat.com/show_bug.cgi?id=1051028#c11 for details on how the neutron-netns-cleanup script will behave.

Comment 13 Ofer Blaut 2014-04-22 09:03:24 UTC
I have tested that when service neutron-netns-cleanup stop used netns are cleaned 

The stop conditions are HA related and not script one.

openstack-neutron-2013.2.3-4.el6ost.noarch


[root@puma05 ~]# ip netns
qdhcp-a76e98a5-7ae3-4f91-b721-4f81cebcfa6f
qdhcp-6dcaa203-e61a-4003-a1fe-95d60853516f
qrouter-15ef1247-b52a-43fc-bfa2-27478dbfe1f3

[root@puma05 ~]# service neutron-netns-cleanup stop
[root@puma05 ~]# ip netns
[root@puma05 ~]# 
[root@puma05 ~]# 
[root@puma05 ~]# service neutron-netns-cleanup start
[root@puma05 ~]# ip netns 
[root@puma05 ~]# openstack-status 
== neutron services ==
neutron-server:                         inactive  (disabled on boot)
neutron-dhcp-agent:                     active
neutron-l3-agent:                       active
neutron-metadata-agent:                 active
neutron-lbaas-agent:                    inactive  (disabled on boot)
neutron-openvswitch-agent:              active
== Support services ==
openvswitch:                            active
messagebus:                             active
[root@puma05 ~]# service neutron-dhcp-agent restart
Stopping neutron-dhcp-agent:                               [  OK  ]
Starting neutron-dhcp-agent:                               [  OK  ]
[root@puma05 ~]# service neutron-l3-agent restart
Stopping neutron-l3-agent:                                 [  OK  ]
Starting neutron-l3-agent:                                 [  OK  ]
[root@puma05 ~]# ip netns 
qdhcp-a76e98a5-7ae3-4f91-b721-4f81cebcfa6f
qdhcp-6dcaa203-e61a-4003-a1fe-95d60853516f
[root@puma05 ~]# ip netns 
qdhcp-a76e98a5-7ae3-4f91-b721-4f81cebcfa6f
qdhcp-6dcaa203-e61a-4003-a1fe-95d60853516f
qrouter-15ef1247-b52a-43fc-bfa2-27478dbfe1f3

Comment 15 errata-xmlrpc 2014-05-29 20:18:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0516.html


Note You need to log in before you can comment on or make changes to this bug.