Bug 1136969

Summary: [l2pop] Parallel create/delete requests to fdb entries may mix and delete a tunnel that is still needed
Product: Red Hat OpenStack Reporter: Joe Talerico <jtaleric>
Component: openstack-neutronAssignee: Ihar Hrachyshka <ihrachys>
Status: CLOSED ERRATA QA Contact: Toni Freger <tfreger>
Severity: high Docs Contact:
Priority: high    
Version: 5.0 (RHEL 7)CC: chrisw, ihrachys, jeder, jtaleric, kambiz, lpeer, mlopes, mwagner, myllynen, nyechiel, oblaut, perfbz, tfreger, yeylon
Target Milestone: z4Keywords: ZStream
Target Release: 5.0 (RHEL 7)   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-neutron-2014.1.4-1.el7ost Doc Type: Bug Fix
Doc Text:
Previously, the ML2 l2 population (l2pop) mechanism driver had a race condition that could request a tunnel removal while it's still in use by new flows that were added in parallel to the last flow removal. Consequently, connections between instances located on different Compute nodes, and attached to the same network could be lost. This update addresses this issue by updating the check on whether a tunnel is still needed or can be dropped, to consider all flows currently in action. As a result, no tunnels are dropped by the l2pop mechanism driver if any active flows are still present.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-04-16 14:36:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ovs-agent log-file none

Description Joe Talerico 2014-09-03 17:18:06 UTC
Description of problem:
Using Rally I am starting 10 guests in parallel, if this test passes the automation moves to starting 20 Guests in parallel, we keep increasing by 10 guests until we determine the max guest launch a OSP environment can sustain. 

Test cases 10-80 seem to successfully pass (unless we run into a harness issue), however once we reach 90-100 test cases, we loose connectivity to the Rally server which is running on our cloud (to avoid the need for a ton of floating-ips). 

This is a Non-HA Run, on the Neutron Networker we see the following : 
ovs-agent.log w/ DEBUG enabled:
2014-09-02 17:49:14.546 16886 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Agent rpc_loop - iteration:717 completed. Processed ports statistics: {'ancillary': {'removed': 0, 'added': 0}, 'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:0.034 rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py:1386
2014-09-02 17:49:15.454 16886 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Agent caught SIGTERM, quitting daemon loop. _handle_sigterm /usr/lib/python2.7/site-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py:1405
2014-09-02 17:49:16.513 16886 DEBUG neutron.agent.linux.async_process [-] Halting async process [['ovsdb-client', 'monitor', 'Interface', 'name,ofport', '--format=json']]. stop /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:90

More :
2014-09-03 11:52:34.528 20796 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Agent rpc_loop - iteration:32471 completed. Processed ports statistics: {'ancillary': {'removed': 0, 'added': 0}, 'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:0.034 rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py:1386
2014-09-03 11:52:35.734 20796 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Agent caught SIGTERM, quitting daemon loop. _handle_sigterm /usr/lib/python2.7/site-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py:1405
2014-09-03 11:52:36.496 20796 DEBUG neutron.agent.linux.async_process [-] Halting async process [['ovsdb-client', 'monitor', 'Interface', 'name,ofport', '--format=json']]. stop /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:90
2014-09-03 11:52:36.496 20796 DEBUG neutron.agent.linux.utils [-] Running command: ['ps', '--ppid', '21126', '-o', 'pid='] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:48
2014-09-03 11:52:36.511 20796 DEBUG neutron.agent.linux.utils [-]
Command: ['ps', '--ppid', '21126', '-o', 'pid=']
Exit code: 0
Stdout: '21128\n'
Stderr: '' execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:74
2014-09-03 11:52:36.511 20796 DEBUG neutron.agent.linux.utils [-] Running command: ['ps', '--ppid', '21128', '-o', 'pid='] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:48
2014-09-03 11:52:36.519 20796 DEBUG neutron.agent.linux.utils [-]
Command: ['ps', '--ppid', '21128', '-o', 'pid=']
Exit code: 0
Stdout: '21133\n'
Stderr: '' execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:74
2014-09-03 11:52:36.520 20796 DEBUG neutron.agent.linux.utils [-] Running command: ['ps', '--ppid', '21133', '-o', 'pid='] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:48
2014-09-03 11:52:36.528 20796 DEBUG neutron.agent.linux.utils [-]
Command: ['ps', '--ppid', '21133', '-o', 'pid=']
Exit code: 1
Stdout: ''
Stderr: '' execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:74
2014-09-03 11:52:36.528 20796 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-9', '21133'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:48
2014-09-03 11:52:36.561 20796 DEBUG neutron.agent.linux.utils [-]
Command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-9', '21133']
Exit code: 0
Stdout: ''
Stderr: '' execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:74
2014-09-03 11:52:38.468 9081 INFO neutron.common.config [-] Logging enabled!

Once this happens, the Rally guest looses connectivity, and I have to manually restart the Neutron Networker OSP services -- Restarting the Rally guest does not restore connectivity. 

Interesting observation, guests launched after the restart of the service seem to have connectivity to the qdhcp ns, however the Rally guest does not.
[root@macbc305bf5f451 neutron]# ip netns exec qdhcp-02fac325-2b01-4418-b305-6ef69a877b63 ping 10.0.0.165
PING 10.0.0.165 (10.0.0.165) 56(84) bytes of data.
64 bytes from 10.0.0.165: icmp_seq=1 ttl=64 time=2.78 ms
64 bytes from 10.0.0.165: icmp_seq=2 ttl=64 time=0.554 ms
^C
--- 10.0.0.165 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.554/1.670/2.787/1.117 ms
[root@macbc305bf5f451 neutron]# ip netns exec qdhcp-02fac325-2b01-4418-b305-6ef69a877b63 ping 10.0.0.8
PING 10.0.0.8 (10.0.0.8) 56(84) bytes of data.
From 10.0.0.3 icmp_seq=1 Destination Host Unreachable
From 10.0.0.3 icmp_seq=2 Destination Host Unreachable
From 10.0.0.3 icmp_seq=3 Destination Host Unreachable
From 10.0.0.3 icmp_seq=4 Destination Host Unreachable
^C
--- 10.0.0.8 ping statistics ---
4 packets

Comment 2 Joe Talerico 2014-09-03 17:48:05 UTC
This is a RHEL OSP5 on RHEL7 Deployment non-ha w/ 5 nodes deployed with Staypuft. 

The deployment is:
1x Controller
1x Neutron Networker
3x Compute nodes.

Rally is running within the OpenStack Cloud Deployed above, with a Floating IP to make API calls to OpenStack.

To be clear, both the floating-ip and the internal address are unreachable.

Comment 3 Joe Talerico 2014-09-03 18:27:03 UTC
Created attachment 934192 [details]
ovs-agent log-file

Comment 4 Joe Talerico 2014-09-03 20:50:23 UTC
Hit this issue again, here is ovs-vsctl show after I hit the issue and clean up all the old guests except for the Rally, guest, it is still there: 
[root@macbc305bf5f451 neutron]# ovs-vsctl show
8e923870-f081-4585-ba03-7ec16e55b6cf
    Bridge br-ex
        Port br-ex
            Interface br-ex
                type: internal
        Port "qg-8a7abac9-d6"
            Interface "qg-8a7abac9-d6"
                type: internal
        Port phy-br-ex
            Interface phy-br-ex
        Port "eno2"
            Interface "eno2"
    Bridge br-int
        fail_mode: secure
        Port "tap7418e859-a2"
            tag: 1
            Interface "tap7418e859-a2"
                type: internal
        Port "tapbfc4341f-c8"
            tag: 2
            Interface "tapbfc4341f-c8"
                type: internal
        Port int-br-ex
            Interface int-br-ex
        Port patch-tun
            Interface patch-tun
                type: patch
                options: {peer=patch-int}
        Port "qr-f23ab158-ff"
            tag: 1
            Interface "qr-f23ab158-ff"
                type: internal
        Port br-int
            Interface br-int
                type: internal
    Bridge br-tun
        Port br-tun
            Interface br-tun
                type: internal
        Port patch-int
            Interface patch-int
                type: patch
                options: {peer=patch-tun}
    ovs_version: "2.1.3"

Note: The tunnel for the compute node where the Rally guest is, is missing... So the guest will have zero connectivity to anything.

Restart services:
8e923870-f081-4585-ba03-7ec16e55b6cf
    Bridge br-ex
        Port br-ex
            Interface br-ex
                type: internal
        Port phy-br-ex
            Interface phy-br-ex
        Port "qg-8a7abac9-d6"
            Interface "qg-8a7abac9-d6"
                type: internal
        Port "eno2"
            Interface "eno2"
    Bridge br-int
        fail_mode: secure
        Port "qr-f23ab158-ff"
            tag: 1
            Interface "qr-f23ab158-ff"
                type: internal
        Port "tapbfc4341f-c8"
            tag: 2
            Interface "tapbfc4341f-c8"
                type: internal
        Port "tap7418e859-a2"
            tag: 1
            Interface "tap7418e859-a2"
                type: internal
        Port br-int
            Interface br-int
                type: internal
        Port patch-tun
            Interface patch-tun
                type: patch
                options: {peer=patch-int}
        Port int-br-ex
            Interface int-br-ex
    Bridge br-tun
        Port "vxlan-ac1264f1"
            Interface "vxlan-ac1264f1"
                type: vxlan
                options: {in_key=flow, local_ip="172.18.100.240", out_key=flow, remote_ip="172.18.100.241"}
        Port br-tun
            Interface br-tun
                type: internal
        Port patch-int
            Interface patch-int
                type: patch
                options: {peer=patch-tun}
    ovs_version: "2.1.3"

Comment 5 Joe Talerico 2014-09-08 20:34:31 UTC
Disabling L2Population seemed to fix the problem with not being able to reach above 100 Guest launched concurrently. 

With the latest testing L2Population disabled, with 3 iterations, I averaged, 110 (90,130,120). 

The reason I am noting this, is because I am not seeing the br-tun removed during this test, like I would when L2Population was enabled.

Comment 6 Joe Talerico 2014-09-17 19:49:22 UTC
I can recreate this bug running : 
rally task start rally-launch

I was using a RHEL7 Cloud image. 

I would run the above command a couple of times, until failure - Failure is when I loose connectivity to the Rally guest. 

The rally-launch scenario :
{
    "VMTasks.boot_runcommand_delete": [
        {
            "runner": {
                "type": "constant",
                "times": 100,
                "concurrency": 100
            },
            "args": {
                "username": "root",
                "floating_network": "Public",
                "use_floatingip": false,
                "script": "/opt/rally/true.sh",
                "auto_assign_nic" : True,
                "fixed_network": "private",
                "interpreter": "/bin/sh",
                "flavor": {
                    "name": "m1.small"
                },
                "image": {
                    "name": "rhel7"
                },
                "detailed" : True
            },
            "context": {
                "users": {
                    "users_per_tenant": 1,
                    "tenants": 1
                },
                "quotas": {
                    "neutron": {
                        "network": -1,
                        "port": -1
                    },
                    "nova": {
                        "instances": -1,
                        "cores": -1,
                        "ram": -1
                    }
                }
            }
        }
    ]
}

Comment 7 Ihar Hrachyshka 2014-09-18 08:41:52 UTC
This looks like a race condition when multiple port delete and port create/update requests are incoming for OVS agent. It may turn out that neutron-server incorrectly detects that the whole tunnel is unused and requests FLOODING_ENTRY removal while new ports are coming later.

Comment 8 Ihar Hrachyshka 2014-09-18 14:10:13 UTC
I've added a patch to external links list that may fix the issues that we experience. Though more testing is needed to make sure it helps.

Comment 9 Ihar Hrachyshka 2014-09-22 12:44:14 UTC
Raised the bug to upstream (see external trackers' list). Vivek from HP who worked on DVR privately told me that he is willing to provide the fix for that, because DVR team re-introduced that regression in Juno after it was fixed there.

Comment 10 Ihar Hrachyshka 2015-01-14 14:00:39 UTC
*** Bug 1141497 has been marked as a duplicate of this bug. ***

Comment 11 Ihar Hrachyshka 2015-03-19 10:34:15 UTC
Fix arrived with 2014.1.4 rebase.

Comment 14 Toni Freger 2015-04-05 14:09:42 UTC
This bug was verified on rhel7.1 with smaller setup (AIO+Comute Node) and Rally "boot_runcomand_delete" script, when L2pop was enabled.

Times-500
Concurrency-50 

openstack-neutron-ml2-2014.1.4-1.el7ost.noarch
openstack-neutron-openvswitch-2014.1.4-1.el7ost.noarch
openstack-neutron-2014.1.4-1.el7ost.noarch

Comment 16 errata-xmlrpc 2015-04-16 14:36:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0829.html