Bug 1136969
| Summary: | [l2pop] Parallel create/delete requests to fdb entries may mix and delete a tunnel that is still needed | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Joe Talerico <jtaleric> | ||||
| Component: | openstack-neutron | Assignee: | Ihar Hrachyshka <ihrachys> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Toni Freger <tfreger> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 5.0 (RHEL 7) | CC: | chrisw, ihrachys, jeder, jtaleric, kambiz, lpeer, mlopes, mwagner, myllynen, nyechiel, oblaut, perfbz, tfreger, yeylon | ||||
| Target Milestone: | z4 | Keywords: | ZStream | ||||
| Target Release: | 5.0 (RHEL 7) | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | openstack-neutron-2014.1.4-1.el7ost | Doc Type: | Bug Fix | ||||
| Doc Text: |
Previously, the ML2 l2 population (l2pop) mechanism driver had a race condition that could request a tunnel removal while it's still in use by new flows that were added in parallel to the last flow removal.
Consequently, connections between instances located on different Compute nodes, and attached to the same network could be lost.
This update addresses this issue by updating the check on whether a tunnel is still needed or can be dropped, to consider all flows currently in action.
As a result, no tunnels are dropped by the l2pop mechanism driver if any active flows are still present.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2015-04-16 14:36:36 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
This is a RHEL OSP5 on RHEL7 Deployment non-ha w/ 5 nodes deployed with Staypuft. The deployment is: 1x Controller 1x Neutron Networker 3x Compute nodes. Rally is running within the OpenStack Cloud Deployed above, with a Floating IP to make API calls to OpenStack. To be clear, both the floating-ip and the internal address are unreachable. Created attachment 934192 [details]
ovs-agent log-file
Hit this issue again, here is ovs-vsctl show after I hit the issue and clean up all the old guests except for the Rally, guest, it is still there:
[root@macbc305bf5f451 neutron]# ovs-vsctl show
8e923870-f081-4585-ba03-7ec16e55b6cf
Bridge br-ex
Port br-ex
Interface br-ex
type: internal
Port "qg-8a7abac9-d6"
Interface "qg-8a7abac9-d6"
type: internal
Port phy-br-ex
Interface phy-br-ex
Port "eno2"
Interface "eno2"
Bridge br-int
fail_mode: secure
Port "tap7418e859-a2"
tag: 1
Interface "tap7418e859-a2"
type: internal
Port "tapbfc4341f-c8"
tag: 2
Interface "tapbfc4341f-c8"
type: internal
Port int-br-ex
Interface int-br-ex
Port patch-tun
Interface patch-tun
type: patch
options: {peer=patch-int}
Port "qr-f23ab158-ff"
tag: 1
Interface "qr-f23ab158-ff"
type: internal
Port br-int
Interface br-int
type: internal
Bridge br-tun
Port br-tun
Interface br-tun
type: internal
Port patch-int
Interface patch-int
type: patch
options: {peer=patch-tun}
ovs_version: "2.1.3"
Note: The tunnel for the compute node where the Rally guest is, is missing... So the guest will have zero connectivity to anything.
Restart services:
8e923870-f081-4585-ba03-7ec16e55b6cf
Bridge br-ex
Port br-ex
Interface br-ex
type: internal
Port phy-br-ex
Interface phy-br-ex
Port "qg-8a7abac9-d6"
Interface "qg-8a7abac9-d6"
type: internal
Port "eno2"
Interface "eno2"
Bridge br-int
fail_mode: secure
Port "qr-f23ab158-ff"
tag: 1
Interface "qr-f23ab158-ff"
type: internal
Port "tapbfc4341f-c8"
tag: 2
Interface "tapbfc4341f-c8"
type: internal
Port "tap7418e859-a2"
tag: 1
Interface "tap7418e859-a2"
type: internal
Port br-int
Interface br-int
type: internal
Port patch-tun
Interface patch-tun
type: patch
options: {peer=patch-int}
Port int-br-ex
Interface int-br-ex
Bridge br-tun
Port "vxlan-ac1264f1"
Interface "vxlan-ac1264f1"
type: vxlan
options: {in_key=flow, local_ip="172.18.100.240", out_key=flow, remote_ip="172.18.100.241"}
Port br-tun
Interface br-tun
type: internal
Port patch-int
Interface patch-int
type: patch
options: {peer=patch-tun}
ovs_version: "2.1.3"
Disabling L2Population seemed to fix the problem with not being able to reach above 100 Guest launched concurrently. With the latest testing L2Population disabled, with 3 iterations, I averaged, 110 (90,130,120). The reason I am noting this, is because I am not seeing the br-tun removed during this test, like I would when L2Population was enabled. I can recreate this bug running :
rally task start rally-launch
I was using a RHEL7 Cloud image.
I would run the above command a couple of times, until failure - Failure is when I loose connectivity to the Rally guest.
The rally-launch scenario :
{
"VMTasks.boot_runcommand_delete": [
{
"runner": {
"type": "constant",
"times": 100,
"concurrency": 100
},
"args": {
"username": "root",
"floating_network": "Public",
"use_floatingip": false,
"script": "/opt/rally/true.sh",
"auto_assign_nic" : True,
"fixed_network": "private",
"interpreter": "/bin/sh",
"flavor": {
"name": "m1.small"
},
"image": {
"name": "rhel7"
},
"detailed" : True
},
"context": {
"users": {
"users_per_tenant": 1,
"tenants": 1
},
"quotas": {
"neutron": {
"network": -1,
"port": -1
},
"nova": {
"instances": -1,
"cores": -1,
"ram": -1
}
}
}
}
]
}
This looks like a race condition when multiple port delete and port create/update requests are incoming for OVS agent. It may turn out that neutron-server incorrectly detects that the whole tunnel is unused and requests FLOODING_ENTRY removal while new ports are coming later. I've added a patch to external links list that may fix the issues that we experience. Though more testing is needed to make sure it helps. Raised the bug to upstream (see external trackers' list). Vivek from HP who worked on DVR privately told me that he is willing to provide the fix for that, because DVR team re-introduced that regression in Juno after it was fixed there. *** Bug 1141497 has been marked as a duplicate of this bug. *** Fix arrived with 2014.1.4 rebase. This bug was verified on rhel7.1 with smaller setup (AIO+Comute Node) and Rally "boot_runcomand_delete" script, when L2pop was enabled. Times-500 Concurrency-50 openstack-neutron-ml2-2014.1.4-1.el7ost.noarch openstack-neutron-openvswitch-2014.1.4-1.el7ost.noarch openstack-neutron-2014.1.4-1.el7ost.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0829.html |
Description of problem: Using Rally I am starting 10 guests in parallel, if this test passes the automation moves to starting 20 Guests in parallel, we keep increasing by 10 guests until we determine the max guest launch a OSP environment can sustain. Test cases 10-80 seem to successfully pass (unless we run into a harness issue), however once we reach 90-100 test cases, we loose connectivity to the Rally server which is running on our cloud (to avoid the need for a ton of floating-ips). This is a Non-HA Run, on the Neutron Networker we see the following : ovs-agent.log w/ DEBUG enabled: 2014-09-02 17:49:14.546 16886 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Agent rpc_loop - iteration:717 completed. Processed ports statistics: {'ancillary': {'removed': 0, 'added': 0}, 'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:0.034 rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py:1386 2014-09-02 17:49:15.454 16886 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Agent caught SIGTERM, quitting daemon loop. _handle_sigterm /usr/lib/python2.7/site-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py:1405 2014-09-02 17:49:16.513 16886 DEBUG neutron.agent.linux.async_process [-] Halting async process [['ovsdb-client', 'monitor', 'Interface', 'name,ofport', '--format=json']]. stop /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:90 More : 2014-09-03 11:52:34.528 20796 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Agent rpc_loop - iteration:32471 completed. Processed ports statistics: {'ancillary': {'removed': 0, 'added': 0}, 'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:0.034 rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py:1386 2014-09-03 11:52:35.734 20796 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Agent caught SIGTERM, quitting daemon loop. _handle_sigterm /usr/lib/python2.7/site-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py:1405 2014-09-03 11:52:36.496 20796 DEBUG neutron.agent.linux.async_process [-] Halting async process [['ovsdb-client', 'monitor', 'Interface', 'name,ofport', '--format=json']]. stop /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:90 2014-09-03 11:52:36.496 20796 DEBUG neutron.agent.linux.utils [-] Running command: ['ps', '--ppid', '21126', '-o', 'pid='] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:48 2014-09-03 11:52:36.511 20796 DEBUG neutron.agent.linux.utils [-] Command: ['ps', '--ppid', '21126', '-o', 'pid='] Exit code: 0 Stdout: '21128\n' Stderr: '' execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:74 2014-09-03 11:52:36.511 20796 DEBUG neutron.agent.linux.utils [-] Running command: ['ps', '--ppid', '21128', '-o', 'pid='] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:48 2014-09-03 11:52:36.519 20796 DEBUG neutron.agent.linux.utils [-] Command: ['ps', '--ppid', '21128', '-o', 'pid='] Exit code: 0 Stdout: '21133\n' Stderr: '' execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:74 2014-09-03 11:52:36.520 20796 DEBUG neutron.agent.linux.utils [-] Running command: ['ps', '--ppid', '21133', '-o', 'pid='] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:48 2014-09-03 11:52:36.528 20796 DEBUG neutron.agent.linux.utils [-] Command: ['ps', '--ppid', '21133', '-o', 'pid='] Exit code: 1 Stdout: '' Stderr: '' execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:74 2014-09-03 11:52:36.528 20796 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-9', '21133'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:48 2014-09-03 11:52:36.561 20796 DEBUG neutron.agent.linux.utils [-] Command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-9', '21133'] Exit code: 0 Stdout: '' Stderr: '' execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:74 2014-09-03 11:52:38.468 9081 INFO neutron.common.config [-] Logging enabled! Once this happens, the Rally guest looses connectivity, and I have to manually restart the Neutron Networker OSP services -- Restarting the Rally guest does not restore connectivity. Interesting observation, guests launched after the restart of the service seem to have connectivity to the qdhcp ns, however the Rally guest does not. [root@macbc305bf5f451 neutron]# ip netns exec qdhcp-02fac325-2b01-4418-b305-6ef69a877b63 ping 10.0.0.165 PING 10.0.0.165 (10.0.0.165) 56(84) bytes of data. 64 bytes from 10.0.0.165: icmp_seq=1 ttl=64 time=2.78 ms 64 bytes from 10.0.0.165: icmp_seq=2 ttl=64 time=0.554 ms ^C --- 10.0.0.165 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1001ms rtt min/avg/max/mdev = 0.554/1.670/2.787/1.117 ms [root@macbc305bf5f451 neutron]# ip netns exec qdhcp-02fac325-2b01-4418-b305-6ef69a877b63 ping 10.0.0.8 PING 10.0.0.8 (10.0.0.8) 56(84) bytes of data. From 10.0.0.3 icmp_seq=1 Destination Host Unreachable From 10.0.0.3 icmp_seq=2 Destination Host Unreachable From 10.0.0.3 icmp_seq=3 Destination Host Unreachable From 10.0.0.3 icmp_seq=4 Destination Host Unreachable ^C --- 10.0.0.8 ping statistics --- 4 packets