Bug 1136969
Summary: | [l2pop] Parallel create/delete requests to fdb entries may mix and delete a tunnel that is still needed | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Joe Talerico <jtaleric> | ||||
Component: | openstack-neutron | Assignee: | Ihar Hrachyshka <ihrachys> | ||||
Status: | CLOSED ERRATA | QA Contact: | Toni Freger <tfreger> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 5.0 (RHEL 7) | CC: | chrisw, ihrachys, jeder, jtaleric, kambiz, lpeer, mlopes, mwagner, myllynen, nyechiel, oblaut, perfbz, tfreger, yeylon | ||||
Target Milestone: | z4 | Keywords: | ZStream | ||||
Target Release: | 5.0 (RHEL 7) | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | openstack-neutron-2014.1.4-1.el7ost | Doc Type: | Bug Fix | ||||
Doc Text: |
Previously, the ML2 l2 population (l2pop) mechanism driver had a race condition that could request a tunnel removal while it's still in use by new flows that were added in parallel to the last flow removal.
Consequently, connections between instances located on different Compute nodes, and attached to the same network could be lost.
This update addresses this issue by updating the check on whether a tunnel is still needed or can be dropped, to consider all flows currently in action.
As a result, no tunnels are dropped by the l2pop mechanism driver if any active flows are still present.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2015-04-16 14:36:36 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Joe Talerico
2014-09-03 17:18:06 UTC
This is a RHEL OSP5 on RHEL7 Deployment non-ha w/ 5 nodes deployed with Staypuft. The deployment is: 1x Controller 1x Neutron Networker 3x Compute nodes. Rally is running within the OpenStack Cloud Deployed above, with a Floating IP to make API calls to OpenStack. To be clear, both the floating-ip and the internal address are unreachable. Created attachment 934192 [details]
ovs-agent log-file
Hit this issue again, here is ovs-vsctl show after I hit the issue and clean up all the old guests except for the Rally, guest, it is still there: [root@macbc305bf5f451 neutron]# ovs-vsctl show 8e923870-f081-4585-ba03-7ec16e55b6cf Bridge br-ex Port br-ex Interface br-ex type: internal Port "qg-8a7abac9-d6" Interface "qg-8a7abac9-d6" type: internal Port phy-br-ex Interface phy-br-ex Port "eno2" Interface "eno2" Bridge br-int fail_mode: secure Port "tap7418e859-a2" tag: 1 Interface "tap7418e859-a2" type: internal Port "tapbfc4341f-c8" tag: 2 Interface "tapbfc4341f-c8" type: internal Port int-br-ex Interface int-br-ex Port patch-tun Interface patch-tun type: patch options: {peer=patch-int} Port "qr-f23ab158-ff" tag: 1 Interface "qr-f23ab158-ff" type: internal Port br-int Interface br-int type: internal Bridge br-tun Port br-tun Interface br-tun type: internal Port patch-int Interface patch-int type: patch options: {peer=patch-tun} ovs_version: "2.1.3" Note: The tunnel for the compute node where the Rally guest is, is missing... So the guest will have zero connectivity to anything. Restart services: 8e923870-f081-4585-ba03-7ec16e55b6cf Bridge br-ex Port br-ex Interface br-ex type: internal Port phy-br-ex Interface phy-br-ex Port "qg-8a7abac9-d6" Interface "qg-8a7abac9-d6" type: internal Port "eno2" Interface "eno2" Bridge br-int fail_mode: secure Port "qr-f23ab158-ff" tag: 1 Interface "qr-f23ab158-ff" type: internal Port "tapbfc4341f-c8" tag: 2 Interface "tapbfc4341f-c8" type: internal Port "tap7418e859-a2" tag: 1 Interface "tap7418e859-a2" type: internal Port br-int Interface br-int type: internal Port patch-tun Interface patch-tun type: patch options: {peer=patch-int} Port int-br-ex Interface int-br-ex Bridge br-tun Port "vxlan-ac1264f1" Interface "vxlan-ac1264f1" type: vxlan options: {in_key=flow, local_ip="172.18.100.240", out_key=flow, remote_ip="172.18.100.241"} Port br-tun Interface br-tun type: internal Port patch-int Interface patch-int type: patch options: {peer=patch-tun} ovs_version: "2.1.3" Disabling L2Population seemed to fix the problem with not being able to reach above 100 Guest launched concurrently. With the latest testing L2Population disabled, with 3 iterations, I averaged, 110 (90,130,120). The reason I am noting this, is because I am not seeing the br-tun removed during this test, like I would when L2Population was enabled. I can recreate this bug running : rally task start rally-launch I was using a RHEL7 Cloud image. I would run the above command a couple of times, until failure - Failure is when I loose connectivity to the Rally guest. The rally-launch scenario : { "VMTasks.boot_runcommand_delete": [ { "runner": { "type": "constant", "times": 100, "concurrency": 100 }, "args": { "username": "root", "floating_network": "Public", "use_floatingip": false, "script": "/opt/rally/true.sh", "auto_assign_nic" : True, "fixed_network": "private", "interpreter": "/bin/sh", "flavor": { "name": "m1.small" }, "image": { "name": "rhel7" }, "detailed" : True }, "context": { "users": { "users_per_tenant": 1, "tenants": 1 }, "quotas": { "neutron": { "network": -1, "port": -1 }, "nova": { "instances": -1, "cores": -1, "ram": -1 } } } } ] } This looks like a race condition when multiple port delete and port create/update requests are incoming for OVS agent. It may turn out that neutron-server incorrectly detects that the whole tunnel is unused and requests FLOODING_ENTRY removal while new ports are coming later. I've added a patch to external links list that may fix the issues that we experience. Though more testing is needed to make sure it helps. Raised the bug to upstream (see external trackers' list). Vivek from HP who worked on DVR privately told me that he is willing to provide the fix for that, because DVR team re-introduced that regression in Juno after it was fixed there. *** Bug 1141497 has been marked as a duplicate of this bug. *** Fix arrived with 2014.1.4 rebase. This bug was verified on rhel7.1 with smaller setup (AIO+Comute Node) and Rally "boot_runcomand_delete" script, when L2pop was enabled. Times-500 Concurrency-50 openstack-neutron-ml2-2014.1.4-1.el7ost.noarch openstack-neutron-openvswitch-2014.1.4-1.el7ost.noarch openstack-neutron-2014.1.4-1.el7ost.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0829.html |