RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1377305 - Deleting an 'used' ovs port leads to ofport assigned duplication
Summary: Deleting an 'used' ovs port leads to ofport assigned duplication
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: openvswitch
Version: 7.3
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Eelco Chaudron
QA Contact: ovs-qe
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-19 12:10 UTC by Miguel Angel Ajo
Modified: 2017-05-10 11:29 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-05-10 08:53:52 UTC
Target Upstream Version:
Embargoed:
majopela: needinfo+


Attachments (Terms of Use)
ovs-vswitchd log when reproduced. (5.30 MB, text/plain)
2016-09-19 12:10 UTC, Miguel Angel Ajo
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1624701 0 None None None 2016-09-19 12:10:40 UTC

Description Miguel Angel Ajo 2016-09-19 12:10:41 UTC
Created attachment 1202461 [details]
ovs-vswitchd log when reproduced.

Description of problem:


Within the neutron context, we found that sometimes openvswitch would assign a duplicated ofport. We initially tried to reproduce such behaviour with no success.

Kevin Benton discovered that this happens with dnsmasq in place making use of the port. If you delete the port while dnsmasq is making use of it, eventually it makes ovs-vswitchd crazy  'added interface tap%% on port ##' happens a gazillion time pointing to the same ofport.


Version-Release number of selected component (if applicable):

2.0.x and 2.5.x series.


How reproducible:

sometimes only.

Steps to Reproduce:
1. create an internal port [we do it in a namespace, but I'm not sure that's critical to reproduce it]
2. attach dnsmasq to it
3. kill the port
4. repeat 1-3 several times
5. attach new ports


Actual results:

2016-09-16T02:26:55.037Z|00583|dpif|WARN|system@ovs-system: port_del failed (No such device)
2016-09-16T02:26:55.083Z|00584|bridge|INFO|bridge br-int: added interface tap60c6b7ea-54 on port 163
2016-09-16T02:26:57.347Z|00585|dpif|WARN|system@ovs-system: port_del failed (No such device)
2016-09-16T02:26:57.413Z|00586|bridge|INFO|bridge br-int: added interface tap8b78341c-10 on port 166
2016-09-16T02:26:57.469Z|00587|bridge|INFO|bridge br-int: added interface tap10cc4d3a-5d on port 166
2016-09-16T02:26:57.469Z|00588|netdev_linux|WARN|Dropped 1 log messages in last 20 seconds (most recently, 20 seconds ago) due to excessive rate
2016-09-16T02:26:57.469Z|00589|netdev_linux|WARN|query tap8b78341c-10 qdisc failed (No such device)
2016-09-16T02:26:57.517Z|00590|bridge|INFO|bridge br-int: added interface tap8b78341c-10 on port 166
2016-09-16T02:26:57.565Z|00591|bridge|INFO|bridge br-int: added interface tap10cc4d3a-5d on port 166
2016-09-16T02:26:57.637Z|00592|bridge|INFO|bridge br-int: added interface tap8b78341c-10 on port 166
2016-09-16T02:26:57.693Z|00593|bridge|INFO|bridge br-int: added interface tap10cc4d3a-5d on port 166
2016-09-16T02:26:57.737Z|00594|bridge|INFO|bridge br-int: added interface tap8b78341c-10 on port 166
2016-09-16T02:26:57.773Z|00595|bridge|INFO|bridge br-int: added interface tap10cc4d3a-5d on port 166
2016-09-16T02:26:57.825Z|00596|bridge|INFO|bridge br-int: added interface tap8b78341c-10 on port 166
2016-09-16T02:26:57.869Z|00597|bridge|INFO|bridge br-int: added interface tap10cc4d3a-5d on port 166
2016-09-16T02:26:57.925Z|00598|bridge|INFO|bridge br-int: added interface tap8b78341c-10 on port 166
2016-09-16T02:26:57.965Z|00599|bridge|INFO|bridge br-int: added interface tap10cc4d3a-5d on port 166
2016-09-16T02:26:58.017Z|00600|bridge|INFO|bridge br-int: added interface tap8b78341c-10 on port 166
2016-09-16T02:26:58.077Z|00601|bridge|INFO|bridge br-int: added interface tap10cc4d3a-5d on port 166
2016-09-16T02:26:58.121Z|00602|bridge|INFO|bridge br-int: added interface tap8b78341c-10 on port 166
2016-09-16T02:26:58.165Z|00603|bridge|INFO|bridge br-int: added interface tap10cc4d3a-5d on port 166
2016-09-16T02:26:58.217Z|00604|bridge|INFO|bridge br-int: added interface tap8b78341c-10 on port 166
2016-09-16T02:26:58.249Z|00605|bridge|INFO|bridge br-int: added interface tap10cc4d3a-5d on port 166
2016-09-16T02:26:58.301Z|00606|bridge|INFO|bridge br-int: added interface tap8b78341c-10 on port 166
2016-09-16T02:26:58.373Z|00607|bridge|INFO|bridge br-int: added interface tap10cc4d3a-5d on port 166
2016-09-16T02:26:58.421Z|00608|bridge|INFO|bridge br-int: added interface tap8b78341c-10 on port 166
2016-09-16T02:26:58.484Z|00609|bridge|INFO|bridge br-int: added interface tap10cc4d3a-5d on port 166

Expected results:

No duplicated port assignments

Additional info:

Comment 2 Thadeu Lima de Souza Cascardo 2016-09-20 19:39:41 UTC
Hi, Ajo.

What do you mean by "kill the port"? Do you mean remove it from the switch, by using ovs-vsctl del-port?

Thanks.
Cascardo.

Comment 3 Miguel Angel Ajo 2016-10-17 13:05:50 UTC
It's an internal port, I'd guess it's both when we do a del-port  ?

Comment 4 Thadeu Lima de Souza Cascardo 2016-12-02 18:30:33 UTC
Hey, team.

Moving back to you. I have crossed this bug before, as it was reported on the mailing list, and my first investigation pointed out that a race was possible when assigning an ofport number. Take a look at alloc_ofp_port at ofproto/ofproto.c. Maybe it's not possible that multiple threads will run it, that was a check I needed to do. But maybe there is some error path in there that allows a given ofp_port to be reused.

Cascardo.

Comment 5 Eelco Chaudron 2017-01-13 15:40:39 UTC
Hi Miguel,

I tried to replicate this with your minimal info, but I can not see the problem both on 2.5, or on the latest 2.6.1 test image ( http://download-node-02.eng.bos.redhat.com/brewroot/packages/openvswitch/2.6.1/3.git20161206.el7fdb/x86_64/openvswitch-2.6.1-3.git20161206.el7fdb.x86_64.rpm).

Can you give me more exact step on how to replicate this, i.e. command line's executed. Also if you are re-trying, please try also with 2.6.1.

FYI I tried stuff like;

ovs-vsctl add-br br0

REPEAT x times {
  ovs-vsctl add-port br0 vlanX -- set interface vlanX type=internal

  killall dnsmasq
  /sbin/dnsmasq --interface vlanX \
                --dhcp-range 192.168.122.2,192.168.122.254

  ovs-vsctl del-port br0 vlanX
}
ovs-vsctl add-port br0 enp129s0f0

Comment 6 Miguel Angel Ajo 2017-01-31 12:32:04 UTC
Could you try leaving a few seconds between dnsmasq and del-port? 

could you also move the port inside a namespace, and run dnsmasq inside such namespace too?


I wonder if we need to exercise any DHCP request at all, but I guess it doesn't change the picture.


Thanks for trying, if those changes don't make an effect I will revisit the reproducer details with the Upstream Openstack logs where we saw that.

Comment 7 Eelco Chaudron 2017-02-16 09:43:59 UTC
I tried your suggestions but no luck replicating this. Please provide a simple reproducer so I can continue my investigation. Going over the code I see no obvious way how this could have occurred.

Comment 8 Miguel Angel Ajo 2017-02-16 10:48:26 UTC
@eelco, do you mind if I "Un-Private" our comments? I would like to open this thread up, to have help from upstream on this matter.

Comment 9 Eelco Chaudron 2017-02-16 11:30:02 UTC
Went over the mailing list and could not find the previous report Cascardo was talking about. Also walked a bit over the code and I see no obvious way a duplicate ofport could be assigned.

Tried various ways to reproduce this, even talking to Miguel, but I'm not successful.

To continue my investigation I need a "simple" reproducer.

Comment 10 Miguel Angel Ajo 2017-02-21 17:17:02 UTC
I'm removing the private flags of our messages to ask for help upstream.

Comment 11 Miguel Angel Ajo 2017-02-21 20:07:05 UTC
Sorry, re-adding my needinfo, I'm asking the neutron PTL for the details, I remember he added a workaround for avoiding this being reproduced in the neutron agents, but I can't remember the exact details.

Comment 13 Kevin Benton 2017-02-26 12:10:52 UTC
Hi,

I don't have an easy script to reproduce it, I was only able to get it to work using the actual neutron DHCP agent running tempest tests back when I worked on the bug so it may be a combination of flow rules being setup for the port by the L2 agent as well.

One option would be to do a devstack stable/newton setup and then revert commit 2f44402777a662fb68a069443b41c75b68b05287 and restart the agent to put it in the state before my bug fix.

Then a cycle of creating and deleting networks with subnets while regular tempest tests are being executed might reproduce it.

This was heavily impacting our gate jobs running xenial at the time.


Sorry I don't have something more concrete.

Comment 14 Kevin Benton 2017-02-26 12:19:31 UTC
Two more things that may help: 

1. This was with the version of OVS that shipped with Ubuntu xenial back in September.
2. The dhcp agent does immediately move the tap device after creating it into a namespace that it runs dnsmasq in. This may be an important component since we had problems with ports disappearing from vswitchd for a short period time after moving and then re-appearing (https://bugs.launchpad.net/neutron/+bug/1618987).

Comment 19 Eelco Chaudron 2017-05-10 08:53:52 UTC
As we where not able to get a reproducer for this I'm closing this BZ for now with insufficient data. We can re-open if we have a reproducer.


Note You need to log in before you can comment on or make changes to this bug.