Bug 1889419 - ovn-controller doesn't reclaim localport after reconnection
Summary: ovn-controller doesn't reclaim localport after reconnection
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovn2.13
Version: FDP 20.G
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: OVN Team
QA Contact: Jianlin Shi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-19 15:54 UTC by Jakub Libosvar
Modified: 2020-10-20 12:10 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description Jakub Libosvar 2020-10-19 15:54:50 UTC
Description of problem:
This has been discovered in OSP. The ovn-metadata localport doesn't get its openflow flows refreshed after it's been removed and added back, after a network outage.

Version-Release number of selected component (if applicable):
ovn2.13-20.06.2-11.el8fdp.x86_64

How reproducible:
Always

Steps to Reproduce in OSP:
1. Create a VM V on node C
2. Drop outgoing connection with iptables from the compute node to the OVN DBs
3. Wait until ovn-controller notices the problem
4. Remove the iptables rule blocking the connection
5. Delete the VM, it must be the last VM on given logical switch with port bound to this chassis
6. Create a new VM on the same node C

Actual results:
When the last VM (logical port) is removed from the node, ovn-metadata agent deletes OVN localport for metadata. Once a new VM is spawned on the node, ovn-metadata agent creates a new localport but ovn-controller doesn't notice and keeps the old flow using the old ofport for output. The communication between the VM logical port and metadata local port doesn't work because of wrong ofport number.

Additional info:
If steps 2-4 are omitted, everything works fine.

It's easily reproducible, here are the ovn-controller logs snippets

Non-working case:
2020-10-19T15:32:20.391Z|00014|binding|INFO|Claiming lport aa3a10f8-a853-4beb-928e-bde1a6202b48 for this chassis.
2020-10-19T15:32:20.391Z|00015|binding|INFO|aa3a10f8-a853-4beb-928e-bde1a6202b48: Claiming fa:16:3e:11:23:16 192.168.30.152
                                            ^^ VM logical port

2020-10-19T15:32:21.847Z|00019|binding|INFO|Claiming lport 30f6fbea-3e5d-463b-9adc-a7ad8d2ed39f for this chassis.
2020-10-19T15:32:21.847Z|00020|binding|INFO|30f6fbea-3e5d-463b-9adc-a7ad8d2ed39f: Claiming fa:16:3e:d6:0d:00 192.168.30.2
                                            ^^ metadata localport

2020-10-19T15:38:16.456Z|00021|reconnect|ERR|tcp:172.17.1.27:6642: no response to inactivity probe after 60 seconds, disconnecting
2020-10-19T15:38:16.456Z|00022|reconnect|INFO|tcp:172.17.1.27:6642: connection dropped
2020-10-19T15:38:17.457Z|00023|reconnect|INFO|tcp:172.17.1.27:6642: connecting...
2020-10-19T15:38:18.458Z|00024|reconnect|INFO|tcp:172.17.1.27:6642: connection attempt timed out
2020-10-19T15:38:18.458Z|00025|reconnect|INFO|tcp:172.17.1.27:6642: waiting 2 seconds before reconnect
2020-10-19T15:38:20.461Z|00026|reconnect|INFO|tcp:172.17.1.27:6642: connecting...
2020-10-19T15:38:22.461Z|00027|reconnect|INFO|tcp:172.17.1.27:6642: connection attempt timed out
2020-10-19T15:38:22.461Z|00028|reconnect|INFO|tcp:172.17.1.27:6642: waiting 4 seconds before reconnect
2020-10-19T15:38:26.466Z|00029|reconnect|INFO|tcp:172.17.1.27:6642: connecting...
2020-10-19T15:38:30.471Z|00030|reconnect|INFO|tcp:172.17.1.27:6642: connection attempt timed out
2020-10-19T15:38:30.471Z|00031|reconnect|INFO|tcp:172.17.1.27:6642: continuing to reconnect in the background but suppressing further logging
2020-10-19T15:38:54.483Z|00032|reconnect|INFO|tcp:172.17.1.27:6642: connected
2020-10-19T15:38:54.495Z|00033|main|INFO|OVNSB IDL reconnected, force recompute.

2020-10-19T15:39:17.580Z|00034|binding|INFO|Releasing lport aa3a10f8-a853-4beb-928e-bde1a6202b48 from this chassis.
                                            ^^ VM logical port removed

2020-10-19T15:39:46.492Z|00035|binding|INFO|Claiming lport bc62b1fc-d658-4d0b-8555-20fe11dbecea for this chassis.
2020-10-19T15:39:46.492Z|00036|binding|INFO|bc62b1fc-d658-4d0b-8555-20fe11dbecea: Claiming fa:16:3e:db:64:9e 192.168.30.95
                                            ^^ new VM logical port

The metadata localport is not released. With working case (i.e. there was no connection disruption), there is additional log message about the localport
2020-10-19T15:12:01.023Z|00044|binding|INFO|Changing chassis for lport c3e85ffb-62a9-4643-99ca-ddca450a4de3 from fa29f9dc-c9a6-41a7-9160-2b31a6c703d7 to 5405eec9-fa6b-432f-b35a-91654a5ca634.
2020-10-19T15:12:01.023Z|00045|binding|INFO|c3e85ffb-62a9-4643-99ca-ddca450a4de3: Claiming fa:16:3e:09:e6:b8 10.100.0.2

Note: The working case is from different environment, thus the ids changed.

Calling ovs-appctl recompute fixes the problem. This is a regression from the previous OVN 2.13 FDP.


Note You need to log in before you can comment on or make changes to this bug.