Bug 1889419

Summary: ovn-controller doesn't reclaim localport after reconnection
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Jakub Libosvar <jlibosva>
Component: ovn2.13Assignee: lorenzo bianconi <lorenzo.bianconi>
Status: CLOSED DUPLICATE QA Contact: Jianlin Shi <jishi>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: FDP 20.GCC: bdobreli, ctrautma, jishi, lorenzo.bianconi, mkrcmari, ralongi
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-17 15:08:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jakub Libosvar 2020-10-19 15:54:50 UTC
Description of problem:
This has been discovered in OSP. The ovn-metadata localport doesn't get its openflow flows refreshed after it's been removed and added back, after a network outage.

Version-Release number of selected component (if applicable):
ovn2.13-20.06.2-11.el8fdp.x86_64

How reproducible:
Always

Steps to Reproduce in OSP:
1. Create a VM V on node C
2. Drop outgoing connection with iptables from the compute node to the OVN DBs
3. Wait until ovn-controller notices the problem
4. Remove the iptables rule blocking the connection
5. Delete the VM, it must be the last VM on given logical switch with port bound to this chassis
6. Create a new VM on the same node C

Actual results:
When the last VM (logical port) is removed from the node, ovn-metadata agent deletes OVN localport for metadata. Once a new VM is spawned on the node, ovn-metadata agent creates a new localport but ovn-controller doesn't notice and keeps the old flow using the old ofport for output. The communication between the VM logical port and metadata local port doesn't work because of wrong ofport number.

Additional info:
If steps 2-4 are omitted, everything works fine.

It's easily reproducible, here are the ovn-controller logs snippets

Non-working case:
2020-10-19T15:32:20.391Z|00014|binding|INFO|Claiming lport aa3a10f8-a853-4beb-928e-bde1a6202b48 for this chassis.
2020-10-19T15:32:20.391Z|00015|binding|INFO|aa3a10f8-a853-4beb-928e-bde1a6202b48: Claiming fa:16:3e:11:23:16 192.168.30.152
                                            ^^ VM logical port

2020-10-19T15:32:21.847Z|00019|binding|INFO|Claiming lport 30f6fbea-3e5d-463b-9adc-a7ad8d2ed39f for this chassis.
2020-10-19T15:32:21.847Z|00020|binding|INFO|30f6fbea-3e5d-463b-9adc-a7ad8d2ed39f: Claiming fa:16:3e:d6:0d:00 192.168.30.2
                                            ^^ metadata localport

2020-10-19T15:38:16.456Z|00021|reconnect|ERR|tcp:172.17.1.27:6642: no response to inactivity probe after 60 seconds, disconnecting
2020-10-19T15:38:16.456Z|00022|reconnect|INFO|tcp:172.17.1.27:6642: connection dropped
2020-10-19T15:38:17.457Z|00023|reconnect|INFO|tcp:172.17.1.27:6642: connecting...
2020-10-19T15:38:18.458Z|00024|reconnect|INFO|tcp:172.17.1.27:6642: connection attempt timed out
2020-10-19T15:38:18.458Z|00025|reconnect|INFO|tcp:172.17.1.27:6642: waiting 2 seconds before reconnect
2020-10-19T15:38:20.461Z|00026|reconnect|INFO|tcp:172.17.1.27:6642: connecting...
2020-10-19T15:38:22.461Z|00027|reconnect|INFO|tcp:172.17.1.27:6642: connection attempt timed out
2020-10-19T15:38:22.461Z|00028|reconnect|INFO|tcp:172.17.1.27:6642: waiting 4 seconds before reconnect
2020-10-19T15:38:26.466Z|00029|reconnect|INFO|tcp:172.17.1.27:6642: connecting...
2020-10-19T15:38:30.471Z|00030|reconnect|INFO|tcp:172.17.1.27:6642: connection attempt timed out
2020-10-19T15:38:30.471Z|00031|reconnect|INFO|tcp:172.17.1.27:6642: continuing to reconnect in the background but suppressing further logging
2020-10-19T15:38:54.483Z|00032|reconnect|INFO|tcp:172.17.1.27:6642: connected
2020-10-19T15:38:54.495Z|00033|main|INFO|OVNSB IDL reconnected, force recompute.

2020-10-19T15:39:17.580Z|00034|binding|INFO|Releasing lport aa3a10f8-a853-4beb-928e-bde1a6202b48 from this chassis.
                                            ^^ VM logical port removed

2020-10-19T15:39:46.492Z|00035|binding|INFO|Claiming lport bc62b1fc-d658-4d0b-8555-20fe11dbecea for this chassis.
2020-10-19T15:39:46.492Z|00036|binding|INFO|bc62b1fc-d658-4d0b-8555-20fe11dbecea: Claiming fa:16:3e:db:64:9e 192.168.30.95
                                            ^^ new VM logical port

The metadata localport is not released. With working case (i.e. there was no connection disruption), there is additional log message about the localport
2020-10-19T15:12:01.023Z|00044|binding|INFO|Changing chassis for lport c3e85ffb-62a9-4643-99ca-ddca450a4de3 from fa29f9dc-c9a6-41a7-9160-2b31a6c703d7 to 5405eec9-fa6b-432f-b35a-91654a5ca634.
2020-10-19T15:12:01.023Z|00045|binding|INFO|c3e85ffb-62a9-4643-99ca-ddca450a4de3: Claiming fa:16:3e:09:e6:b8 10.100.0.2

Note: The working case is from different environment, thus the ids changed.

Calling ovs-appctl recompute fixes the problem. This is a regression from the previous OVN 2.13 FDP.

Comment 2 Jakub Libosvar 2021-03-17 15:08:01 UTC
I can't reproduce anymore. Likely it was fixed by bug 1908391, marking as such.

*** This bug has been marked as a duplicate of bug 1908391 ***