Description of problem: This has been discovered in OSP. The ovn-metadata localport doesn't get its openflow flows refreshed after it's been removed and added back, after a network outage. Version-Release number of selected component (if applicable): ovn2.13-20.06.2-11.el8fdp.x86_64 How reproducible: Always Steps to Reproduce in OSP: 1. Create a VM V on node C 2. Drop outgoing connection with iptables from the compute node to the OVN DBs 3. Wait until ovn-controller notices the problem 4. Remove the iptables rule blocking the connection 5. Delete the VM, it must be the last VM on given logical switch with port bound to this chassis 6. Create a new VM on the same node C Actual results: When the last VM (logical port) is removed from the node, ovn-metadata agent deletes OVN localport for metadata. Once a new VM is spawned on the node, ovn-metadata agent creates a new localport but ovn-controller doesn't notice and keeps the old flow using the old ofport for output. The communication between the VM logical port and metadata local port doesn't work because of wrong ofport number. Additional info: If steps 2-4 are omitted, everything works fine. It's easily reproducible, here are the ovn-controller logs snippets Non-working case: 2020-10-19T15:32:20.391Z|00014|binding|INFO|Claiming lport aa3a10f8-a853-4beb-928e-bde1a6202b48 for this chassis. 2020-10-19T15:32:20.391Z|00015|binding|INFO|aa3a10f8-a853-4beb-928e-bde1a6202b48: Claiming fa:16:3e:11:23:16 192.168.30.152 ^^ VM logical port 2020-10-19T15:32:21.847Z|00019|binding|INFO|Claiming lport 30f6fbea-3e5d-463b-9adc-a7ad8d2ed39f for this chassis. 2020-10-19T15:32:21.847Z|00020|binding|INFO|30f6fbea-3e5d-463b-9adc-a7ad8d2ed39f: Claiming fa:16:3e:d6:0d:00 192.168.30.2 ^^ metadata localport 2020-10-19T15:38:16.456Z|00021|reconnect|ERR|tcp:172.17.1.27:6642: no response to inactivity probe after 60 seconds, disconnecting 2020-10-19T15:38:16.456Z|00022|reconnect|INFO|tcp:172.17.1.27:6642: connection dropped 2020-10-19T15:38:17.457Z|00023|reconnect|INFO|tcp:172.17.1.27:6642: connecting... 2020-10-19T15:38:18.458Z|00024|reconnect|INFO|tcp:172.17.1.27:6642: connection attempt timed out 2020-10-19T15:38:18.458Z|00025|reconnect|INFO|tcp:172.17.1.27:6642: waiting 2 seconds before reconnect 2020-10-19T15:38:20.461Z|00026|reconnect|INFO|tcp:172.17.1.27:6642: connecting... 2020-10-19T15:38:22.461Z|00027|reconnect|INFO|tcp:172.17.1.27:6642: connection attempt timed out 2020-10-19T15:38:22.461Z|00028|reconnect|INFO|tcp:172.17.1.27:6642: waiting 4 seconds before reconnect 2020-10-19T15:38:26.466Z|00029|reconnect|INFO|tcp:172.17.1.27:6642: connecting... 2020-10-19T15:38:30.471Z|00030|reconnect|INFO|tcp:172.17.1.27:6642: connection attempt timed out 2020-10-19T15:38:30.471Z|00031|reconnect|INFO|tcp:172.17.1.27:6642: continuing to reconnect in the background but suppressing further logging 2020-10-19T15:38:54.483Z|00032|reconnect|INFO|tcp:172.17.1.27:6642: connected 2020-10-19T15:38:54.495Z|00033|main|INFO|OVNSB IDL reconnected, force recompute. 2020-10-19T15:39:17.580Z|00034|binding|INFO|Releasing lport aa3a10f8-a853-4beb-928e-bde1a6202b48 from this chassis. ^^ VM logical port removed 2020-10-19T15:39:46.492Z|00035|binding|INFO|Claiming lport bc62b1fc-d658-4d0b-8555-20fe11dbecea for this chassis. 2020-10-19T15:39:46.492Z|00036|binding|INFO|bc62b1fc-d658-4d0b-8555-20fe11dbecea: Claiming fa:16:3e:db:64:9e 192.168.30.95 ^^ new VM logical port The metadata localport is not released. With working case (i.e. there was no connection disruption), there is additional log message about the localport 2020-10-19T15:12:01.023Z|00044|binding|INFO|Changing chassis for lport c3e85ffb-62a9-4643-99ca-ddca450a4de3 from fa29f9dc-c9a6-41a7-9160-2b31a6c703d7 to 5405eec9-fa6b-432f-b35a-91654a5ca634. 2020-10-19T15:12:01.023Z|00045|binding|INFO|c3e85ffb-62a9-4643-99ca-ddca450a4de3: Claiming fa:16:3e:09:e6:b8 10.100.0.2 Note: The working case is from different environment, thus the ids changed. Calling ovs-appctl recompute fixes the problem. This is a regression from the previous OVN 2.13 FDP.
I can't reproduce anymore. Likely it was fixed by bug 1908391, marking as such. *** This bug has been marked as a duplicate of bug 1908391 ***