Seen in https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1239/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-local-gateway/1599628212420218880 ovnkube-node creates the port, sets iface-id/iface-id-ver, deletes it, then re-creates (because of how the "stateful set" test in Kubernetes works...) I1205 05:54:07.239975 3586 cni.go:227] [e2e-statefulset-4735/ss-0 2cc0e91bcfc5520aa97a93468460b8feda7b261592b8ea7101a9fece29aee316] ADD starting CNI request [e2e-statefulset-4735/ss-0 2cc0e91bcfc5520aa97a93468460b8feda7b261592b8ea7101a9fece29aee316] I1205 05:54:07.241815 3586 helper_linux.go:334] ConfigureOVS: namespace: e2e-statefulset-4735, podName: ss-0, SandboxID: "2cc0e91bcfc5520aa97a93468460b8feda7b261592b8ea7101a9fece29aee316", UID: "2d8210e0-5afb-44ac-a92e-dec06573d249", MAC: 0a:58:0a:81:02:6e, IPs: [10.129.2.110/23] I1205 05:54:07.334655 3586 cni.go:248] [e2e-statefulset-4735/ss-0 2cc0e91bcfc5520aa97a93468460b8feda7b261592b8ea7101a9fece29aee316] ADD finished CNI request [e2e-statefulset-4735/ss-0 2cc0e91bcfc5520aa97a93468460b8feda7b261592b8ea7101a9fece29aee316], result "{\"interfaces\":[{\"name\":\"2cc0e91bcfc5520\",\"mac\":\"c2:38:34:74:11:cd\"},{\"name\":\"eth0\",\"mac\":\"0a:58:0a:81:02:6e\",\"sandbox\":\"/var/run/netns/106216d2-388c-4503-8150-4129b7f98adf\"}],\"ips\":[{\"interface\":1,\"address\":\"10.129.2.110/23\",\"gateway\":\"10.129.2.1\"}],\"dns\":{}}", err <nil> I1205 05:54:07.576573 3586 cni.go:227] [e2e-statefulset-4735/ss-0 2cc0e91bcfc5520aa97a93468460b8feda7b261592b8ea7101a9fece29aee316] DEL starting CNI request [e2e-statefulset-4735/ss-0 2cc0e91bcfc5520aa97a93468460b8feda7b261592b8ea7101a9fece29aee316] I1205 05:54:07.748780 3586 cni.go:248] [e2e-statefulset-4735/ss-0 2cc0e91bcfc5520aa97a93468460b8feda7b261592b8ea7101a9fece29aee316] DEL finished CNI request [e2e-statefulset-4735/ss-0 2cc0e91bcfc5520aa97a93468460b8feda7b261592b8ea7101a9fece29aee316], result "{\"dns\":{}}", err <nil> I1205 05:54:08.521666 3586 cni.go:227] [e2e-statefulset-4735/ss-0 471d464709d4cfeabe5d9374b78dac5f01385572e4fe89112ff0fb514ebb6590] ADD starting CNI request [e2e-statefulset-4735/ss-0 471d464709d4cfeabe5d9374b78dac5f01385572e4fe89112ff0fb514ebb6590] I1205 05:54:08.523503 3586 helper_linux.go:334] ConfigureOVS: namespace: e2e-statefulset-4735, podName: ss-0, SandboxID: "471d464709d4cfeabe5d9374b78dac5f01385572e4fe89112ff0fb514ebb6590", UID: "2d8210e0-5afb-44ac-a92e-dec06573d249", MAC: 0a:58:0a:81:02:6e, IPs: [10.129.2.110/23] <2 minutes, then deletes port after timeout> W1205 05:56:08.622847 3586 helper_linux.go:390] [e2e-statefulset-4735/ss-0 471d464709d4cfeabe5d9374b78dac5f01385572e4fe89112ff0fb514ebb6590] pod uid 2d8210e0-5afb-44ac-a92e-dec06573d249: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:81:02:6e [10.129.2.110/23] ovn-controller doesn't see the second port binding request at 05:54:08 because it seems to think the SB is read-only: 2022-12-05T05:54:07.276Z|00485|binding|INFO|Claiming lport e2e-statefulset-4735_ss-0 for this chassis. 2022-12-05T05:54:07.276Z|00486|binding|INFO|e2e-statefulset-4735_ss-0: Claiming 0a:58:0a:81:02:6e 10.129.2.110 2022-12-05T05:54:07.304Z|00487|binding|INFO|Setting lport e2e-statefulset-4735_ss-0 ovn-installed in OVS 2022-12-05T05:54:07.340Z|00488|binding|INFO|Setting lport e2e-statefulset-4735_ss-0 up in Southbound 2022-12-05T05:54:07.588Z|00489|binding|INFO|Releasing lport e2e-statefulset-4735_ss-0 from this chassis (sb_readonly=1) <two minute gap, then ovnkube-node deletes the port> 2022-12-05T05:56:08.633Z|00787|binding|INFO|Releasing lport e2e-statefulset-4735_ss-0 from this chassis (sb_readonly=0) 2022-12-05T05:56:08.633Z|00788|if_status|WARN|Trying to release unknown interface e2e-statefulset-4735_ss-0 2022-12-05T05:56:09.506Z|00790|binding|INFO|Claiming lport e2e-statefulset-4735_ss-0 for this chassis. 2022-12-05T05:56:09.506Z|00791|binding|INFO|e2e-statefulset-4735_ss-0: Claiming 0a:58:0a:81:02:6e 10.129.2.110 2022-12-05T05:56:09.524Z|00792|binding|INFO|Setting lport e2e-statefulset-4735_ss-0 ovn-installed in OVS 2022-12-05T05:56:09.527Z|00793|binding|INFO|Setting lport e2e-statefulset-4735_ss-0 up in Southbound 2022-12-05T05:56:12.676Z|00818|binding|INFO|Releasing lport e2e-statefulset-4735_ss-0 from this chassis (sb_readonly=0) The SBs aren't busy at all: 2022-12-05T05:54:04.197Z|00240|ovsdb_server|INFO|memory trimming after compaction enabled. 2022-12-05T05:54:07.282Z|00241|raft|INFO|Dropped 1 log messages in last 6 seconds (most recently, 6 seconds ago) due to excessive rate 2022-12-05T05:54:07.282Z|00242|raft|INFO|current entry eid 3bae01f3-de19-4c0c-b987-5bea4acbcf0e does not match prerequisite e1a5a3f8-2a7e-49d2-a145-cbaa47abe82b in execute_command_request 2022-12-05T05:54:14.268Z|00243|ovsdb_server|INFO|memory trimming after compaction enabled. Other RAFT member log messages are just: 2022-12-05T05:53:55.521Z|00165|ovsdb_server|INFO|memory trimming after compaction enabled. 2022-12-05T05:54:05.527Z|00166|ovsdb_server|INFO|memory trimming after compaction enabled. 2022-12-05T05:54:15.547Z|00167|ovsdb_server|INFO|memory trimming after compaction enabled.
Is ovn-controller maybe not claiming the port binding request from ovnkube-node at 05:54:08 because of the sb_readonly=1 at that time?
I think that the issue is with releasing the port when sb is read only. In that case, ovn-controller might delete its internal binding for this port before sb becomes writable. When sb becomes writable, it's too late: the binding for this port is gone in ovn-controller, and ovn-controller never updates sb (it never removes chassis from port_binding in sb). Then, next time this port is claimed, ovn-controller sees the port as already claimed, does not log the claim, and does not create the interface in if-status (causing next WARN when releasing) I'll work on a fix.
Seeing this in a few more CI tests in OCP 4.13: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1449/pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn/1603172544557355008/artifacts/e2e-gcp-ovn/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-node-572nb_ovn-controller.log for port e2e-statefulset-7013_ss2-0 2022-12-15T00:43:24.001Z|01585|binding|INFO|Claiming lport e2e-statefulset-7013_ss2-0 for this chassis. 2022-12-15T00:43:24.001Z|01586|binding|INFO|e2e-statefulset-7013_ss2-0: Claiming 0a:58:0a:83:01:68 10.131.1.104 2022-12-15T00:43:24.140Z|01587|binding|INFO|Setting lport e2e-statefulset-7013_ss2-0 ovn-installed in OVS 2022-12-15T00:43:24.140Z|01588|binding|INFO|Setting lport e2e-statefulset-7013_ss2-0 up in Southbound 2022-12-15T00:43:40.534Z|01659|binding|INFO|Mismatch iface-id-ver for lport e2e-statefulset-7013_ss2-0, expected 6e14a1af-8dd4-4656-ad2c-9b145386ed0d, found ddf315b4-06c9-459b-a84f-6fe75194c0c5 2022-12-15T00:43:40.534Z|01660|binding|INFO|Releasing lport e2e-statefulset-7013_ss2-0 from this chassis (sb_readonly=1) 2022-12-15T00:43:40.537Z|01661|binding|INFO|Removing lport e2e-statefulset-7013_ss2-0 ovn-installed in OVS 2022-12-15T00:45:41.819Z|01793|binding|INFO|Releasing lport e2e-statefulset-7013_ss2-0 from this chassis (sb_readonly=0) 2022-12-15T00:45:41.819Z|01794|if_status|WARN|Trying to release unknown interface e2e-statefulset-7013_ss2-0 2022-12-15T00:45:43.574Z|01795|binding|INFO|Claiming lport e2e-statefulset-7013_ss2-0 for this chassis. 2022-12-15T00:45:43.574Z|01796|binding|INFO|e2e-statefulset-7013_ss2-0: Claiming 0a:58:0a:83:01:77 10.131.1.119 2022-12-15T00:45:43.638Z|01797|binding|INFO|Setting lport e2e-statefulset-7013_ss2-0 ovn-installed in OVS 2022-12-15T00:45:43.638Z|01798|binding|INFO|Setting lport e2e-statefulset-7013_ss2-0 up in Southbound 2022-12-15T00:45:50.898Z|01800|binding|INFO|Mismatch iface-id-ver for lport e2e-statefulset-7013_ss2-0, expected 520daa0d-4ad2-4490-8d5e-fd167bc26a7d, found 6e14a1af-8dd4-4656-ad2c-9b145386ed0d 2022-12-15T00:45:51.633Z|01801|binding|INFO|Claiming lport e2e-statefulset-7013_ss2-0 for this chassis. 2022-12-15T00:45:51.633Z|01802|binding|INFO|e2e-statefulset-7013_ss2-0: Claiming 0a:58:0a:83:01:8f 10.131.1.143 2022-12-15T00:45:51.678Z|01803|binding|INFO|Setting lport e2e-statefulset-7013_ss2-0 ovn-installed in OVS 2022-12-15T00:45:51.678Z|01804|binding|INFO|Setting lport e2e-statefulset-7013_ss2-0 up in Southbound 2022-12-15T00:46:04.863Z|01806|binding|INFO|Releasing lport e2e-statefulset-7013_ss2-0 from this chassis (sb_readonly=0)
Patch posted upstream for review: http://patchwork.ozlabs.org/project/ovn/patch/20230104083209.3880669-2-xsimonar@redhat.com/
ovn23.03 fast-datapath-rhel-8 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2196412 ovn23.03 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2196413 ovn22.12 fast-datapath-rhel-8 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2196414 ovn22.12 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2196415
Reproduced on: ovn22.12-22.12.0-51.el8fdp.x86_64 ovn22.12-central-22.12.0-51.el8fdp.x86_64 openvswitch-selinux-extra-policy-1.0-30.el8fdp.noarch ovn22.12-host-22.12.0-51.el8fdp.x86_64 openvswitch2.17-2.17.0-95.el8fdp.x86_64 Here is the reproducer: systemctl start ovn-northd ovn-nbctl set-connection ptcp:6641 ovn-sbctl set-connection ptcp:6642 systemctl start openvswitch ovs-vsctl set open . external_ids:system-id=hv1 ovs-vsctl set open . external_ids:ovn-remote=tcp:127.0.0.1:6642 ovs-vsctl set open . external_ids:ovn-encap-type=geneve ovs-vsctl set open . external_ids:ovn-encap-ip=127.0.0.1 systemctl restart ovn-controller # Add switch ovn-nbctl ls-add ls1 ovn-nbctl set Logical_Switch ls1 other_config:subnet=10.1.0.0/16 ovn-nbctl --wait=hv sync ovn-nbctl lsp-add ls1 lsp1 ovn-nbctl lsp-add ls1 lsp2 # add_ovs_interfaces ovs-vsctl --no-wait -- add-port br-int vif1 -- set Interface vif1 external_ids:iface-id=lsp1 -- set Interface vif1 type=internal ovs-vsctl --no-wait -- add-port br-int vif2 -- set Interface vif2 external_ids:iface-id=lsp2 -- set Interface vif2 type=internal # wait_for_local_bindings sleep 5 # check bindings, should be 2 ovn-appctl -t ovn-controller debug/dump-local-bindings | grep interface | wc -l |grep 2 # check how many ports are up, should be 2 ovn-nbctl --bare --columns up list Logical_switch_p | grep true |wc -l |grep 2 ovn-nbctl --wait=hv sync # sleep_sb kill -STOP $(ps -ef | grep ovnsb_db.pid |grep -v grep |awk '{ print $2 }') sleep 5 echo "Interface removal" # remove_ovs_interfaces ovs-vsctl --no-wait -- del-port vif1 ovs-vsctl --no-wait -- del-port vif2 sleep 5 # ensure_controller_run # We want to make sure controller could run at least one full loop. # We can't use wait=hv as sb might be sleeping. ovn-appctl -t ovn-controller debug/status |grep running # sleep_controller ovn-appctl debug/pause sleep 5 ovn-appctl -t ovn-controller debug/status |grep paused # wake_up_sb kill -CONT $(ps -ef | grep ovnsb_db.pid |grep -v grep |awk '{ print $2 }') # wake_up_controller ovn-appctl debug/resume sleep 5 # check_ovn_uninstalled # check_port_down lsp1 ovn-sbctl get Port_Binding lsp1 up |grep false # check_port_down lsp2 ovn-sbctl get Port_Binding lsp2 up |grep false # <<<<<< in non fixed version, these ports are not down >>>>>> # check_ports_unbound, should be 0 ovn-sbctl --bare --columns chassis list Port_Bind |grep '-' | wc -l |grep 0 # <<<<<< in non fixed version, these ports remained bound >>>>>> # add_ovs_interfaces again # add_ovs_interfaces ovs-vsctl --no-wait -- add-port br-int vif1 -- set Interface vif1 external_ids:iface-id=lsp1 -- set Interface vif1 type=internal ovs-vsctl --no-wait -- add-port br-int vif2 -- set Interface vif2 external_ids:iface-id=lsp2 -- set Interface vif2 type=internal ovn-nbctl --wait=hv sync echo "Interface removal done and added back" # sleep_sb kill -STOP $(ps -ef | grep ovnsb_db.pid |grep -v grep |awk '{ print $2 }') sleep 2 # remove_ovs_interfaces ovs-vsctl --no-wait -- del-port vif1 ovs-vsctl --no-wait -- del-port vif2 # ensure_controller_run ovn-appctl -t ovn-controller debug/status |grep running # sleep_controller ovn-appctl debug/pause sleep 5 # add_ovs_interfaces back again # add_ovs_interfaces ovs-vsctl --no-wait -- add-port br-int vif1 -- set Interface vif1 external_ids:iface-id=lsp1 -- set Interface vif1 type=internal ovs-vsctl --no-wait -- add-port br-int vif2 -- set Interface vif2 external_ids:iface-id=lsp2 -- set Interface vif2 type=internal # wake_up_sb kill -CONT $(ps -ef | grep ovnsb_db.pid |grep -v grep |awk '{ print $2 }') # wake_up_controller ovn-appctl debug/resume sleep 5 # check_ovn_installed vif1 # check_ovn_installed vif2 # check_ports_up ovn-sbctl get Port_Binding lsp1 up |grep true ovn-sbctl get Port_Binding lsp2 up |grep true # check_ports_bound should be 2 ovn-sbctl --bare --columns chassis list Port_Bind |grep '-' | wc -l |grep 2
As per the reproducer given in Comment 6, verified on: ovn22.09-22.09.0-91.el8fdp.x86_64 openvswitch-selinux-extra-policy-1.0-31.el8fdp.noarch openvswitch2.17-2.17.0-106.el8fdp.x86_64