Bug 2150905

Summary: ovn-controller thinks SB is read-only when it's not
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Dan Williams <dcbw>
Component: ovn22.09Assignee: xsimonar
Status: VERIFIED --- QA Contact: Ehsan Elahi <eelahi>
Severity: unspecified Docs Contact:
Priority: high    
Version: FDP 22.LCC: ctrautma, eelahi, jiji, jishi, mmichels, xsimonar
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovn22.09-22.09.0-91.el8fdp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Williams 2022-12-05 15:38:27 UTC
Seen in https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1239/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-local-gateway/1599628212420218880

ovnkube-node creates the port, sets iface-id/iface-id-ver, deletes it, then re-creates (because of how the "stateful set" test in Kubernetes works...)

I1205 05:54:07.239975    3586 cni.go:227] [e2e-statefulset-4735/ss-0 2cc0e91bcfc5520aa97a93468460b8feda7b261592b8ea7101a9fece29aee316] ADD starting CNI request [e2e-statefulset-4735/ss-0 2cc0e91bcfc5520aa97a93468460b8feda7b261592b8ea7101a9fece29aee316]
I1205 05:54:07.241815    3586 helper_linux.go:334] ConfigureOVS: namespace: e2e-statefulset-4735, podName: ss-0, SandboxID: "2cc0e91bcfc5520aa97a93468460b8feda7b261592b8ea7101a9fece29aee316", UID: "2d8210e0-5afb-44ac-a92e-dec06573d249", MAC: 0a:58:0a:81:02:6e, IPs: [10.129.2.110/23]
I1205 05:54:07.334655    3586 cni.go:248] [e2e-statefulset-4735/ss-0 2cc0e91bcfc5520aa97a93468460b8feda7b261592b8ea7101a9fece29aee316] ADD finished CNI request [e2e-statefulset-4735/ss-0 2cc0e91bcfc5520aa97a93468460b8feda7b261592b8ea7101a9fece29aee316], result "{\"interfaces\":[{\"name\":\"2cc0e91bcfc5520\",\"mac\":\"c2:38:34:74:11:cd\"},{\"name\":\"eth0\",\"mac\":\"0a:58:0a:81:02:6e\",\"sandbox\":\"/var/run/netns/106216d2-388c-4503-8150-4129b7f98adf\"}],\"ips\":[{\"interface\":1,\"address\":\"10.129.2.110/23\",\"gateway\":\"10.129.2.1\"}],\"dns\":{}}", err <nil>
I1205 05:54:07.576573    3586 cni.go:227] [e2e-statefulset-4735/ss-0 2cc0e91bcfc5520aa97a93468460b8feda7b261592b8ea7101a9fece29aee316] DEL starting CNI request [e2e-statefulset-4735/ss-0 2cc0e91bcfc5520aa97a93468460b8feda7b261592b8ea7101a9fece29aee316]
I1205 05:54:07.748780    3586 cni.go:248] [e2e-statefulset-4735/ss-0 2cc0e91bcfc5520aa97a93468460b8feda7b261592b8ea7101a9fece29aee316] DEL finished CNI request [e2e-statefulset-4735/ss-0 2cc0e91bcfc5520aa97a93468460b8feda7b261592b8ea7101a9fece29aee316], result "{\"dns\":{}}", err <nil>
I1205 05:54:08.521666    3586 cni.go:227] [e2e-statefulset-4735/ss-0 471d464709d4cfeabe5d9374b78dac5f01385572e4fe89112ff0fb514ebb6590] ADD starting CNI request [e2e-statefulset-4735/ss-0 471d464709d4cfeabe5d9374b78dac5f01385572e4fe89112ff0fb514ebb6590]
I1205 05:54:08.523503    3586 helper_linux.go:334] ConfigureOVS: namespace: e2e-statefulset-4735, podName: ss-0, SandboxID: "471d464709d4cfeabe5d9374b78dac5f01385572e4fe89112ff0fb514ebb6590", UID: "2d8210e0-5afb-44ac-a92e-dec06573d249", MAC: 0a:58:0a:81:02:6e, IPs: [10.129.2.110/23]
<2 minutes, then deletes port after timeout>
W1205 05:56:08.622847    3586 helper_linux.go:390] [e2e-statefulset-4735/ss-0 471d464709d4cfeabe5d9374b78dac5f01385572e4fe89112ff0fb514ebb6590] pod uid 2d8210e0-5afb-44ac-a92e-dec06573d249: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:81:02:6e [10.129.2.110/23]


ovn-controller doesn't see the second port binding request at 05:54:08 because it seems to think the SB is read-only:

2022-12-05T05:54:07.276Z|00485|binding|INFO|Claiming lport e2e-statefulset-4735_ss-0 for this chassis.
2022-12-05T05:54:07.276Z|00486|binding|INFO|e2e-statefulset-4735_ss-0: Claiming 0a:58:0a:81:02:6e 10.129.2.110
2022-12-05T05:54:07.304Z|00487|binding|INFO|Setting lport e2e-statefulset-4735_ss-0 ovn-installed in OVS
2022-12-05T05:54:07.340Z|00488|binding|INFO|Setting lport e2e-statefulset-4735_ss-0 up in Southbound
2022-12-05T05:54:07.588Z|00489|binding|INFO|Releasing lport e2e-statefulset-4735_ss-0 from this chassis (sb_readonly=1)

<two minute gap, then ovnkube-node deletes the port>

2022-12-05T05:56:08.633Z|00787|binding|INFO|Releasing lport e2e-statefulset-4735_ss-0 from this chassis (sb_readonly=0)
2022-12-05T05:56:08.633Z|00788|if_status|WARN|Trying to release unknown interface e2e-statefulset-4735_ss-0
2022-12-05T05:56:09.506Z|00790|binding|INFO|Claiming lport e2e-statefulset-4735_ss-0 for this chassis.
2022-12-05T05:56:09.506Z|00791|binding|INFO|e2e-statefulset-4735_ss-0: Claiming 0a:58:0a:81:02:6e 10.129.2.110
2022-12-05T05:56:09.524Z|00792|binding|INFO|Setting lport e2e-statefulset-4735_ss-0 ovn-installed in OVS
2022-12-05T05:56:09.527Z|00793|binding|INFO|Setting lport e2e-statefulset-4735_ss-0 up in Southbound
2022-12-05T05:56:12.676Z|00818|binding|INFO|Releasing lport e2e-statefulset-4735_ss-0 from this chassis (sb_readonly=0)

The SBs aren't busy at all:

2022-12-05T05:54:04.197Z|00240|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-12-05T05:54:07.282Z|00241|raft|INFO|Dropped 1 log messages in last 6 seconds (most recently, 6 seconds ago) due to excessive rate
2022-12-05T05:54:07.282Z|00242|raft|INFO|current entry eid 3bae01f3-de19-4c0c-b987-5bea4acbcf0e does not match prerequisite e1a5a3f8-2a7e-49d2-a145-cbaa47abe82b in execute_command_request
2022-12-05T05:54:14.268Z|00243|ovsdb_server|INFO|memory trimming after compaction enabled.

Other RAFT member log messages are just:

2022-12-05T05:53:55.521Z|00165|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-12-05T05:54:05.527Z|00166|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-12-05T05:54:15.547Z|00167|ovsdb_server|INFO|memory trimming after compaction enabled.

Comment 1 Dan Williams 2022-12-05 15:45:27 UTC
Is ovn-controller maybe not claiming the port binding request from ovnkube-node at 05:54:08 because of the sb_readonly=1 at that time?

Comment 2 xsimonar 2022-12-05 18:13:36 UTC
I think that the issue is with releasing the port when sb is read only.
In that case, ovn-controller might delete its internal binding for this port before sb becomes writable. 
When sb becomes writable, it's too late: the binding for this port is gone in ovn-controller, and ovn-controller never updates sb (it never removes chassis from port_binding in sb).
Then, next time this port is claimed, ovn-controller sees the port as already claimed, does not log the claim, and does not create the interface in if-status (causing next WARN when releasing)
I'll work on a fix.

Comment 3 Dan Williams 2022-12-15 02:23:25 UTC
Seeing this in a few more CI tests in OCP 4.13:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1449/pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn/1603172544557355008/artifacts/e2e-gcp-ovn/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-node-572nb_ovn-controller.log

for port e2e-statefulset-7013_ss2-0

2022-12-15T00:43:24.001Z|01585|binding|INFO|Claiming lport e2e-statefulset-7013_ss2-0 for this chassis.
2022-12-15T00:43:24.001Z|01586|binding|INFO|e2e-statefulset-7013_ss2-0: Claiming 0a:58:0a:83:01:68 10.131.1.104
2022-12-15T00:43:24.140Z|01587|binding|INFO|Setting lport e2e-statefulset-7013_ss2-0 ovn-installed in OVS
2022-12-15T00:43:24.140Z|01588|binding|INFO|Setting lport e2e-statefulset-7013_ss2-0 up in Southbound
2022-12-15T00:43:40.534Z|01659|binding|INFO|Mismatch iface-id-ver for lport e2e-statefulset-7013_ss2-0, expected 6e14a1af-8dd4-4656-ad2c-9b145386ed0d, found ddf315b4-06c9-459b-a84f-6fe75194c0c5
2022-12-15T00:43:40.534Z|01660|binding|INFO|Releasing lport e2e-statefulset-7013_ss2-0 from this chassis (sb_readonly=1)
2022-12-15T00:43:40.537Z|01661|binding|INFO|Removing lport e2e-statefulset-7013_ss2-0 ovn-installed in OVS
2022-12-15T00:45:41.819Z|01793|binding|INFO|Releasing lport e2e-statefulset-7013_ss2-0 from this chassis (sb_readonly=0)
2022-12-15T00:45:41.819Z|01794|if_status|WARN|Trying to release unknown interface e2e-statefulset-7013_ss2-0
2022-12-15T00:45:43.574Z|01795|binding|INFO|Claiming lport e2e-statefulset-7013_ss2-0 for this chassis.
2022-12-15T00:45:43.574Z|01796|binding|INFO|e2e-statefulset-7013_ss2-0: Claiming 0a:58:0a:83:01:77 10.131.1.119
2022-12-15T00:45:43.638Z|01797|binding|INFO|Setting lport e2e-statefulset-7013_ss2-0 ovn-installed in OVS
2022-12-15T00:45:43.638Z|01798|binding|INFO|Setting lport e2e-statefulset-7013_ss2-0 up in Southbound
2022-12-15T00:45:50.898Z|01800|binding|INFO|Mismatch iface-id-ver for lport e2e-statefulset-7013_ss2-0, expected 520daa0d-4ad2-4490-8d5e-fd167bc26a7d, found 6e14a1af-8dd4-4656-ad2c-9b145386ed0d
2022-12-15T00:45:51.633Z|01801|binding|INFO|Claiming lport e2e-statefulset-7013_ss2-0 for this chassis.
2022-12-15T00:45:51.633Z|01802|binding|INFO|e2e-statefulset-7013_ss2-0: Claiming 0a:58:0a:83:01:8f 10.131.1.143
2022-12-15T00:45:51.678Z|01803|binding|INFO|Setting lport e2e-statefulset-7013_ss2-0 ovn-installed in OVS
2022-12-15T00:45:51.678Z|01804|binding|INFO|Setting lport e2e-statefulset-7013_ss2-0 up in Southbound
2022-12-15T00:46:04.863Z|01806|binding|INFO|Releasing lport e2e-statefulset-7013_ss2-0 from this chassis (sb_readonly=0)

Comment 4 xsimonar 2023-01-04 08:34:09 UTC
Patch posted upstream for review: http://patchwork.ozlabs.org/project/ovn/patch/20230104083209.3880669-2-xsimonar@redhat.com/

Comment 5 OVN Bot 2023-05-09 04:08:06 UTC
ovn23.03 fast-datapath-rhel-8 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2196412
ovn23.03 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2196413
ovn22.12 fast-datapath-rhel-8 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2196414
ovn22.12 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2196415

Comment 6 Ehsan Elahi 2023-05-29 21:24:59 UTC
Reproduced on:
ovn22.12-22.12.0-51.el8fdp.x86_64
ovn22.12-central-22.12.0-51.el8fdp.x86_64
openvswitch-selinux-extra-policy-1.0-30.el8fdp.noarch
ovn22.12-host-22.12.0-51.el8fdp.x86_64
openvswitch2.17-2.17.0-95.el8fdp.x86_64

Here is the reproducer:

systemctl start ovn-northd
ovn-nbctl set-connection ptcp:6641
ovn-sbctl set-connection ptcp:6642
systemctl start openvswitch
ovs-vsctl set open . external_ids:system-id=hv1
ovs-vsctl set open . external_ids:ovn-remote=tcp:127.0.0.1:6642
ovs-vsctl set open . external_ids:ovn-encap-type=geneve
ovs-vsctl set open . external_ids:ovn-encap-ip=127.0.0.1
systemctl restart ovn-controller
 
# Add switch
ovn-nbctl ls-add ls1
ovn-nbctl set Logical_Switch ls1 other_config:subnet=10.1.0.0/16
ovn-nbctl --wait=hv sync
 
ovn-nbctl lsp-add ls1 lsp1
ovn-nbctl lsp-add ls1 lsp2

# add_ovs_interfaces
ovs-vsctl --no-wait -- add-port br-int vif1 -- set Interface vif1 external_ids:iface-id=lsp1 -- set Interface vif1 type=internal
ovs-vsctl --no-wait -- add-port br-int vif2 -- set Interface vif2 external_ids:iface-id=lsp2 -- set Interface vif2 type=internal

# wait_for_local_bindings
sleep 5
# check bindings, should be 2
ovn-appctl -t ovn-controller debug/dump-local-bindings | grep interface | wc -l |grep 2
# check how many ports are up, should be 2
ovn-nbctl  --bare --columns up list Logical_switch_p | grep true |wc -l |grep 2
ovn-nbctl --wait=hv sync
# sleep_sb
kill -STOP $(ps -ef | grep ovnsb_db.pid  |grep -v grep |awk '{ print $2 }')
sleep 5
echo "Interface removal"
# remove_ovs_interfaces
ovs-vsctl --no-wait -- del-port vif1
ovs-vsctl --no-wait -- del-port vif2
sleep 5
# ensure_controller_run
# We want to make sure controller could run at least one full loop.
# We can't use wait=hv as sb might be sleeping.
ovn-appctl -t ovn-controller debug/status |grep running

# sleep_controller
ovn-appctl debug/pause
sleep 5
ovn-appctl -t ovn-controller debug/status |grep paused

# wake_up_sb
kill -CONT $(ps -ef | grep ovnsb_db.pid  |grep -v grep |awk '{ print $2 }')

# wake_up_controller
ovn-appctl debug/resume
sleep 5
# check_ovn_uninstalled

# check_port_down lsp1
ovn-sbctl get Port_Binding lsp1 up |grep false
# check_port_down lsp2
ovn-sbctl get Port_Binding lsp2 up |grep false
# <<<<<< in non fixed version, these ports are not down >>>>>>
# check_ports_unbound, should be 0
ovn-sbctl --bare --columns chassis list Port_Bind |grep '-' | wc -l |grep 0
# <<<<<< in non fixed version, these ports remained bound >>>>>>

# add_ovs_interfaces again
# add_ovs_interfaces
ovs-vsctl --no-wait -- add-port br-int vif1 -- set Interface vif1 external_ids:iface-id=lsp1 -- set Interface vif1 type=internal
ovs-vsctl --no-wait -- add-port br-int vif2 -- set Interface vif2 external_ids:iface-id=lsp2 -- set Interface vif2 type=internal
ovn-nbctl --wait=hv sync
echo "Interface removal done and added back"
# sleep_sb
kill -STOP $(ps -ef | grep ovnsb_db.pid  |grep -v grep |awk '{ print $2 }')
sleep 2
# remove_ovs_interfaces
ovs-vsctl --no-wait -- del-port vif1
ovs-vsctl --no-wait -- del-port vif2

# ensure_controller_run
ovn-appctl -t ovn-controller debug/status |grep running

# sleep_controller
ovn-appctl debug/pause
sleep 5
# add_ovs_interfaces back again
# add_ovs_interfaces
ovs-vsctl --no-wait -- add-port br-int vif1 -- set Interface vif1 external_ids:iface-id=lsp1 -- set Interface vif1 type=internal
ovs-vsctl --no-wait -- add-port br-int vif2 -- set Interface vif2 external_ids:iface-id=lsp2 -- set Interface vif2 type=internal

# wake_up_sb
kill -CONT $(ps -ef | grep ovnsb_db.pid  |grep -v grep |awk '{ print $2 }')
# wake_up_controller
ovn-appctl debug/resume
sleep 5
# check_ovn_installed vif1
# check_ovn_installed vif2

# check_ports_up
ovn-sbctl get Port_Binding lsp1 up |grep true
ovn-sbctl get Port_Binding lsp2 up |grep true

# check_ports_bound should be 2
ovn-sbctl --bare --columns chassis list Port_Bind |grep '-' | wc -l |grep 2

Comment 11 Ehsan Elahi 2023-07-18 09:10:55 UTC
As per the reproducer given in Comment 6, verified on:

ovn22.09-22.09.0-91.el8fdp.x86_64
openvswitch-selinux-extra-policy-1.0-31.el8fdp.noarch
openvswitch2.17-2.17.0-106.el8fdp.x86_64