+++ This bug was initially created as a clone of Bug #2076604 +++ Description of problem: Originally detected in Upstream OpenStack job https://bugs.launchpad.net/tripleo/+bug/1968732. This is seen after ovn-2021 update[1] in CentOS Stream NFV SIG(ovn-2021-21.06.0-17.el9s --> ovn-2021-21.12.0-11.el9s). When the issue happens(ssh fails as vm cannot connect with metadata proxy 169.254.169.254:80), flows for metadata port(ip 10.100.0.2) are missing. After running recompute(ovs-appctl -t /var/run/ovn/ovn-controller.2.ctl recompute) flows appears and all works fine. [root@node-0002427663 /]# ovs-ofctl show br-int OFPT_FEATURES_REPLY (xid=0x2): dpid:0000ea1fd3f2be98 n_tables:254, n_buffers:0 capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst 1(patch-br-int-to): addr:32:44:73:93:c8:40 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 46(tapef1645c0-13): addr:fe:16:3e:a7:0e:10 config: 0 state: 0 current: 10MB-FD COPPER speed: 10 Mbps now, 0 Mbps max 47(tapa3803bbf-b0): addr:ea:fe:c4:d2:ab:b8 config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max 48(tap4987daed-f3): addr:fe:16:3e:e2:c5:59 config: 0 state: 0 current: 10MB-FD COPPER speed: 10 Mbps now, 0 Mbps max 49(tap96ca70a5-6d): addr:fe:16:3e:29:c1:e9 config: 0 state: 0 current: 10MB-FD COPPER speed: 10 Mbps now, 0 Mbps max 50(tape6d9123e-6f): addr:fe:16:3e:9a:67:86 config: 0 state: 0 current: 10MB-FD COPPER speed: 10 Mbps now, 0 Mbps max 51(tapd59710fd-ed): addr:fe:16:3e:a0:2c:2e config: 0 state: 0 current: 10MB-FD COPPER speed: 10 Mbps now, 0 Mbps max 52(tapfe28971b-5c): addr:fe:16:3e:fa:ee:21 config: 0 state: 0 current: 10MB-FD COPPER speed: 10 Mbps now, 0 Mbps max 71(tap67e2e6ae-0e): addr:fe:16:3e:8a:77:69 config: 0 state: 0 current: 10MB-FD COPPER speed: 10 Mbps now, 0 Mbps max 72(tapd00f18e1-90): addr:42:f4:52:46:51:da config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max LOCAL(br-int): addr:ea:1f:d3:f2:be:98 config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps max OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0 Metadata port ==> 72(tapd00f18e1-90): addr:42:f4:52:46:51:da [root@node-0002427663 /]# ovs-ofctl dump-flows br-int|grep output: cookie=0xf91d47f4, duration=12983.494s, table=65, n_packets=2043, n_bytes=207683, idle_age=155, priority=100,reg15=0x1,metadata=0x2 actions=output:1 cookie=0xe2f863bd, duration=5577.208s, table=65, n_packets=68, n_bytes=7626, idle_age=5475, priority=100,reg15=0x3,metadata=0x4 actions=output:46 cookie=0xa397bc61, duration=5576.884s, table=65, n_packets=583, n_bytes=50253, idle_age=5530, priority=100,reg15=0x1,metadata=0x4 actions=output:47 cookie=0x2124f02e, duration=5575.984s, table=65, n_packets=67, n_bytes=7500, idle_age=5479, priority=100,reg15=0x4,metadata=0x4 actions=output:48 cookie=0x42b7518f, duration=5575.761s, table=65, n_packets=64, n_bytes=7374, idle_age=5487, priority=100,reg15=0x5,metadata=0x4 actions=output:49 cookie=0xb2cef043, duration=5575.489s, table=65, n_packets=63, n_bytes=7332, idle_age=5477, priority=100,reg15=0x6,metadata=0x4 actions=output:50 cookie=0x72093ac3, duration=5575.131s, table=65, n_packets=61, n_bytes=7248, idle_age=5471, priority=100,reg15=0x7,metadata=0x4 actions=output:51 cookie=0x1cddeec2, duration=5574.421s, table=65, n_packets=58, n_bytes=7098, idle_age=5469, priority=100,reg15=0x8,metadata=0x4 actions=output:52 cookie=0xf589365e, duration=946.264s, table=65, n_packets=62, n_bytes=4 544, idle_age=15, priority=100,reg15=0x3,metadata=0x8 actions=output:71 # Flow missing ^ for metadata port 72 [root@node-0002427663 /]# ovs-appctl -t /var/run/ovn/ovn-controller.2.ctl recompute [root@node-0002427663 /]# ovs-ofctl dump-flows br-int|grep output: cookie=0xf91d47f4, duration=13472.388s, table=65, n_packets=2056, n_bytes=208745, idle_age=371, priority=100,reg15=0x1,metadata=0x2 actions=output:1 cookie=0xe2f863bd, duration=6066.102s, table=65, n_packets=68, n_bytes=7626, idle_age=5964, priority=100,reg15=0x3,metadata=0x4 actions=output:46 cookie=0xa397bc61, duration=6065.778s, table=65, n_packets=583, n_bytes=50253, idle_age=6019, priority=100,reg15=0x1,metadata=0x4 actions=output:47 cookie=0x2124f02e, duration=6064.878s, table=65, n_packets=67, n_bytes=7500, idle_age=5968, priority=100,reg15=0x4,metadata=0x4 actions=output:48 cookie=0x42b7518f, duration=6064.655s, table=65, n_packets=64, n_bytes=7374, idle_age=5976, priority=100,reg15=0x5,metadata=0x4 actions=output:49 cookie=0xb2cef043, duration=6064.383s, table=65, n_packets=63, n_bytes=7332, idle_age=5966, priority=100,reg15=0x6,metadata=0x4 actions=output:50 cookie=0x72093ac3, duration=6064.025s, table=65, n_packets=61, n_bytes=7248, idle_age=5959, priority=100,reg15=0x7,metadata=0x4 actions=output:51 cookie=0x1cddeec2, duration=6063.315s, table=65, n_packets=58, n_bytes=7098, idle_age=5958, priority=100,reg15=0x8,metadata=0x4 actions=output:52 cookie=0xf589365e, duration=1435.158s, table=65, n_packets=66, n_bytes=4768, idle_age=366, priority=100,reg15=0x3,metadata=0x8 actions=output:71 cookie=0x6f326bd2, duration=52.307s, table=65, n_packets=0, n_bytes=0, idle_age=52, priority=100,reg15=0x1,metadata=0x8 actions=output:72 # After recompute flow for port 72 appears ^. Following additional flows appears after running recompute:- cookie=0x6f326bd2, duration=30.050s, table=0, n_packets=0, n_bytes=0, idle_age=30, priority=100,in_port=72 actions=load:0x1b->NXM_NX_REG13[],load:0x18->NXM_NX_REG11[],load:0x19->NXM_NX_REG12[],load:0x8->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],load:0x1->NXM_NX_REG10[10],resubmit(,8) cookie=0x6f326bd2, duration=30.053s, table=37, n_packets=0, n_bytes=0, idle_age=30, priority=150,reg14=0x1,metadata=0x8 actions=resubmit(,38) cookie=0x6f326bd2, duration=30.053s, table=38, n_packets=0, n_bytes=0, idle_age=30, priority=100,reg15=0x1,metadata=0x8 actions=load:0x1b->NXM_NX_REG13[],load:0x18->NXM_NX_REG11[],load:0x19->NXM_NX_REG12[],resubmit(,39) cookie=0x6f326bd2, duration=30.053s, table=39, n_packets=0, n_bytes=0, idle_age=30, priority=100,reg10=0/0x1,reg14=0x1,reg15=0x1,metadata=0x8 actions=drop cookie=0x6f326bd2, duration=30.054s, table=64, n_packets=0, n_bytes=0, idle_age=30, priority=100,reg10=0x1/0x1,reg15=0x1,metadata=0x8 actions=push:NXM_OF_IN_PORT[],load:0xffff->NXM_OF_IN_PORT[],resubmit(,65),pop:NXM_OF_IN_PORT[] cookie=0x6f326bd2, duration=30.054s, table=65, n_packets=0, n_bytes=0, idle_age=30, priority=100,reg15=0x1,metadata=0x8 actions=output:72 [1] https://review.rdoproject.org/r/c/nfvinfo/+/40817 Version-Release number of selected component (if applicable): ovn-2021-21.12.0-11.el9s How reproducible: Randomly reproduces as one of tempest test fails Steps to Reproduce: 1. OpenStack wallaby ovn based multinode deployment 2. run following until it fails:- sudo tempest run --regex tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_update_router_admin_state Actual results: Test fails while performing SSH. Expected results: Test should pass Additional info: This a regression caused with ovn-2021-21.12.0-11.el9s. Attaching following in a tar archive:- - ovnnb_db before recompute - ovnsb_db before recompute - ovs db before and after recompute - Output of "ovs-ofctl dump-flows br-int" before and after running recompute --- Additional comment from Yatin Karel on 2022-04-19 13:46:39 UTC --- The similar failure are seen in RHEL9 OSP 17.0 job:- https://sf.hosted.upshift.rdu2.redhat.com/logs/58/398758/7/check/periodic-tripleo-ci-rhel-9-containers-multinode-rhos-17/3a6700d/logs/undercloud/var/log/tempest/stestr_results.html OVN version:- 21.12.0-42.el9fdp OVS version:- 2.15.0-51.el9fdp Also to NOTE in similar job[1] running on RHEL8.4 issue not seen, not sure if it's just lucky as issue is random or something specific to do with RHEL/CentOS-Stream 9. OVN version:- 21.12.0-45.el8fdp OVS version:- 2.15.0-93.el8fdp on some previous runs from 7th April success was with ovn-2021-21.12.0-39.el8fdp and openvswitch2.15-2.15.0-91.el8fdp. For older runs logs are not available. [1] https://sf.hosted.upshift.rdu2.redhat.com/zuul/t/tripleo-ci-internal/builds?job_name=periodic-tripleo-ci-rhel-8-containers-multinode-rhos-17&skip=0 --- Additional comment from Mark Michelson on 2022-04-22 15:06:11 UTC --- Marking ovn-triage-complete since issue is well-defined and relevant information has been attached. --- Additional comment from Yatin Karel on 2022-04-28 10:43:28 UTC --- I tried to debug it further only thing i got is when the issue happens en_lflow_output_run[1] is not getting called and also en_pflow_output_run[2] not called which calls physical_run[3] where the flows are getting added. I didn't find what calls en_lflow_output_run and en_pflow_output_run, may be some events call those and those events are somehow missed? And then at time of vm delete both of these are getting called. Also came across a patch[4], not sure if that's related to this regression. [1] https://github.com/ovn-org/ovn/blob/v21.12.0/controller/ovn-controller.c#L2345 [2] https://github.com/ovn-org/ovn/blob/v21.12.0/controller/ovn-controller.c#L2802 [3] https://github.com/ovn-org/ovn/blob/v21.12.0/controller/physical.c#L1649 [4] https://github.com/ovn-org/ovn/commit/3ae8470edc648b7401433a22a9f15053cc7e666d --- Additional comment from Ales Musil on 2022-05-04 06:03:39 UTC --- (In reply to Yatin Karel from comment #3) > I tried to debug it further only thing i got is when the issue happens > en_lflow_output_run[1] is not getting called and also en_pflow_output_run[2] > not called which calls physical_run[3] where the flows are getting added. I > didn't find what calls en_lflow_output_run and en_pflow_output_run, may be > some events call those and those events are somehow missed? And then at > time of vm delete both of these are getting called. Also came across a > patch[4], not sure if that's related to this regression. > > > [1] > https://github.com/ovn-org/ovn/blob/v21.12.0/controller/ovn-controller. > c#L2345 > [2] > https://github.com/ovn-org/ovn/blob/v21.12.0/controller/ovn-controller. > c#L2802 > [3] https://github.com/ovn-org/ovn/blob/v21.12.0/controller/physical.c#L1649 > [4] > https://github.com/ovn-org/ovn/commit/ > 3ae8470edc648b7401433a22a9f15053cc7e666d Hi, the en_lflow_output_run and en_pflow_output_run are running only during full recompute. So the fact that they are not called is correct. The function responsible for those flows is consider_port_binding[0], which is called during I-P from two places pflow_output_runtime_data_handler[1] and pflow_output_sb_port_binding_handler[2]. During the investigation I have noticed that there is a missing record in table 37 that should be added during the I-P run, however the consider_port_binding has been called in all occasions in my test setup. [0] https://github.com/ovn-org/ovn/blob/v21.12.0/controller/physical.c#L889 [1] https://github.com/ovn-org/ovn/blob/v21.12.0/controller/ovn-controller.c#L2876 [2] https://github.com/ovn-org/ovn/blob/v21.12.0/controller/ovn-controller.c#L2826 --- Additional comment from Yatin Karel on 2022-05-18 12:05:00 UTC --- Troubleshooted it further and found that issue is happening at the resource cleanup time, and once issue occurs next run of test fails. # On good cases the metadata port is deleted soon(within 1-2 seconds) after vm port is deleted # On bad cases it deletes after few seconds(more that 10 seconds as i noticed). And before it get's removed network is deleted. From neutron perspective this is happening because ovn-metadata when it's about to remove metadata port from bridge receives PortBindingChassisDeletedEvent Event and it start processing it and then cleans the metadata port. Was able to reproduce the issue with following ovn only commands:- # Configure OVS ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:$IP:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=$IP # ovn-nbctl --version ovn-nbctl 21.12.1 Open vSwitch Library 2.16.90 DB Schema 5.34.1 # Create script ovn-nbctl ls-add ls1 ovn-nbctl lsp-add ls1 ls1p1 ovn-nbctl lsp-set-addresses ls1p1 "00:00:00:01:01:11 192.168.1.11" ovn-nbctl lsp-add ls1 ls1lp ovn-nbctl lsp-set-type ls1lp localport ovn-nbctl lsp-set-addresses ls1lp "00:00:00:01:01:02 192.168.1.2" ovn-nbctl lr-add lr1 ovn-nbctl lrp-add lr1 lr1-ls1 00:00:00:00:00:01 192.168.1.154/24 2001::a/64 ovn-nbctl lsp-add ls1 ls1-lr1 ovn-nbctl lsp-set-addresses ls1-lr1 "00:00:00:00:00:01" ovn-nbctl lsp-set-type ls1-lr1 router ovn-nbctl lsp-set-options ls1-lr1 router-port=lr1-ls1 ovn-nbctl lrp-add lr1 lr1-pub 00:00:00:00:00:02 172.17.1.154/24 7011::a/64 ovn-nbctl lrp-set-gateway-chassis lr1-pub hv1 ovn-nbctl lr-route-add lr1 0.0.0.0/0 172.17.1.100 lr1-pub ovn-nbctl lr-route-add lr1 ::/0 7011::100 lr1-pub ovn-nbctl ls-add pub ovn-nbctl lsp-add pub pub-lr1 ovn-nbctl lsp-set-type pub-lr1 router ovn-nbctl lsp-set-addresses pub-lr1 router ovn-nbctl lsp-set-options pub-lr1 router-port=lr1-pub ovn-nbctl lsp-add pub ln0 ovn-nbctl lsp-set-type ln0 localnet ovn-nbctl lsp-set-options ln0 network_name=phys ovn-nbctl lsp-set-addresses ln0 unknown ovs-vsctl add-port br-int ls1p1 -- set interface ls1p1 type=internal external_ids:iface-id=ls1p1 ovs-vsctl add-port br-int ls1lp -- set interface ls1lp type=internal external_ids:iface-id=ls1lp # After running Create script can check with below commands 5/6 flows exists for localport COOKIE=$(ovn-sbctl find port_binding logical_port=ls1lp|grep uuid|cut -d: -f2| cut -c1-9) ovs-ofctl dump-flows br-int|grep $COOKIE # Cleanup script ovs-vsctl del-port br-int ls1p1 ovn-nbctl lrp-del-gateway-chassis lr1-pub hv1 #ovs-vsctl del-port br-int ls1lp ovn-nbctl lrp-del lr1-pub ovn-nbctl lrp-del lr1-ls1 ovn-nbctl lsp-del pub-lr1 ovn-nbctl lsp-del ls1-lr1 ovn-nbctl lsp-del ln0 ovn-nbctl ls-del pub ovn-nbctl lsp-del ls1lp ovn-nbctl ls-del ls1 ovn-nbctl lr-del lr1 ovs-vsctl del-port br-int ls1lp # After running Cleanup script can check 1 flow(table=37 priority=150) is still left when issue happens, without issue no flow should remain ovs-ofctl dump-flows br-int|grep $COOKIE NOTES:- - Issue do not reproduce with previous version i.e ovn-2021-21.06.0-17 using above scripts - With ovn-2021-21.06.0-17, flows are cleaned when ls1 is deleted(even if ls1lp not deleted from br-int bridge) as happening with ovn-2021-21.12 - If localport is deleted earlier (uncomment ovs-vsctl del-port br-int ls1lp In cleanup script), issue do not reproduce - With ovn-2021-21.12.0-11 when issue happens at time of cleanup following flow is not cleaned cookie=0x9e4428b3, duration=193.727s, table=37, n_packets=0, n_bytes=0, idle_age=193, priority=150,reg14=0x2,metadata=0x1 actions=resubmit(,38) - And if Create script is run again, this extra flow is cleaned and no new flow is added for localport and thus the metadata issue is seen in tests. - Also noticed if ls-del pub is done after lsp-del ls1lp then also flows are getting cleaned, don't know if that's something related, But in our tempest tests we do not touch public switch/network during tests run.
Tested with patches from ovn bug[1][2] in [3] and the issue no longer reproduced. Now need to wait for these to be available in ovn-2021-21.12 for osp-17.0 [1] https://patchwork.ozlabs.org/project/ovn/patch/20220525065155.293205-2-amusil@redhat.com/ [2] https://patchwork.ozlabs.org/project/ovn/patch/20220527073040.519706-1-amusil@redhat.com/ [3] https://review.rdoproject.org/r/c/testproject/+/41999
The test: tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_update_router_admin_state was run 1000 times and all were Passed. Also the steps of ovn commands was followed, after running cleanup script the flow was cleaned up. Version: RHOS-17.0-RHEL-9-20220628.n.1 ovn22.03-22.03.0-52.el9fdp.x86_64 [heat-admin@controller-1 ~]$ sudo podman exec ovn_controller rpm -qa | grep -E 'ovn|openvswitch' openvswitch-selinux-extra-policy-1.0-31.el9fdp.noarch ovn22.03-22.03.0-52.el9fdp.x86_64 openvswitch2.17-2.17.0-18.el9fdp.x86_64 python3-openvswitch2.17-2.17.0-18.el9fdp.x86_64 openstack-network-scripts-openvswitch2.17-10.11.1-3.el9ost.x86_64 rhosp-network-scripts-openvswitch-2.17-5.el9ost.noarch rhosp-openvswitch-2.17-5.el9ost.noarch python3-rhosp-openvswitch-2.17-5.el9ost.noarch rhosp-ovn-22.03-5.el9ost.noarch ovn22.03-host-22.03.0-52.el9fdp.x86_64 rhosp-ovn-host-22.03-5.el9ost.noarch [heat-admin@controller-1 ~]$ [heat-admin@controller-0 ~]$ ovn-nbctl --version ovn-nbctl 22.03.2 Open vSwitch Library 2.17.90 DB Schema 6.1.0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2022:6543