Bug 2088454 - Random tempest test failures due to missing flows for metadata port
Summary: Random tempest test failures due to missing flows for metadata port
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 17.0 (Wallaby)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ga
: 17.0
Assignee: Elvira
QA Contact: Fiorella Yanac
URL:
Whiteboard:
Depends On: 2076604
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-19 13:35 UTC by Yatin Karel
Modified: 2022-09-21 12:22 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of: 2076604
Environment:
Last Closed: 2022-09-21 12:21:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-15316 0 None None None 2022-05-19 13:54:12 UTC
Red Hat Product Errata RHEA-2022:6543 0 None None None 2022-09-21 12:22:06 UTC

Description Yatin Karel 2022-05-19 13:35:56 UTC
+++ This bug was initially created as a clone of Bug #2076604 +++

Description of problem:
Originally detected in Upstream OpenStack job https://bugs.launchpad.net/tripleo/+bug/1968732.
This is seen after ovn-2021 update[1] in CentOS Stream NFV SIG(ovn-2021-21.06.0-17.el9s --> ovn-2021-21.12.0-11.el9s).

When the issue happens(ssh fails as vm cannot connect with metadata proxy 169.254.169.254:80), flows for metadata port(ip 10.100.0.2) are missing. After running recompute(ovs-appctl -t /var/run/ovn/ovn-controller.2.ctl recompute) flows appears and all works fine.

[root@node-0002427663 /]# ovs-ofctl show br-int                           
OFPT_FEATURES_REPLY (xid=0x2): dpid:0000ea1fd3f2be98
n_tables:254, n_buffers:0
capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
 1(patch-br-int-to): addr:32:44:73:93:c8:40
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 46(tapef1645c0-13): addr:fe:16:3e:a7:0e:10
     config:     0                                                       
     state:      0
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 47(tapa3803bbf-b0): addr:ea:fe:c4:d2:ab:b8
     config:     0                                                       
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 48(tap4987daed-f3): addr:fe:16:3e:e2:c5:59
     config:     0                                                       
     state:      0        
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 49(tap96ca70a5-6d): addr:fe:16:3e:29:c1:e9
     config:     0
     state:      0
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 50(tape6d9123e-6f): addr:fe:16:3e:9a:67:86
     config:     0
     state:      0
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 51(tapd59710fd-ed): addr:fe:16:3e:a0:2c:2e
     config:     0
     state:      0
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 52(tapfe28971b-5c): addr:fe:16:3e:fa:ee:21
     config:     0
     state:      0
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 71(tap67e2e6ae-0e): addr:fe:16:3e:8a:77:69
     config:     0
     state:      0
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 72(tapd00f18e1-90): addr:42:f4:52:46:51:da
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 LOCAL(br-int): addr:ea:1f:d3:f2:be:98
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0

Metadata port ==> 72(tapd00f18e1-90): addr:42:f4:52:46:51:da

[root@node-0002427663 /]# ovs-ofctl dump-flows br-int|grep output:
 cookie=0xf91d47f4, duration=12983.494s, table=65, n_packets=2043, n_bytes=207683, idle_age=155, priority=100,reg15=0x1,metadata=0x2 actions=output:1
 cookie=0xe2f863bd, duration=5577.208s, table=65, n_packets=68, n_bytes=7626, idle_age=5475, priority=100,reg15=0x3,metadata=0x4 actions=output:46
 cookie=0xa397bc61, duration=5576.884s, table=65, n_packets=583, n_bytes=50253, idle_age=5530, priority=100,reg15=0x1,metadata=0x4 actions=output:47
 cookie=0x2124f02e, duration=5575.984s, table=65, n_packets=67, n_bytes=7500, idle_age=5479, priority=100,reg15=0x4,metadata=0x4 actions=output:48
 cookie=0x42b7518f, duration=5575.761s, table=65, n_packets=64, n_bytes=7374, idle_age=5487, priority=100,reg15=0x5,metadata=0x4 actions=output:49
 cookie=0xb2cef043, duration=5575.489s, table=65, n_packets=63, n_bytes=7332, idle_age=5477, priority=100,reg15=0x6,metadata=0x4 actions=output:50
 cookie=0x72093ac3, duration=5575.131s, table=65, n_packets=61, n_bytes=7248, idle_age=5471, priority=100,reg15=0x7,metadata=0x4 actions=output:51
 cookie=0x1cddeec2, duration=5574.421s, table=65, n_packets=58, n_bytes=7098, idle_age=5469, priority=100,reg15=0x8,metadata=0x4 actions=output:52
 cookie=0xf589365e, duration=946.264s, table=65, n_packets=62, n_bytes=4	544, idle_age=15, priority=100,reg15=0x3,metadata=0x8 actions=output:71

# Flow missing ^ for metadata port 72

[root@node-0002427663 /]# ovs-appctl -t /var/run/ovn/ovn-controller.2.ctl recompute
 
 
 [root@node-0002427663 /]# ovs-ofctl dump-flows br-int|grep output:
 cookie=0xf91d47f4, duration=13472.388s, table=65, n_packets=2056, n_bytes=208745, idle_age=371, priority=100,reg15=0x1,metadata=0x2 actions=output:1
 cookie=0xe2f863bd, duration=6066.102s, table=65, n_packets=68, n_bytes=7626, idle_age=5964, priority=100,reg15=0x3,metadata=0x4 actions=output:46
 cookie=0xa397bc61, duration=6065.778s, table=65, n_packets=583, n_bytes=50253, idle_age=6019, priority=100,reg15=0x1,metadata=0x4 actions=output:47
 cookie=0x2124f02e, duration=6064.878s, table=65, n_packets=67, n_bytes=7500, idle_age=5968, priority=100,reg15=0x4,metadata=0x4 actions=output:48
 cookie=0x42b7518f, duration=6064.655s, table=65, n_packets=64, n_bytes=7374, idle_age=5976, priority=100,reg15=0x5,metadata=0x4 actions=output:49
 cookie=0xb2cef043, duration=6064.383s, table=65, n_packets=63, n_bytes=7332, idle_age=5966, priority=100,reg15=0x6,metadata=0x4 actions=output:50
 cookie=0x72093ac3, duration=6064.025s, table=65, n_packets=61, n_bytes=7248, idle_age=5959, priority=100,reg15=0x7,metadata=0x4 actions=output:51
 cookie=0x1cddeec2, duration=6063.315s, table=65, n_packets=58, n_bytes=7098, idle_age=5958, priority=100,reg15=0x8,metadata=0x4 actions=output:52
 cookie=0xf589365e, duration=1435.158s, table=65, n_packets=66, n_bytes=4768, idle_age=366, priority=100,reg15=0x3,metadata=0x8 actions=output:71
 cookie=0x6f326bd2, duration=52.307s, table=65, n_packets=0, n_bytes=0, idle_age=52, priority=100,reg15=0x1,metadata=0x8 actions=output:72

# After recompute flow for port 72 appears ^. Following additional flows appears after running recompute:-

cookie=0x6f326bd2, duration=30.050s, table=0, n_packets=0, n_bytes=0, idle_age=30, priority=100,in_port=72 actions=load:0x1b->NXM_NX_REG13[],load:0x18->NXM_NX_REG11[],load:0x19->NXM_NX_REG12[],load:0x8->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],load:0x1->NXM_NX_REG10[10],resubmit(,8)
cookie=0x6f326bd2, duration=30.053s, table=37, n_packets=0, n_bytes=0, idle_age=30, priority=150,reg14=0x1,metadata=0x8 actions=resubmit(,38)
cookie=0x6f326bd2, duration=30.053s, table=38, n_packets=0, n_bytes=0, idle_age=30, priority=100,reg15=0x1,metadata=0x8 actions=load:0x1b->NXM_NX_REG13[],load:0x18->NXM_NX_REG11[],load:0x19->NXM_NX_REG12[],resubmit(,39)
cookie=0x6f326bd2, duration=30.053s, table=39, n_packets=0, n_bytes=0, idle_age=30, priority=100,reg10=0/0x1,reg14=0x1,reg15=0x1,metadata=0x8 actions=drop
cookie=0x6f326bd2, duration=30.054s, table=64, n_packets=0, n_bytes=0, idle_age=30, priority=100,reg10=0x1/0x1,reg15=0x1,metadata=0x8 actions=push:NXM_OF_IN_PORT[],load:0xffff->NXM_OF_IN_PORT[],resubmit(,65),pop:NXM_OF_IN_PORT[]
cookie=0x6f326bd2, duration=30.054s, table=65, n_packets=0, n_bytes=0, idle_age=30, priority=100,reg15=0x1,metadata=0x8 actions=output:72


[1] https://review.rdoproject.org/r/c/nfvinfo/+/40817

Version-Release number of selected component (if applicable):
ovn-2021-21.12.0-11.el9s

How reproducible:
Randomly reproduces as one of tempest test fails

Steps to Reproduce:
1. OpenStack wallaby ovn based multinode deployment
2. run following until it fails:- sudo tempest run --regex tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_update_router_admin_state


Actual results:
Test fails while performing SSH.

Expected results:
Test should pass

Additional info:
This a regression caused with ovn-2021-21.12.0-11.el9s.
Attaching following in a tar archive:-
- ovnnb_db before recompute
- ovnsb_db before recompute
- ovs db before and after recompute
- Output of "ovs-ofctl dump-flows br-int" before and after running recompute

--- Additional comment from Yatin Karel on 2022-04-19 13:46:39 UTC ---

The similar failure are seen in RHEL9 OSP 17.0 job:- https://sf.hosted.upshift.rdu2.redhat.com/logs/58/398758/7/check/periodic-tripleo-ci-rhel-9-containers-multinode-rhos-17/3a6700d/logs/undercloud/var/log/tempest/stestr_results.html

OVN version:- 21.12.0-42.el9fdp
OVS version:- 2.15.0-51.el9fdp

Also to NOTE in similar job[1] running on RHEL8.4 issue not seen, not sure if it's just lucky as issue is random or something specific to do with RHEL/CentOS-Stream 9.
OVN version:- 21.12.0-45.el8fdp
OVS version:- 2.15.0-93.el8fdp
on some previous runs from 7th April success was with ovn-2021-21.12.0-39.el8fdp and openvswitch2.15-2.15.0-91.el8fdp. For older runs logs are not available.


[1] https://sf.hosted.upshift.rdu2.redhat.com/zuul/t/tripleo-ci-internal/builds?job_name=periodic-tripleo-ci-rhel-8-containers-multinode-rhos-17&skip=0

--- Additional comment from Mark Michelson on 2022-04-22 15:06:11 UTC ---

Marking ovn-triage-complete since issue is well-defined and relevant information has been attached.

--- Additional comment from Yatin Karel on 2022-04-28 10:43:28 UTC ---

I tried to debug it further only thing i got is when the issue happens en_lflow_output_run[1] is not getting called and also en_pflow_output_run[2] not called which calls physical_run[3] where the flows are getting added. I didn't find what calls en_lflow_output_run and en_pflow_output_run, may be some events call those and those events are somehow missed?  And then at time of vm delete both of these are getting called. Also came across a patch[4], not sure if that's related to this regression.


[1] https://github.com/ovn-org/ovn/blob/v21.12.0/controller/ovn-controller.c#L2345
[2] https://github.com/ovn-org/ovn/blob/v21.12.0/controller/ovn-controller.c#L2802
[3] https://github.com/ovn-org/ovn/blob/v21.12.0/controller/physical.c#L1649
[4] https://github.com/ovn-org/ovn/commit/3ae8470edc648b7401433a22a9f15053cc7e666d

--- Additional comment from Ales Musil on 2022-05-04 06:03:39 UTC ---

(In reply to Yatin Karel from comment #3)
> I tried to debug it further only thing i got is when the issue happens
> en_lflow_output_run[1] is not getting called and also en_pflow_output_run[2]
> not called which calls physical_run[3] where the flows are getting added. I
> didn't find what calls en_lflow_output_run and en_pflow_output_run, may be
> some events call those and those events are somehow missed?  And then at
> time of vm delete both of these are getting called. Also came across a
> patch[4], not sure if that's related to this regression.
> 
> 
> [1]
> https://github.com/ovn-org/ovn/blob/v21.12.0/controller/ovn-controller.
> c#L2345
> [2]
> https://github.com/ovn-org/ovn/blob/v21.12.0/controller/ovn-controller.
> c#L2802
> [3] https://github.com/ovn-org/ovn/blob/v21.12.0/controller/physical.c#L1649
> [4]
> https://github.com/ovn-org/ovn/commit/
> 3ae8470edc648b7401433a22a9f15053cc7e666d

Hi,
the en_lflow_output_run and en_pflow_output_run are running only during full
recompute. So the fact that they are not called is correct. The function
responsible for those flows is consider_port_binding[0], which is called during I-P
from two places pflow_output_runtime_data_handler[1] and pflow_output_sb_port_binding_handler[2].

During the investigation I have noticed that there is a missing record in table 37
that should be added during the I-P run, however the consider_port_binding has been called in all
occasions in my test setup.

[0] https://github.com/ovn-org/ovn/blob/v21.12.0/controller/physical.c#L889
[1] https://github.com/ovn-org/ovn/blob/v21.12.0/controller/ovn-controller.c#L2876
[2] https://github.com/ovn-org/ovn/blob/v21.12.0/controller/ovn-controller.c#L2826

--- Additional comment from Yatin Karel on 2022-05-18 12:05:00 UTC ---

Troubleshooted it further and found that issue is happening at the resource cleanup time, and once issue occurs next run of test fails.

# On good cases the metadata port is deleted soon(within 1-2 seconds) after vm port is deleted

# On bad cases it deletes after few seconds(more that 10 seconds as i noticed). And before it get's removed network is deleted. From neutron perspective this is happening because ovn-metadata when it's about to remove metadata port from bridge receives PortBindingChassisDeletedEvent Event and it start processing it and then cleans the metadata port.


Was able to reproduce the issue with following ovn only commands:-

# Configure OVS
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:$IP:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=$IP

# ovn-nbctl --version
ovn-nbctl 21.12.1
Open vSwitch Library 2.16.90
DB Schema 5.34.1

# Create script
ovn-nbctl ls-add ls1
 
ovn-nbctl lsp-add ls1 ls1p1
ovn-nbctl lsp-set-addresses ls1p1 "00:00:00:01:01:11 192.168.1.11"
 
ovn-nbctl lsp-add ls1 ls1lp
ovn-nbctl lsp-set-type ls1lp localport
ovn-nbctl lsp-set-addresses ls1lp "00:00:00:01:01:02 192.168.1.2"
 
ovn-nbctl lr-add lr1                                                                                  
ovn-nbctl lrp-add lr1 lr1-ls1 00:00:00:00:00:01 192.168.1.154/24 2001::a/64
ovn-nbctl lsp-add ls1 ls1-lr1                                                                        
ovn-nbctl lsp-set-addresses ls1-lr1 "00:00:00:00:00:01"
ovn-nbctl lsp-set-type ls1-lr1 router                                                                
ovn-nbctl lsp-set-options ls1-lr1 router-port=lr1-ls1
                                                                                                     
ovn-nbctl lrp-add lr1 lr1-pub  00:00:00:00:00:02 172.17.1.154/24 7011::a/64
ovn-nbctl lrp-set-gateway-chassis lr1-pub hv1                                                        
ovn-nbctl lr-route-add lr1 0.0.0.0/0 172.17.1.100 lr1-pub
ovn-nbctl lr-route-add lr1 ::/0 7011::100 lr1-pub                                                    
                                                                                                     
ovn-nbctl ls-add pub                                                                                  
ovn-nbctl lsp-add pub pub-lr1                                                                        
ovn-nbctl lsp-set-type pub-lr1 router                                                                
ovn-nbctl lsp-set-addresses pub-lr1 router                                                            
ovn-nbctl lsp-set-options pub-lr1 router-port=lr1-pub
                                                                                                     
ovn-nbctl lsp-add pub ln0                                                                            
ovn-nbctl lsp-set-type ln0 localnet                                                                  
ovn-nbctl lsp-set-options ln0 network_name=phys                                                      
ovn-nbctl lsp-set-addresses ln0 unknown                
 
ovs-vsctl add-port br-int ls1p1 -- set interface ls1p1 type=internal external_ids:iface-id=ls1p1
ovs-vsctl add-port br-int ls1lp -- set interface ls1lp type=internal external_ids:iface-id=ls1lp

# After running Create script can check with below commands 5/6 flows exists for localport
COOKIE=$(ovn-sbctl find port_binding logical_port=ls1lp|grep uuid|cut -d: -f2| cut -c1-9)
ovs-ofctl dump-flows br-int|grep $COOKIE

# Cleanup script
ovs-vsctl del-port br-int ls1p1
ovn-nbctl lrp-del-gateway-chassis lr1-pub hv1
#ovs-vsctl del-port br-int ls1lp
ovn-nbctl lrp-del lr1-pub
ovn-nbctl lrp-del lr1-ls1
 
ovn-nbctl lsp-del pub-lr1
ovn-nbctl lsp-del ls1-lr1
 
ovn-nbctl lsp-del ln0
ovn-nbctl ls-del pub
 
ovn-nbctl lsp-del ls1lp
ovn-nbctl ls-del ls1
ovn-nbctl lr-del lr1
ovs-vsctl del-port br-int ls1lp

# After running Cleanup script can check 1 flow(table=37 priority=150) is still left when issue happens, without issue no flow should remain
ovs-ofctl dump-flows br-int|grep $COOKIE

NOTES:-
- Issue do not reproduce with previous version i.e ovn-2021-21.06.0-17 using above scripts
- With ovn-2021-21.06.0-17, flows are cleaned when ls1 is deleted(even if ls1lp not deleted from br-int bridge) as happening with ovn-2021-21.12
- If localport is deleted earlier (uncomment ovs-vsctl del-port br-int ls1lp In cleanup script), issue do not reproduce
- With ovn-2021-21.12.0-11 when issue happens at time of cleanup following flow is not cleaned
  cookie=0x9e4428b3, duration=193.727s, table=37, n_packets=0, n_bytes=0, idle_age=193, priority=150,reg14=0x2,metadata=0x1 actions=resubmit(,38)
- And if Create script is run again, this extra flow is cleaned and no new flow is added for localport and thus the metadata issue is seen in tests.
- Also noticed if ls-del pub is done after lsp-del ls1lp then also flows are getting cleaned, don't know if that's something related, But in our tempest tests we do not touch public switch/network during tests run.

Comment 1 Yatin Karel 2022-05-30 05:16:47 UTC
Tested with patches from ovn bug[1][2] in [3] and the issue no longer reproduced. Now need to wait for these to be available in ovn-2021-21.12 for osp-17.0

[1] https://patchwork.ozlabs.org/project/ovn/patch/20220525065155.293205-2-amusil@redhat.com/
[2] https://patchwork.ozlabs.org/project/ovn/patch/20220527073040.519706-1-amusil@redhat.com/
[3] https://review.rdoproject.org/r/c/testproject/+/41999

Comment 5 Fiorella Yanac 2022-07-06 09:07:51 UTC
The test: tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_update_router_admin_state was run 1000 times and all were Passed.
Also the steps of ovn commands was followed, after running cleanup script the flow was cleaned up.


Version:
RHOS-17.0-RHEL-9-20220628.n.1
ovn22.03-22.03.0-52.el9fdp.x86_64


[heat-admin@controller-1 ~]$ sudo podman exec ovn_controller rpm -qa | grep -E 'ovn|openvswitch' 
openvswitch-selinux-extra-policy-1.0-31.el9fdp.noarch                                                   
ovn22.03-22.03.0-52.el9fdp.x86_64
openvswitch2.17-2.17.0-18.el9fdp.x86_64
python3-openvswitch2.17-2.17.0-18.el9fdp.x86_64
openstack-network-scripts-openvswitch2.17-10.11.1-3.el9ost.x86_64                                       
rhosp-network-scripts-openvswitch-2.17-5.el9ost.noarch                                                  
rhosp-openvswitch-2.17-5.el9ost.noarch
python3-rhosp-openvswitch-2.17-5.el9ost.noarch
rhosp-ovn-22.03-5.el9ost.noarch
ovn22.03-host-22.03.0-52.el9fdp.x86_64
rhosp-ovn-host-22.03-5.el9ost.noarch
[heat-admin@controller-1 ~]$ 

[heat-admin@controller-0 ~]$ ovn-nbctl --version
ovn-nbctl 22.03.2
Open vSwitch Library 2.17.90
DB Schema 6.1.0

Comment 13 errata-xmlrpc 2022-09-21 12:21:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543


Note You need to log in before you can comment on or make changes to this bug.