Created attachment 1812779 [details] 0630 flow dumps Description of problem: After upgrading the OVN-Kube nightly, performance has drastically dropped - both in relation to SDN (OVN vs. SDN performance) as well as w.r.t. previous OVN versions (OVN 6/30 to OVN 7/29). Data collection gives the following versions for these components: 6/30 build ("fast") openvswitch2.15-2.15.0-15.el8fdp.x86_64 ovn2.13-20.12.0-140.el8fdp.x86_64 ovn2.13-host-20.12.0-140.el8fdp.x86_64 ovn2.13-central-20.12.0-140.el8fdp.x86_64 ovn2.13-vtep-20.12.0-140.el8fdp.x86_64 kernel-4.18.0-305.3.1.el8_4.x86_64 7/29 build ("slow") openvswitch2.15-2.15.0-28.el8fdp.x86_64 ovn21.09-21.09.0-9.el8fdp.x86_64 ovn21.09-host-21.09.0-9.el8fdp.x86_64 ovn21.09-central-21.09.0-9.el8fdp.x86_64 ovn21.09-vtep-21.09.0-9.el8fdp.x86_64 kernel-4.18.0-305.10.2.el8_4.x86_64 Attached, find the ofctl and dpctl flow dumps, for the OVN-Kube setup.
The following script was developed to try and create a reproducer that existed outside of the cloud environment: --------- 8< --------- #!/bin/sh # SPDX-License-Identifier: GPL-2.0 # # Basic OVS testing on_exit () { echo "$1" > ${ovs_dir}/cleanup.tmp cat ${ovs_dir}/cleanup >> ${ovs_dir}/cleanup.tmp mv ${ovs_dir}/cleanup.tmp ${ovs_dir}/cleanup } ovs_setenv() { sandbox=$1 ovs_dir=$ovs_base${1:+/$1}; export ovs_dir test -e ${ovs_dir}/cleanup || : > ${ovs_dir}/cleanup OVS_RUNDIR=$ovs_dir; export OVS_RUNDIR OVS_LOGDIR=$ovs_dir; export OVS_LOGDIR OVS_DBDIR=$ovs_dir; export OVS_DBDIR OVS_SYSCONFDIR=$ovs_dir; export OVS_SYSCONFDIR OVS_PKGDATADIR=$ovs_dir; export OVS_PKGDATADIR } ovs_exit_sig() { . "$ovs_dir/cleanup" ovs-dpctl del-dp ovs-system } ovs_cleanup() { ovs_exit_sig echo "Error detected. See $ovs_dir for more details." } ovs_normal_exit() { ovs_exit_sig rm -rf ${ovs_dir} } kill_ovs_vswitchd () { # Use provided PID or save the current PID if available. TMPPID=$1 if test -z "$TMPPID"; then TMPPID=$(cat $OVS_RUNDIR/ovs-vswitchd.pid 2>/dev/null) fi # Tell the daemon to terminate gracefully ovs-appctl -t ovs-vswitchd exit --cleanup 2>/dev/null # Nothing else to be done if there is no PID test -z "$TMPPID" && return for i in 1 2 3 4 5 6 7 8 9; do # Check if the daemon is alive. kill -0 $TMPPID 2>/dev/null || return # Fallback to whole number since POSIX doesn't require # fractional times to work. sleep 0.1 || sleep 1 done # Make sure it is terminated. kill $TMPPID } start_daemon () { echo "exec: $@ -vconsole:off --detach --no-chdir --pidfile --log-file" "$@" -vconsole:off --detach --no-chdir --pidfile --log-file pidfile="$OVS_RUNDIR"/$1.pid echo "setting kill for $@..." on_exit "test -e \"$pidfile\" && kill \`cat \"$pidfile\"\`" } if test "X$vswitchd_schema" = "X"; then vswitchd_schema="/usr/share/openvswitch" fi ovs_sbx() { if test "X$2" != X; then (ovs_setenv $1; shift; "$@") else ovs_setenv $1 fi } seq () { if test $# = 1; then set 1 $1 fi while test $1 -le $2; do echo $1 set `expr $1 + ${3-1}` $2 $3 done } ovs_wait () { echo "$1: waiting $2..." # First try the condition without waiting. if ovs_wait_cond; then echo "$1: wait succeeded immediately"; return 0; fi # Try a quick sleep, so that the test completes very quickly # in the normal case. POSIX doesn't require fractional times to # work, so this might not work. sleep 0.1 if ovs_wait_cond; then echo "$1: wait succeeded quickly"; return 0; fi if [ "$OVS_CTL_TIMEOUT" == "" ]; then OVS_CTL_TIMEOUT=30 fi # Then wait up to OVS_CTL_TIMEOUT seconds. local d for d in `seq 1 "$OVS_CTL_TIMEOUT"`; do sleep 1 if ovs_wait_cond; then echo "$1: wait succeeded after $d seconds"; return 0; fi done echo "$1: wait failed after $d seconds" ovs_wait_failed } sbxs= sbx_add () { echo "adding sandbox '$1'" sbxs="$sbxs $1" # Create sandbox. local d="$ovs_base"/$1 mkdir "$d" || return 1 ovs_setenv $1 # Create database and start ovsdb-server. : > "$d"/.conf.db.~lock~ ovs_sbx $1 ovsdb-tool create "$d"/conf.db "$vswitchd_schema"/vswitch.ovsschema || return 1 ovs_sbx $1 start_daemon ovsdb-server --detach --pidfile --log-file --remote=punix:"$d"/db.sock || return 1 # Initialize database. ovs_sbx $1 ovs-vsctl --no-wait -- init || return 1 # Start ovs-vswitchd ovs_sbx $1 start_daemon ovs-vswitchd --pidfile -vvconn -vofproto_dpif -vunixctl ovs_wait_cond () { if ip link show ovs-system; then return 0; else return 1; fi } ovs_wait_failed () { : } ovs_wait "sandbox_add" "while ip link show ovs-system" || return 1 } ovs_base=`pwd` # # test_udp_performance() { sbx_add "test_udp_performance" || return 1 for ns in client server; do ip netns add $ns || return 1 on_exit "ip netns del $ns" done # setup the base bridge ovs_sbx "test_udp_performance" ovs-vsctl add-br br0 || return 1 # setup the client ip link add c0 type veth peer name c1 || return 1 on_exit "ip link del c0 >/dev/null 2>&1" ip link set c0 mtu 1450 ip link set c0 up ip link set c1 netns client || return 1 ip netns exec client ip addr add 172.31.110.2/24 dev c1 ip netns exec client ip link set c1 mtu 1450 ip netns exec client ip link set c1 up ovs_sbx "test_udp_performance" ovs-vsctl add-port br0 c0 || return 1 # setup the server ip link add s0 type veth peer name s1 || return 1 on_exit "ip link del s0 >/dev/null 2>&1; ip netns exec server ip link del s0 >/dev/null 2>&1" ip link set s0 up ip link set s1 netns server || return 1 ip netns exec server ip addr add 172.31.110.1/24 dev s1 || return 1 ip netns exec server ip link set s1 up ovs_sbx "test_udp_performance" ovs-vsctl add-port br0 s0 || return 1 ovs_sbx "test_udp_performance" ovs-ofctl del-flows br0 cat >${ovs_dir}/flows.txt <<EOF table=0,priority=1000,arp action=normal table=0,priority=50,ip,in_port=c0 action=ct(table=1,zone=64000) table=0,priority=40,ip,in_port=s0 action=ct(table=1,zone=64000,commit) table=0,priority=0 action=normal table=1,priority=0 action=normal EOF ovs_sbx "test_udp_performance" ovs-ofctl --bundle add-flows br0 ${ovs_dir}/flows.txt || return 1 ovs_sbx "test_udp_performance" ovs-ofctl dump-flows br0 >> ${ovs_dir}/debug.log ovs_sbx "test_udp_performance" ovs-vsctl show >> ${ovs_dir}/debug.log # setup iperf3 server ip netns exec server sh -c 'iperf3 -B 172.31.110.1 -s >${ovs_dir}/s1-iperf.log 2>&1 & echo $! > ${ovs_dir}/server.pid' on_exit "test -e \"${ovs_dir}/server.pid\" && kill \`cat \"${ovs_dir}/server.pid\"\`" # give 5 seconds to sleep sleep 5 # start iperf3 client ip netns exec client ping -c 1 172.31.110.1 || echo "unable to ping" ip netns exec client iperf3 -u -c 172.31.110.1 -b 2000M >${ovs_dir}/c1-iperf.log 2>&1 || echo "failed to run iperf client" echo "dumping debug logs" echo "====== default stuff =====" >> ${ovs_dir}/debug.log rpm -qa | grep openvswitch >> ${ovs_dir}/debug.log ovs-vsctl -V >> ${ovs_dir}/debug.log ip link show >> ${ovs_dir}/debug.log ip addr show >> ${ovs_dir}/debug.log ovs_sbx "test_unbalanced_conntrack" ovs-appctl dpctl/dump-flows >> ${ovs_dir}/debug.log ovs_sbx "test_unbalanced_conntrack" ovs-ofctl dump-flows br0 >> ${ovs_dir}/debug.log echo "====== server ======" >> ${ovs_dir}/debug.log ip netns exec server ip link show >> ${ovs_dir}/debug.log ip netns exec server ip addr show >> ${ovs_dir}/debug.log echo "====== client ======" >> ${ovs_dir}/debug.log ip netns exec client ip link show >> ${ovs_dir}/debug.log ip netns exec client ip addr show >> ${ovs_dir}/debug.log echo "debug logs dumped" echo "manual cleanup" . test_udp_performance/cleanup # ovs_normal_exit } if [ "$1" = "" ]; then test_udp_performance DATE=$(date +"%d-%m-%y-%T") mv test_udp_performance test_udp_performance_${DATE} fi --------- 8< --------- When running the above script with the 06/30 z-stream kernel, the results from iperf were stable as: [ 5] 0.00-10.00 sec 2.33 GBytes 2.00 Gbits/sec 0.001 ms 0/1788241 (0%) receiver But after upgrading to the 07/29 kernel: [ 5] 0.00-10.00 sec 2.19 GBytes 1.88 Gbits/sec 0.002 ms 0/1683879 (0%) receiver We attempted to downgrade to this kernel on a running cluster (given the above data) to see if the drop was kernel related - we upgraded two nodes in a cluster (uperf server and uperf client) to do the host-host test, but didn't see much (if any) improvement. Next test would be to downgrade the OVN version from 21.09 to 2.13 for the cluster.
This is the dpctl from 0630 and 0729 filtering out flows matching on proto=6 (TCP) and sorted by stats excluding not used ones: dpctl.0630.txt →recirc_id(0),in_port(2),eth(src=02:48:fd:1e:64:57,dst=02:2e:ad:a7:47:7b),eth_type(0x0800),ipv4(dst=0.0.0.0/128.0.0.0,frag=no), packets:54505185, bytes:448851922234, used:0.005s, flags:SFPR., actions:1 →recirc_id(0),in_port(2),eth(src=02:48:fd:1e:64:57,dst=02:2e:ad:a7:47:7b),eth_type(0x0800),ipv4(dst=169.254.169.128/255.255.255.128,frag=no), packets:166062, bytes:18701730, used:4.422s, flags:SFP., actions:1 →recirc_id(0),in_port(2),ct_state(-new+est+trk),eth(src=02:48:fd:1e:64:57,dst=02:4d:e0:28:c8:c3),eth_type(0x0800),ipv4(dst=0.0.0.0/128.0.0.0,frag=no), packets:116648, bytes:367292975, used:1.526s, flags:SP., actions:1 →recirc_id(0),in_port(1),eth(),eth_type(0x0800),ipv4(proto=17,frag=no),udp(dst=6081), packets:91472, bytes:33082055, used:1.581s, actions:2 →recirc_id(0),in_port(2),ct_state(-new-est-trk),eth(src=02:48:fd:1e:64:57,dst=02:4d:e0:28:c8:c3),eth_type(0x0800),ipv4(dst=0.0.0.0/128.0.0.0,frag=no), packets:25158, bytes:11363273, used:1.581s, actions:1 →recirc_id(0),in_port(2),eth(),eth_type(0x0800),ipv4(dst=172.30.0.0/255.255.0.0,frag=no), packets:75, bytes:12115, used:1.388s, flags:SFPR., actions:ct(commit,zone=64001,nat(src=169.254.169.2)),recirc(0x1890c) dpctl.0729.txt →recirc_id(0),in_port(2),eth(src=02:bf:28:e2:dc:89,dst=02:f4:f5:2d:bf:e9),eth_type(0x0800),ipv4(dst=0.0.0.0/128.0.0.0,frag=no), packets:81105258, bytes:458801744067, used:0.005s, flags:SFPR., actions:1 →recirc_id(0),in_port(1),eth(dst=02:bf:28:e2:dc:89),eth_type(0x0800),ipv4(proto=17,frag=no),udp(dst=6081), packets:33829437, bytes:9300103083, used:0.504s, actions:2 →recirc_id(0),in_port(2),ct_state(-new+est+trk),ct_label(0/0x2),eth(src=02:bf:28:e2:dc:89,dst=02:59:76:a7:0d:3d),eth_type(0x0800),ipv4(dst=0.0.0.0/128.0.0.0,frag=no), packets:1137218, bytes:3735974911, used:0.302s, flags:SFPR., actions:1 →recirc_id(0),in_port(2),ct_state(-new-est-trk),ct_label(0/0x2),eth(src=02:bf:28:e2:dc:89,dst=02:59:76:a7:0d:3d),eth_type(0x0800),ipv4(dst=0.0.0.0/128.0.0.0,frag=no), packets:366119, bytes:204930834, used:0.504s, actions:1 →recirc_id(0),in_port(2),eth(src=02:bf:28:e2:dc:89,dst=02:f4:f5:2d:bf:e9),eth_type(0x0800),ipv4(dst=169.254.169.128/255.255.255.128,frag=no), packets:125411, bytes:13986602, used:3.178s, flags:SFP., actions:1 If they were taken during test, the datapath flows are not the issue. fbl
Hi, See below the hot flows in kernel which I assume were from the uperf test while running the test: This is 4.9 OVN Client dpctl: →recirc_id(0),in_port(10),ct_state(-new-est-trk),ct_label(0/0x2),eth(src=0a:58:0a:8f:00:cb,dst=0a:58:0a:8f:00:01),eth_type(0x0800),ipv4(src=10.143.0.203,dst=10.150.0.186,proto=17,frag=no),udp(src=32768/0x8000,dst=40678), packets:7434814, bytes:7925511724, used:0.001s, actions:ct(zone=67,nat),recirc(0x4b8f0a) ↳recirc_id(0x4b8f0a),in_port(10),ct_state(+new-est-rel-rpl-inv+trk),ct_label(0/0x3),eth(src=0a:58:0a:8f:00:cb,dst=0a:58:0a:8f:00:01),eth_type(0x0800),ipv4(src=10.143.0.128/255.255.255.128,dst=10.150.0.186,proto=17,tos=0/0x3,ttl=64,frag=no), packets:7434846, bytes:7925545836, used:0.001s, actions:ct(commit,zone=67,label=0/0x1,nat(src)),ct_clear,ct_clear,set(tunnel(tun_id=0x45,dst=10.0.131.65,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x10007}),flags(df|csum|key))),set(eth(src=0a:58:0a:96:00:01,dst=0a:58:0a:96:00:ba)),set(ipv4(ttl=63)),4 This is 4.8 OVN Client dpctl: →recirc_id(0),in_port(10),ct_state(-new-est-trk),eth(src=0a:58:0a:8f:00:08,dst=0a:58:0a:8f:00:01),eth_type(0x0800),ipv4(src=10.143.0.8,dst=10.132.0.17,proto=17,frag=no),udp(src=32768/0x8000,dst=35150), packets:8782451, bytes:9362092766, used:7.315s, actions:ct(zone=98,nat),recirc(0x1b69) ↳recirc_id(0x1b69),in_port(10),ct_state(+new-est-rel-rpl-inv+trk),ct_label(0/0x1),eth(src=0a:58:0a:8f:00:08,dst=0a:58:0a:8f:00:01),eth_type(0x0800),ipv4(src=10.143.0.8/255.255.255.248,dst=10.132.0.17,proto=17,tos=0/0x3,ttl=64,frag=no), packets:8782451, bytes:9362092766, used:7.314s, actions:ct(commit,zone=98,label=0/0x1),ct_clear,ct_clear,set(tunnel(tun_id=0xd,dst=10.0.128.60,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x10010}),flags(df|csum|key))),set(eth(src=0a:58:0a:84:00:01,dst=0a:58:0a:84:00:11)),set(ipv4(ttl=63)),4 Note that 4.9 has one extra action: ct(nat(src)). Also, although not visible in the above output, the 4.8 OVN client sees UDP fragments while 4.9 does not. This is 4.9 OVN Server dpctl: →recirc_id(0),in_port(1),eth(dst=06:ae:e3:7e:b6:85),eth_type(0x0800),ipv4(proto=17,frag=no),udp(dst=6081), packets:25008537, bytes:49704438428, used:0.001s, actions:2 →recirc_id(0),tunnel(tun_id=0x45,src=10.0.190.13,dst=10.0.131.65,geneve({class=0x102,type=0x80,len=4,0x10007/0x7fffffff}),flags(-df+csum+key)),in_port(4),eth(src=0a:58:0a:96:00:01),eth_type(0x0800),ipv4(proto=17,frag=no),udp(src=32768/0x8000), packets:1084965, bytes:115006290, used:0.000s, actions:ct(zone=67,nat),recirc(0x4ceea9) ↳recirc_id(0x4ceea9),tunnel(tun_id=0x45,src=10.0.190.13,dst=10.0.131.65,geneve({}{}),flags(-df+csum+key)),in_port(4),ct_state(+new-est-rel-rpl-inv+trk),ct_label(0/0x1),eth(src=0a:58:0a:96:00:01,dst=0a:58:0a:96:00:ba),eth_type(0x0800),ipv4(src=10.143.0.128/255.255.255.128,dst=10.150.0.186,frag=no), packets:1084916, bytes:115001096, used:0.001s, actions:ct(commit,zone=67,label=0/0x1,nat(src)),10 This is 4.8 OVN Server dpctl: →recirc_id(0),in_port(1),eth(),eth_type(0x0800),ipv4(proto=17,frag=no),udp(dst=6081), packets:66552266, bytes:138847224676, used:0.000s, actions:2 →recirc_id(0),tunnel(tun_id=0xd,src=10.0.181.14,dst=10.0.128.60,geneve({class=0x102,type=0x80,len=4,0x10011/0x7fffffff}),flags(-df+csum+key)),in_port(4),eth(src=0a:58:0a:84:00:01),eth_type(0x0800),ipv4(proto=17,frag=first),udp(src=53248/0xfe00), packets:2061962, bytes:18380329268, used:0.001s, actions:ct(zone=44,nat),recirc(0x46cc) Error, can't find recirc_id 18124. →recirc_id(0),tunnel(tun_id=0xd,src=10.0.181.14,dst=10.0.128.60,geneve({class=0x102,type=0x80,len=4,0x10011/0x7fffffff}),flags(-df+csum+key)),in_port(4),eth(src=0a:58:0a:84:00:01),eth_type(0x0800),ipv4(proto=17,frag=later), packets:2061727, bytes:15557791942, used:0.001s, actions:ct(zone=44,nat),recirc(0x46cc) Error, can't find recirc_id 18124. Unfortunately the dpctl missed an important recirculation entries in the log file, but looking at 4.9 we can see the ct(nat(src)) there, so maybe it is far to assume 4.8 did not have that too. Again the 4.8 sees UDP fragments while 4.9 does not. I am going to build a small reproducer environment with this extra ct(nat(src)) to see if it could explain all the difference we are seeing. fbl
Hi, The OVS build with the patch (comment#9) reverted resulted in improvements in a single run. We were at about -20% to about -8~13% now. The threshold is 10%. https://docs.google.com/spreadsheets/d/12kVpSSYjlrbN6m3XBwT0X14mWkC968pZS03l9s828gA/edit#gid=1076561704&range=A104:D114 However, the dpctl output still shows the extra ct() call. The client has it in zone=67 while the server in zone=68. Client: →recirc_id(0),tunnel(tun_id=0x32,src=10.0.181.194,dst=10.0.140.195,geneve({class=0x102,type=0x80,len=4,0x10007/0x7fffffff}),flags(-df+csum+key)),in_port(1),eth(src=0a:58:0a:91:00:01),eth_type(0x0800),ipv4(proto=17,frag=no),udp(src=32768/0x8000), packets:5999985, bytes:635998410, used:0.001s, actions:ct(zone=67,nat),recirc(0x43dc) ↳recirc_id(0x43dc),tunnel(tun_id=0x32,src=10.0.181.194,dst=10.0.140.195,geneve({}{}),flags(-df+csum+key)),in_port(1),ct_state(+new-est-rel-rpl-inv+trk),ct_label(0/0x1),eth(src=0a:58:0a:91:00:01,dst=0a:58:0a:91:00:0e),eth_type(0x0800),ipv4(src=10.138.0.8/255.255.255.248,dst=10.145.0.14,frag=no), packets:5999964, bytes:635996184, used:0.001s, actions:ct(commit,zone=67,label=0/0x1,nat(src)),10 Server: →recirc_id(0),in_port(11),ct_state(-new-est-trk),ct_label(0/0x2),eth(src=0a:58:0a:8a:00:0f,dst=0a:58:0a:8a:00:01),eth_type(0x0800),ipv4(src=10.138.0.15,dst=10.145.0.14,proto=17,frag=no),udp(src=32768/0x8000,dst=39425), packets:6206955, bytes:657937230, used:0.001s, actions:ct(zone=68,nat),recirc(0x58bd) ↳recirc_id(0x58bd),in_port(11),ct_state(+new-est-rel-rpl-inv+trk),ct_label(0/0x3),eth(src=0a:58:0a:8a:00:0f,dst=0a:58:0a:8a:00:01),eth_type(0x0800),ipv4(src=10.138.0.8/255.255.255.248,dst=10.145.0.14,proto=17,tos=0/0x3,ttl=64,frag=no),udp(src=32768/0x8000), packets:6206971, bytes:657938926, used:0.002s, actions:ct(commit,zone=68,label=0/0x1,nat(src)),ct_clear,ct_clear,set(tunnel(tun_id=0x32,dst=10.0.140.195,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x10007}),flags(df|csum|key))),set(eth(src=0a:58:0a:91:00:01,dst=0a:58:0a:91:00:0e)),set(ipv4(ttl=63)),1 We are not sure if the procedure to replace ovs RPM package in OCP was correct. We are going to run 3 times to get an average and then move to replace OVN instead. fbl
Hi A new round of tests using OVN package from https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=39300696 which has the patch from comment#9 gave 8~14% drop comparing to SDN: https://docs.google.com/spreadsheets/d/12kVpSSYjlrbN6m3XBwT0X14mWkC968pZS03l9s828gA/edit#gid=1076561704&range=A130:D140 The dpctl output now looks good because it does not contain the extra ct(nat(src)) call. Server side: →recirc_id(0),tunnel(tun_id=0x5,src=10.0.181.194,dst=10.0.140.195,geneve({class=0x102,type=0x80,len=4,0xc0031/0x7fffffff}),flags(-df+csum+key)),in_port(1),ct_state(-new-est-rel-rpl-inv-trk),ct_label(0/0x3),eth(src=0a:58:0a:91:00:01,dst=0a:58:0a:91:00:07),eth_type(0x0800),ipv4(proto=17,frag=no),udp(src=32768/0x8000), packets:34258740, bytes:7078459905, used:0.000s, actions:ct_clear,ct(zone=67,nat),recirc(0x9578) ↳recirc_id(0x9578),tunnel(tun_id=0x5,src=10.0.181.194,dst=10.0.140.195,geneve({}{}),flags(-df+csum+key)),in_port(1),ct_state(+new-est-rel-rpl-inv+trk),ct_label(0/0x1),eth(src=0a:58:0a:91:00:01,dst=0a:58:0a:91:00:07),eth_type(0x0800),ipv4(src=10.138.0.8/255.255.255.248,dst=10.145.0.7,frag=no), packets:34258773, bytes:7078495083, used:0.001s, actions:ct(commit,zone=67,label=0/0x1),10 Client side: →recirc_id(0),in_port(11),ct_state(-new-est-trk),ct_label(0/0x2),eth(src=0a:58:0a:8a:00:08,dst=0a:58:0a:8a:00:01),eth_type(0x0800),ipv4(src=10.138.0.8,dst=10.145.0.7,proto=17,frag=no),udp(src=32768/0x8000,dst=39216), packets:3809514, bytes:4060941924, used:0.000s, actions:ct(zone=68,nat),recirc(0xc48b) ↳recirc_id(0xc48b),in_port(11),ct_state(+new-est-rel-rpl-inv+trk),ct_label(0/0x3),eth(src=0a:58:0a:8a:00:08,dst=0a:58:0a:8a:00:01),eth_type(0x0800),ipv4(src=10.138.0.8/255.255.255.248,dst=10.145.0.7,proto=17,tos=0/0x3,ttl=64,frag=no),udp(src=32768/0x8000), packets:3809552, bytes:4060982432, used:0.001s, actions:ct(commit,zone=68,label=0/0x1),ct_clear,set(tunnel(tun_id=0x5,dst=10.0.140.195,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0xc0031}),flags(df|csum|key))),set(eth(src=0a:58:0a:91:00:01,dst=0a:58:0a:91:00:07)),set(ipv4(ttl=63)),1 fbl
Argh, the comment above was supposed to say that the patch from #comment9 is REVERTED in that build.
Mohit provided a new set o results for today's build (no modification/no custom builds). It is a PASS because no test is above 10%. https://docs.google.com/spreadsheets/d/12kVpSSYjlrbN6m3XBwT0X14mWkC968pZS03l9s828gA/edit#gid=1076561704&range=A143:D153 I am waiting for Mohit to provide the dpctl outputs to confirm if the null source nat is back, but it sounds like reverting the patch is not related after all. fbl
Copy& paste of our gchat group from Sept/5th. ----8<---- Flavio Leitner, Sun 11:38 AM, Hello everyone. I finished up running tests with SDN and OVN on both 4.8 and 4.9. I noticed that the same image can result in two cases which I call "good" or "bad". When you get a "good" result, it is consistent (low std dev) and you can reliably reproduce the same numbers running the test again and again. The same happens with a "bad" result, which is slower of course. That happens with the same image file. That happens with SDN or OVN images. You can see that in the tab '4.8' in the spreadsheet I am going to copy&paste next. For 4.9 I could get a "bad" and "good" results with OVN, but not yet with SDN. I didn't try much (most of the time spent testing 4.8), so I believe if I continue testing eventually I will get them. In the 4.9 case I got OVN "bad" and "good" with different images, so giving the results with 4.8 same image I would say image is not related. Alright, then I did tables calculating OVN #1 / SDN #1 and OVN #2 / SDN #2 with all 6 tests. Flavio Leitner, Sun 11:43 AM, Edited So, two re-installations each ( SDN#1 and SDN#2) and 6 runs (each column), then the average (AVG) Flavio Leitner, Sun 11:52 AM in 4.9 I have a PASS (#1) and a FAIL (#2). Giving that I haven't got a SDN "bad" run, doing combinations won't give anything new. in 4.8 however, I got "good" and "bad" runs for both, so we can see all possible results. I will copy the link to the specific table below: https://docs.google.com/spreadsheets/d/1p7lfOO6hSh6alpnDCzhXUov5qHTgQLk-ecXIkOZoHBY/edit#gid=926360702&range=B86:G100 ----8<---- In summary, it seems clear that the same image can return two distinct results once deployed for both SDN and OVN. I think before we go deeper into any performance problem, we should first stabilize the results and get a reliable baseline. fbl
Based on the previous comment, the OVN-K Scale sync meeting discussion, and that there was no relevant difference between OVN best numbers in 4.8 and 4.9 in the spreadsheet, I am closing this ticket. Feel free to re-open if necessary. Thanks, fbl
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days