Bug 1868392

Summary: [FDP 20.F] OVN 2.13 breaks pod-pod networking across the nodes on OCP
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Anurag saxena <anusaxen>
Component: ovn2.13Assignee: Numan Siddique <nusiddiq>
Status: CLOSED ERRATA QA Contact: Jianlin Shi <jishi>
Severity: urgent Docs Contact:
Priority: urgent    
Version: FDP 20.ECC: ctrautma, dcbw, huirwang, jishi, kfida, nusiddiq, ralongi, rbrattai, zzhao
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 09:49:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log bundle PR 755 none

Description Anurag saxena 2020-08-12 14:07:05 UTC
Description of problem: Brought up OCP 4.5.5 cluster with OVNKubernetes with FDP 20.F rpms 
https://errata.devel.redhat.com/advisory/57163
https://errata.devel.redhat.com/advisory/57030


Possible reason is missing table=21 arp entries under br-int flows on nodes for other pods in other projects which might be breaking the networking across

$ oc get pods -n test -o wide
NAME            READY   STATUS    RESTARTS   AGE   IP           NODE                                         NOMINATED NODE   READINESS GATES
test-rc-45876   1/1     Running   0          46m   10.129.2.7   ip-10-0-142-102.us-east-2.compute.internal   <none>           <none>
test-rc-pxnt2   1/1     Running   0          46m   10.128.2.7   ip-10-0-181-122.us-east-2.compute.internal   <none>           <none>
[anusaxen@anusaxen verification-tests]$ oc get pods -n test1 -o wide
NAME            READY   STATUS    RESTARTS   AGE   IP           NODE                                         NOMINATED NODE   READINESS GATES
test-rc-hfbdj   1/1     Running   0          36m   10.129.2.8   ip-10-0-142-102.us-east-2.compute.internal   <none>           <none>
test-rc-nf88l   1/1     Running   0          36m   10.128.2.8   ip-10-0-181-122.us-east-2.compute.internal   <none>           <none>
[anusaxen@anusaxen verification-tests]$ oc rsh -n test test-rc-45876
~ $ curl 10.129.2.8:8080
curl: (7) Failed to connect to 10.129.2.8 port 8080: Host is unreachable
~ $ exit

$ oc debug node/ip-10-0-142-102.us-east-2.compute.internal
Starting pod/ip-10-0-142-102us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.142.102
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host

sh-4.4# ovs-ofctl dump-flows br-int -O openflow13 | grep "10.129.2.8" | grep "table=21"

sh-4.4# ovs-ofctl dump-flows br-int -O openflow13 | grep "10.129.2.8"                  
 cookie=0x4b6bf54, duration=2312.751s, table=9, n_packets=0, n_bytes=0, priority=90,ip,reg14=0x8,metadata=0x13,dl_src=9a:41:fc:81:02:09,nw_src=10.129.2.8 actions=resubmit(,10)
 cookie=0x21ba1ead, duration=2312.751s, table=10, n_packets=0, n_bytes=0, priority=90,arp,reg14=0x8,metadata=0x13,dl_src=9a:41:fc:81:02:09,arp_spa=10.129.2.8,arp_sha=9a:41:fc:81:02:09 actions=resubmit(,11)
 cookie=0x1e32982a, duration=2309.325s, table=19, n_packets=0, n_bytes=0, priority=2,tcp,metadata=0x13,nw_src=10.129.2.8,nw_dst=10.129.2.8,tp_dst=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(commit,table=20,zone=NXM_NX_REG12[0..15],nat(src=172.30.15.78))
 cookie=0x8268653, duration=2309.325s, table=19, n_packets=0, n_bytes=0, priority=2,tcp,metadata=0x12,nw_src=10.129.2.8,nw_dst=10.129.2.8,tp_dst=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(commit,table=20,zone=NXM_NX_REG12[0..15],nat(src=172.30.15.78))
 cookie=0xa0f80b53, duration=2309.325s, table=19, n_packets=0, n_bytes=0, priority=2,tcp,metadata=0xe,nw_src=10.129.2.8,nw_dst=10.129.2.8,tp_dst=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(commit,table=20,zone=NXM_NX_REG12[0..15],nat(src=172.30.15.78))
 cookie=0x798a1ba2, duration=2309.325s, table=19, n_packets=0, n_bytes=0, priority=2,tcp,metadata=0x1,nw_src=10.129.2.8,nw_dst=10.129.2.8,tp_dst=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(commit,table=20,zone=NXM_NX_REG12[0..15],nat(src=172.30.15.78))
 cookie=0x78e28fb0, duration=2309.325s, table=19, n_packets=0, n_bytes=0, priority=2,tcp,metadata=0x2,nw_src=10.129.2.8,nw_dst=10.129.2.8,tp_dst=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(commit,table=20,zone=NXM_NX_REG12[0..15],nat(src=172.30.15.78))
 cookie=0xa339c137, duration=2309.325s, table=19, n_packets=0, n_bytes=0, priority=2,tcp,metadata=0x4,nw_src=10.129.2.8,nw_dst=10.129.2.8,tp_dst=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(commit,table=20,zone=NXM_NX_REG12[0..15],nat(src=172.30.15.78))
 cookie=0xa21e8056, duration=2309.328s, table=19, n_packets=0, n_bytes=0, priority=1,tcp,metadata=0x1,nw_src=10.129.2.8,nw_dst=172.30.15.78,tp_src=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(table=20,zone=NXM_NX_REG12[0..15],nat)
 cookie=0x3dbce638, duration=2309.328s, table=19, n_packets=0, n_bytes=0, priority=1,tcp,metadata=0x12,nw_src=10.129.2.8,nw_dst=172.30.15.78,tp_src=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(table=20,zone=NXM_NX_REG12[0..15],nat)
 cookie=0x383d1b02, duration=2309.328s, table=19, n_packets=0, n_bytes=0, priority=1,tcp,metadata=0x13,nw_src=10.129.2.8,nw_dst=172.30.15.78,tp_src=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(table=20,zone=NXM_NX_REG12[0..15],nat)
 cookie=0xbf8b5801, duration=2309.328s, table=19, n_packets=0, n_bytes=0, priority=1,tcp,metadata=0x2,nw_src=10.129.2.8,nw_dst=172.30.15.78,tp_src=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(table=20,zone=NXM_NX_REG12[0..15],nat)
 cookie=0x46d9b70e, duration=2309.328s, table=19, n_packets=0, n_bytes=0, priority=1,tcp,metadata=0xe,nw_src=10.129.2.8,nw_dst=172.30.15.78,tp_src=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(table=20,zone=NXM_NX_REG12[0..15],nat)
 cookie=0xecc33ac9, duration=2309.328s, table=19, n_packets=0, n_bytes=0, priority=1,tcp,metadata=0x4,nw_src=10.129.2.8,nw_dst=172.30.15.78,tp_src=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(table=20,zone=NXM_NX_REG12[0..15],nat)
 cookie=0xd5587ab3, duration=2312.766s, table=48, n_packets=0, n_bytes=0, priority=90,ip,reg15=0x8,metadata=0x13,dl_dst=9a:41:fc:81:02:09,nw_dst=10.129.2.8 actions=resubmit(,49)
sh-4.4# 



Version-Release number of selected component (if applicable):openvswitch2.13-2.13.0-39.el7fdp and ovn2.13-20.06.1-6.el7fdp


How reproducible:Always


Steps to Reproduce:
1.Bring up OVNKubernetes OCP cluster on 4.5.5 with above rpms
2. Run basic SDN tests
3.

Actual results:communication across projects failing


Expected results:Tests should work fine


Additional info: I have cluster to debug if anyone wants to take a look

Comment 1 Anurag saxena 2020-08-12 14:08:55 UTC
Bugzilla doesn;t have FDP 20.F listed in Version.


Pasting rpm version from cluster

$ oc rsh -n openshift-ovn-kubernetes ovnkube-master-4ktht
Defaulting container name to northd.
Use 'oc describe pod/ovnkube-master-4ktht -n openshift-ovn-kubernetes' to see all of the containers in this pod.

sh-4.2# rpm -qa | grep -i openv
openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch
openvswitch2.13-devel-2.13.0-39.el7fdp.x86_64
openvswitch2.13-2.13.0-39.el7fdp.x86_64

sh-4.2# rpm -qa | grep -i ovn  
ovn2.13-20.06.1-6.el7fdp.x86_64
ovn2.13-host-20.06.1-6.el7fdp.x86_64
ovn2.13-central-20.06.1-6.el7fdp.x86_64
ovn2.13-vtep-20.06.1-6.el7fdp.x86_64

Comment 10 Anurag saxena 2020-08-12 18:43:49 UTC
So this issue seems to be related to why db's are not getting upgraded and i hope the steps to deploy image are correct in comment 8

Comment 11 Numan Siddique 2020-08-13 12:48:59 UTC
The ovn-ctl script here [1] takes care of uprading the cluster dbs.

****
 "$@" "$file"

    # Initialize the database if it's NOT joining a cluster.
    if test -z "$cluster_remote_addr"; then
        $(echo ovn-${db}ctl | tr _ -) --no-leader-only init
    fi

    if test $mode = cluster; then
        upgrade_cluster "$schema" "unix:$sock"
    fi

****

But the problem is that the cluster network operator starts ovn dbs using the command - run_sb_ovsdb/run_nb_ovsdb.

and ovsdb-server when started (via "$@" "$file") doesn't daemonize and runs in foreground. And that's why the "uprade_cluster"
function is never invoked.

For the raft setup, ovsdb-server should be running to uprade the DB from the new schema (which is not the case with stanadlone db).

I think cluster network operator ovnkube-master.yaml here [2] should take care of upgrading the cluster.


the uprade_cluster code is here [3]



[1] - https://github.com/ovn-org/ovn/blob/master/utilities/ovn-ctl#L299
[2] - https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/ovn-kubernetes/ovnkube-master.yaml#L148
[3] - https://github.com/openvswitch/ovs/blob/master/utilities/ovs-lib.in#L461

Comment 12 Numan Siddique 2020-08-13 14:25:58 UTC
I have submitted a PR to handle this in CNO - https://github.com/openshift/cluster-network-operator/pull/755

@Anurag - Is it possible to test this PR out.

Thanks
Numan

Comment 15 Anurag saxena 2020-08-13 20:06:48 UTC
Created attachment 1711386 [details]
log bundle PR 755

Comment 16 Numan Siddique 2020-08-19 16:55:46 UTC
I think we can move this BZ to CNO as it is not an OVN issue.

Comment 20 Jianlin Shi 2020-10-13 12:40:55 UTC
reproduced on ovn20.06.1-6 with following steps:

1. install ovn2.13.0-39
[root@wsfd-advnetlab16 bz1868392]# rpm -qa | grep -E "openvswitch|ovn"
kernel-kernel-networking-openvswitch-ovn-common-1.0-11.noarch
python3-openvswitch2.13-2.13.0-51.el7fdp.x86_64
ovn2.13-2.13.0-39.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-acl-1.0-19.noarch
openvswitch2.13-2.13.0-51.el7fdp.x86_64
ovn2.13-central-2.13.0-39.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch
ovn2.13-host-2.13.0-39.el7fdp.x86_64
2. rm db file
[root@wsfd-advnetlab16 bz1868392]# rm -f /etc/ovn/*
3. start run_sb_ovsdb
ctl_cmd="/usr/share/ovn/scripts/ovn-ctl"
ip_s=1.1.1.16
ip_c1=1.1.1.17
ip_c2=1.1.1.18

$ctl_cmd --db-sb-cluster-local-addr=$ip_s --db-sb-create-insecure-remote=yes --db-sb-cluster-local-port=6642 --db-sb-cluster-remote-proto=tcp --no-monitor run_sb_ovsdb
4. check chassis table in sb db file
[root@wsfd-advnetlab16 scripts]# ovsdb-client dump tcp:1.1.1.16:6642 Chassis
Chassis table
_uuid encaps external_ids hostname name nb_cfg transport_zones vtep_logical_switches
----- ------ ------------ -------- ---- ------ --------------- ---------------------

5. stop the script and upgrade ovn to 20.06.1-6
[root@wsfd-advnetlab16 bz1868392]# rpm -qa | grep -E "openvswitch|ovn"
kernel-kernel-networking-openvswitch-ovn-common-1.0-11.noarch
python3-openvswitch2.13-2.13.0-51.el7fdp.x86_64
ovn2.13-central-20.06.1-6.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-acl-1.0-19.noarch
openvswitch2.13-2.13.0-51.el7fdp.x86_64
ovn2.13-20.06.1-6.el7fdp.x86_64
ovn2.13-host-20.06.1-6.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch

6. start the script again
7. check chassis table:
[root@wsfd-advnetlab16 scripts]# ovsdb-client dump tcp:1.1.1.16:6642 Chassis
Chassis table
_uuid encaps external_ids hostname name nb_cfg transport_zones vtep_logical_switches
----- ------ ------------ -------- ---- ------ --------------- ---------------------

<==== the db is not updated. as other_config is added in 20.06.1-6, but it is not listed here

Verified on ovn20.09.0-2:

[root@wsfd-advnetlab16 bz1868392]# rpm -qa | grep -E "openvswitch|ovn"
kernel-kernel-networking-openvswitch-ovn-common-1.0-11.noarch
python3-openvswitch2.13-2.13.0-51.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-acl-1.0-19.noarch
openvswitch2.13-2.13.0-51.el7fdp.x86_64
ovn2.13-20.09.0-2.el7fdp.x86_64
ovn2.13-host-20.09.0-2.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch
ovn2.13-central-20.09.0-2.el7fdp.x86_64

[root@wsfd-advnetlab16 scripts]# ovsdb-client dump tcp:1.1.1.16:6642 Chassis
Chassis table
_uuid encaps external_ids hostname name nb_cfg other_config transport_zones vtep_logical_switches                                                                                            
----- ------ ------------ -------- ---- ------ ------------ --------------- ---------------------

<=== db is updated, other_config is listed here

Comment 21 Jianlin Shi 2020-10-14 01:33:18 UTC
Verified on rhel8 version:

with ovn2.13.0-39 installed:

[root@wsfd-advnetlab18 ~]# ovsdb-client dump tcp:1.1.23.25:6642 Chassis
Chassis table
_uuid encaps external_ids hostname name nb_cfg transport_zones vtep_logical_switches
----- ------ ------------ -------- ---- ------ --------------- ---------------------

upgrade to ovn20.09.0-2:

[root@wsfd-advnetlab18 ~]# ovsdb-client dump tcp:1.1.23.25:6642 Chassis
Chassis table
_uuid encaps external_ids hostname name nb_cfg other_config transport_zones vtep_logical_switches
----- ------ ------------ -------- ---- ------ ------------ --------------- ---------------------

<== db is upgraded

Comment 23 errata-xmlrpc 2020-10-27 09:49:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4356