Description of problem: Brought up OCP 4.5.5 cluster with OVNKubernetes with FDP 20.F rpms https://errata.devel.redhat.com/advisory/57163 https://errata.devel.redhat.com/advisory/57030 Possible reason is missing table=21 arp entries under br-int flows on nodes for other pods in other projects which might be breaking the networking across $ oc get pods -n test -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-rc-45876 1/1 Running 0 46m 10.129.2.7 ip-10-0-142-102.us-east-2.compute.internal <none> <none> test-rc-pxnt2 1/1 Running 0 46m 10.128.2.7 ip-10-0-181-122.us-east-2.compute.internal <none> <none> [anusaxen@anusaxen verification-tests]$ oc get pods -n test1 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-rc-hfbdj 1/1 Running 0 36m 10.129.2.8 ip-10-0-142-102.us-east-2.compute.internal <none> <none> test-rc-nf88l 1/1 Running 0 36m 10.128.2.8 ip-10-0-181-122.us-east-2.compute.internal <none> <none> [anusaxen@anusaxen verification-tests]$ oc rsh -n test test-rc-45876 ~ $ curl 10.129.2.8:8080 curl: (7) Failed to connect to 10.129.2.8 port 8080: Host is unreachable ~ $ exit $ oc debug node/ip-10-0-142-102.us-east-2.compute.internal Starting pod/ip-10-0-142-102us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.142.102 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# ovs-ofctl dump-flows br-int -O openflow13 | grep "10.129.2.8" | grep "table=21" sh-4.4# ovs-ofctl dump-flows br-int -O openflow13 | grep "10.129.2.8" cookie=0x4b6bf54, duration=2312.751s, table=9, n_packets=0, n_bytes=0, priority=90,ip,reg14=0x8,metadata=0x13,dl_src=9a:41:fc:81:02:09,nw_src=10.129.2.8 actions=resubmit(,10) cookie=0x21ba1ead, duration=2312.751s, table=10, n_packets=0, n_bytes=0, priority=90,arp,reg14=0x8,metadata=0x13,dl_src=9a:41:fc:81:02:09,arp_spa=10.129.2.8,arp_sha=9a:41:fc:81:02:09 actions=resubmit(,11) cookie=0x1e32982a, duration=2309.325s, table=19, n_packets=0, n_bytes=0, priority=2,tcp,metadata=0x13,nw_src=10.129.2.8,nw_dst=10.129.2.8,tp_dst=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(commit,table=20,zone=NXM_NX_REG12[0..15],nat(src=172.30.15.78)) cookie=0x8268653, duration=2309.325s, table=19, n_packets=0, n_bytes=0, priority=2,tcp,metadata=0x12,nw_src=10.129.2.8,nw_dst=10.129.2.8,tp_dst=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(commit,table=20,zone=NXM_NX_REG12[0..15],nat(src=172.30.15.78)) cookie=0xa0f80b53, duration=2309.325s, table=19, n_packets=0, n_bytes=0, priority=2,tcp,metadata=0xe,nw_src=10.129.2.8,nw_dst=10.129.2.8,tp_dst=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(commit,table=20,zone=NXM_NX_REG12[0..15],nat(src=172.30.15.78)) cookie=0x798a1ba2, duration=2309.325s, table=19, n_packets=0, n_bytes=0, priority=2,tcp,metadata=0x1,nw_src=10.129.2.8,nw_dst=10.129.2.8,tp_dst=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(commit,table=20,zone=NXM_NX_REG12[0..15],nat(src=172.30.15.78)) cookie=0x78e28fb0, duration=2309.325s, table=19, n_packets=0, n_bytes=0, priority=2,tcp,metadata=0x2,nw_src=10.129.2.8,nw_dst=10.129.2.8,tp_dst=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(commit,table=20,zone=NXM_NX_REG12[0..15],nat(src=172.30.15.78)) cookie=0xa339c137, duration=2309.325s, table=19, n_packets=0, n_bytes=0, priority=2,tcp,metadata=0x4,nw_src=10.129.2.8,nw_dst=10.129.2.8,tp_dst=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(commit,table=20,zone=NXM_NX_REG12[0..15],nat(src=172.30.15.78)) cookie=0xa21e8056, duration=2309.328s, table=19, n_packets=0, n_bytes=0, priority=1,tcp,metadata=0x1,nw_src=10.129.2.8,nw_dst=172.30.15.78,tp_src=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(table=20,zone=NXM_NX_REG12[0..15],nat) cookie=0x3dbce638, duration=2309.328s, table=19, n_packets=0, n_bytes=0, priority=1,tcp,metadata=0x12,nw_src=10.129.2.8,nw_dst=172.30.15.78,tp_src=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(table=20,zone=NXM_NX_REG12[0..15],nat) cookie=0x383d1b02, duration=2309.328s, table=19, n_packets=0, n_bytes=0, priority=1,tcp,metadata=0x13,nw_src=10.129.2.8,nw_dst=172.30.15.78,tp_src=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(table=20,zone=NXM_NX_REG12[0..15],nat) cookie=0xbf8b5801, duration=2309.328s, table=19, n_packets=0, n_bytes=0, priority=1,tcp,metadata=0x2,nw_src=10.129.2.8,nw_dst=172.30.15.78,tp_src=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(table=20,zone=NXM_NX_REG12[0..15],nat) cookie=0x46d9b70e, duration=2309.328s, table=19, n_packets=0, n_bytes=0, priority=1,tcp,metadata=0xe,nw_src=10.129.2.8,nw_dst=172.30.15.78,tp_src=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(table=20,zone=NXM_NX_REG12[0..15],nat) cookie=0xecc33ac9, duration=2309.328s, table=19, n_packets=0, n_bytes=0, priority=1,tcp,metadata=0x4,nw_src=10.129.2.8,nw_dst=172.30.15.78,tp_src=8080 actions=load:0x1->NXM_NX_XXREG0[102],ct(table=20,zone=NXM_NX_REG12[0..15],nat) cookie=0xd5587ab3, duration=2312.766s, table=48, n_packets=0, n_bytes=0, priority=90,ip,reg15=0x8,metadata=0x13,dl_dst=9a:41:fc:81:02:09,nw_dst=10.129.2.8 actions=resubmit(,49) sh-4.4# Version-Release number of selected component (if applicable):openvswitch2.13-2.13.0-39.el7fdp and ovn2.13-20.06.1-6.el7fdp How reproducible:Always Steps to Reproduce: 1.Bring up OVNKubernetes OCP cluster on 4.5.5 with above rpms 2. Run basic SDN tests 3. Actual results:communication across projects failing Expected results:Tests should work fine Additional info: I have cluster to debug if anyone wants to take a look
Bugzilla doesn;t have FDP 20.F listed in Version. Pasting rpm version from cluster $ oc rsh -n openshift-ovn-kubernetes ovnkube-master-4ktht Defaulting container name to northd. Use 'oc describe pod/ovnkube-master-4ktht -n openshift-ovn-kubernetes' to see all of the containers in this pod. sh-4.2# rpm -qa | grep -i openv openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch openvswitch2.13-devel-2.13.0-39.el7fdp.x86_64 openvswitch2.13-2.13.0-39.el7fdp.x86_64 sh-4.2# rpm -qa | grep -i ovn ovn2.13-20.06.1-6.el7fdp.x86_64 ovn2.13-host-20.06.1-6.el7fdp.x86_64 ovn2.13-central-20.06.1-6.el7fdp.x86_64 ovn2.13-vtep-20.06.1-6.el7fdp.x86_64
So this issue seems to be related to why db's are not getting upgraded and i hope the steps to deploy image are correct in comment 8
The ovn-ctl script here [1] takes care of uprading the cluster dbs. **** "$@" "$file" # Initialize the database if it's NOT joining a cluster. if test -z "$cluster_remote_addr"; then $(echo ovn-${db}ctl | tr _ -) --no-leader-only init fi if test $mode = cluster; then upgrade_cluster "$schema" "unix:$sock" fi **** But the problem is that the cluster network operator starts ovn dbs using the command - run_sb_ovsdb/run_nb_ovsdb. and ovsdb-server when started (via "$@" "$file") doesn't daemonize and runs in foreground. And that's why the "uprade_cluster" function is never invoked. For the raft setup, ovsdb-server should be running to uprade the DB from the new schema (which is not the case with stanadlone db). I think cluster network operator ovnkube-master.yaml here [2] should take care of upgrading the cluster. the uprade_cluster code is here [3] [1] - https://github.com/ovn-org/ovn/blob/master/utilities/ovn-ctl#L299 [2] - https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/ovn-kubernetes/ovnkube-master.yaml#L148 [3] - https://github.com/openvswitch/ovs/blob/master/utilities/ovs-lib.in#L461
I have submitted a PR to handle this in CNO - https://github.com/openshift/cluster-network-operator/pull/755 @Anurag - Is it possible to test this PR out. Thanks Numan
Created attachment 1711386 [details] log bundle PR 755
I think we can move this BZ to CNO as it is not an OVN issue.
reproduced on ovn20.06.1-6 with following steps: 1. install ovn2.13.0-39 [root@wsfd-advnetlab16 bz1868392]# rpm -qa | grep -E "openvswitch|ovn" kernel-kernel-networking-openvswitch-ovn-common-1.0-11.noarch python3-openvswitch2.13-2.13.0-51.el7fdp.x86_64 ovn2.13-2.13.0-39.el7fdp.x86_64 kernel-kernel-networking-openvswitch-ovn-acl-1.0-19.noarch openvswitch2.13-2.13.0-51.el7fdp.x86_64 ovn2.13-central-2.13.0-39.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch ovn2.13-host-2.13.0-39.el7fdp.x86_64 2. rm db file [root@wsfd-advnetlab16 bz1868392]# rm -f /etc/ovn/* 3. start run_sb_ovsdb ctl_cmd="/usr/share/ovn/scripts/ovn-ctl" ip_s=1.1.1.16 ip_c1=1.1.1.17 ip_c2=1.1.1.18 $ctl_cmd --db-sb-cluster-local-addr=$ip_s --db-sb-create-insecure-remote=yes --db-sb-cluster-local-port=6642 --db-sb-cluster-remote-proto=tcp --no-monitor run_sb_ovsdb 4. check chassis table in sb db file [root@wsfd-advnetlab16 scripts]# ovsdb-client dump tcp:1.1.1.16:6642 Chassis Chassis table _uuid encaps external_ids hostname name nb_cfg transport_zones vtep_logical_switches ----- ------ ------------ -------- ---- ------ --------------- --------------------- 5. stop the script and upgrade ovn to 20.06.1-6 [root@wsfd-advnetlab16 bz1868392]# rpm -qa | grep -E "openvswitch|ovn" kernel-kernel-networking-openvswitch-ovn-common-1.0-11.noarch python3-openvswitch2.13-2.13.0-51.el7fdp.x86_64 ovn2.13-central-20.06.1-6.el7fdp.x86_64 kernel-kernel-networking-openvswitch-ovn-acl-1.0-19.noarch openvswitch2.13-2.13.0-51.el7fdp.x86_64 ovn2.13-20.06.1-6.el7fdp.x86_64 ovn2.13-host-20.06.1-6.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch 6. start the script again 7. check chassis table: [root@wsfd-advnetlab16 scripts]# ovsdb-client dump tcp:1.1.1.16:6642 Chassis Chassis table _uuid encaps external_ids hostname name nb_cfg transport_zones vtep_logical_switches ----- ------ ------------ -------- ---- ------ --------------- --------------------- <==== the db is not updated. as other_config is added in 20.06.1-6, but it is not listed here Verified on ovn20.09.0-2: [root@wsfd-advnetlab16 bz1868392]# rpm -qa | grep -E "openvswitch|ovn" kernel-kernel-networking-openvswitch-ovn-common-1.0-11.noarch python3-openvswitch2.13-2.13.0-51.el7fdp.x86_64 kernel-kernel-networking-openvswitch-ovn-acl-1.0-19.noarch openvswitch2.13-2.13.0-51.el7fdp.x86_64 ovn2.13-20.09.0-2.el7fdp.x86_64 ovn2.13-host-20.09.0-2.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch ovn2.13-central-20.09.0-2.el7fdp.x86_64 [root@wsfd-advnetlab16 scripts]# ovsdb-client dump tcp:1.1.1.16:6642 Chassis Chassis table _uuid encaps external_ids hostname name nb_cfg other_config transport_zones vtep_logical_switches ----- ------ ------------ -------- ---- ------ ------------ --------------- --------------------- <=== db is updated, other_config is listed here
Verified on rhel8 version: with ovn2.13.0-39 installed: [root@wsfd-advnetlab18 ~]# ovsdb-client dump tcp:1.1.23.25:6642 Chassis Chassis table _uuid encaps external_ids hostname name nb_cfg transport_zones vtep_logical_switches ----- ------ ------------ -------- ---- ------ --------------- --------------------- upgrade to ovn20.09.0-2: [root@wsfd-advnetlab18 ~]# ovsdb-client dump tcp:1.1.23.25:6642 Chassis Chassis table _uuid encaps external_ids hostname name nb_cfg other_config transport_zones vtep_logical_switches ----- ------ ------------ -------- ---- ------ ------------ --------------- --------------------- <== db is upgraded
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4356