Bug 1952846

Summary: [ovn-controller] OVS.Interface.external-ids:ovn-installed is not set if original OVS TXN failed.
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Dumitru Ceara <dceara>
Component: ovn2.13Assignee: Dumitru Ceara <dceara>
Status: CLOSED ERRATA QA Contact: ying xu <yinxu>
Severity: high Docs Contact:
Priority: high    
Version: FDP 20.HCC: ctrautma, jishi, jtaleric, ralongi, trozet
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: perfscale-ovn
Fixed In Version: ovn2.13-20.12.0-135 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-21 14:44:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1959200    

Description Dumitru Ceara 2021-04-23 11:20:35 UTC
Description of problem:

OVN uses the OVS.Interface.external-ids:ovn-installed attribute to notify the CMS that an OVS port has been bound to an OVN port and that all required OVS flows have been installed.

However, if the ovsdb transaction to set this attribute in the local conf.db fails then ovn-controller doesn't retry.

The transaction can fail, especially at scale, and ovn-controller should be resilient enough to handle it.

Comment 3 Dumitru Ceara 2021-05-06 15:36:51 UTC
V2 patch:
http://patchwork.ozlabs.org/project/ovn/list/?series=242485&state=*

Comment 5 Tim Rozet 2021-05-12 16:03:23 UTC
It looks like this fix doesn't entirely fix the problem of ovn-installed being reported before the flows are installed. When I test with this fix I run a script that checks every .5 seconds to see if ovn-installed is added, as well as the flows in table8 during a pod create. I see this:


Wed May 12 14:39:56 UTC 2021 external_ids        : {attached_mac="0a:58:0a:97:0d:3d", iface-id=openshift-authentication_trozet1, ip_addresses="10.151.13.61/22", ovn-installed="true", sandbox="87a49511bcad42f70c952f6a67e386a58b270b60250b546d0cdd1e40e44ece75"}


Wed May 12 14:40:22 UTC 2021 cookie=0xfb844538, duration=0.135s, table=8, n_packets=0, n_bytes=0, idle_age=0, priority=50,reg14=0x13c,metadata=0x264,dl_src=0a:58:0a:97:0d:3d actions=resubmit(,9)

we can see the flow was installed much later (26 seconds or so) than when ovn-installed was added to a pod.

Comment 6 Dumitru Ceara 2021-05-17 07:56:25 UTC
(In reply to Tim Rozet from comment #5)
> It looks like this fix doesn't entirely fix the problem of ovn-installed
> being reported before the flows are installed. When I test with this fix I
> run a script that checks every .5 seconds to see if ovn-installed is added,
> as well as the flows in table8 during a pod create. I see this:
> 
> 
> Wed May 12 14:39:56 UTC 2021 external_ids        :
> {attached_mac="0a:58:0a:97:0d:3d",
> iface-id=openshift-authentication_trozet1, ip_addresses="10.151.13.61/22",
> ovn-installed="true",
> sandbox="87a49511bcad42f70c952f6a67e386a58b270b60250b546d0cdd1e40e44ece75"}
> 
> 
> Wed May 12 14:40:22 UTC 2021 cookie=0xfb844538, duration=0.135s, table=8,
> n_packets=0, n_bytes=0, idle_age=0,
> priority=50,reg14=0x13c,metadata=0x264,dl_src=0a:58:0a:97:0d:3d
> actions=resubmit(,9)
> 
> we can see the flow was installed much later (26 seconds or so) than when
> ovn-installed was added to a pod.

Based on https://bugzilla.redhat.com/show_bug.cgi?id=1959200#c4, this is a
different issue, which I don't think we can fix in OVN itself.  AFAICT,
the only option is to ensure that the CMS doesn't reuse logical port names.

Comment 10 ying xu 2021-06-04 10:34:59 UTC
Dumitru Ceara said this bug is very hard to reproduce, he suggests to do sanity test.

So I just do the regression test.

set verified as sanity-only

Comment 12 errata-xmlrpc 2021-06-21 14:44:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2507

Comment 13 Joe Talerico 2021-06-23 10:13:53 UTC
We are still seeing this with the latest 4.9 nightly compose.

kube-apiserver            4.9.0-0.nightly-2021-06-21-191858   True        True          True       13h     InstallerPodContainerWaitingDegraded: Pod "installer-9-ip-10-0-161-94.us-west-2.compute.internal" on node "ip-10-0-161-94.us-west-2.compute.internal" container "installer" is waiting since 2021-06-23 08:11:54 +0000 UTC because ContainerCreating
InstallerPodNetworkingDegraded: Pod "installer-9-ip-10-0-161-94.us-west-2.compute.internal" on node "ip-10-0-161-94.us-west-2.compute.internal" observed degraded networking: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-9-ip-10-0-161-94.us-west-2.compute.internal_openshift-kube-apiserver_39a7beab-7f9b-4f21-b2a9-9d2e302f7998_0(77e08343ec87696849117f1313ae37f8902f86c8bcc9080945c78c9feed02172): [openshift-kube-apiserver/installer-9-ip-10-0-161-94.us-west-2.compute.internal:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-kube-apiserver/installer-9-ip-10-0-161-94.us-west-2.compute.internal 77e08343ec87696849117f1313ae37f8902f86c8bcc9080945c78c9feed02172] [openshift-kube-apiserver/installer-9-ip-10-0-161-94.us-west-2.compute.internal 77e08343ec87696849117f1313ae37f8902f86c8bcc9080945c78c9feed02172] failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding
InstallerPodNetworkingDegraded: '

OCP Version 4.9.0-0.nightly-2021-06-21-191858

OVS bits. 
openvswitch2.15-2.15.0-9.el8fdp.x86_64
openvswitch2.15-devel-2.15.0-9.el8fdp.x86_64
ovn2.13-20.12.0-140.el8fdp.x86_64
ovn2.13-host-20.12.0-140.el8fdp.x86_64
openvswitch-selinux-extra-policy-1.0-28.el8fdp.noarch
python3-openvswitch2.15-2.15.0-9.el8fdp.x86_64
openvswitch2.15-ipsec-2.15.0-9.el8fdp.x86_64
ovn2.13-central-20.12.0-140.el8fdp.x86_64
ovn2.13-vtep-20.12.0-140.el8fdp.x86_64

Comment 14 Dumitru Ceara 2021-06-23 11:43:37 UTC
(In reply to Joe Talerico from comment #13)
> We are still seeing this with the latest 4.9 nightly compose.
> 
> kube-apiserver            4.9.0-0.nightly-2021-06-21-191858   True       
> True          True       13h     InstallerPodContainerWaitingDegraded: Pod
> "installer-9-ip-10-0-161-94.us-west-2.compute.internal" on node
> "ip-10-0-161-94.us-west-2.compute.internal" container "installer" is waiting
> since 2021-06-23 08:11:54 +0000 UTC because ContainerCreating
> InstallerPodNetworkingDegraded: Pod
> "installer-9-ip-10-0-161-94.us-west-2.compute.internal" on node
> "ip-10-0-161-94.us-west-2.compute.internal" observed degraded networking:
> Failed to create pod sandbox: rpc error: code = Unknown desc = failed to
> create pod network sandbox
> k8s_installer-9-ip-10-0-161-94.us-west-2.compute.internal_openshift-kube-
> apiserver_39a7beab-7f9b-4f21-b2a9-
> 9d2e302f7998_0(77e08343ec87696849117f1313ae37f8902f86c8bcc9080945c78c9feed021
> 72):
> [openshift-kube-apiserver/installer-9-ip-10-0-161-94.us-west-2.compute.
> internal:ovn-kubernetes]: error adding container to network
> "ovn-kubernetes": CNI request failed with status 400:
> '[openshift-kube-apiserver/installer-9-ip-10-0-161-94.us-west-2.compute.
> internal 77e08343ec87696849117f1313ae37f8902f86c8bcc9080945c78c9feed02172]
> [openshift-kube-apiserver/installer-9-ip-10-0-161-94.us-west-2.compute.
> internal 77e08343ec87696849117f1313ae37f8902f86c8bcc9080945c78c9feed02172]
> failed to configure pod interface: error while waiting on
> OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting
> for OVS port binding
> InstallerPodNetworkingDegraded: '
> 
> OCP Version 4.9.0-0.nightly-2021-06-21-191858
> 
> OVS bits. 
> openvswitch2.15-2.15.0-9.el8fdp.x86_64
> openvswitch2.15-devel-2.15.0-9.el8fdp.x86_64
> ovn2.13-20.12.0-140.el8fdp.x86_64
> ovn2.13-host-20.12.0-140.el8fdp.x86_64
> openvswitch-selinux-extra-policy-1.0-28.el8fdp.noarch
> python3-openvswitch2.15-2.15.0-9.el8fdp.x86_64
> openvswitch2.15-ipsec-2.15.0-9.el8fdp.x86_64
> ovn2.13-central-20.12.0-140.el8fdp.x86_64
> ovn2.13-vtep-20.12.0-140.el8fdp.x86_64

Per our discussion on Slack, we have bug 1959200 tracking the ovn-kubernetes issue.