Description of problem: OVN uses the OVS.Interface.external-ids:ovn-installed attribute to notify the CMS that an OVS port has been bound to an OVN port and that all required OVS flows have been installed. However, if the ovsdb transaction to set this attribute in the local conf.db fails then ovn-controller doesn't retry. The transaction can fail, especially at scale, and ovn-controller should be resilient enough to handle it.
Fix sent for review: http://patchwork.ozlabs.org/project/ovn/patch/20210423141752.15080.58931.stgit@dceara.remote.csb/
V2 patch: http://patchwork.ozlabs.org/project/ovn/list/?series=242485&state=*
It looks like this fix doesn't entirely fix the problem of ovn-installed being reported before the flows are installed. When I test with this fix I run a script that checks every .5 seconds to see if ovn-installed is added, as well as the flows in table8 during a pod create. I see this: Wed May 12 14:39:56 UTC 2021 external_ids : {attached_mac="0a:58:0a:97:0d:3d", iface-id=openshift-authentication_trozet1, ip_addresses="10.151.13.61/22", ovn-installed="true", sandbox="87a49511bcad42f70c952f6a67e386a58b270b60250b546d0cdd1e40e44ece75"} Wed May 12 14:40:22 UTC 2021 cookie=0xfb844538, duration=0.135s, table=8, n_packets=0, n_bytes=0, idle_age=0, priority=50,reg14=0x13c,metadata=0x264,dl_src=0a:58:0a:97:0d:3d actions=resubmit(,9) we can see the flow was installed much later (26 seconds or so) than when ovn-installed was added to a pod.
(In reply to Tim Rozet from comment #5) > It looks like this fix doesn't entirely fix the problem of ovn-installed > being reported before the flows are installed. When I test with this fix I > run a script that checks every .5 seconds to see if ovn-installed is added, > as well as the flows in table8 during a pod create. I see this: > > > Wed May 12 14:39:56 UTC 2021 external_ids : > {attached_mac="0a:58:0a:97:0d:3d", > iface-id=openshift-authentication_trozet1, ip_addresses="10.151.13.61/22", > ovn-installed="true", > sandbox="87a49511bcad42f70c952f6a67e386a58b270b60250b546d0cdd1e40e44ece75"} > > > Wed May 12 14:40:22 UTC 2021 cookie=0xfb844538, duration=0.135s, table=8, > n_packets=0, n_bytes=0, idle_age=0, > priority=50,reg14=0x13c,metadata=0x264,dl_src=0a:58:0a:97:0d:3d > actions=resubmit(,9) > > we can see the flow was installed much later (26 seconds or so) than when > ovn-installed was added to a pod. Based on https://bugzilla.redhat.com/show_bug.cgi?id=1959200#c4, this is a different issue, which I don't think we can fix in OVN itself. AFAICT, the only option is to ensure that the CMS doesn't reuse logical port names.
Dumitru Ceara said this bug is very hard to reproduce, he suggests to do sanity test. So I just do the regression test. set verified as sanity-only
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2507
We are still seeing this with the latest 4.9 nightly compose. kube-apiserver 4.9.0-0.nightly-2021-06-21-191858 True True True 13h InstallerPodContainerWaitingDegraded: Pod "installer-9-ip-10-0-161-94.us-west-2.compute.internal" on node "ip-10-0-161-94.us-west-2.compute.internal" container "installer" is waiting since 2021-06-23 08:11:54 +0000 UTC because ContainerCreating InstallerPodNetworkingDegraded: Pod "installer-9-ip-10-0-161-94.us-west-2.compute.internal" on node "ip-10-0-161-94.us-west-2.compute.internal" observed degraded networking: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-9-ip-10-0-161-94.us-west-2.compute.internal_openshift-kube-apiserver_39a7beab-7f9b-4f21-b2a9-9d2e302f7998_0(77e08343ec87696849117f1313ae37f8902f86c8bcc9080945c78c9feed02172): [openshift-kube-apiserver/installer-9-ip-10-0-161-94.us-west-2.compute.internal:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-kube-apiserver/installer-9-ip-10-0-161-94.us-west-2.compute.internal 77e08343ec87696849117f1313ae37f8902f86c8bcc9080945c78c9feed02172] [openshift-kube-apiserver/installer-9-ip-10-0-161-94.us-west-2.compute.internal 77e08343ec87696849117f1313ae37f8902f86c8bcc9080945c78c9feed02172] failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding InstallerPodNetworkingDegraded: ' OCP Version 4.9.0-0.nightly-2021-06-21-191858 OVS bits. openvswitch2.15-2.15.0-9.el8fdp.x86_64 openvswitch2.15-devel-2.15.0-9.el8fdp.x86_64 ovn2.13-20.12.0-140.el8fdp.x86_64 ovn2.13-host-20.12.0-140.el8fdp.x86_64 openvswitch-selinux-extra-policy-1.0-28.el8fdp.noarch python3-openvswitch2.15-2.15.0-9.el8fdp.x86_64 openvswitch2.15-ipsec-2.15.0-9.el8fdp.x86_64 ovn2.13-central-20.12.0-140.el8fdp.x86_64 ovn2.13-vtep-20.12.0-140.el8fdp.x86_64
(In reply to Joe Talerico from comment #13) > We are still seeing this with the latest 4.9 nightly compose. > > kube-apiserver 4.9.0-0.nightly-2021-06-21-191858 True > True True 13h InstallerPodContainerWaitingDegraded: Pod > "installer-9-ip-10-0-161-94.us-west-2.compute.internal" on node > "ip-10-0-161-94.us-west-2.compute.internal" container "installer" is waiting > since 2021-06-23 08:11:54 +0000 UTC because ContainerCreating > InstallerPodNetworkingDegraded: Pod > "installer-9-ip-10-0-161-94.us-west-2.compute.internal" on node > "ip-10-0-161-94.us-west-2.compute.internal" observed degraded networking: > Failed to create pod sandbox: rpc error: code = Unknown desc = failed to > create pod network sandbox > k8s_installer-9-ip-10-0-161-94.us-west-2.compute.internal_openshift-kube- > apiserver_39a7beab-7f9b-4f21-b2a9- > 9d2e302f7998_0(77e08343ec87696849117f1313ae37f8902f86c8bcc9080945c78c9feed021 > 72): > [openshift-kube-apiserver/installer-9-ip-10-0-161-94.us-west-2.compute. > internal:ovn-kubernetes]: error adding container to network > "ovn-kubernetes": CNI request failed with status 400: > '[openshift-kube-apiserver/installer-9-ip-10-0-161-94.us-west-2.compute. > internal 77e08343ec87696849117f1313ae37f8902f86c8bcc9080945c78c9feed02172] > [openshift-kube-apiserver/installer-9-ip-10-0-161-94.us-west-2.compute. > internal 77e08343ec87696849117f1313ae37f8902f86c8bcc9080945c78c9feed02172] > failed to configure pod interface: error while waiting on > OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting > for OVS port binding > InstallerPodNetworkingDegraded: ' > > OCP Version 4.9.0-0.nightly-2021-06-21-191858 > > OVS bits. > openvswitch2.15-2.15.0-9.el8fdp.x86_64 > openvswitch2.15-devel-2.15.0-9.el8fdp.x86_64 > ovn2.13-20.12.0-140.el8fdp.x86_64 > ovn2.13-host-20.12.0-140.el8fdp.x86_64 > openvswitch-selinux-extra-policy-1.0-28.el8fdp.noarch > python3-openvswitch2.15-2.15.0-9.el8fdp.x86_64 > openvswitch2.15-ipsec-2.15.0-9.el8fdp.x86_64 > ovn2.13-central-20.12.0-140.el8fdp.x86_64 > ovn2.13-vtep-20.12.0-140.el8fdp.x86_64 Per our discussion on Slack, we have bug 1959200 tracking the ovn-kubernetes issue.