Created attachment 1823587 [details] Logs snippets with failures. Description of problem: Cluster deploy failed due to OVN timing out waiting on a condition. Failure manifests itself with failed PodSandbox creation InstallerPodContainerWaitingDegraded: Pod "installer-7-cluster-bootstrap-fxks4-master-0" on node "cluster-bootstrap-fxks4-master-0" container "installer" is waiting since 2021-09-16 09:48:10 +0000 UTC becaus e ContainerCreating InstallerPodNetworkingDegraded: Pod "installer-7-cluster-bootstrap-fxks4-master-0" on node "cluster-bootstrap-fxks4-master-0" observed degraded networking: Failed to create pod sandbox: rpc error: code = Unk nown desc = failed to create pod network sandbox k8s_installer-7-cluster-bootstrap-fxks4-master-0_openshift-kube-controller-manager_d4e9df4c-dfab-4bc5-9d89-f96f23b2d299_0(627b8dcbc820e99414ba2ac5ccb964f7bad7df038d 3094a8f3f535a3021a468f): error adding pod openshift-kube-controller-manager_installer-7-cluster-bootstrap-fxks4-master-0 to CNI network "multus-cni-network": [openshift-kube-controller-manager/installer-7-cluster- bootstrap-fxks4-master-0:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-kube-controller-manager/installer-7-cluster-bootstrap-fxks4-master-0 62 7b8dcbc820e99414ba2ac5ccb964f7bad7df038d3094a8f3f535a3021a468f] [openshift-kube-controller-manager/installer-7-cluster-bootstrap-fxks4-master-0 627b8dcbc820e99414ba2ac5ccb964f7bad7df038d3094a8f3f535a3021a468f] fai led to get pod annotation: timed out waiting for annotations Further digging shows failures in ovn-kube 2021-09-16T09:44:30.467004793+00:00 stderr F I0916 09:44:30.466963 1 pods.go:302] [openshift-multus/network-metrics-daemon-cr5x7] addLogicalPort took 6.913889ms 2021-09-16T09:44:30.499490783+00:00 stderr F I0916 09:44:30.499391 1 pods.go:302] [openshift-kube-scheduler/installer-5-cluster-bootstrap-fxks4-master-0] addLogicalPort took 30.000973158s 2021-09-16T09:44:30.499490783+00:00 stderr F E0916 09:44:30.499448 1 ovn.go:481] timed out waiting for logical switch "cluster-bootstrap-fxks4-master-0" subnet: timed out waiting for the condition 2021-09-16T09:44:30.499734090+00:00 stderr F I0916 09:44:30.499671 1 pods.go:338] LSP already exists for port: openshift-oauth-apiserver_apiserver-7cdd8b65df-w2mf8 2021-09-16T09:44:30.500019797+00:00 stderr F I0916 09:44:30.499930 1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-scheduler", Name:"installer-5-cluster-bootstrap-fxks4-master-", UID:"f8788e92-fcf4-4fab-a50b-52e370e651d6", APIVersion:"v1", ResourceVersion:"11620", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' timed out waiting for logical switch "cluster-bootstrap-fxk4-master-0" subnet: timed out waiting for the condition The issue manifests itself randomly during ARO cluster deploy. We tried to prolong the timeouts waiting for cluster to go ready. Without positive result. The issue also manifests itself withing small time windows. I am attaching logs with multiple occurencies First lines comes from operator, the rest of the logs come from ovn-kube on master-0.
@pkotas: The error in itself means that the pod subnet for that node is not getting assigned, so its expected that the logical switch can't get added and pods can't get created since there is no IP pool. Question is what's different about ARO due to which the clustersubnet is not being added as the host subnet for the node? I'd need a full must-gather on when the error occurs plus if possible kubeconfig or access to the cluster to debug this. Cheers, Surya.
@surya I'm trying to gather the information requested above. We were seeing pretty high failure rates on installation on 4.8.10. We recently bumped our ARO default installation to 4.8.11 and I haven't seen a failure yet in about 10 attempts. Wondering if it could be related to: https://bugzilla.redhat.com/show_bug.cgi?id=1999895 as that's a similar error we saw and I haven't hit it on install of 4.8.11 yet. I'm dropping our install version to 4.8.10 again and will get you the info requested when I have a replication of the issue.
Hey Benjamin, Thanks for letting me know, yea that change could totally be it! You are right. If you haven't seen it on 4.8.11 and only see it on 4.8.10, let me quickly pull a diff of the two versions and see if we have that revert commit in there for 4.8.11 and not for 4.8.10.
@surya - were you able to pull a diff of the two versions and see if we have that revert commit in there for 4.8.11 and not for 4.8.10?
Hi dramseur, yes sorry for not providing an update. I checked and its indeed that commit that's causing this issue on 4.8.10. The revert commit is in 4.8.11: https://github.com/openshift/ovn-kubernetes/commit/8eb74914e7425f544504baa8357144dc21deac3d (tip of 4.8.11) but not in 4.8.10: https://github.com/openshift/ovn-kubernetes/commit/5ee76f3d1635658339433f9937714b3e44546591 (tip of 4.8.10). Shall I close this bug? Cheers, Surya.
@suryase
@surya I think we're good to close this. Seems like it was fixed in 4.8.11 and I can't reproduce it now.
grt, thanks for confirming Benjamin. Closing this.
*** This bug has been marked as a duplicate of bug 1999895 ***