Bug 2004998

Summary: OVN timed out waiting for logical switch on subnet: timed out waiting for the condition
Product: OpenShift Container Platform Reporter: Petr Kotas <pkotas>
Component: NetworkingAssignee: Surya Seetharaman <surya>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: bvesel, cmarches, dramseur, ljakubow, mradchuk, pkotas, rogbas, surya, trwest, vpickard
Version: 4.8Keywords: ServiceDeliveryImpact
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-04 12:38:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Description Flags
Logs snippets with failures. none

Description Petr Kotas 2021-09-16 14:30:57 UTC
Created attachment 1823587 [details]
Logs snippets with failures.

Description of problem:

Cluster deploy failed due to OVN timing out waiting on a condition. 
Failure manifests itself with failed PodSandbox creation

InstallerPodContainerWaitingDegraded: Pod "installer-7-cluster-bootstrap-fxks4-master-0" on node "cluster-bootstrap-fxks4-master-0" container "installer" is waiting since 2021-09-16 09:48:10 +0000 UTC becaus
e ContainerCreating                                                                                                                                                                                                  
      InstallerPodNetworkingDegraded: Pod "installer-7-cluster-bootstrap-fxks4-master-0" on node "cluster-bootstrap-fxks4-master-0" observed degraded networking: Failed to create pod sandbox: rpc error: code = Unk
nown desc = failed to create pod network sandbox k8s_installer-7-cluster-bootstrap-fxks4-master-0_openshift-kube-controller-manager_d4e9df4c-dfab-4bc5-9d89-f96f23b2d299_0(627b8dcbc820e99414ba2ac5ccb964f7bad7df038d
3094a8f3f535a3021a468f): error adding pod openshift-kube-controller-manager_installer-7-cluster-bootstrap-fxks4-master-0 to CNI network "multus-cni-network": [openshift-kube-controller-manager/installer-7-cluster-
bootstrap-fxks4-master-0:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-kube-controller-manager/installer-7-cluster-bootstrap-fxks4-master-0 62
7b8dcbc820e99414ba2ac5ccb964f7bad7df038d3094a8f3f535a3021a468f] [openshift-kube-controller-manager/installer-7-cluster-bootstrap-fxks4-master-0 627b8dcbc820e99414ba2ac5ccb964f7bad7df038d3094a8f3f535a3021a468f] fai
led to get pod annotation: timed out waiting for annotations    

Further digging shows failures in ovn-kube

2021-09-16T09:44:30.467004793+00:00 stderr F I0916 09:44:30.466963       1 pods.go:302] [openshift-multus/network-metrics-daemon-cr5x7] addLogicalPort took 6.913889ms
2021-09-16T09:44:30.499490783+00:00 stderr F I0916 09:44:30.499391       1 pods.go:302] [openshift-kube-scheduler/installer-5-cluster-bootstrap-fxks4-master-0] addLogicalPort took 30.000973158s
2021-09-16T09:44:30.499490783+00:00 stderr F E0916 09:44:30.499448       1 ovn.go:481] timed out waiting for logical switch "cluster-bootstrap-fxks4-master-0" subnet: timed out waiting for the condition
2021-09-16T09:44:30.499734090+00:00 stderr F I0916 09:44:30.499671       1 pods.go:338] LSP already exists for port: openshift-oauth-apiserver_apiserver-7cdd8b65df-w2mf8
2021-09-16T09:44:30.500019797+00:00 stderr F I0916 09:44:30.499930       1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-scheduler", Name:"installer-5-cluster-bootstrap-fxks4-master-", UID:"f8788e92-fcf4-4fab-a50b-52e370e651d6", APIVersion:"v1", ResourceVersion:"11620", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' timed out waiting for logical switch "cluster-bootstrap-fxk4-master-0" subnet: timed out waiting for the condition

The issue manifests itself randomly during ARO cluster deploy. We tried to prolong the timeouts waiting for cluster to go ready. Without positive result. The issue also manifests itself withing small time windows.

I am attaching logs with multiple occurencies First lines comes from operator, the rest of the logs come from ovn-kube on master-0.

Comment 1 Surya Seetharaman 2021-09-22 09:45:15 UTC
@pkotas: The error in itself means that the pod subnet for that node is not getting assigned, so its expected that the logical switch can't get added and pods can't get created since there is no IP pool. Question is what's different about ARO due to which the clustersubnet is not being added as the host subnet for the node? I'd need a full must-gather on when the error occurs plus if possible kubeconfig or access to the cluster to debug this.


Comment 2 Benjamin Vesel 2021-09-23 15:37:39 UTC
@surya I'm trying to gather the information requested above.  We were seeing pretty high failure rates on installation on 4.8.10.  We recently bumped our ARO default installation to 4.8.11 and I haven't seen a failure yet in about 10 attempts.  Wondering if it could be related to: https://bugzilla.redhat.com/show_bug.cgi?id=1999895 as that's a similar error we saw and I haven't hit it on install of 4.8.11 yet.  

I'm dropping our install version to 4.8.10 again and will get you the info requested when I have a replication of the issue.

Comment 3 Surya Seetharaman 2021-09-23 15:53:32 UTC
Hey Benjamin,

Thanks for letting me know, yea that change could totally be it! You are right. If you haven't seen it on 4.8.11 and only see it on 4.8.10, let me quickly pull a diff of the two versions and see if we have that revert commit in there for 4.8.11 and not for 4.8.10.

Comment 4 DRamseur 2021-09-30 14:19:03 UTC
@surya  - were you able to  pull a diff of the two versions and see if we have that revert commit in there for 4.8.11 and not for 4.8.10?

Comment 5 Surya Seetharaman 2021-10-01 12:47:39 UTC
Hi dramseur,

yes sorry for not providing an update. I checked and its indeed that commit that's causing this issue on 4.8.10.

The revert commit is in 4.8.11: https://github.com/openshift/ovn-kubernetes/commit/8eb74914e7425f544504baa8357144dc21deac3d (tip of 4.8.11) but not in 4.8.10: https://github.com/openshift/ovn-kubernetes/commit/5ee76f3d1635658339433f9937714b3e44546591 (tip of 4.8.10).

Shall I close this bug?


Comment 6 Benjamin Vesel 2021-10-01 20:42:30 UTC

Comment 7 Benjamin Vesel 2021-10-01 20:43:39 UTC
@surya I think we're good to close this.  Seems like it was fixed in 4.8.11 and I can't reproduce it now.

Comment 8 Surya Seetharaman 2021-10-04 12:38:16 UTC
grt, thanks for confirming Benjamin.

Closing this.

Comment 9 Surya Seetharaman 2021-10-04 12:39:57 UTC

*** This bug has been marked as a duplicate of bug 1999895 ***