Bug 1885713

Summary: failed to configure pod interface: timed out waiting for pod flows for pod
Product: OpenShift Container Platform Reporter: Sai Sindhur Malleni <smalleni>
Component: NetworkingAssignee: Anil Vishnoi <avishnoi>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: high CC: anbhat, avishnoi, bbennett, dblack, jtaleric, trozet
Version: 4.6   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-18 19:53:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sai Sindhur Malleni 2020-10-06 19:15:15 UTC
Description of problem:
During an API stress test on a 4.6 cluster on baremetal (3 masters + 110 worker nodes), 
we create 
10 Deployment Configs
10 services
3 Routes
and other control plane resource per project.

We are doing this across 100 projects serially.

So the flow of test is,

The control plane objects in each namespace are first created before moving on to the next namespace to create objects

After a few projects, we see errors like
0s          Warning   FailedCreatePodSandBox   pod/deploymentconfig9-1-deploy         Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_deploymentconfig9-1-deploy_mastervert084_f81fe01c-bab7-460f-881a-7fd2b6b2055d_0(c8993ebc2d95ae3d4be80f7a3242ff4e63b10205c65c540f1196c87d2b93001b): [mastervert084/deploymentconfig9-1-deploy:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[mastervert084/deploymentconfig9-1-deploy] failed to configure pod interface: timed out waiting for pod flows for pod: deploymentconfig9-1-deploy, error: timed out waiting for the condition

in the project events.


Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-10-03-051134

How reproducible:
100%

Steps to Reproduce:
1. Deploy a large cluster
2. Create multiple deployments/services per project across 100 projects
3.

Actual results:
Certain pods fail to launch due to the above mentioned error

Expected results:
We shouldn't be seeing such errors.

Additional info:

Comment 8 Tim Rozet 2020-11-18 19:53:24 UTC
Looking at the must gather, we can see that the pod creation from NB side comes in at 15:23:51:
ovnkube-master-gv6kt/ovnkube-master/ovnkube-master/logs/previous.log:2020-10-06T15:23:51.220803029Z I1006 15:23:51.220765       1 kube.go:63] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.131.22.48/23"],"mac_address":"0a:58:0a:83:16:30","gateway_ips":["10.131.22.1"],"ip_address":"10.131.22.48/23","gateway_ip":"10.131.22.1"}}] on pod mastervert058/deploymentconfig9-1-deploy


and CNI request happens at roughly the same time:
ovnkube-node-4gnnj/ovnkube-node/ovnkube-node/logs/current.log:2020-10-06T15:23:51.566462152Z I1006 15:23:51.566413    6673 cniserver.go:147] Waiting for ADD result for pod mastervert058/deploymentconfig9-1-deploy

then in ovn-controller on the node, the port isn't bound until 15:24:05:
ovnkube-node-4gnnj/ovn-controller/ovn-controller/logs/current.log:2020-10-06T15:24:05.302862332Z 2020-10-06T15:24:05Z|01231|binding|INFO|Claiming lport mastervert058_deploymentconfig9-1-deploy for this chassis.
ovnkube-node-4gnnj/ovn-controller/ovn-controller/logs/current.log:2020-10-06T15:24:05.302862332Z 2020-10-06T15:24:05Z|01232|binding|INFO|mastervert058_deploymentconfig9-1-deploy: Claiming 0a:58:0a:83:16:30 10.131.22.48
ovnkube-node-4gnnj/ovn-controller/ovn-controller/logs/current.log:2020-10-06T15:24:30.871492290Z 2020-10-06T15:24:30Z|01248|binding|INFO|Releasing lport mastervert058_deploymentconfig9-1-deploy from this chassis.


and then CNI times out waiting for the flows at 15:24:13:
ovnkube-node-4gnnj/ovnkube-node/ovnkube-node/logs/current.log:2020-10-06T15:24:13.526463348Z I1006 15:24:13.526362    6673 cni.go:157] [mastervert058/deploymentconfig9-1-deploy] CNI request &{ADD mastervert058 deploymentconfig9-1-deploy 0ec8e8b4da2fbc6c482cd98b07c6b2fb92f094b65a707b4b096cc6881273b0e5 /var/run/netns/f595d15a-093a-4240-8539-ecc384c65666 eth0 0xc003244d00}, result "", err failed to configure pod interface: timed out waiting for pod flows for pod: deploymentconfig9-1-deploy, error: timed out waiting for the condition

tl;dr OVN is under too much stress here and is taking too long to wire the port and add the flows. The fixes for 1855408, 1888829, 1859924 should improve OVN handling of requests. We can close this for now and re-open if we see this issue again.

*** This bug has been marked as a duplicate of bug 1859924 ***