Description of problem: Hit this on 4.6.0-0.nightly-2021-01-05-062422 trying to verify bug 1883917 While running the same workload described in bug 1883917 and it's 4.7 parent bug 1855408 (which passed verification): on a 100 worker node cluster, create 1000 namespaces with a 2 pod deployment in each 427/2000 pods started successfully The remainder are stuck in ContainerCreating with this event in oc describe: Warning FailedCreatePodSandBox 21s (x12 over 6m47s) kubelet, ip-10-0-200-163.us-west-2.compute.internal (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_deployment1v1-7b9bd87f99-dlw58_bz-a-999_d8103c15-b725-4579-aa22-3d15358bad2d_0(ff966e88f1870cb00e0c962a2c970ca380c6debdf21bbffae4a8edbc2ce09ee0): [bz-a-999/deployment1v1-7b9bd87f99-dlw58:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[bz-a-999/deployment1v1-7b9bd87f99-dlw58] failed to configure pod interface: timed out waiting for pod flows for pod: deployment1v1-7b9bd87f99-dlw58, error: timed out waiting for the condition This workload was OK on 4.7 and works OK for openshiftSDN Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2021-01-05-062422 How reproducible: Always for this workload Steps to Reproduce: 1. AWS cluster with 3 m5.2xlarge masters and 100 m5.large workers 2. Create 1000 namespaces each with 1 deployment containing 2 replicas. (20 pods/node on avg) Actual results: Only 472 pods start, others are ContainerCreating with event above and never seem to progress. Expected results: Successful execution of this workload as in 4.7 and for openshiftSDN Additional info: Will include link to must-gather
Created attachment 1744664 [details] journal from one node in the cluster this bz is being reported on. Unfortunately the cluster degraded to the point that the API became unavailable and I could not get must gather. The masters were inaccessible from an ssh bastion but i was able to get the journal off of 1 worker. Let me know what else is needed for the next repro of this issue.
Reproduced on 4.6.0-0.nightly-2021-01-18-070340. Still blocks verification of bug 1883917
reassigning to Ben since I'm on leave, please reassign to someone in the team.
Could not reproduce this on 4.8.0-0.nightly-2021-05-13-222446 Created 2000 pods in 1000 namespaces Created 5000 pods in 2500 namespaces. CNI Request ADD latency increased significantly by the end of this run to ~12s but everything started succesfully The error event and ContainerCreating issue described in this bug were not seen. Closing as fixed upstream.