Bug 1912975

Summary: Containers stuck in ContainerCreating creating 1000 namespaces on 100 nodes with 1000 deployments
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: NetworkingAssignee: Mohamed Mahmoud <mmahmoud>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED UPSTREAM Docs Contact:
Severity: high    
Priority: high CC: aconstan, anbhat, vpickard
Version: 4.6.z   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-14 16:27:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1883917, 1908472    
Attachments:
Description Flags
journal from one node in the cluster this bz is being reported on. none

Description Mike Fiedler 2021-01-05 17:38:31 UTC
Description of problem:

Hit this on 4.6.0-0.nightly-2021-01-05-062422 trying to verify bug 1883917

While running the same workload described in bug 1883917 and it's 4.7 parent bug 1855408 (which passed verification):

on a 100 worker node cluster, create 1000 namespaces with a 2 pod deployment in each

427/2000 pods started successfully
The remainder are stuck in ContainerCreating with this event in oc describe:

  Warning  FailedCreatePodSandBox  21s (x12 over 6m47s)  kubelet, ip-10-0-200-163.us-west-2.compute.internal  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_deployment1v1-7b9bd87f99-dlw58_bz-a-999_d8103c15-b725-4579-aa22-3d15358bad2d_0(ff966e88f1870cb00e0c962a2c970ca380c6debdf21bbffae4a8edbc2ce09ee0): [bz-a-999/deployment1v1-7b9bd87f99-dlw58:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[bz-a-999/deployment1v1-7b9bd87f99-dlw58] failed to configure pod interface: timed out waiting for pod flows for pod: deployment1v1-7b9bd87f99-dlw58, error: timed out waiting for the condition


This workload was OK on 4.7 and works OK for openshiftSDN


Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2021-01-05-062422


How reproducible: Always for this workload


Steps to Reproduce:
1. AWS cluster with 3 m5.2xlarge masters and 100 m5.large workers
2. Create 1000 namespaces each with 1 deployment containing 2 replicas.  (20 pods/node on avg)


Actual results:

Only 472 pods start, others are ContainerCreating with event above and never seem to progress.


Expected results:

Successful execution of this workload as in 4.7 and for openshiftSDN


Additional info:

Will include link to must-gather

Comment 1 Mike Fiedler 2021-01-05 18:06:15 UTC
Created attachment 1744664 [details]
journal from one node in the cluster this bz is being reported on.

Unfortunately the cluster degraded to the point that the API became unavailable and I could not get must gather.   The masters were inaccessible from an ssh bastion but i was able to get the journal off of 1 worker.   Let me know what else is needed for the next repro of this issue.

Comment 2 Mike Fiedler 2021-01-18 22:28:03 UTC
Reproduced on 4.6.0-0.nightly-2021-01-18-070340.  Still blocks verification of bug 1883917

Comment 3 Ricardo Carrillo Cruz 2021-01-19 13:43:29 UTC
reassigning to Ben since I'm on leave, please reassign to someone in the team.

Comment 7 Mike Fiedler 2021-05-14 16:27:43 UTC
Could not reproduce this on 4.8.0-0.nightly-2021-05-13-222446

Created 2000 pods in 1000 namespaces 
Created 5000 pods in 2500 namespaces.   CNI Request ADD latency increased significantly by the end of this run to ~12s but everything started succesfully

The error event and ContainerCreating issue described in this bug were not seen.

Closing as fixed upstream.