Bug 1912975 - Containers stuck in ContainerCreating creating 1000 namespaces on 100 nodes with 1000 deployments
Summary: Containers stuck in ContainerCreating creating 1000 namespaces on 100 nodes w...
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6.z
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Mohamed Mahmoud
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks: 1883917 1908472
TreeView+ depends on / blocked
 
Reported: 2021-01-05 17:38 UTC by Mike Fiedler
Modified: 2021-05-14 16:27 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-14 16:27:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
journal from one node in the cluster this bz is being reported on. (2.35 MB, application/gzip)
2021-01-05 18:06 UTC, Mike Fiedler
no flags Details

Description Mike Fiedler 2021-01-05 17:38:31 UTC
Description of problem:

Hit this on 4.6.0-0.nightly-2021-01-05-062422 trying to verify bug 1883917

While running the same workload described in bug 1883917 and it's 4.7 parent bug 1855408 (which passed verification):

on a 100 worker node cluster, create 1000 namespaces with a 2 pod deployment in each

427/2000 pods started successfully
The remainder are stuck in ContainerCreating with this event in oc describe:

  Warning  FailedCreatePodSandBox  21s (x12 over 6m47s)  kubelet, ip-10-0-200-163.us-west-2.compute.internal  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_deployment1v1-7b9bd87f99-dlw58_bz-a-999_d8103c15-b725-4579-aa22-3d15358bad2d_0(ff966e88f1870cb00e0c962a2c970ca380c6debdf21bbffae4a8edbc2ce09ee0): [bz-a-999/deployment1v1-7b9bd87f99-dlw58:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[bz-a-999/deployment1v1-7b9bd87f99-dlw58] failed to configure pod interface: timed out waiting for pod flows for pod: deployment1v1-7b9bd87f99-dlw58, error: timed out waiting for the condition


This workload was OK on 4.7 and works OK for openshiftSDN


Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2021-01-05-062422


How reproducible: Always for this workload


Steps to Reproduce:
1. AWS cluster with 3 m5.2xlarge masters and 100 m5.large workers
2. Create 1000 namespaces each with 1 deployment containing 2 replicas.  (20 pods/node on avg)


Actual results:

Only 472 pods start, others are ContainerCreating with event above and never seem to progress.


Expected results:

Successful execution of this workload as in 4.7 and for openshiftSDN


Additional info:

Will include link to must-gather

Comment 1 Mike Fiedler 2021-01-05 18:06:15 UTC
Created attachment 1744664 [details]
journal from one node in the cluster this bz is being reported on.

Unfortunately the cluster degraded to the point that the API became unavailable and I could not get must gather.   The masters were inaccessible from an ssh bastion but i was able to get the journal off of 1 worker.   Let me know what else is needed for the next repro of this issue.

Comment 2 Mike Fiedler 2021-01-18 22:28:03 UTC
Reproduced on 4.6.0-0.nightly-2021-01-18-070340.  Still blocks verification of bug 1883917

Comment 3 Ricardo Carrillo Cruz 2021-01-19 13:43:29 UTC
reassigning to Ben since I'm on leave, please reassign to someone in the team.

Comment 7 Mike Fiedler 2021-05-14 16:27:43 UTC
Could not reproduce this on 4.8.0-0.nightly-2021-05-13-222446

Created 2000 pods in 1000 namespaces 
Created 5000 pods in 2500 namespaces.   CNI Request ADD latency increased significantly by the end of this run to ~12s but everything started succesfully

The error event and ContainerCreating issue described in this bug were not seen.

Closing as fixed upstream.


Note You need to log in before you can comment on or make changes to this bug.