Bug 2032998

Summary:	Can not achieve 250 pods/node with OVNKubernetes in a multiple worker node cluster
Product:	OpenShift Container Platform	Reporter:	Alex Krzos <akrzos>
Component:	Networking	Assignee:	Surya Seetharaman <surya>
Networking sub component:	ovn-kubernetes	QA Contact:	Ross Brattain <rbrattai>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	anusaxen, dcbw, rbrattai, surya
Version:	4.10
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-10 16:34:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alex Krzos 2021-12-15 16:37:33 UTC

Description of problem:
Attempting to validate 250 max-pods a node per OpenShift worker node is failing with many pods stuck in ContainerCreating. This seems to occur after several times roughly reaching 250 max-pods a node.

The current cluster under test is a 13 node cluster hosted in ibmcloud. There are 3 control-plane nodes and 10 worker nodes.  The 10 worker nodes are sub-divided into nodedensity and worker role nodes:
# oc get no 
NAME          STATUS   ROLES                AGE   VERSION
jetlag-bm10   Ready    master               20h   v1.22.1+6859754
jetlag-bm11   Ready    master               21h   v1.22.1+6859754
jetlag-bm12   Ready    master               21h   v1.22.1+6859754
jetlag-bm13   Ready    nodedensity,worker   20h   v1.22.1+6859754
jetlag-bm14   Ready    nodedensity,worker   20h   v1.22.1+6859754
jetlag-bm15   Ready    nodedensity,worker   20h   v1.22.1+6859754
jetlag-bm16   Ready    nodedensity,worker   20h   v1.22.1+6859754
jetlag-bm17   Ready    nodedensity,worker   20h   v1.22.1+6859754
jetlag-bm18   Ready    worker               20h   v1.22.1+6859754
jetlag-bm19   Ready    worker               20h   v1.22.1+6859754
jetlag-bm20   Ready    worker               20h   v1.22.1+6859754
jetlag-bm21   Ready    worker               20h   v1.22.1+6859754
jetlag-bm22   Ready    worker               20h   v1.22.1+6859754

The test places workload pods on the nodedensity role type nodes.  The nodedensity nodes were cordoned, drained, then uncordoned prior to testing to "push" any steady state workloads off those 5 nodes such that they should run on the remaining 5 worker nodes and not interfere with the capacity of the nodedensity role-type workers.

After reviewing capacity for the nodes, I set the pod/node count to 234 such that we would create 5 * 234 namespaces each with one deployment with one pod replica with one container per pod. This results in 1170 namespaces, 1170 deployments and 1170 pods and 1170 total workload containers running across the 5 "nodedensity" role nodes. 234pods/node leaves room for 10 pods across the 5 nodes under test and actually does not test to the exact limit of 250pods/node.

The test repeat to run several iterations to see if the performance of loading up the nodes varies at all.  The test iterations until failure are:

1. 6 x 1170 namespaces with 1 deployment, 1 service, 1 route, 1 pod replica per deployment, 1 workload container per pod replica, no probes, no secrets/configmaps
2. 3 x 1170 namespaces same as above but setting requests and limits to establish Guaranteed QOS pods.  <--- Test failed/stuck on 3rd iteration


Layout of stuck pods:
# oc get po -A -o wide | grep ContainerCreating | awk '{print $8}' | sort | uniq -c
      5 jetlag-bm10
      4 jetlag-bm11
      6 jetlag-bm12
    235 jetlag-bm13
    234 jetlag-bm14
    234 jetlag-bm15
    233 jetlag-bm16
    234 jetlag-bm17
     42 jetlag-bm18
     38 jetlag-bm20
      1 jetlag-bm21

# oc get po -A -o wide | grep boatload | grep ContainerCreating | awk '{print $8}' | sort | uniq -c
    235 jetlag-bm13
    234 jetlag-bm14
    234 jetlag-bm15
    233 jetlag-bm16
    234 jetlag-bm17

# oc get po -A -o wide | grep boatload | grep ContainerCreating -c
1170

Every workload pod (pods with name prefix boatload) is stuck
Pods are stuck on nodes not hosting the workload (Including master nodes)

Snippet of stuck pods:
# oc get po -A | grep boatload | head
boatload-1                                         boatload-1-1-boatload-7fbbfdd9d4-dbdln                       0/1     ContainerCreating   0             18h
boatload-10                                        boatload-10-1-boatload-9b7c4f96-ntnr5                        0/1     ContainerCreating   0             18h
boatload-100                                       boatload-100-1-boatload-699b746448-nzg5w                     0/1     ContainerCreating   0             18h
boatload-1000                                      boatload-1000-1-boatload-6649fcdfbd-llndn                    0/1     ContainerCreating   0             18h
boatload-1001                                      boatload-1001-1-boatload-76cfbcdcb5-p4w6n                    0/1     ContainerCreating   0             18h
boatload-1002                                      boatload-1002-1-boatload-6d4fff77f8-cpvxk                    0/1     ContainerCreating   0             18h
boatload-1003                                      boatload-1003-1-boatload-5ff58b46c9-n66qk                    0/1     ContainerCreating   0             18h
boatload-1004                                      boatload-1004-1-boatload-6b7797978d-crjzx                    0/1     ContainerCreating   0             18h
boatload-1005                                      boatload-1005-1-boatload-844768f7b9-kxcxt                    0/1     ContainerCreating   0             18h
boatload-1006                                      boatload-1006-1-boatload-78dd5c8f54-2pt2h                    0/1     ContainerCreating   0             18h





Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2021-12-12-232810


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Description from failed pod:

# oc describe po -n boatload-1
Name:           boatload-1-1-boatload-7fbbfdd9d4-dbdln
Namespace:      boatload-1
Priority:       0
Node:           jetlag-bm15/10.5.190.3
Start Time:     Tue, 14 Dec 2021 15:52:02 -0600
Labels:         app=boatload-1-1
                pod-template-hash=7fbbfdd9d4
Annotations:    openshift.io/scc: restricted
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/boatload-1-1-boatload-7fbbfdd9d4
Containers:
  boatload-1:
    Container ID:   
    Image:          quay.io/redhat-performance/test-gohttp-probe:v0.0.2
    Image ID:       
    Port:           8000/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     50m
      memory:  100Mi
    Requests:
      cpu:     50m
      memory:  100Mi
    Environment:
      PORT:                         8000
      LISTEN_DELAY_SECONDS:         0
      LIVENESS_DELAY_SECONDS:       0
      READINESS_DELAY_SECONDS:      0
      RESPONSE_DELAY_MILLISECONDS:  0
      LIVENESS_SUCCESS_MAX:         0
      READINESS_SUCCESS_MAX:        0
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9q6zv (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-9q6zv:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Guaranteed
Node-Selectors:              jetlag=true
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                  From          Message
  ----     ------                  ----                 ----          -------
  Warning  ErrorAddingLogicalPort  73m (x17 over 18h)   controlplane  failed to ensure namespace locked: failed to create address set for namespace: boatload-1, error: failed to create new address set boatload-1_v4 (error in transact with ops [{Op:insert Table:Address_Set Row:map[external_ids:{GoMap:map[name:boatload-1_v4]} name:a9965174113379253624] Rows:[] Columns:[] Mutations:[] Timeout:0 Where:[] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]: context deadline exceeded: while awaiting reconnection)
  Warning  FailedCreatePodSandBox  96s (x903 over 18h)  kubelet       (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_boatload-1-1-boatload-7fbbfdd9d4-dbdln_boatload-1_6001f68b-5ae2-4615-8fd7-263152c548ed_0(a5bf992d2e41e0c3a3c6e2b478538819b3c14f1ab44dc55e6dc91a26d7bbe973): error adding pod boatload-1_boatload-1-1-boatload-7fbbfdd9d4-dbdln to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [boatload-1/boatload-1-1-boatload-7fbbfdd9d4-dbdln/6001f68b-5ae2-4615-8fd7-263152c548ed:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[boatload-1/boatload-1-1-boatload-7fbbfdd9d4-dbdln a5bf992d2e41e0c3a3c6e2b478538819b3c14f1ab44dc55e6dc91a26d7bbe973] [boatload-1/boatload-1-1-boatload-7fbbfdd9d4-dbdln a5bf992d2e41e0c3a3c6e2b478538819b3c14f1ab44dc55e6dc91a26d7bbe973] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded
'

All nodes are in Ready State:
# oc get no
NAME          STATUS   ROLES                AGE   VERSION
jetlag-bm10   Ready    master               20h   v1.22.1+6859754
jetlag-bm11   Ready    master               21h   v1.22.1+6859754
jetlag-bm12   Ready    master               21h   v1.22.1+6859754
jetlag-bm13   Ready    nodedensity,worker   20h   v1.22.1+6859754
jetlag-bm14   Ready    nodedensity,worker   20h   v1.22.1+6859754
jetlag-bm15   Ready    nodedensity,worker   20h   v1.22.1+6859754
jetlag-bm16   Ready    nodedensity,worker   20h   v1.22.1+6859754
jetlag-bm17   Ready    nodedensity,worker   20h   v1.22.1+6859754
jetlag-bm18   Ready    worker               20h   v1.22.1+6859754
jetlag-bm19   Ready    worker               20h   v1.22.1+6859754
jetlag-bm20   Ready    worker               20h   v1.22.1+6859754
jetlag-bm21   Ready    worker               20h   v1.22.1+6859754
jetlag-bm22   Ready    worker               20h   v1.22.1+6859754

I assume a must-gather will fail while the cluster is in the current state so I will capture logs, then attempt to clear the workload, then attempt a must-gather.

Comment 2 Alex Krzos 2021-12-17 16:15:21 UTC

Looking at some of the ovn logs on a specific pod show addlogicalport timings much have became unreasonable:

# cat ovnkube-master-5kz84.log | grep "addlogicalport" -i 
...
I1216 19:10:57.968847       1 pods.go:300] [boatload-1168/boatload-1168-1-boatload-59c794667-tsq5s] addLogicalPort took 119.267425ms                                                          
I1216 19:10:58.285261       1 pods.go:300] [boatload-1169/boatload-1169-1-boatload-85f5db9bf4-mbdtc] addLogicalPort took 119.572973ms                                                         
I1216 19:10:58.885310       1 pods.go:300] [boatload-1170/boatload-1170-1-boatload-848b99df44-7nspz] addLogicalPort took 119.824014ms                                                         
I1216 19:11:15.915533       1 pods.go:300] [openshift-marketplace/redhat-marketplace-6lzvf] addLogicalPort took 121.279971ms                                                                  
I1216 19:14:14.439708       1 pods.go:300] [openshift-marketplace/redhat-operators-w5qzm] addLogicalPort took 118.814712ms                                                                    
I1216 19:14:14.449908       1 pods.go:300] [openshift-marketplace/certified-operators-9dhg9] addLogicalPort took 116.228607ms                                                                 
I1216 19:14:14.463422       1 pods.go:300] [openshift-marketplace/community-operators-qd59v] addLogicalPort took 117.597242ms                                                                 
I1216 19:15:00.263116       1 pods.go:300] [openshift-multus/ip-reconciler-27328035--1-ghxwz] addLogicalPort took 112.461108ms                                                                
I1216 19:15:00.269914       1 pods.go:300] [openshift-operator-lifecycle-manager/collect-profiles-27328035--1-hgh5k] addLogicalPort took 115.551699ms                                         
I1216 19:41:36.790291       1 pods.go:300] [boatload-6/boatload-6-1-boatload-788dc74479-gw68m] addLogicalPort took 50.001351653s                                                              
I1216 19:42:36.792850       1 pods.go:300] [boatload-6/boatload-6-1-boatload-788dc74479-gw68m] addLogicalPort took 1m0.002430439s                                                             
I1216 19:42:56.793158       1 pods.go:300] [boatload-1/boatload-1-1-boatload-6c96b649bd-48bvl] addLogicalPort took 1m20.001660463s                                                            
I1216 19:43:16.793452       1 pods.go:300] [boatload-4/boatload-4-1-boatload-5f5f864d8f-778z4] addLogicalPort took 1m40.002229834s                                                            
I1216 19:43:36.794630       1 pods.go:300] [boatload-2/boatload-2-1-boatload-797b75b5c9-klz67] addLogicalPort took 2m0.00304807s                                                              
I1216 19:43:56.796481       1 pods.go:300] [boatload-3/boatload-3-1-boatload-f4cf55cdb-475p8] addLogicalPort took 2m20.004503398s  

In this log we see addLogicalPort timigns from a previous test taking ~119ms, then the following test that results in every workload pod stuck in containercreating taking 1m or greater.  

After some time passes, looking at one specific pod:

# oc get po -n boatload-34
NAME                                      READY   STATUS              RESTARTS   AGE
boatload-34-1-boatload-7d58cbdfbb-bpvlr   0/1     ContainerCreating   0          6h33m

I1217 01:44:17.761433       1 pods.go:300] [boatload-34/boatload-34-1-boatload-7d58cbdfbb-bpvlr] addLogicalPort took 9m30.017726293s

We can see that this pod took 9m30s for the addlogicalport entry in the log

Events part of the description of the pod:

Events:
  Type     Reason                  Age                      From          Message
  ----     ------                  ----                     ----          -------
  Warning  ErrorAddingLogicalPort  9m19s (x3 over 4h12m)    controlplane  failed to ensure namespace locked: failed to create address set for namespace: boatload-34, error: failed to create new address set boatload-34_v4 (error in transact with ops [{Op:insert Table:Address_Set Row:map[external_ids:{GoMap:map[name:boatload-34_v4]} name:a1817380831482281572] Rows:[] Columns:[] Mutations:[] Timeout:0 Where:[] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]: context deadline exceeded: while awaiting reconnection)
  Warning  FailedCreatePodSandBox  4m10s (x309 over 6h22m)  kubelet       (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_boatload-34-1-boatload-7d58cbdfbb-bpvlr_boatload-34_8c1ba5ef-f0c8-43b0-a73a-3662eb9f8735_0(b101d034c4bb7ed13ad02c24e58d622dc8a67a729c790c251c19bfe344837266): error adding pod boatload-34_boatload-34-1-boatload-7d58cbdfbb-bpvlr to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [boatload-34/boatload-34-1-boatload-7d58cbdfbb-bpvlr/8c1ba5ef-f0c8-43b0-a73a-3662eb9f8735:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[boatload-34/boatload-34-1-boatload-7d58cbdfbb-bpvlr b101d034c4bb7ed13ad02c24e58d622dc8a67a729c790c251c19bfe344837266] [boatload-34/boatload-34-1-boatload-7d58cbdfbb-bpvlr b101d034c4bb7ed13ad02c24e58d622dc8a67a729c790c251c19bfe344837266] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded

Seems like there was context deadlines exceeded while waiting for connections.

Later on in order to clear the workload I had to in serial reboot each master and I was able to eventually delete all workload pods, and other job like openshift pods were able to come and go too. (Not just the workload pods started ending up stuck in containercreating.

Comment 5 Surya Seetharaman 2022-01-18 09:45:12 UTC

Marking this as blocker+ since without the PRs there is a bad regression with pod creation latency. @dcbw: hope you agree.

Comment 8 Anurag saxena 2022-01-25 18:45:24 UTC

Comment 11 Ross Brattain 2022-02-01 05:43:16 UTC

Verified on 4.10.0-0.nightly-2022-01-31-012936 on IBM cloud.  10 workers,  250 pods on 5 nodes.

Running pods/node
    236 rbrattai-o410i11-kvds9-worker-1-b66hx
    231 rbrattai-o410i11-kvds9-worker-1-h6zdr
    237 rbrattai-o410i11-kvds9-worker-1-j9js4
    237 rbrattai-o410i11-kvds9-worker-2-6l5tt
    236 rbrattai-o410i11-kvds9-worker-2-8ggpw


largest addLogicalPort times

node-density-287] addLogicalPort took 470.708987ms
node-density-978] addLogicalPort took 493.649061ms
node-density-785] addLogicalPort took 511.898523ms
node-density-586] addLogicalPort took 513.501564ms
node-density-284] addLogicalPort took 537.483542ms
node-density-38] addLogicalPort took 548.649217ms
node-density-976] addLogicalPort took 554.633343ms
node-density-283] addLogicalPort took 577.463732ms
node-density-148] addLogicalPort took 600.974141ms
node-density-145] addLogicalPort took 619.917334ms
node-density-37] addLogicalPort took 639.893066ms
node-density-422] addLogicalPort took 686.451966ms
node-density-424] addLogicalPort took 716.245933ms
node-density-146] addLogicalPort took 727.488826ms
node-density-36] addLogicalPort took 764.081905ms
node-density-977] addLogicalPort took 769.694995ms
node-density-423] addLogicalPort took 771.642659ms
node-density-585] addLogicalPort took 785.058304ms

On AWS 4.10.0-0.nightly-2022-01-31-012936  9 workers,  250 pods on 5 workers

Running pods/node
    236 rbrattai-o410i11-kvds9-worker-1-b66hx
    231 rbrattai-o410i11-kvds9-worker-1-h6zdr
    237 rbrattai-o410i11-kvds9-worker-1-j9js4
    237 rbrattai-o410i11-kvds9-worker-2-6l5tt
    236 rbrattai-o410i11-kvds9-worker-2-8ggpw


largest addLogicalPort times

node-density-514] addLogicalPort took 164.437205ms
node-density-30] addLogicalPort took 166.760694ms
node-density-833] addLogicalPort took 185.298644ms
node-density-739] addLogicalPort took 189.04798ms
node-density-831] addLogicalPort took 189.115347ms
node-density-735] addLogicalPort took 218.693352ms
node-density-738] addLogicalPort took 224.195572ms
node-density-732] addLogicalPort took 276.709528ms
node-density-734] addLogicalPort took 278.150532ms
node-density-737] addLogicalPort took 316.868319ms
node-density-736] addLogicalPort took 339.96229ms
node-density-733] addLogicalPort took 352.281789ms

Comment 12 Ross Brattain 2022-02-01 05:44:31 UTC

Correction
On AWS 4.10.0-0.nightly-2022-01-31-012936  9 workers,  250 pods on 5 workers

Running pods/node
    237 ip-10-0-139-124.us-east-2.compute.internal
    232 ip-10-0-142-79.us-east-2.compute.internal
    236 ip-10-0-154-66.us-east-2.compute.internal
    231 ip-10-0-166-221.us-east-2.compute.internal
    237 ip-10-0-176-104.us-east-2.compute.internal

Comment 15 errata-xmlrpc 2022-03-10 16:34:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056