Summary: | Can not achieve 250 pods/node with OVNKubernetes in a multiple worker node cluster | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Alex Krzos <akrzos> |
Component: | Networking | Assignee: | Surya Seetharaman <surya> |
Networking sub component: | ovn-kubernetes | QA Contact: | Ross Brattain <rbrattai> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | urgent | CC: | anusaxen, dcbw, rbrattai, surya |
Version: | 4.10 | ||
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-10 16:34:06 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: |
Description
Alex Krzos
2021-12-15 16:37:33 UTC
Looking at some of the ovn logs on a specific pod show addlogicalport timings much have became unreasonable: # cat ovnkube-master-5kz84.log | grep "addlogicalport" -i ... I1216 19:10:57.968847 1 pods.go:300] [boatload-1168/boatload-1168-1-boatload-59c794667-tsq5s] addLogicalPort took 119.267425ms I1216 19:10:58.285261 1 pods.go:300] [boatload-1169/boatload-1169-1-boatload-85f5db9bf4-mbdtc] addLogicalPort took 119.572973ms I1216 19:10:58.885310 1 pods.go:300] [boatload-1170/boatload-1170-1-boatload-848b99df44-7nspz] addLogicalPort took 119.824014ms I1216 19:11:15.915533 1 pods.go:300] [openshift-marketplace/redhat-marketplace-6lzvf] addLogicalPort took 121.279971ms I1216 19:14:14.439708 1 pods.go:300] [openshift-marketplace/redhat-operators-w5qzm] addLogicalPort took 118.814712ms I1216 19:14:14.449908 1 pods.go:300] [openshift-marketplace/certified-operators-9dhg9] addLogicalPort took 116.228607ms I1216 19:14:14.463422 1 pods.go:300] [openshift-marketplace/community-operators-qd59v] addLogicalPort took 117.597242ms I1216 19:15:00.263116 1 pods.go:300] [openshift-multus/ip-reconciler-27328035--1-ghxwz] addLogicalPort took 112.461108ms I1216 19:15:00.269914 1 pods.go:300] [openshift-operator-lifecycle-manager/collect-profiles-27328035--1-hgh5k] addLogicalPort took 115.551699ms I1216 19:41:36.790291 1 pods.go:300] [boatload-6/boatload-6-1-boatload-788dc74479-gw68m] addLogicalPort took 50.001351653s I1216 19:42:36.792850 1 pods.go:300] [boatload-6/boatload-6-1-boatload-788dc74479-gw68m] addLogicalPort took 1m0.002430439s I1216 19:42:56.793158 1 pods.go:300] [boatload-1/boatload-1-1-boatload-6c96b649bd-48bvl] addLogicalPort took 1m20.001660463s I1216 19:43:16.793452 1 pods.go:300] [boatload-4/boatload-4-1-boatload-5f5f864d8f-778z4] addLogicalPort took 1m40.002229834s I1216 19:43:36.794630 1 pods.go:300] [boatload-2/boatload-2-1-boatload-797b75b5c9-klz67] addLogicalPort took 2m0.00304807s I1216 19:43:56.796481 1 pods.go:300] [boatload-3/boatload-3-1-boatload-f4cf55cdb-475p8] addLogicalPort took 2m20.004503398s In this log we see addLogicalPort timigns from a previous test taking ~119ms, then the following test that results in every workload pod stuck in containercreating taking 1m or greater. After some time passes, looking at one specific pod: # oc get po -n boatload-34 NAME READY STATUS RESTARTS AGE boatload-34-1-boatload-7d58cbdfbb-bpvlr 0/1 ContainerCreating 0 6h33m I1217 01:44:17.761433 1 pods.go:300] [boatload-34/boatload-34-1-boatload-7d58cbdfbb-bpvlr] addLogicalPort took 9m30.017726293s We can see that this pod took 9m30s for the addlogicalport entry in the log Events part of the description of the pod: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ErrorAddingLogicalPort 9m19s (x3 over 4h12m) controlplane failed to ensure namespace locked: failed to create address set for namespace: boatload-34, error: failed to create new address set boatload-34_v4 (error in transact with ops [{Op:insert Table:Address_Set Row:map[external_ids:{GoMap:map[name:boatload-34_v4]} name:a1817380831482281572] Rows:[] Columns:[] Mutations:[] Timeout:0 Where:[] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]: context deadline exceeded: while awaiting reconnection) Warning FailedCreatePodSandBox 4m10s (x309 over 6h22m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_boatload-34-1-boatload-7d58cbdfbb-bpvlr_boatload-34_8c1ba5ef-f0c8-43b0-a73a-3662eb9f8735_0(b101d034c4bb7ed13ad02c24e58d622dc8a67a729c790c251c19bfe344837266): error adding pod boatload-34_boatload-34-1-boatload-7d58cbdfbb-bpvlr to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [boatload-34/boatload-34-1-boatload-7d58cbdfbb-bpvlr/8c1ba5ef-f0c8-43b0-a73a-3662eb9f8735:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[boatload-34/boatload-34-1-boatload-7d58cbdfbb-bpvlr b101d034c4bb7ed13ad02c24e58d622dc8a67a729c790c251c19bfe344837266] [boatload-34/boatload-34-1-boatload-7d58cbdfbb-bpvlr b101d034c4bb7ed13ad02c24e58d622dc8a67a729c790c251c19bfe344837266] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded Seems like there was context deadlines exceeded while waiting for connections. Later on in order to clear the workload I had to in serial reboot each master and I was able to eventually delete all workload pods, and other job like openshift pods were able to come and go too. (Not just the workload pods started ending up stuck in containercreating. Marking this as blocker+ since without the PRs there is a bad regression with pod creation latency. @dcbw: hope you agree. @ Verified on 4.10.0-0.nightly-2022-01-31-012936 on IBM cloud. 10 workers, 250 pods on 5 nodes. Running pods/node 236 rbrattai-o410i11-kvds9-worker-1-b66hx 231 rbrattai-o410i11-kvds9-worker-1-h6zdr 237 rbrattai-o410i11-kvds9-worker-1-j9js4 237 rbrattai-o410i11-kvds9-worker-2-6l5tt 236 rbrattai-o410i11-kvds9-worker-2-8ggpw largest addLogicalPort times node-density-287] addLogicalPort took 470.708987ms node-density-978] addLogicalPort took 493.649061ms node-density-785] addLogicalPort took 511.898523ms node-density-586] addLogicalPort took 513.501564ms node-density-284] addLogicalPort took 537.483542ms node-density-38] addLogicalPort took 548.649217ms node-density-976] addLogicalPort took 554.633343ms node-density-283] addLogicalPort took 577.463732ms node-density-148] addLogicalPort took 600.974141ms node-density-145] addLogicalPort took 619.917334ms node-density-37] addLogicalPort took 639.893066ms node-density-422] addLogicalPort took 686.451966ms node-density-424] addLogicalPort took 716.245933ms node-density-146] addLogicalPort took 727.488826ms node-density-36] addLogicalPort took 764.081905ms node-density-977] addLogicalPort took 769.694995ms node-density-423] addLogicalPort took 771.642659ms node-density-585] addLogicalPort took 785.058304ms On AWS 4.10.0-0.nightly-2022-01-31-012936 9 workers, 250 pods on 5 workers Running pods/node 236 rbrattai-o410i11-kvds9-worker-1-b66hx 231 rbrattai-o410i11-kvds9-worker-1-h6zdr 237 rbrattai-o410i11-kvds9-worker-1-j9js4 237 rbrattai-o410i11-kvds9-worker-2-6l5tt 236 rbrattai-o410i11-kvds9-worker-2-8ggpw largest addLogicalPort times node-density-514] addLogicalPort took 164.437205ms node-density-30] addLogicalPort took 166.760694ms node-density-833] addLogicalPort took 185.298644ms node-density-739] addLogicalPort took 189.04798ms node-density-831] addLogicalPort took 189.115347ms node-density-735] addLogicalPort took 218.693352ms node-density-738] addLogicalPort took 224.195572ms node-density-732] addLogicalPort took 276.709528ms node-density-734] addLogicalPort took 278.150532ms node-density-737] addLogicalPort took 316.868319ms node-density-736] addLogicalPort took 339.96229ms node-density-733] addLogicalPort took 352.281789ms Correction On AWS 4.10.0-0.nightly-2022-01-31-012936 9 workers, 250 pods on 5 workers Running pods/node 237 ip-10-0-139-124.us-east-2.compute.internal 232 ip-10-0-142-79.us-east-2.compute.internal 236 ip-10-0-154-66.us-east-2.compute.internal 231 ip-10-0-166-221.us-east-2.compute.internal 237 ip-10-0-176-104.us-east-2.compute.internal Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |