Description of problem: Creating 25 application pods/worker with a QPS of 20 in a 120 node environment presents a P99 Container Ready latency of 13.71 mins. Without ICNI2 annotations, creating 125 pods/worker has a 39.3 sec P99 Container Ready latency. Version-Release number of selected component (if applicable): - OCP 4.8.0-fc.9 - local gateway - ovn2.13-20.12.0-25.el8fdp.x86_64 How reproducible: 100% Steps to Reproduce: 1. Create several serving (lb) pods with proper ICNI2 annotations (40 in our scenario) 2. Create 25 app pods/worker Additional info: - Lots of pods stuck in ContainerCreating with Waiting for annotations events - Lots of nbctl commands run in the ovnkube-masters during the test: 2021-07-28T11:15:07.403Z|921664|nbctl|INFO|Running command run --may-exist --bfd --policy=src-ip --ecmp-symmetric-reply -- lr-route-add GR_worker064-fc640 10.129.47.201/32 192.168.217.71 rtoe-GR_worker064-fc640 2021-07-28T11:15:07.422Z|921665|nbctl|INFO|Running command run -- lr-policy-add ovn_cluster_router 501 "inport == \"rtos-worker064-fc640\" && ip4.src == 10.129.47.201 && ip4.dst != 10.128.0.0/14" reroute 100.64.0.64 - ovnkube-master, nbdb and northd show a considerably higher CPU utilisation during the ICNI2 experiments - nbdb and sbdb show a considerably higher memory usage
Lots of N/S leader elections during the 120 BM node / 25 pods/worker test.
(In reply to Jose Castillo Lema from comment #0) > - Lots of nbctl commands run in the ovnkube-masters during the test: > 2021-07-28T11:15:07.403Z|921664|nbctl|INFO|Running command run > --may-exist --bfd --policy=src-ip --ecmp-symmetric-reply -- lr-route-add > GR_worker064-fc640 10.129.47.201/32 192.168.217.71 rtoe-GR_worker064-fc640 > 2021-07-28T11:15:07.422Z|921665|nbctl|INFO|Running command run -- > lr-policy-add ovn_cluster_router 501 "inport == \"rtos-worker064-fc640\" && > ip4.src == 10.129.47.201 && ip4.dst != 10.128.0.0/14" reroute 100.64.0.64 AFAICS, these are just the nbctl daemon logs, since we have the "may-exist", they don't actually do anything. It creates the policies only once - when pod gets created, and rest of the times its a no-op, so should effect the perf. On the other hand, I am working on PRs to reduce the number of lflows and openflows created by the routes and policies. Hopefully this should have some effect at least. > - nbdb and sbdb show a considerably higher memory usage The 501 LRP != can cause a lot of flow creations. For ~2800 pods we were using, it would create 2800 lflows and 2800*21=~60,000 openflows during pod creation. With the fixes, we are hoping to bring this down to ~120 lflows {equal to the number of nodes in the cluster} and ~5000 openflows totally.
We are seeing multiple ecmp/policy adds happening for a single exgw pod creation. This is because we are calling addPodExternalGW from both addLogicalPort and ensurePod. This means we do multiple adds for each pod and that can really effect the pod latency. // ensurePod tries to set up a pod. It returns success or failure; failure // indicates the pod should be retried later. func (oc *Controller) ensurePod(oldPod, pod *kapi.Pod, addPort bool) bool { // Try unscheduled pods later if !util.PodScheduled(pod) { return false } if oldPod != nil && (exGatewayAnnotationsChanged(oldPod, pod) || networkStatusAnnotationsChanged(oldPod, pod)) { // No matter if a pod is ovn networked, or host networked, we still need to check for exgw // annotations. If the pod is ovn networked and is in update reschedule, addLogicalPort will take // care of updating the exgw updates oc.deletePodExternalGW(oldPod) } if util.PodWantsNetwork(pod) && addPort { if err := oc.addLogicalPort(pod); err != nil { klog.Errorf(err.Error()) oc.recordPodEvent(err, pod) return false } } *** this code is wrong. We see this code getting called for every update. else { if err := oc.addPodExternalGW(pod); err != nil { klog.Errorf(err.Error()) oc.recordPodEvent(err, pod) return false } } return true }
We see this happening three times per pod creation. I0823 21:15:24.588731 1 egressgw.go:38] External gateway pod: pod-serving-9-2-serving-job, detected for namespace(s) served-ns-9 2021-08-23T21:15:24.591Z|33431|nbctl|INFO|Running command run --may-exist --bfd --policy=src-ip --ecmp-symmetric-reply -- lr-route-add GR_worker080-r640 10.130.38.10/32 192.168.218.9 rtoe-GR_worker080-r640 2021-08-23T21:15:24.597Z|33433|nbctl|INFO|Running command run -- lr-policy-add ovn_cluster_router 501 "inport == \"rtos-worker080-r640\" && ip4.src == 10.130.38.10 && ip4.dst != 10.128.0.0/14" reroute 100.64.0.80 I0823 21:15:24.616615 1 egressgw.go:38] External gateway pod: pod-serving-9-2-serving-job, detected for namespace(s) served-ns-9 2021-08-23T21:15:24.619Z|33434|nbctl|INFO|Running command run --may-exist --bfd --policy=src-ip --ecmp-symmetric-reply -- lr-route-add GR_worker080-r640 10.130.38.10/32 192.168.218.9 rtoe-GR_worker080-r640 2021-08-23T21:15:24.624Z|33435|nbctl|INFO|Running command run -- lr-policy-add ovn_cluster_router 501 "inport == \"rtos-worker080-r640\" && ip4.src == 10.130.38.10 && ip4.dst != 10.128.0.0/14" reroute 100.64.0.80 I0823 21:15:24.624864 1 egressgw.go:38] External gateway pod: pod-serving-9-2-serving-job, detected for namespace(s) served-ns-9 2021-08-23T21:15:24.626Z|33436|nbctl|INFO|Running command run --may-exist --bfd --policy=src-ip --ecmp-symmetric-reply -- lr-route-add GR_worker080-r640 10.130.38.10/32 192.168.218.9 rtoe-GR_worker080-r640 2021-08-23T21:15:24.631Z|33437|nbctl|INFO|Running command run -- lr-policy-add ovn_cluster_router 501 "inport == \"rtos-worker080-r640\" && ip4.src == 10.130.38.10 && ip4.dst != 10.128.0.0/14" reroute 100.64.0.80
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759