Bug 1986946 - High ICNI2 application pod creation times
Summary: High ICNI2 application pod creation times
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.0
Assignee: Surya Seetharaman
QA Contact: Yurii Prokulevych
URL:
Whiteboard:
Depends On:
Blocks: 1996739
TreeView+ depends on / blocked
 
Reported: 2021-07-28 15:05 UTC by Jose Castillo Lema
Modified: 2021-10-18 17:43 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1996739 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:42:58 UTC
Target Upstream Version:
Embargoed:
jlema: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 691 0 None None None 2021-08-26 08:26:37 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:43:02 UTC

Description Jose Castillo Lema 2021-07-28 15:05:31 UTC
Description of problem:
Creating 25 application pods/worker with a QPS of 20 in a 120 node environment presents a P99 Container Ready latency of 13.71 mins.
Without ICNI2 annotations, creating 125 pods/worker has a 39.3 sec P99 Container Ready latency.

Version-Release number of selected component (if applicable):
 - OCP 4.8.0-fc.9
 - local gateway
 - ovn2.13-20.12.0-25.el8fdp.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create several serving (lb) pods with proper ICNI2 annotations (40 in our scenario)
2. Create 25 app pods/worker

Additional info:
 - Lots of pods stuck in ContainerCreating with Waiting for annotations events
 - Lots of nbctl commands run in the ovnkube-masters during the test:
    2021-07-28T11:15:07.403Z|921664|nbctl|INFO|Running command run --may-exist --bfd --policy=src-ip --ecmp-symmetric-reply -- lr-route-add GR_worker064-fc640 10.129.47.201/32 192.168.217.71 rtoe-GR_worker064-fc640
    2021-07-28T11:15:07.422Z|921665|nbctl|INFO|Running command run -- lr-policy-add ovn_cluster_router 501 "inport == \"rtos-worker064-fc640\" && ip4.src == 10.129.47.201 && ip4.dst != 10.128.0.0/14" reroute 100.64.0.64
 - ovnkube-master, nbdb and northd show a considerably higher CPU utilisation during the ICNI2 experiments
 - nbdb and sbdb show a considerably higher memory usage

Comment 2 Jose Castillo Lema 2021-07-28 15:31:48 UTC
Lots of N/S leader elections during the 120 BM node / 25 pods/worker test.

Comment 4 Surya Seetharaman 2021-07-30 08:29:24 UTC
(In reply to Jose Castillo Lema from comment #0)
>  - Lots of nbctl commands run in the ovnkube-masters during the test:
>     2021-07-28T11:15:07.403Z|921664|nbctl|INFO|Running command run
> --may-exist --bfd --policy=src-ip --ecmp-symmetric-reply -- lr-route-add
> GR_worker064-fc640 10.129.47.201/32 192.168.217.71 rtoe-GR_worker064-fc640
>     2021-07-28T11:15:07.422Z|921665|nbctl|INFO|Running command run --
> lr-policy-add ovn_cluster_router 501 "inport == \"rtos-worker064-fc640\" &&
> ip4.src == 10.129.47.201 && ip4.dst != 10.128.0.0/14" reroute 100.64.0.64


AFAICS, these are just the nbctl daemon logs, since we have the "may-exist", they don't actually do anything. It creates the policies only once - when pod gets created, and rest of the times its a no-op, so should effect the perf. On the other hand, I am working on PRs to reduce the number of lflows and openflows created by the routes and policies. Hopefully this should have some effect at least.

>  - nbdb and sbdb show a considerably higher memory usage

The 501 LRP != can cause a lot of flow creations. For ~2800 pods we were using, it would create 2800 lflows and 2800*21=~60,000 openflows during pod creation. With the fixes, we are hoping to bring this down to ~120 lflows {equal to the number of nodes in the cluster} and ~5000 openflows totally.

Comment 7 Surya Seetharaman 2021-08-24 17:52:15 UTC
We are seeing multiple ecmp/policy adds happening for a single exgw pod creation. This is because we are calling addPodExternalGW from both addLogicalPort and ensurePod. This means we do multiple adds for each pod and that can really effect the pod latency.

// ensurePod tries to set up a pod. It returns success or failure; failure
// indicates the pod should be retried later.
func (oc *Controller) ensurePod(oldPod, pod *kapi.Pod, addPort bool) bool {
	// Try unscheduled pods later
	if !util.PodScheduled(pod) {
		return false
	}

	if oldPod != nil && (exGatewayAnnotationsChanged(oldPod, pod) || networkStatusAnnotationsChanged(oldPod, pod)) {
		// No matter if a pod is ovn networked, or host networked, we still need to check for exgw
		// annotations. If the pod is ovn networked and is in update reschedule, addLogicalPort will take
		// care of updating the exgw updates
		oc.deletePodExternalGW(oldPod)
	}

	if util.PodWantsNetwork(pod) && addPort {
		if err := oc.addLogicalPort(pod); err != nil {
			klog.Errorf(err.Error())
			oc.recordPodEvent(err, pod)
			return false
		}
	}
*** this code is wrong. We see this code getting called for every update.
         else {
		if err := oc.addPodExternalGW(pod); err != nil {
			klog.Errorf(err.Error())
			oc.recordPodEvent(err, pod)
			return false
		}
	}

	return true
}

Comment 8 Surya Seetharaman 2021-08-24 17:55:23 UTC
We see this happening three times per pod creation.

I0823 21:15:24.588731       1 egressgw.go:38] External gateway pod: pod-serving-9-2-serving-job, detected for namespace(s) served-ns-9
2021-08-23T21:15:24.591Z|33431|nbctl|INFO|Running command run --may-exist --bfd --policy=src-ip --ecmp-symmetric-reply -- lr-route-add GR_worker080-r640 10.130.38.10/32 192.168.218.9 rtoe-GR_worker080-r640
2021-08-23T21:15:24.597Z|33433|nbctl|INFO|Running command run -- lr-policy-add ovn_cluster_router 501 "inport == \"rtos-worker080-r640\" && ip4.src == 10.130.38.10 && ip4.dst != 10.128.0.0/14" reroute 100.64.0.80
I0823 21:15:24.616615       1 egressgw.go:38] External gateway pod: pod-serving-9-2-serving-job, detected for namespace(s) served-ns-9
2021-08-23T21:15:24.619Z|33434|nbctl|INFO|Running command run --may-exist --bfd --policy=src-ip --ecmp-symmetric-reply -- lr-route-add GR_worker080-r640 10.130.38.10/32 192.168.218.9 rtoe-GR_worker080-r640
2021-08-23T21:15:24.624Z|33435|nbctl|INFO|Running command run -- lr-policy-add ovn_cluster_router 501 "inport == \"rtos-worker080-r640\" && ip4.src == 10.130.38.10 && ip4.dst != 10.128.0.0/14" reroute 100.64.0.80
I0823 21:15:24.624864       1 egressgw.go:38] External gateway pod: pod-serving-9-2-serving-job, detected for namespace(s) served-ns-9
2021-08-23T21:15:24.626Z|33436|nbctl|INFO|Running command run --may-exist --bfd --policy=src-ip --ecmp-symmetric-reply -- lr-route-add GR_worker080-r640 10.130.38.10/32 192.168.218.9 rtoe-GR_worker080-r640
2021-08-23T21:15:24.631Z|33437|nbctl|INFO|Running command run -- lr-policy-add ovn_cluster_router 501 "inport == \"rtos-worker080-r640\" && ip4.src == 10.130.38.10 && ip4.dst != 10.128.0.0/14" reroute 100.64.0.80

Comment 46 errata-xmlrpc 2021-10-18 17:42:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.