Bug 1986946

Summary:	High ICNI2 application pod creation times
Product:	OpenShift Container Platform	Reporter:	Jose Castillo Lema <jlema>
Component:	Networking	Assignee:	Surya Seetharaman <surya>
Networking sub component:	ovn-kubernetes	QA Contact:	Yurii Prokulevych <yprokule>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	dblack, smalleni, surya, trozet, yprokule
Version:	4.7	Keywords:	FastFix
Target Milestone:	---	Flags:	jlema: needinfo-
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1996739 (view as bug list)		Environment:
Last Closed:	2021-10-18 17:42:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1996739

Description Jose Castillo Lema 2021-07-28 15:05:31 UTC

Description of problem:
Creating 25 application pods/worker with a QPS of 20 in a 120 node environment presents a P99 Container Ready latency of 13.71 mins.
Without ICNI2 annotations, creating 125 pods/worker has a 39.3 sec P99 Container Ready latency.

Version-Release number of selected component (if applicable):
 - OCP 4.8.0-fc.9
 - local gateway
 - ovn2.13-20.12.0-25.el8fdp.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create several serving (lb) pods with proper ICNI2 annotations (40 in our scenario)
2. Create 25 app pods/worker

Additional info:
 - Lots of pods stuck in ContainerCreating with Waiting for annotations events
 - Lots of nbctl commands run in the ovnkube-masters during the test:
    2021-07-28T11:15:07.403Z|921664|nbctl|INFO|Running command run --may-exist --bfd --policy=src-ip --ecmp-symmetric-reply -- lr-route-add GR_worker064-fc640 10.129.47.201/32 192.168.217.71 rtoe-GR_worker064-fc640
    2021-07-28T11:15:07.422Z|921665|nbctl|INFO|Running command run -- lr-policy-add ovn_cluster_router 501 "inport == \"rtos-worker064-fc640\" && ip4.src == 10.129.47.201 && ip4.dst != 10.128.0.0/14" reroute 100.64.0.64
 - ovnkube-master, nbdb and northd show a considerably higher CPU utilisation during the ICNI2 experiments
 - nbdb and sbdb show a considerably higher memory usage

Comment 2 Jose Castillo Lema 2021-07-28 15:31:48 UTC

Lots of N/S leader elections during the 120 BM node / 25 pods/worker test.

Comment 4 Surya Seetharaman 2021-07-30 08:29:24 UTC

(In reply to Jose Castillo Lema from comment #0)
>  - Lots of nbctl commands run in the ovnkube-masters during the test:
>     2021-07-28T11:15:07.403Z|921664|nbctl|INFO|Running command run
> --may-exist --bfd --policy=src-ip --ecmp-symmetric-reply -- lr-route-add
> GR_worker064-fc640 10.129.47.201/32 192.168.217.71 rtoe-GR_worker064-fc640
>     2021-07-28T11:15:07.422Z|921665|nbctl|INFO|Running command run --
> lr-policy-add ovn_cluster_router 501 "inport == \"rtos-worker064-fc640\" &&
> ip4.src == 10.129.47.201 && ip4.dst != 10.128.0.0/14" reroute 100.64.0.64


AFAICS, these are just the nbctl daemon logs, since we have the "may-exist", they don't actually do anything. It creates the policies only once - when pod gets created, and rest of the times its a no-op, so should effect the perf. On the other hand, I am working on PRs to reduce the number of lflows and openflows created by the routes and policies. Hopefully this should have some effect at least.

>  - nbdb and sbdb show a considerably higher memory usage

The 501 LRP != can cause a lot of flow creations. For ~2800 pods we were using, it would create 2800 lflows and 2800*21=~60,000 openflows during pod creation. With the fixes, we are hoping to bring this down to ~120 lflows {equal to the number of nodes in the cluster} and ~5000 openflows totally.

Comment 7 Surya Seetharaman 2021-08-24 17:52:15 UTC

We are seeing multiple ecmp/policy adds happening for a single exgw pod creation. This is because we are calling addPodExternalGW from both addLogicalPort and ensurePod. This means we do multiple adds for each pod and that can really effect the pod latency.

// ensurePod tries to set up a pod. It returns success or failure; failure
// indicates the pod should be retried later.
func (oc *Controller) ensurePod(oldPod, pod *kapi.Pod, addPort bool) bool {
	// Try unscheduled pods later
	if !util.PodScheduled(pod) {
		return false
	}

	if oldPod != nil && (exGatewayAnnotationsChanged(oldPod, pod) || networkStatusAnnotationsChanged(oldPod, pod)) {
		// No matter if a pod is ovn networked, or host networked, we still need to check for exgw
		// annotations. If the pod is ovn networked and is in update reschedule, addLogicalPort will take
		// care of updating the exgw updates
		oc.deletePodExternalGW(oldPod)
	}

	if util.PodWantsNetwork(pod) && addPort {
		if err := oc.addLogicalPort(pod); err != nil {
			klog.Errorf(err.Error())
			oc.recordPodEvent(err, pod)
			return false
		}
	}
*** this code is wrong. We see this code getting called for every update.
         else {
		if err := oc.addPodExternalGW(pod); err != nil {
			klog.Errorf(err.Error())
			oc.recordPodEvent(err, pod)
			return false
		}
	}

	return true
}

Comment 8 Surya Seetharaman 2021-08-24 17:55:23 UTC

We see this happening three times per pod creation.

I0823 21:15:24.588731       1 egressgw.go:38] External gateway pod: pod-serving-9-2-serving-job, detected for namespace(s) served-ns-9
2021-08-23T21:15:24.591Z|33431|nbctl|INFO|Running command run --may-exist --bfd --policy=src-ip --ecmp-symmetric-reply -- lr-route-add GR_worker080-r640 10.130.38.10/32 192.168.218.9 rtoe-GR_worker080-r640
2021-08-23T21:15:24.597Z|33433|nbctl|INFO|Running command run -- lr-policy-add ovn_cluster_router 501 "inport == \"rtos-worker080-r640\" && ip4.src == 10.130.38.10 && ip4.dst != 10.128.0.0/14" reroute 100.64.0.80
I0823 21:15:24.616615       1 egressgw.go:38] External gateway pod: pod-serving-9-2-serving-job, detected for namespace(s) served-ns-9
2021-08-23T21:15:24.619Z|33434|nbctl|INFO|Running command run --may-exist --bfd --policy=src-ip --ecmp-symmetric-reply -- lr-route-add GR_worker080-r640 10.130.38.10/32 192.168.218.9 rtoe-GR_worker080-r640
2021-08-23T21:15:24.624Z|33435|nbctl|INFO|Running command run -- lr-policy-add ovn_cluster_router 501 "inport == \"rtos-worker080-r640\" && ip4.src == 10.130.38.10 && ip4.dst != 10.128.0.0/14" reroute 100.64.0.80
I0823 21:15:24.624864       1 egressgw.go:38] External gateway pod: pod-serving-9-2-serving-job, detected for namespace(s) served-ns-9
2021-08-23T21:15:24.626Z|33436|nbctl|INFO|Running command run --may-exist --bfd --policy=src-ip --ecmp-symmetric-reply -- lr-route-add GR_worker080-r640 10.130.38.10/32 192.168.218.9 rtoe-GR_worker080-r640
2021-08-23T21:15:24.631Z|33437|nbctl|INFO|Running command run -- lr-policy-add ovn_cluster_router 501 "inport == \"rtos-worker080-r640\" && ip4.src == 10.130.38.10 && ip4.dst != 10.128.0.0/14" reroute 100.64.0.80

Comment 46 errata-xmlrpc 2021-10-18 17:42:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759