Bug 2082236

Summary:	[ARO] Performance issue(very low req/s) on running router-perf-tests on OVNK cluster
Product:	OpenShift Container Platform	Reporter:	Murali Krishnasamy <murali>
Component:	Networking	Assignee:	Surya Seetharaman <surya>
Networking sub component:	ovn-kubernetes	QA Contact:	Anurag saxena <anusaxen>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	smalleni, surya
Version:	4.9
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	ovn-perfscale
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-05-13 08:37:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Murali Krishnasamy 2022-05-05 15:51:55 UTC

Description of problem:
On ARO 4.9.9 cluster, router-perf-test(http,edge,passthrough,re-rencrypt) results are very low compare to a cluster running OpenshiftSDN network operator on same release and platform. 

Version-Release number of selected component (if applicable):
4.9.9 (GA)

How reproducible:
Always on a 24 or 27 node cluster

Steps to Reproduce:
1. Deploy a healthy ARO 4.9.9 cluster using OVNKubernetes CNO with 24 workers
2. Start this(https://github.com/cloud-bulldozer/e2e-benchmarking/tree/master/workloads/router-perf-v2) workload that creates 2k pods, services and routes and runs a client traffic across route endpoints to find the maximum req/s and latency.
3. And compare the results with a cluster using OpenshiftSDN CNO 

Actual results:
The results was, on running http traffic across 500 endpoint on OVNKubernetes endpoint from a single client pod(with 1 keepalive session) can get only 17 req/s whereas on OpenshiftSDN for similar traffic it could get up to 12.6k req/s.

Expected results:

Additional info:

Comment 1 Surya Seetharaman 2022-05-12 19:37:15 UTC

This is dupe of https://bugzilla.redhat.com/show_bug.cgi?id=2040594

On ARO 4.9.9:
when I curl a route:
sh-4.4# tcpdump -i any -nneepv | grep 20.106.0.19                                                                                                                            
dropped privs to tcpdump                                                                                                                                                     
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes                                                                                  
    168.63.129.16.53 > 10.0.128.5.49588: 25316 1/0/0 hello-openshift-hello-openshift.apps.ci-ln-yrzd8tb-1d09d.ci.azure.devcluster.openshift.com. A 20.106.0.19 (124)         
    168.63.129.16.53 > 10.0.128.5.49588: 25316 1/0/0 hello-openshift-hello-openshift.apps.ci-ln-yrzd8tb-1d09d.ci.azure.devcluster.openshift.com. A 20.106.0.19 (124)         
    10.0.128.5.51404 > 20.106.0.19.80: Flags [S], cksum 0x9eb0 (incorrect -> 0x47c5), seq 3690511438, win 29200, options [mss 1460,sackOK,TS val 1586573235 ecr 0,nop,wscale 
7], length 0                                                                                                                                                                 
    10.0.128.5.51404 > 20.106.0.19.80: Flags [S], cksum 0x47c5 (correct), seq 3690511438, win 29200, options [mss 1460,sackOK,TS val 1586573235 ecr 0,nop,wscale 7], length 0
    20.106.0.19.80 > 10.0.128.5.51404: Flags [S.], cksum 0x2f66 (correct), seq 3282724326, ack 3690511439, win 26960, options [mss 1340,sackOK,TS val 3745192121 ecr 15865732
35,nop,wscale 7], length 0                                                                                                                                                   
    20.106.0.19.80 > 10.0.128.5.51404: Flags [S.], cksum 0x2f66 (correct), seq 3282724326, ack 3690511439, win 26960, options [mss 1340,sackOK,TS val 3745192121 ecr 15865732
35,nop,wscale 7], length 0                                                                                                                                                   
    10.0.128.5.51404 > 20.106.0.19.80: Flags [.], cksum 0x9ea8 (incorrect -> 0xc622), ack 1, win 229, options [nop,nop,TS val 1586573238 ecr 3745192121], length 0           
    10.0.128.5.51404 > 20.106.0.19.80: Flags [P.], cksum 0x9f42 (incorrect -> 0x57c3), seq 1:155, ack 1, win 229, options [nop,nop,TS val 1586573238 ecr 3745192121], length 
154: HTTP, length: 154                                                                     

                
Traffic is leaving the cluster towards the public DNS server which resolves the public route to the ingress Load Balancer VIP that is provided by Azure and ultimately the host is communicating directly with the load balancer. 20.106.0.19 is the load balancer VIP and 10.0.128.5 is hostIP and 168.63.129.16 is the DNS server and this is dump while a route was curled

so essentially on 4.9.9 traffic is not under our control as its leaving the cluster...(which it won't for SDN thanks to IPTables rules)


However on ARO >4.9.23:
when I curl a route:
sh-4.4# tcpdump -i any -nneepv | grep 13.86.88.185                                                                                                                           
dropped privs to tcpdump                                                                                                                                                     
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes                                                                                  
    10.0.2.34.53 > 10.0.2.34.40996: 12089* 1/0/0 hello-openshift-hello-openshift.apps.rhx9q0yj.centralus.aroapp.io. A 13.86.88.185 (99)   
        
13.86.88.185 is the load balancer VIP provided by azure for the ingress service and 10.0.2.34 is the hostIP and 10.0.2.34 is also serving DNS at 53.

We however have iptable rules on the host:
[2:120] -A OVN-KUBE-EXTERNALIP -d 13.86.88.185/32 -p tcp -m tcp --dport 80 -j DNAT --to-destination 172.30.223.227:80

that ensures traffic doesn't have to leave the cluster at all. So this is having at par perf with SDN which does the same thing (or almost same):
on the newer ARO versions:
OVNK is serving 16.8k per sec
SDN is serving 19.4k per sec

not sure if we need to investigate this small 3k gap?: https://grafana.apps.observability.perfscale.devcluster.openshift.com/d/KbKAA4fnK/ingress-performance?orgId=1&from=now-7d&to=now&var-datasource=Observability%20-%20ingress%20performance&var-keepalive=All&var-termination=All&var-cluster_name=mukrishn-aro-ovn-hcxwt&var-cluster_name=mukrishn-aro-sdn-kpftx&var-uuid=All&var-sdn=All&var-ocp_version=All&var-platform=All&var-routes=500&var-conn_per_targetroute=1&var-conn_per_targetroute=20&var-group_by=cluster.name.keyword

Comment 2 Surya Seetharaman 2022-05-13 08:37:32 UTC

Closing this as dupe of https://bugzilla.redhat.com/show_bug.cgi?id=2040594 based on https://coreos.slack.com/archives/CU9HKBZKJ/p1652383143251699?thread_ts=1652382186.170729&cid=CU9HKBZKJ

*** This bug has been marked as a duplicate of bug 2040594 ***