Description of problem: On ARO 4.9.9 cluster, router-perf-test(http,edge,passthrough,re-rencrypt) results are very low compare to a cluster running OpenshiftSDN network operator on same release and platform. Version-Release number of selected component (if applicable): 4.9.9 (GA) How reproducible: Always on a 24 or 27 node cluster Steps to Reproduce: 1. Deploy a healthy ARO 4.9.9 cluster using OVNKubernetes CNO with 24 workers 2. Start this(https://github.com/cloud-bulldozer/e2e-benchmarking/tree/master/workloads/router-perf-v2) workload that creates 2k pods, services and routes and runs a client traffic across route endpoints to find the maximum req/s and latency. 3. And compare the results with a cluster using OpenshiftSDN CNO Actual results: The results was, on running http traffic across 500 endpoint on OVNKubernetes endpoint from a single client pod(with 1 keepalive session) can get only 17 req/s whereas on OpenshiftSDN for similar traffic it could get up to 12.6k req/s. Expected results: Additional info:
This is dupe of https://bugzilla.redhat.com/show_bug.cgi?id=2040594 On ARO 4.9.9: when I curl a route: sh-4.4# tcpdump -i any -nneepv | grep 20.106.0.19 dropped privs to tcpdump tcpdump: listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes 168.63.129.16.53 > 10.0.128.5.49588: 25316 1/0/0 hello-openshift-hello-openshift.apps.ci-ln-yrzd8tb-1d09d.ci.azure.devcluster.openshift.com. A 20.106.0.19 (124) 168.63.129.16.53 > 10.0.128.5.49588: 25316 1/0/0 hello-openshift-hello-openshift.apps.ci-ln-yrzd8tb-1d09d.ci.azure.devcluster.openshift.com. A 20.106.0.19 (124) 10.0.128.5.51404 > 20.106.0.19.80: Flags [S], cksum 0x9eb0 (incorrect -> 0x47c5), seq 3690511438, win 29200, options [mss 1460,sackOK,TS val 1586573235 ecr 0,nop,wscale 7], length 0 10.0.128.5.51404 > 20.106.0.19.80: Flags [S], cksum 0x47c5 (correct), seq 3690511438, win 29200, options [mss 1460,sackOK,TS val 1586573235 ecr 0,nop,wscale 7], length 0 20.106.0.19.80 > 10.0.128.5.51404: Flags [S.], cksum 0x2f66 (correct), seq 3282724326, ack 3690511439, win 26960, options [mss 1340,sackOK,TS val 3745192121 ecr 15865732 35,nop,wscale 7], length 0 20.106.0.19.80 > 10.0.128.5.51404: Flags [S.], cksum 0x2f66 (correct), seq 3282724326, ack 3690511439, win 26960, options [mss 1340,sackOK,TS val 3745192121 ecr 15865732 35,nop,wscale 7], length 0 10.0.128.5.51404 > 20.106.0.19.80: Flags [.], cksum 0x9ea8 (incorrect -> 0xc622), ack 1, win 229, options [nop,nop,TS val 1586573238 ecr 3745192121], length 0 10.0.128.5.51404 > 20.106.0.19.80: Flags [P.], cksum 0x9f42 (incorrect -> 0x57c3), seq 1:155, ack 1, win 229, options [nop,nop,TS val 1586573238 ecr 3745192121], length 154: HTTP, length: 154 Traffic is leaving the cluster towards the public DNS server which resolves the public route to the ingress Load Balancer VIP that is provided by Azure and ultimately the host is communicating directly with the load balancer. 20.106.0.19 is the load balancer VIP and 10.0.128.5 is hostIP and 168.63.129.16 is the DNS server and this is dump while a route was curled so essentially on 4.9.9 traffic is not under our control as its leaving the cluster...(which it won't for SDN thanks to IPTables rules) However on ARO >4.9.23: when I curl a route: sh-4.4# tcpdump -i any -nneepv | grep 13.86.88.185 dropped privs to tcpdump tcpdump: listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes 10.0.2.34.53 > 10.0.2.34.40996: 12089* 1/0/0 hello-openshift-hello-openshift.apps.rhx9q0yj.centralus.aroapp.io. A 13.86.88.185 (99) 13.86.88.185 is the load balancer VIP provided by azure for the ingress service and 10.0.2.34 is the hostIP and 10.0.2.34 is also serving DNS at 53. We however have iptable rules on the host: [2:120] -A OVN-KUBE-EXTERNALIP -d 13.86.88.185/32 -p tcp -m tcp --dport 80 -j DNAT --to-destination 172.30.223.227:80 that ensures traffic doesn't have to leave the cluster at all. So this is having at par perf with SDN which does the same thing (or almost same): on the newer ARO versions: OVNK is serving 16.8k per sec SDN is serving 19.4k per sec not sure if we need to investigate this small 3k gap?: https://grafana.apps.observability.perfscale.devcluster.openshift.com/d/KbKAA4fnK/ingress-performance?orgId=1&from=now-7d&to=now&var-datasource=Observability%20-%20ingress%20performance&var-keepalive=All&var-termination=All&var-cluster_name=mukrishn-aro-ovn-hcxwt&var-cluster_name=mukrishn-aro-sdn-kpftx&var-uuid=All&var-sdn=All&var-ocp_version=All&var-platform=All&var-routes=500&var-conn_per_targetroute=1&var-conn_per_targetroute=20&var-group_by=cluster.name.keyword
Closing this as dupe of https://bugzilla.redhat.com/show_bug.cgi?id=2040594 based on https://coreos.slack.com/archives/CU9HKBZKJ/p1652383143251699?thread_ts=1652382186.170729&cid=CU9HKBZKJ *** This bug has been marked as a duplicate of bug 2040594 ***