Bug 1869847 - 4.6 command line operators are slower by 25% from 4.5 [NEEDINFO]
Summary: 4.6 command line operators are slower by 25% from 4.5
Keywords:
Status: CLOSED DUPLICATE of bug 1899575
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: oc
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.7.0
Assignee: Maciej Szulik
QA Contact: Simon
URL:
Whiteboard: LifecycleStale
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-18 18:52 UTC by Paige Rubendall
Modified: 2021-01-25 11:28 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-21 11:54:10 UTC
Target Upstream Version:
Embargoed:
mfojtik: needinfo?


Attachments (Terms of Use)
Deployment config (4.65 KB, text/plain)
2020-08-18 18:52 UTC, Paige Rubendall
no flags Details
configmap-post-apf-enabled-4.6 (40.67 KB, image/png)
2020-08-28 18:05 UTC, Abu Kashem
no flags Details
configmap-post-apf-enabled-4.5 (45.83 KB, image/png)
2020-08-28 18:06 UTC, Abu Kashem
no flags Details
configmap-post-apf-disabled-4.6 (44.49 KB, image/png)
2020-08-28 18:07 UTC, Abu Kashem
no flags Details
configmap-post-apf-disabled-4.5 (43.42 KB, image/png)
2020-08-28 18:07 UTC, Abu Kashem
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1626291 0 unspecified CLOSED 4.0 command line operations are 2x - 4x slower than 3.11 2023-09-14 04:34:23 UTC

Description Paige Rubendall 2020-08-18 18:52:15 UTC
Created attachment 1711766 [details]
Deployment config

Description of problem:
While running performance tests on different oc commands, I am noticing a regression in the amount of time commands take to run.  For small/individual commands a difference in the times might not be noticeable but on a larger scale the slowdown becomes much greater. 

From previous bugs we have found that to best test this all master and worker nodes need to be in the same region and aws availability zone. (attaching install config I used)

1. Create 200 projects and create 1 dc in each project

time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done
4.5:  84.97s
4.5 different run: 65.11s
4.6:  109.11s -29%
4.6 different run: 118.84

2. for each of 200 projects, get the dcs, sa and secrets

time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done 
4.5: 96.93s 
4.5 different run: 85.597s
4.6: 108.64s -12%
4.6 different run: 111.42s


3. for each of 200 projects, scale the dc replicas to 0

time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done
4.5:  35.78s
4.5 different run: 33.005s
4.6:  41.16s -15%
4.6 different run: 42.16

4. Delete 200 projects 

time for i in {0..199}; do oc delete project --wait=false test$i; done
4.5: 35.55s
4.5 different run: 25.964s
4.6: 43.39s -22%
4.6 different run: 42.88


Version-Release number of selected component (if applicable): 4.6


How reproducible: 100%


Steps to Reproduce:
1. Install 4.5 and 4.6 clusters in aws (I used OVN as well)
2. Run commands in description as with dc.json (attached) 


Actual results:
4.6 commands are about 15-25% slower

Expected results:
No regression in response time compared to 4.5 


Additional info:

Comment 4 Maciej Szulik 2020-08-21 14:11:15 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 5 Stefan Schimanski 2020-08-26 16:02:43 UTC
We need an analysis ASAP how much of the slow-down is server-side and how much is oc. This has release blocker potential.

Comment 6 Abu Kashem 2020-08-26 16:30:16 UTC
Hi prubenda,
Can you re run the test and capture a dump of the prometheus metrics for both 4.5 and 4.6? We would like to analyse the apiserver metrics side by side for 4.6 and 4.5. This will help answer if the latency is from the server side.

Comment 7 Paige Rubendall 2020-08-26 16:37:33 UTC
Yep will work on capturing that information for 4.5 and 4.6

Comment 8 Paige Rubendall 2020-08-26 19:03:42 UTC
Created 2 clusters (4.5 and 4.6) with all nodes in the same availability zone (us-east-1a) (3 workers and 3 masters)

The times were not as large of a difference this time. Will give links to prometheus dumps in private comment following this 


time -p for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done
4.5:  89.09
4.6:  110.45 (-13.37%)


time -p for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done
4.5:  97.42
4.6:  110.45 (-13.37%)

time -p for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done

4.5: 36.63
4.6: 41.36 (-12.91%)

time -p for i in {0..199}; do oc delete project --wait=false test$i; done
4.5: 36.32
4.6: 41.76 (-14.97%)

Comment 10 Paige Rubendall 2020-08-26 19:07:18 UTC
Correction for the time in the oc new project line. 


time -p for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done
4.5:  89.09
4.6: 107.69 (-20.87%)

Comment 11 Stefan Schimanski 2020-08-26 19:47:27 UTC
from looking at the logs of the delete ase I see that between last log message of a command and first of the next:
- 4.5: 130ms
- 4.6: 150ms and sometimes even much more

200x 20ms gives 4s. That's in the dimension of the observed longer runtime.

Comment 12 Stefan Schimanski 2020-08-26 19:49:07 UTC
@Paige to see whether this is server or client-side: can you run oc delete through xarg to avoid reinvocations of the same command? I bet the difference in runtime will go down.

Comment 13 Paige Rubendall 2020-08-26 20:30:01 UTC
I used the following command to delete with trying to use xargs. I'm not 100% sure this is the best way to construct the xargs command. It seems like this is doing the delete differently than the command I had ran before, with deleting like "oc delete project0 project1..." vs "oc delete project0" "oc delete project1"...


% time -p oc get projects -o name | grep test | xargs oc delete --wait=false

4.5
real 39.03
user 0.81
sys 0.19


4.6:
real 39.15
user 0.74
sys 0.22

Comment 14 Abu Kashem 2020-08-27 09:13:31 UTC
Hi prubenda,
We are analyzing the prom dumps. In the meantime I would like you to run the test again exactly (with the same script) as you have done for the run with the prom dump. But this time I would like you to disable Priority & Fairness feature of the kube-apiserver for both 4.5 and 4.6.

Here is how you disable Priority & Fairness:
- "oc edit kubeapiserver/cluster -o yaml" and make the following change:

spec:
  unsupportedConfigOverrides:
    apiServerArguments:
      feature-gates:
      - APIPriorityAndFairness=false


Once you make the change kube-apiserver will go through a new revision. Once the new revision has rolled out to all 3 kube-apiserver pods you can start the new run. Please collect prom dump for the new run. I would like to know if the regression is similar.

Comment 15 Paige Rubendall 2020-08-27 14:06:05 UTC
Seeming pretty similar if not worse in timing

time -p for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done
4.5 83.39
4.6 110.87 (-32.95%)


time -p for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done
4.5 95.67
4.6 109.68 (-14.64%)

time -p for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done
4.5 35.81
4.6 41.10 (-14.77%)


time -p for i in {0..199}; do oc delete project --wait=false test$i; done
4.5 36.70
4.6 42.19 (-14.95%)

Comment 17 Abu Kashem 2020-08-27 15:17:00 UTC
Hi prubenda
Thanks for the re run. Can you share your test setup? I would like to run it against my dev cluster.

Comment 18 Paige Rubendall 2020-08-27 15:38:28 UTC
I am attaching the new install_config.yaml I used to create the cluster(minus the pull secret), please let me know if you can't see it. Note this is an OVN network type cluster. 

The only thing you'll need for this test is the dc.json file that should also be attached. I do no other set up for this test.

Comment 20 Abu Kashem 2020-08-28 18:05:37 UTC
Created attachment 1712995 [details]
configmap-post-apf-enabled-4.6

Comment 21 Abu Kashem 2020-08-28 18:06:48 UTC
Created attachment 1712996 [details]
configmap-post-apf-enabled-4.5

Comment 22 Abu Kashem 2020-08-28 18:07:19 UTC
Created attachment 1712998 [details]
configmap-post-apf-disabled-4.6

Comment 23 Abu Kashem 2020-08-28 18:07:45 UTC
Created attachment 1712999 [details]
configmap-post-apf-disabled-4.5

Comment 24 Abu Kashem 2020-08-28 18:15:30 UTC
We designed a test to see if there is systemic latency regression between 4.5 and 4.6. We simplified the test to minimize noise. The test focuses on a single resource. We chose ConfigMap, hoping that it will trigger minimum attention from scheduler/controller manager.
The test creates/gets/deletes ConfigMap in a dedicated namespace. This also ensures no etcd db growth.

We saw no latency regression in either etcd or the kube-apiserver.

compare https://bugzilla.redhat.com/attachment.cgi?id=1712996 and https://bugzilla.redhat.com/attachment.cgi?id=1712995 side by side (latency for configmap create priority & fairness enabled)

compare https://bugzilla.redhat.com/attachment.cgi?id=1712998 and https://bugzilla.redhat.com/attachment.cgi?id=1712999 side by side (latency for configmap create priority & fairness disabled)

Comment 25 Maciej Szulik 2020-09-10 18:56:02 UTC
I just re-run the aforementioned tests and I'm not seeing that big discrepancies as OA, here are my results:

$ ./oc4.5 version
Client Version: 4.5.9
Server Version: 4.6.0-0.ci-2020-09-10-100011
Kubernetes Version: v1.19.0-rc.2.489+fc4c489e03104b-dirty

$ ./oc4.6 version
Client Version: 4.6.0-0.ci-2020-09-10-100011
Server Version: 4.6.0-0.ci-2020-09-10-100011
Kubernetes Version: v1.19.0-rc.2.489+fc4c489e03104b-dirty


time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done

4.5:
real    2m8.315s
user    0m55.579s
sys     0m6.628s
user+sys    62.207

4.6:
real	2m7.975s
user	0m54.380s
sys     0m7.024s
user+sys    61.404


time for i in {0..199}; do oc delete project --wait=false test$i; done

4.5:
real    0m46.169s
user    0m30.151s
sys     0m3.591s
user+sys    33.742

4.6:
real    1m30.436s
user    0m30.554s
sys     0m3.946s
user+sys    34.500


Even though I only focused on the two operations that were pointed out I think above is sufficient information 
to state that there's no oc regression, and definitely not that significant as mentioned initially.

I suspect that OVN might be affecting the tests, since mine were executed from within AWS platform against AWS IPI cluster.
If so, I'd suggest directing this bug to appropriate team. For now I'm moving this to QA to confirm.

Comment 26 Paige Rubendall 2020-09-10 20:45:46 UTC
I tried the same test using SDN with still the same availability zone for all worker and master nodes
It showed similar results to my findings before but 4.5 had faster user and sys times even though the real times were around the same. 
Is there a reason why you are looking at the user and sys time and not the real time? 




$ oc version
Client Version: 4.6.0-0.nightly-2020-09-10-100526
Server Version: 4.6.0-0.nightly-2020-09-10-100526


$ oc version
Client Version: 4.5.9
Server Version: 4.6.0-0.nightly-2020-09-10-100526
Kubernetes Version: v1.19.0-rc.2+068702d


time -p for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done
4.6:
real 108.80
user 71.58
sys 11.26
user+sys: 82.84


4.5:
real 111.22
user 66.53
sys 10.59
user+sys: 77.1

time -p for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done
4.6:
real 108.75
user 119.41
sys 18.83
user+sys: 137.83


4.5:
real 98.63
user 108.04
sys 17.37
user+sys: 125.41


time -p for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done

4.6:
real 41.27
user 40.45
sys 6.57
user+sys: 47.02

4.5:
real 36.85
user 35.08
sys 5.71
user+sys: 40.79

time -p for i in {0..199}; do oc delete project --wait=false test$i; done


4.6:
real 44.37
user 41.62
sys 6.48
user+sys: 48.1

4.5:
real 37.98
user 35.86
sys 5.90
user+sys: 41.76

Comment 27 Maciej Szulik 2020-09-11 10:24:51 UTC
Why should I be looking at real time, when that is time of command run and everything that is running on a particular machine.
Even though I spin up a separate machine just for executing the tests, the above clearly show that will affect the overall timing
of the operation. Whereas user+sys is relfecting actual time the kernel spend invoking this particular operation.

Comment 29 Mike Fiedler 2020-09-18 15:52:59 UTC
The kernel timings are only relevant on the client side.   The additional remaining latency is in the server - do you want a new bz for that?

New experiment.   The same oc 4.6 latest client issuing commands to a 4.5 and 4.6 server from the same client system.  Only the kubeconfig is changed to access the different servers.   All instances are in the same AWS region.  

As expected the kernel times are very similar on the client, but the wall clock times are off my as much as 37% for create workloads due to server-side differences.   Unfortunately, for single commands, higher loglevel oc traces don't really show where the differences are.  It is when commands are done in bulk that the difference really shows up.

time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done
4.5:
real	1m30.894s
user	0m54.271s
sys	0m4.972s

4.6:
real	2m3.005s
user	0m53.871s
sys	0m5.125s

** 37% increase

time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done
4.5:
real	1m46.258s
user	1m30.539s
sys	0m8.934s

4.6:
real	1m47.300s
user	1m31.102s
sys	0m9.119s

** no significant increase

time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done
4.5:
real	0m37.905s
user	0m29.315s
sys	0m2.950s

4.6:
real	0m47.227s
user	0m34.783s
sys	0m3.344s

** 19% increase

time for i in {0..199}; do oc delete project --wait=false test$i; done
4.5:
real	0m38.642s
user	0m30.115s
sys	0m2.981s

4.6:
real	0m41.712s
user	0m30.883s
sys	0m3.038s

** probably not significant

Moving to ASSIGNED for now, please move back to ON_QA if you want a separate bug for the differences when the client is held constant.

Comment 30 Maciej Szulik 2020-09-21 09:47:33 UTC
Note to self: check oc startup time until first network call and compare 4.5 and 4.6.

This is not a blocker for 4.6, but we'll want to investigate this more, I'm moving this to 4.7, for now.

Comment 31 Abu Kashem 2020-09-21 11:17:48 UTC
Hi mifiedle,

> The kernel timings are only relevant on the client side.   The additional remaining latency is in the server - do you want a new bz for that?

The additional remaining latency = network roundtrip (oc <--> apiserver) + server latency.

If you are trying to see if the apiserver latency has any regression I would recommend:
- A: looking at the apiserver metrics to eliminate network roundtrip from the calculation, or
- B: check the audit log (for the requests sent by the test) and calculate latency regression from there. 

B is more accurate than A since apiserver latency metrics are histogram with buckets. So using metrics will not give you an accurate picture of regression.

Comment 32 Maciej Szulik 2020-10-01 14:49:52 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 33 Michal Fojtik 2020-10-21 12:12:13 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 34 Maciej Szulik 2021-01-21 11:54:10 UTC
It's most likely that we've identified the root cause of this slow down.
With every version of OCP the number of CRDs registered by default is getting higher.
oc has an internal throttling mechanism which we bumped in 4.7 (see https://bugzilla.redhat.com/show_bug.cgi?id=1899575),
that same fix will be backported to 4.6 https://bugzilla.redhat.com/show_bug.cgi?id=1906332

*** This bug has been marked as a duplicate of bug 1899575 ***

Comment 35 Paige Rubendall 2021-01-22 18:54:37 UTC
Retested this issue to further verify that the duplicate issue did have an effect on the regression I was seeing in 4.6. 

I ran the same commands for 4.5.28, 4.6.13 and 4.7.0-0.nightly-2021-01-22-104107

Overall the 4.6.13 commands were still a little slower than 4.5 (around 10% or less increase) but not as slow as I had seen.
The 4.7 timings were comparable to 4.5 (within 1%) and faster than 4.6 

Create 200 projects and create 1 dc in each project

4.5 
real	1m25.648s
user	1m6.725s
sys	0m10.764s

4.6 
real	1m34.850s
user	1m11.329s
sys	0m11.423s

4.7
real	1m25.100s
user	1m6.270s
sys	0m11.053s

for each of 200 projects, get the dcs, sa and secrets


4.5 
real	1m42.791s
user	1m50.105s
sys	0m17.672s

4.6
real	1m50.881s
user	1m58.896s
sys	0m19.376s

4.7
real	1m43.212s
user	1m50.243s
sys	0m18.594s


for each of 200 projects, scale the dc replicas to 0

4.5
real	0m35.977s
user	0m35.934s
sys	0m5.744s


4.6
real	0m39.496s
user	0m39.019s
sys	0m6.356s

4.7
real	0m37.362s
user	0m35.943s
sys	0m6.057s


Delete 200 projects

4.5
real	0m37.081s
user	0m36.600s
sys	0m6.096s

4.6
real	0m40.922s
user	0m39.778s
sys	0m6.279s

4.7
real	0m37.260s
user	0m36.762s
sys	0m6.146s

Comment 36 Maciej Szulik 2021-01-25 11:28:29 UTC
> Overall the 4.6.13 commands were still a little slower than 4.5 (around 10%
> or less increase) but not as slow as I had seen.
> The 4.7 timings were comparable to 4.5 (within 1%) and faster than 4.6 

4.7 has the the discovery burst threshold bumped to 250, which is not in 4.6, yet.
I'm hoping that when this will happen we'll see much better results in 4.6 as well.


Note You need to log in before you can comment on or make changes to this bug.