Bug 1626291 - 4.0 command line operations are 2x - 4x slower than 3.11 [NEEDINFO]
Summary: 4.0 command line operations are 2x - 4x slower than 3.11
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: oc
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.1.0
Assignee: Maciej Szulik
QA Contact: Mike Fiedler
URL:
Whiteboard: aos-scalability-40
Depends On:
Blocks: 1664187
TreeView+ depends on / blocked
 
Reported: 2018-09-07 00:54 UTC by Mike Fiedler
Modified: 2019-06-04 10:40 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:40:34 UTC
Target Upstream Version:
maszulik: needinfo? (skordas)
xxia: needinfo? (skordas)


Attachments (Terms of Use)
3.10 loglevel=10 sample (65.31 KB, text/plain)
2018-09-07 01:01 UTC, Mike Fiedler
no flags Details
3.11 loglevel=10 sample (51.53 KB, text/plain)
2018-09-07 01:02 UTC, Mike Fiedler
no flags Details
deployment config (4.65 KB, text/plain)
2019-01-02 19:15 UTC, Mike Fiedler
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 None None None 2019-06-04 10:40:43 UTC

Description Mike Fiedler 2018-09-07 00:54:28 UTC
Description of problem:

While running the performance master vertical test (creates 1000-5000 projects loaded with dcs, builds, secrets, routes etc using oc commands) it was noticed that it was running considerably slower on 3.11.   The test duration was 1.5 times greater in 3.11

Drilling down and experimenting with specific commands individually shows that each was showing a ~30% slowdown from 3.10.   For individual commands, it might not be that noticeable.  For scripted interactions like performance and scalability testing, the slowdown is a big problem.

The comparison between 3.10 and 3.11 below is on identical clusters - 1 master, 3 infra, 10 compute nodes on AWS m5.xlarge instances.   For the following operations on 200 projects, the slowdowns are:

1. Create 200 projects and create 1 dc in each project

time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done
3.10:  63.8s
3.11:  82.2s -29%

2. for each of 200 projects, get the dcs, sa and secrets

time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done 
3.10: 37.5s 
3.11: 49.7s -32%

3. for each of 200 projects, scale the dc replicas to 0

time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done
3.10:  27s
3.11:  35.8s -32%

4. Delete 200 projects (use --wait=false on 3.11)

time for i in {0..199}; do oc delete project --wait=false test$i; done
3.10: 28.5s
3.11: 35.9 -26%

Version-Release number of selected component (if applicable): 3.11.0-0.28.0


How reproducible: Always


Steps to Reproduce:
1. install 3.10.14 and 3.11.0-0.28.0 clusters in AWS
2. run the commands in the description above (dc.json attached)


Actual results:

3.11 command responses are 30% slower

Expected results:

No degradation in response time from 3.10

Additional info:

Let me know what additional info is required.  I'll attach a loglevel=10 from 3.10 and 3.11 for a command like scaling replicas to 0

Comment 1 Mike Fiedler 2018-09-07 01:01:27 UTC
Created attachment 1481440 [details]
3.10 loglevel=10 sample

oc scale --replicas=0 dc/deploymentconfig1 -n test1 --loglevel=10 > 3.10_scale_to_0_replicas.txt 2>&1

Comment 2 Mike Fiedler 2018-09-07 01:02:56 UTC
Created attachment 1481441 [details]
3.11 loglevel=10 sample

oc scale --replicas=0 dc/deploymentconfig1 -n test1 --loglevel=10 > 3.11_scale_to_0_replicas.txt 2>&1

Comment 4 Maciej Szulik 2018-09-12 09:22:30 UTC
Mike before judging what is causing this 30% increase, I'd like to ask you to perform one another test, if possible. You've mentioned that you've performed tests using 3.10 and 3.11, but each test was always using the exact same version of CLI and API. I'd like to see a test where you're using a 3.11 oc against a 3.10 server and based on that information we'll be able to verify if the slowness is coming from CLI or from API. Currently if both are 3.10 or 3.11 it's hard to judge, do you agree?

Comment 5 Mike Fiedler 2018-09-12 14:02:09 UTC
Ack, I will run that test today and report the results.

Comment 6 Mike Fiedler 2018-09-12 22:00:13 UTC
Looks like the problem is mostly client.

3.10.44 oc -> 3.11.0-0.28.0 server is almost identical as 3.10 -> 3.10.  A little slower for the create/update and actually a little faster for the gets.

3.11 oc -> 3.11 server from the same client system shows the problem in the description.

Let me know what other info would be useful.

Comment 7 Nick Stielau 2018-10-09 15:53:41 UTC
Interesting that it is a client slowdown, I was guessing API.

Is there a way we can move this into OKD unit/functional testing?  I.e fail the build for anything over a 10% slowdown?  I don't see any specific reason we can't catch this earlier in the lifecycle.

Comment 8 Maciej Szulik 2018-10-09 16:08:05 UTC
Nick we've been rewriting significant parts of the client over the course of 3.11 and finished just recently. It will be related to that, but we didn't have the time to further investigate that, yet. I'm not sure if we can have a proper unit/functional testing in place to discover that. Only the full performance suite that invokes these operations in high bandwidths will show you actual problems. Smaller scale might obscure the data.

Comment 9 Mike Fiedler 2019-01-02 19:15:43 UTC
Created attachment 1518021 [details]
deployment config

Comment 10 Mike Fiedler 2019-01-29 18:22:07 UTC
Re-tested in 4.0 (Kube 1.12.4 rebase level) and there is further regression.  Do you want a new bz to track 4.0 specifically?


1. Create 200 projects and create 1 dc in each project

v	time [s] vs 3.10 vs 3.11
3.10	53.6	-	-
3.11	71.4	-33%	-
4.0	86.1	-61%	-21%

2. for each of 200 projects, get the dcs, sa and secrets

v	time [s] vs 3.10 vs 3.11
3.10	82.6	-	-
3.11	106.0	-28%	-
4.0	135.9	-65%	-28%

3. for each of 200 projects, scale the dc replicas to 0

v	time [s] vs 3.10 vs 3.11
3.10	25.4	-	-
3.11	35.1	-38%	-
4.0	37.6	-48%	-7%

4. Delete 200 projects

v	time [s] vs 3.10 vs 3.11
3.10	37.3	-	-
3.11	47.6	-27%	-
4.0	53.5	-43%	-12%

Comment 11 Maciej Szulik 2019-01-30 11:10:15 UTC
Mike we have a pending fix here: https://github.com/kubernetes/kubernetes/pull/73345 that is fixing the 
discovery searches from ~10s to 1s. This will land in 4.0, we just need to fix other stuff first.
There's no need for a new BZ, this is very high in my list. Thanks for the inputs!

Comment 12 Maciej Szulik 2019-01-30 11:19:03 UTC
Also Stefan is working on OpenAPI improvements server-side so they should also significantly improve the performance,
see https://github.com/kubernetes/kube-openapi/pull/124. Let's re-test this once both fixes land, I'll ping you then.

Comment 13 Maciej Szulik 2019-02-13 12:03:15 UTC
The PRs we're waiting for are:
1. https://github.com/openshift/origin/pull/22020 (tagged for merge)
2. https://github.com/openshift/origin/pull/21935 (merged)

Comment 14 Maciej Szulik 2019-02-14 09:45:53 UTC
Mike now that we have both PRs from my previous comment can you re-test that build?

Comment 15 Simon 2019-02-28 18:08:28 UTC
Retest: 2/28/2019
Results for build: 4.0.0-0.nightly-2019-02-27-213933

1. Create 200 projects and create 1 dc in each project
time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done
real    4m6.897s
user    1m11.245s
sys     0m11.567s

4.0 vs 3.10 -287%

2. for each of 200 projects, get the dcs, sa and secrets
time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done
real    5m3.284s
user    1m54.370s
sys     0m18.666s

4.0 vs 3.10 -708%

3. for each of 200 projects, scale the dc replicas to 0
time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done
real    2m12.528s
user    0m44.631s
sys     0m7.204s

4.0 vs 3.10 -391%

4. Delete 200 projects (use --wait=false on 3.11)
time for i in {0..199}; do oc delete project --wait=false test$i; done
real    1m39.443s
user    0m38.657s
sys     0m6.300s

4.0 vs 3.10 -249%

Comment 17 Simon 2019-03-05 21:16:28 UTC
Retest 3/5/2019


Client Version: v4.0.6
Server Version: v1.12.4+b3e6ad4
Results for build: 4.0.0-0.nightly-2019-03-05-065158

1. Create 200 projects and create 1 dc in each project
time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done
real 234.65
user 71.23
sys 11.55

4.0 vs 3,11 -185%


2. for each of 200 projects, get the dcs, sa and secrets
time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done
real 297.82
user 115.41
sys 19.41

4.0 vs 3.11 -499%


3. for each of 200 projects, scale the dc replicas to 0
time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done
real 131.26
user 45.75
sys 7.43

4.0 vs 3.11 -386%


4. Delete 200 projects (use --wait=false on 3.11)
time for i in {0..199}; do oc delete project --wait=false test$i; done
real 96.81
user 38.72
sys 6.40

4.0 vs 3.11 -73%

Comment 18 Maciej Szulik 2019-03-07 12:19:31 UTC
Did both executions (oc 4.0 and 3.11) were against the same 4.0 cluster?

Comment 19 Simon 2019-03-07 14:13:50 UTC
No.
Cluster with 3.11 was provided using flexy build and with 4.0 using Next-Gen installer.

Comment 22 Maciej Szulik 2019-03-08 12:58:59 UTC
Simon can you prepare me a summary similar to the one in #comment 10 where I can clearly see the difference
between client and server versions and how that changes. It's hard to say which elements impact which parts
since there are many components involved in any single oc operation.

Comment 23 Simon 2019-03-11 13:43:14 UTC
Maciej,
here mine last results from last Friday.

                                                         1      2       3      4
cli: 4.0.6   OCP: 4.0.0-0.nightly-2019-03-06-074438  244.5  301.3   136.2  100.2
cli: 3.11.59 OCP: 4.0.0-0.nightly-2019-03-06-074438  251.1  308.3   127.3  152.9
cli: 3.11.93 OCP: 3.11.93                            338.2  223.9    81.3  178.2

3.11.93 as base:

                                                        1       2       3      4
cli: 4.0.6   OCP: 4.0.0-0.nightly-2019-03-06-074438   27%    -34%    -67%   -77%
cli: 3.11.59 OCP: 4.0.0-0.nightly-2019-03-06-074438   25%    -37%    -56%   -16%
cli: 3.11.93 OCP: 3.11.93                            100%    100%    100%   100%

Tests:
[1] time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done
[2] time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done
[3] time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done
[4] time for i in {0..199}; do oc delete project --wait=false test$i; done

Comment 25 Maciej Szulik 2019-03-15 14:51:50 UTC
I run the same set of tests against origin-release:4.0.0-0.alpha-2019-03-15-093013 the results are as follows:

Client Version: v4.0.0-alpha.0+846cde5-1641-dirty

vs

oc v3.11.88
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

vs

oc v3.10.119
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO


[1] time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done
4.0
real    1m59.420s
user    1m17.530s
sys     0m14.524s

3.11
real    2m20.215s
user    1m26.847s
sys     0m14.718s

3.10
real    1m48.523s
user    1m3.508s
sys     0m11.827s


[2] time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done
4.0
real    2m11.206s
user    2m4.650s
sys     0m24.590s

3.11
real    2m26.767s
user    2m17.446s
sys     0m24.213s

3.10
real    2m10.587s
user    1m48.082s
sys     0m20.112s


[3] time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done
4.0
real    0m55.052s
user    0m49.514s
sys     0m9.593s

3.11
real    1m9.007s
user    0m50.180s
sys     0m9.111s

3.10
real    0m53.117s
user    0m38.975s
sys     0m7.612s


[4] time for i in {0..199}; do oc delete project --wait=false test$i; done
4.0
real    0m47.396s
user    0m41.734s
sys     0m8.283s

3.11
real    0m51.819s
user    0m45.850s
sys     0m8.136s

3.10
real    0m43.741s
user    0m34.899s
sys     0m6.515s


So from the above 4.0 is at least 10% faster (up to 23%) than 3.11. 
Comparing 4.0 with 3.10 shows we are slower by it's usually between 1-3%, the only
outlier being new-project which is ~9% slower. 

Simon mind double checking your tests?

Comment 28 Xingxing Xia 2019-03-26 01:52:48 UTC
Mike, Simon, could you continue checking it? Thank you.

Comment 29 Mike Fiedler 2019-03-28 17:59:36 UTC
Latest QE results on 4.0.0-0.nightly-2019-03-28-030453 using a client in the same AWS region as the cluster below.   3.10 and 3.11 numbers are from the original bug description

1. Create 200 projects and create 1 dc in each project

time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done
3.10:  63.8s
3.11:  82.2s 
4.0:   75.2s

2. for each of 200 projects, get the dcs, sa and secrets

time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done 
3.10: 37.5s 
3.11: 49.7s 
4.0:  79.3s

3. for each of 200 projects, scale the dc replicas to 0

time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done
3.10:  27s
3.11:  35.8s
4.0:   33.2s 

4. Delete 200 projects (use --wait=false on 3.11)

time for i in {0..199}; do oc delete project --wait=false test$i; done
3.10: 28.5s
3.11: 35.9 
4.0:  30.4s

The get performance is the only outlier now.   Marking this general bz VERIFIED and will do a deeper 3.11 comparison on varying get scenarios.   The improved create performance matches what the performance team saw in their cluster load-up runs.

Comment 30 Maciej Szulik 2019-04-01 11:31:14 UTC
To shed some more light I did one more round all the way back to 3.9 and we're looking pretty good, all on 4.0.0-0.alpha-2019-04-01-080825 cluster:

time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done

4.1
real    1m57.810s
user    1m22.642s
sys     0m14.628s

3.11
real    2m2.103s
user    1m30.887s
sys     0m14.680s

3.10
real    1m39.206s
user    1m2.525s
sys     0m11.362s

3.9
real    2m13.753s
user    1m40.696s
sys     0m16.518s


time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done

4.1
real    2m13.863s
user    2m8.778s
sys     0m24.360s

3.11
real    2m30.014s
user    2m23.904s
sys     0m23.964s

3.10
real    2m13.108s
user    1m52.827s
sys     0m20.193s

3.9
real    3m9.074s
user    2m48.212s
sys     0m27.287s


time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done

4.1
real    0m55.610s
user    0m50.061s
sys     0m9.342s

3.11
real    1m1.340s
user    0m52.529s
sys     0m9.098s

3.10
real    0m55.181s
user    0m41.092s
sys     0m7.713s

3.9
real    4m28.541s
user    0m57.179s
sys     0m9.277s


time for i in {0..199}; do oc delete project --wait=false test$i; done

4.1
real    0m47.125s
user    0m42.760s
sys     0m8.274s

3.11
real    0m53.575s
user    0m49.766s
sys     0m8.255s

3.10
real    0m44.581s
user    0m37.245s
sys     0m6.618s

3.9
real    0m59.102s
user    0m54.254s
sys     0m8.906s

Comment 32 errata-xmlrpc 2019-06-04 10:40:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.