Bug 1626291
| Summary: | 4.0 command line operations are 2x - 4x slower than 3.11 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||||||
| Component: | oc | Assignee: | Maciej Szulik <maszulik> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Mike Fiedler <mifiedle> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 4.1.0 | CC: | aos-bugs, erich, hongkliu, jeder, jokerman, maszulik, mfojtik, mifiedle, mmccomas, nelluri, nstielau, skordas, xxia | ||||||||
| Target Milestone: | --- | Keywords: | Regression, TestBlocker | ||||||||
| Target Release: | 4.1.0 | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | aos-scalability-40 | ||||||||||
| Fixed In Version: | Doc Type: | No Doc Update | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2019-06-04 10:40:34 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1664187 | ||||||||||
| Attachments: |
|
||||||||||
Created attachment 1481440 [details]
3.10 loglevel=10 sample
oc scale --replicas=0 dc/deploymentconfig1 -n test1 --loglevel=10 > 3.10_scale_to_0_replicas.txt 2>&1
Created attachment 1481441 [details]
3.11 loglevel=10 sample
oc scale --replicas=0 dc/deploymentconfig1 -n test1 --loglevel=10 > 3.11_scale_to_0_replicas.txt 2>&1
Mike before judging what is causing this 30% increase, I'd like to ask you to perform one another test, if possible. You've mentioned that you've performed tests using 3.10 and 3.11, but each test was always using the exact same version of CLI and API. I'd like to see a test where you're using a 3.11 oc against a 3.10 server and based on that information we'll be able to verify if the slowness is coming from CLI or from API. Currently if both are 3.10 or 3.11 it's hard to judge, do you agree? Ack, I will run that test today and report the results. Looks like the problem is mostly client. 3.10.44 oc -> 3.11.0-0.28.0 server is almost identical as 3.10 -> 3.10. A little slower for the create/update and actually a little faster for the gets. 3.11 oc -> 3.11 server from the same client system shows the problem in the description. Let me know what other info would be useful. Interesting that it is a client slowdown, I was guessing API. Is there a way we can move this into OKD unit/functional testing? I.e fail the build for anything over a 10% slowdown? I don't see any specific reason we can't catch this earlier in the lifecycle. Nick we've been rewriting significant parts of the client over the course of 3.11 and finished just recently. It will be related to that, but we didn't have the time to further investigate that, yet. I'm not sure if we can have a proper unit/functional testing in place to discover that. Only the full performance suite that invokes these operations in high bandwidths will show you actual problems. Smaller scale might obscure the data. Created attachment 1518021 [details]
deployment config
Re-tested in 4.0 (Kube 1.12.4 rebase level) and there is further regression. Do you want a new bz to track 4.0 specifically? 1. Create 200 projects and create 1 dc in each project v time [s] vs 3.10 vs 3.11 3.10 53.6 - - 3.11 71.4 -33% - 4.0 86.1 -61% -21% 2. for each of 200 projects, get the dcs, sa and secrets v time [s] vs 3.10 vs 3.11 3.10 82.6 - - 3.11 106.0 -28% - 4.0 135.9 -65% -28% 3. for each of 200 projects, scale the dc replicas to 0 v time [s] vs 3.10 vs 3.11 3.10 25.4 - - 3.11 35.1 -38% - 4.0 37.6 -48% -7% 4. Delete 200 projects v time [s] vs 3.10 vs 3.11 3.10 37.3 - - 3.11 47.6 -27% - 4.0 53.5 -43% -12% Mike we have a pending fix here: https://github.com/kubernetes/kubernetes/pull/73345 that is fixing the discovery searches from ~10s to 1s. This will land in 4.0, we just need to fix other stuff first. There's no need for a new BZ, this is very high in my list. Thanks for the inputs! Also Stefan is working on OpenAPI improvements server-side so they should also significantly improve the performance, see https://github.com/kubernetes/kube-openapi/pull/124. Let's re-test this once both fixes land, I'll ping you then. The PRs we're waiting for are: 1. https://github.com/openshift/origin/pull/22020 (tagged for merge) 2. https://github.com/openshift/origin/pull/21935 (merged) Mike now that we have both PRs from my previous comment can you re-test that build? Retest: 2/28/2019
Results for build: 4.0.0-0.nightly-2019-02-27-213933
1. Create 200 projects and create 1 dc in each project
time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done
real 4m6.897s
user 1m11.245s
sys 0m11.567s
4.0 vs 3.10 -287%
2. for each of 200 projects, get the dcs, sa and secrets
time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done
real 5m3.284s
user 1m54.370s
sys 0m18.666s
4.0 vs 3.10 -708%
3. for each of 200 projects, scale the dc replicas to 0
time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done
real 2m12.528s
user 0m44.631s
sys 0m7.204s
4.0 vs 3.10 -391%
4. Delete 200 projects (use --wait=false on 3.11)
time for i in {0..199}; do oc delete project --wait=false test$i; done
real 1m39.443s
user 0m38.657s
sys 0m6.300s
4.0 vs 3.10 -249%
Retest 3/5/2019
Client Version: v4.0.6
Server Version: v1.12.4+b3e6ad4
Results for build: 4.0.0-0.nightly-2019-03-05-065158
1. Create 200 projects and create 1 dc in each project
time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done
real 234.65
user 71.23
sys 11.55
4.0 vs 3,11 -185%
2. for each of 200 projects, get the dcs, sa and secrets
time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done
real 297.82
user 115.41
sys 19.41
4.0 vs 3.11 -499%
3. for each of 200 projects, scale the dc replicas to 0
time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done
real 131.26
user 45.75
sys 7.43
4.0 vs 3.11 -386%
4. Delete 200 projects (use --wait=false on 3.11)
time for i in {0..199}; do oc delete project --wait=false test$i; done
real 96.81
user 38.72
sys 6.40
4.0 vs 3.11 -73%
Did both executions (oc 4.0 and 3.11) were against the same 4.0 cluster? No. Cluster with 3.11 was provided using flexy build and with 4.0 using Next-Gen installer. Simon can you prepare me a summary similar to the one in #comment 10 where I can clearly see the difference between client and server versions and how that changes. It's hard to say which elements impact which parts since there are many components involved in any single oc operation. Maciej,
here mine last results from last Friday.
1 2 3 4
cli: 4.0.6 OCP: 4.0.0-0.nightly-2019-03-06-074438 244.5 301.3 136.2 100.2
cli: 3.11.59 OCP: 4.0.0-0.nightly-2019-03-06-074438 251.1 308.3 127.3 152.9
cli: 3.11.93 OCP: 3.11.93 338.2 223.9 81.3 178.2
3.11.93 as base:
1 2 3 4
cli: 4.0.6 OCP: 4.0.0-0.nightly-2019-03-06-074438 27% -34% -67% -77%
cli: 3.11.59 OCP: 4.0.0-0.nightly-2019-03-06-074438 25% -37% -56% -16%
cli: 3.11.93 OCP: 3.11.93 100% 100% 100% 100%
Tests:
[1] time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done
[2] time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done
[3] time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done
[4] time for i in {0..199}; do oc delete project --wait=false test$i; done
I run the same set of tests against origin-release:4.0.0-0.alpha-2019-03-15-093013 the results are as follows:
Client Version: v4.0.0-alpha.0+846cde5-1641-dirty
vs
oc v3.11.88
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO
vs
oc v3.10.119
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO
[1] time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done
4.0
real 1m59.420s
user 1m17.530s
sys 0m14.524s
3.11
real 2m20.215s
user 1m26.847s
sys 0m14.718s
3.10
real 1m48.523s
user 1m3.508s
sys 0m11.827s
[2] time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done
4.0
real 2m11.206s
user 2m4.650s
sys 0m24.590s
3.11
real 2m26.767s
user 2m17.446s
sys 0m24.213s
3.10
real 2m10.587s
user 1m48.082s
sys 0m20.112s
[3] time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done
4.0
real 0m55.052s
user 0m49.514s
sys 0m9.593s
3.11
real 1m9.007s
user 0m50.180s
sys 0m9.111s
3.10
real 0m53.117s
user 0m38.975s
sys 0m7.612s
[4] time for i in {0..199}; do oc delete project --wait=false test$i; done
4.0
real 0m47.396s
user 0m41.734s
sys 0m8.283s
3.11
real 0m51.819s
user 0m45.850s
sys 0m8.136s
3.10
real 0m43.741s
user 0m34.899s
sys 0m6.515s
So from the above 4.0 is at least 10% faster (up to 23%) than 3.11.
Comparing 4.0 with 3.10 shows we are slower by it's usually between 1-3%, the only
outlier being new-project which is ~9% slower.
Simon mind double checking your tests?
Mike, Simon, could you continue checking it? Thank you. Latest QE results on 4.0.0-0.nightly-2019-03-28-030453 using a client in the same AWS region as the cluster below. 3.10 and 3.11 numbers are from the original bug description
1. Create 200 projects and create 1 dc in each project
time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done
3.10: 63.8s
3.11: 82.2s
4.0: 75.2s
2. for each of 200 projects, get the dcs, sa and secrets
time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done
3.10: 37.5s
3.11: 49.7s
4.0: 79.3s
3. for each of 200 projects, scale the dc replicas to 0
time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done
3.10: 27s
3.11: 35.8s
4.0: 33.2s
4. Delete 200 projects (use --wait=false on 3.11)
time for i in {0..199}; do oc delete project --wait=false test$i; done
3.10: 28.5s
3.11: 35.9
4.0: 30.4s
The get performance is the only outlier now. Marking this general bz VERIFIED and will do a deeper 3.11 comparison on varying get scenarios. The improved create performance matches what the performance team saw in their cluster load-up runs.
To shed some more light I did one more round all the way back to 3.9 and we're looking pretty good, all on 4.0.0-0.alpha-2019-04-01-080825 cluster:
time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done
4.1
real 1m57.810s
user 1m22.642s
sys 0m14.628s
3.11
real 2m2.103s
user 1m30.887s
sys 0m14.680s
3.10
real 1m39.206s
user 1m2.525s
sys 0m11.362s
3.9
real 2m13.753s
user 1m40.696s
sys 0m16.518s
time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done
4.1
real 2m13.863s
user 2m8.778s
sys 0m24.360s
3.11
real 2m30.014s
user 2m23.904s
sys 0m23.964s
3.10
real 2m13.108s
user 1m52.827s
sys 0m20.193s
3.9
real 3m9.074s
user 2m48.212s
sys 0m27.287s
time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done
4.1
real 0m55.610s
user 0m50.061s
sys 0m9.342s
3.11
real 1m1.340s
user 0m52.529s
sys 0m9.098s
3.10
real 0m55.181s
user 0m41.092s
sys 0m7.713s
3.9
real 4m28.541s
user 0m57.179s
sys 0m9.277s
time for i in {0..199}; do oc delete project --wait=false test$i; done
4.1
real 0m47.125s
user 0m42.760s
sys 0m8.274s
3.11
real 0m53.575s
user 0m49.766s
sys 0m8.255s
3.10
real 0m44.581s
user 0m37.245s
sys 0m6.618s
3.9
real 0m59.102s
user 0m54.254s
sys 0m8.906s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |
Description of problem: While running the performance master vertical test (creates 1000-5000 projects loaded with dcs, builds, secrets, routes etc using oc commands) it was noticed that it was running considerably slower on 3.11. The test duration was 1.5 times greater in 3.11 Drilling down and experimenting with specific commands individually shows that each was showing a ~30% slowdown from 3.10. For individual commands, it might not be that noticeable. For scripted interactions like performance and scalability testing, the slowdown is a big problem. The comparison between 3.10 and 3.11 below is on identical clusters - 1 master, 3 infra, 10 compute nodes on AWS m5.xlarge instances. For the following operations on 200 projects, the slowdowns are: 1. Create 200 projects and create 1 dc in each project time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done 3.10: 63.8s 3.11: 82.2s -29% 2. for each of 200 projects, get the dcs, sa and secrets time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done 3.10: 37.5s 3.11: 49.7s -32% 3. for each of 200 projects, scale the dc replicas to 0 time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done 3.10: 27s 3.11: 35.8s -32% 4. Delete 200 projects (use --wait=false on 3.11) time for i in {0..199}; do oc delete project --wait=false test$i; done 3.10: 28.5s 3.11: 35.9 -26% Version-Release number of selected component (if applicable): 3.11.0-0.28.0 How reproducible: Always Steps to Reproduce: 1. install 3.10.14 and 3.11.0-0.28.0 clusters in AWS 2. run the commands in the description above (dc.json attached) Actual results: 3.11 command responses are 30% slower Expected results: No degradation in response time from 3.10 Additional info: Let me know what additional info is required. I'll attach a loglevel=10 from 3.10 and 3.11 for a command like scaling replicas to 0