Hide Forgot
Description of problem: While running the performance master vertical test (creates 1000-5000 projects loaded with dcs, builds, secrets, routes etc using oc commands) it was noticed that it was running considerably slower on 3.11. The test duration was 1.5 times greater in 3.11 Drilling down and experimenting with specific commands individually shows that each was showing a ~30% slowdown from 3.10. For individual commands, it might not be that noticeable. For scripted interactions like performance and scalability testing, the slowdown is a big problem. The comparison between 3.10 and 3.11 below is on identical clusters - 1 master, 3 infra, 10 compute nodes on AWS m5.xlarge instances. For the following operations on 200 projects, the slowdowns are: 1. Create 200 projects and create 1 dc in each project time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done 3.10: 63.8s 3.11: 82.2s -29% 2. for each of 200 projects, get the dcs, sa and secrets time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done 3.10: 37.5s 3.11: 49.7s -32% 3. for each of 200 projects, scale the dc replicas to 0 time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done 3.10: 27s 3.11: 35.8s -32% 4. Delete 200 projects (use --wait=false on 3.11) time for i in {0..199}; do oc delete project --wait=false test$i; done 3.10: 28.5s 3.11: 35.9 -26% Version-Release number of selected component (if applicable): 3.11.0-0.28.0 How reproducible: Always Steps to Reproduce: 1. install 3.10.14 and 3.11.0-0.28.0 clusters in AWS 2. run the commands in the description above (dc.json attached) Actual results: 3.11 command responses are 30% slower Expected results: No degradation in response time from 3.10 Additional info: Let me know what additional info is required. I'll attach a loglevel=10 from 3.10 and 3.11 for a command like scaling replicas to 0
Created attachment 1481440 [details] 3.10 loglevel=10 sample oc scale --replicas=0 dc/deploymentconfig1 -n test1 --loglevel=10 > 3.10_scale_to_0_replicas.txt 2>&1
Created attachment 1481441 [details] 3.11 loglevel=10 sample oc scale --replicas=0 dc/deploymentconfig1 -n test1 --loglevel=10 > 3.11_scale_to_0_replicas.txt 2>&1
Mike before judging what is causing this 30% increase, I'd like to ask you to perform one another test, if possible. You've mentioned that you've performed tests using 3.10 and 3.11, but each test was always using the exact same version of CLI and API. I'd like to see a test where you're using a 3.11 oc against a 3.10 server and based on that information we'll be able to verify if the slowness is coming from CLI or from API. Currently if both are 3.10 or 3.11 it's hard to judge, do you agree?
Ack, I will run that test today and report the results.
Looks like the problem is mostly client. 3.10.44 oc -> 3.11.0-0.28.0 server is almost identical as 3.10 -> 3.10. A little slower for the create/update and actually a little faster for the gets. 3.11 oc -> 3.11 server from the same client system shows the problem in the description. Let me know what other info would be useful.
Interesting that it is a client slowdown, I was guessing API. Is there a way we can move this into OKD unit/functional testing? I.e fail the build for anything over a 10% slowdown? I don't see any specific reason we can't catch this earlier in the lifecycle.
Nick we've been rewriting significant parts of the client over the course of 3.11 and finished just recently. It will be related to that, but we didn't have the time to further investigate that, yet. I'm not sure if we can have a proper unit/functional testing in place to discover that. Only the full performance suite that invokes these operations in high bandwidths will show you actual problems. Smaller scale might obscure the data.
Created attachment 1518021 [details] deployment config
Re-tested in 4.0 (Kube 1.12.4 rebase level) and there is further regression. Do you want a new bz to track 4.0 specifically? 1. Create 200 projects and create 1 dc in each project v time [s] vs 3.10 vs 3.11 3.10 53.6 - - 3.11 71.4 -33% - 4.0 86.1 -61% -21% 2. for each of 200 projects, get the dcs, sa and secrets v time [s] vs 3.10 vs 3.11 3.10 82.6 - - 3.11 106.0 -28% - 4.0 135.9 -65% -28% 3. for each of 200 projects, scale the dc replicas to 0 v time [s] vs 3.10 vs 3.11 3.10 25.4 - - 3.11 35.1 -38% - 4.0 37.6 -48% -7% 4. Delete 200 projects v time [s] vs 3.10 vs 3.11 3.10 37.3 - - 3.11 47.6 -27% - 4.0 53.5 -43% -12%
Mike we have a pending fix here: https://github.com/kubernetes/kubernetes/pull/73345 that is fixing the discovery searches from ~10s to 1s. This will land in 4.0, we just need to fix other stuff first. There's no need for a new BZ, this is very high in my list. Thanks for the inputs!
Also Stefan is working on OpenAPI improvements server-side so they should also significantly improve the performance, see https://github.com/kubernetes/kube-openapi/pull/124. Let's re-test this once both fixes land, I'll ping you then.
The PRs we're waiting for are: 1. https://github.com/openshift/origin/pull/22020 (tagged for merge) 2. https://github.com/openshift/origin/pull/21935 (merged)
Mike now that we have both PRs from my previous comment can you re-test that build?
Retest: 2/28/2019 Results for build: 4.0.0-0.nightly-2019-02-27-213933 1. Create 200 projects and create 1 dc in each project time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done real 4m6.897s user 1m11.245s sys 0m11.567s 4.0 vs 3.10 -287% 2. for each of 200 projects, get the dcs, sa and secrets time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done real 5m3.284s user 1m54.370s sys 0m18.666s 4.0 vs 3.10 -708% 3. for each of 200 projects, scale the dc replicas to 0 time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done real 2m12.528s user 0m44.631s sys 0m7.204s 4.0 vs 3.10 -391% 4. Delete 200 projects (use --wait=false on 3.11) time for i in {0..199}; do oc delete project --wait=false test$i; done real 1m39.443s user 0m38.657s sys 0m6.300s 4.0 vs 3.10 -249%
Retest 3/5/2019 Client Version: v4.0.6 Server Version: v1.12.4+b3e6ad4 Results for build: 4.0.0-0.nightly-2019-03-05-065158 1. Create 200 projects and create 1 dc in each project time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done real 234.65 user 71.23 sys 11.55 4.0 vs 3,11 -185% 2. for each of 200 projects, get the dcs, sa and secrets time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done real 297.82 user 115.41 sys 19.41 4.0 vs 3.11 -499% 3. for each of 200 projects, scale the dc replicas to 0 time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done real 131.26 user 45.75 sys 7.43 4.0 vs 3.11 -386% 4. Delete 200 projects (use --wait=false on 3.11) time for i in {0..199}; do oc delete project --wait=false test$i; done real 96.81 user 38.72 sys 6.40 4.0 vs 3.11 -73%
Did both executions (oc 4.0 and 3.11) were against the same 4.0 cluster?
No. Cluster with 3.11 was provided using flexy build and with 4.0 using Next-Gen installer.
Simon can you prepare me a summary similar to the one in #comment 10 where I can clearly see the difference between client and server versions and how that changes. It's hard to say which elements impact which parts since there are many components involved in any single oc operation.
Maciej, here mine last results from last Friday. 1 2 3 4 cli: 4.0.6 OCP: 4.0.0-0.nightly-2019-03-06-074438 244.5 301.3 136.2 100.2 cli: 3.11.59 OCP: 4.0.0-0.nightly-2019-03-06-074438 251.1 308.3 127.3 152.9 cli: 3.11.93 OCP: 3.11.93 338.2 223.9 81.3 178.2 3.11.93 as base: 1 2 3 4 cli: 4.0.6 OCP: 4.0.0-0.nightly-2019-03-06-074438 27% -34% -67% -77% cli: 3.11.59 OCP: 4.0.0-0.nightly-2019-03-06-074438 25% -37% -56% -16% cli: 3.11.93 OCP: 3.11.93 100% 100% 100% 100% Tests: [1] time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done [2] time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done [3] time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done [4] time for i in {0..199}; do oc delete project --wait=false test$i; done
I run the same set of tests against origin-release:4.0.0-0.alpha-2019-03-15-093013 the results are as follows: Client Version: v4.0.0-alpha.0+846cde5-1641-dirty vs oc v3.11.88 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO vs oc v3.10.119 kubernetes v1.10.0+b81c8f8 features: Basic-Auth GSSAPI Kerberos SPNEGO [1] time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done 4.0 real 1m59.420s user 1m17.530s sys 0m14.524s 3.11 real 2m20.215s user 1m26.847s sys 0m14.718s 3.10 real 1m48.523s user 1m3.508s sys 0m11.827s [2] time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done 4.0 real 2m11.206s user 2m4.650s sys 0m24.590s 3.11 real 2m26.767s user 2m17.446s sys 0m24.213s 3.10 real 2m10.587s user 1m48.082s sys 0m20.112s [3] time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done 4.0 real 0m55.052s user 0m49.514s sys 0m9.593s 3.11 real 1m9.007s user 0m50.180s sys 0m9.111s 3.10 real 0m53.117s user 0m38.975s sys 0m7.612s [4] time for i in {0..199}; do oc delete project --wait=false test$i; done 4.0 real 0m47.396s user 0m41.734s sys 0m8.283s 3.11 real 0m51.819s user 0m45.850s sys 0m8.136s 3.10 real 0m43.741s user 0m34.899s sys 0m6.515s So from the above 4.0 is at least 10% faster (up to 23%) than 3.11. Comparing 4.0 with 3.10 shows we are slower by it's usually between 1-3%, the only outlier being new-project which is ~9% slower. Simon mind double checking your tests?
Mike, Simon, could you continue checking it? Thank you.
Latest QE results on 4.0.0-0.nightly-2019-03-28-030453 using a client in the same AWS region as the cluster below. 3.10 and 3.11 numbers are from the original bug description 1. Create 200 projects and create 1 dc in each project time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done 3.10: 63.8s 3.11: 82.2s 4.0: 75.2s 2. for each of 200 projects, get the dcs, sa and secrets time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done 3.10: 37.5s 3.11: 49.7s 4.0: 79.3s 3. for each of 200 projects, scale the dc replicas to 0 time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done 3.10: 27s 3.11: 35.8s 4.0: 33.2s 4. Delete 200 projects (use --wait=false on 3.11) time for i in {0..199}; do oc delete project --wait=false test$i; done 3.10: 28.5s 3.11: 35.9 4.0: 30.4s The get performance is the only outlier now. Marking this general bz VERIFIED and will do a deeper 3.11 comparison on varying get scenarios. The improved create performance matches what the performance team saw in their cluster load-up runs.
To shed some more light I did one more round all the way back to 3.9 and we're looking pretty good, all on 4.0.0-0.alpha-2019-04-01-080825 cluster: time for i in {0..199}; do oc new-project --skip-config-write test$i; oc create -n test$i -f dc.json; done 4.1 real 1m57.810s user 1m22.642s sys 0m14.628s 3.11 real 2m2.103s user 1m30.887s sys 0m14.680s 3.10 real 1m39.206s user 1m2.525s sys 0m11.362s 3.9 real 2m13.753s user 1m40.696s sys 0m16.518s time for i in {0..199}; do oc get dc -n test$i; oc get sa -n test$i; oc get secrets -n test$i; done 4.1 real 2m13.863s user 2m8.778s sys 0m24.360s 3.11 real 2m30.014s user 2m23.904s sys 0m23.964s 3.10 real 2m13.108s user 1m52.827s sys 0m20.193s 3.9 real 3m9.074s user 2m48.212s sys 0m27.287s time for i in {0..199}; do oc scale --replicas=0 dc/deploymentconfig1 -n test$i; done 4.1 real 0m55.610s user 0m50.061s sys 0m9.342s 3.11 real 1m1.340s user 0m52.529s sys 0m9.098s 3.10 real 0m55.181s user 0m41.092s sys 0m7.713s 3.9 real 4m28.541s user 0m57.179s sys 0m9.277s time for i in {0..199}; do oc delete project --wait=false test$i; done 4.1 real 0m47.125s user 0m42.760s sys 0m8.274s 3.11 real 0m53.575s user 0m49.766s sys 0m8.255s 3.10 real 0m44.581s user 0m37.245s sys 0m6.618s 3.9 real 0m59.102s user 0m54.254s sys 0m8.906s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758