Description of problem: On a 250 worker node cluster, running 2000 concurrent builds causes SDN and monitoring pods to fail periodically. This seems to cause the builds themselves to build slower and have slower push times. This is a regression from previous releases as this test was completed successfully (in a reasonable amount of time and all builds passing) in both the 4.5, 4.6, and 4.7 releases. Version-Release number of selected component (if applicable):4.8.0-0.nightly-2021-05-10-092939 How reproducible: 100% Steps to Reproduce: 1. Clone https://github.com/openshift/svt repo 2. Cd to svt/openshift_performance/ci/scripts 3. Make sure "python --version" returns python 2 (see more info https://github.com/openshift/svt/blob/master/openshift_performance/ci/scripts/README.md) 4. Edit conc_builds.sh to have the following: build_array=(2000) #line 10 app_array=("cakephp") #line 12 readonly PROJECT_NUM=2000 #line 14 5. Edit ../content/conc_builds_cakephp.yaml to have 2000 projects as well (second line) 6. Run command: ./conc_builds.sh Actual results: Huge increase in both build and push times Average build time, all good builds: 710 (normally in middle to low 100’s) Average push time, all good builds: 411.67507075 ( normally about 3 - 7 seconds) Expected results: Build and Push times are comparable to previous releases at scale Additional info: There were no failed components during this run, monitored using cerberus (https://github.com/cloud-bulldozer/cerberus) Running lower build tests, the build and push times were comparable with previous releases. Around 1500 builds the timings go way up I completed 2 iterations of this build and got the average of the build, push, fetch and pull time of each iteration. All of the below numbers are in seconds iteration 1 Avg build time (from duration): 1269.706 Max build time: 1705.0 Min build time: 298.0 Avg build time: 425.539 Max build time: 673.461 Min build time: 189.545 Avg push time: 802.193 Max push time: 1157.179 Min push time: 50.820 Avg fetch time: .706 Avg pull time: 27.497 iteration 2 Avg build time (from duration): 150.6245 Max build time: 457.0 Min build time: 43.0 Avg build time: 50.532 Max build time: 292.164 Min build time: 12.343 Avg push time: 21.155 Max push time: 172.118 Min push time: 2.933 Avg fetch time: .583 Avg pull time: 16.020 Previous releases times for comparison (all have same set up as current test) 4.5 Average build time, all good builds: 117 Average push time, all good builds: 3.4631505 Good builds included in stats: 4000 4.6 Average build time, all good builds: 137 Average push time, all good builds: 3.18754688672 Good builds included in stats: 4000 4.7 Average build time, all good builds: 117 Average push time, all good builds: 6.48 Good builds included in stats: 4000 4.8 Average build time, all good builds: 710 Average push time, all good builds: 411.67507075 Good builds included in stats: 4000
Another note, I had hit something like this in 4.7 (not nearly as bad) and it got better after adding worker nodes as the node selector in the build config for the cluster. I did already edit the build configuration to only schedule builds onto worker nodes and still seeing this regression. oc get -o yaml build.config.openshift.io cluster Add the following to the bottom of the yaml ```spec: buildOverrides: nodeSelector: node-role.kubernetes.io/worker: ""```
I think that this is caused by this other bug https://bugzilla.redhat.com/show_bug.cgi?id=1953102 that will be fixed by https://github.com/openshift/kubernetes/pull/761 All performance tests are going to be affected one way or another by this bug, so we should repeat once the fix is in, to be totally sure.
*** This bug has been marked as a duplicate of bug 1953102 ***