Bug 1962186 - Increased build and push times for large number of builds
Summary: Increased build and push times for large number of builds
Keywords:
Status: CLOSED DUPLICATE of bug 1953102
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Antonio Ojea
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-19 13:24 UTC by Paige Rubendall
Modified: 2021-06-03 07:04 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-26 14:55:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Paige Rubendall 2021-05-19 13:24:12 UTC
Description of problem:
On a 250 worker node cluster, running 2000 concurrent builds causes SDN and monitoring pods to fail periodically. This seems to cause the builds themselves to build slower and have slower push times. 

This is a regression from previous releases as this test was completed successfully (in a reasonable amount of time and all builds passing) in both the 4.5, 4.6, and 4.7 releases. 

Version-Release number of selected component (if applicable):4.8.0-0.nightly-2021-05-10-092939


How reproducible: 100%


Steps to Reproduce:
1. Clone https://github.com/openshift/svt repo
2. Cd to svt/openshift_performance/ci/scripts
3. Make sure "python --version" returns python 2 (see more info https://github.com/openshift/svt/blob/master/openshift_performance/ci/scripts/README.md)
4. Edit conc_builds.sh to have the following:

build_array=(2000) #line 10
app_array=("cakephp") #line 12
readonly PROJECT_NUM=2000 #line 14

5. Edit ../content/conc_builds_cakephp.yaml to have 2000 projects as well (second line)
6. Run command: ./conc_builds.sh


Actual results:
Huge increase in both build and push times 
Average build time, all good builds: 710 (normally in middle to low 100’s)
Average push time, all good builds: 411.67507075 ( normally about 3 - 7 seconds) 


Expected results:
Build and Push times are comparable to previous releases at scale 

Additional info:
There were no failed components during this run, monitored using cerberus (https://github.com/cloud-bulldozer/cerberus) 

Running lower build tests, the build and push times were comparable with previous releases. Around 1500 builds the timings go way up

I completed 2 iterations of this build and got the average of the build, push, fetch and pull time of each iteration. 

All of the below numbers are in seconds 
iteration 1
  Avg build time (from duration): 1269.706 Max build time: 1705.0 Min build time: 298.0
  Avg build time: 425.539 Max build time: 673.461 Min build time: 189.545
  Avg push time: 802.193 Max push time: 1157.179 Min push time: 50.820
  Avg fetch time: .706
  Avg pull time: 27.497
iteration 2
  Avg build time (from duration): 150.6245 Max build time: 457.0 Min build time: 43.0
  Avg build time: 50.532 Max build time: 292.164 Min build time: 12.343
  Avg push time: 21.155 Max push time: 172.118 Min push time: 2.933
  Avg fetch time: .583
  Avg pull time: 16.020


Previous releases times for comparison (all have same set up as current test) 
4.5 
Average build time, all good builds: 117
Average push time, all good builds: 3.4631505
Good builds included in stats: 4000


4.6
Average build time, all good builds: 137
Average push time, all good builds: 3.18754688672
Good builds included in stats: 4000


4.7
Average build time, all good builds: 117
Average push time, all good builds: 6.48
Good builds included in stats: 4000


4.8 
Average build time, all good builds: 710
Average push time, all good builds: 411.67507075
Good builds included in stats: 4000

Comment 4 Paige Rubendall 2021-05-19 15:25:29 UTC
Another note, I had hit something like this in 4.7 (not nearly as bad) and it got better after adding worker nodes as the node selector in the build config for the cluster. 

I did already edit the build configuration to only schedule builds onto worker nodes and still seeing this regression. 

oc get -o yaml build.config.openshift.io cluster
Add the following to the bottom of the yaml 
```spec:
  buildOverrides:
    nodeSelector:
      node-role.kubernetes.io/worker: ""```

Comment 10 Antonio Ojea 2021-05-26 09:59:01 UTC
I think that this is caused by this other bug https://bugzilla.redhat.com/show_bug.cgi?id=1953102

that will be fixed by https://github.com/openshift/kubernetes/pull/761

All performance tests are going to be affected one way or another by this bug, so we should repeat once the fix is in, to be totally sure.

Comment 11 Ben Bennett 2021-05-26 14:55:45 UTC

*** This bug has been marked as a duplicate of bug 1953102 ***


Note You need to log in before you can comment on or make changes to this bug.