Bug 1685352

Summary: S2I build times 1.25x - 5x longer on 4.1 compared to 3.11
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: BuildAssignee: Adam Kaplan <adam.kaplan>
Status: CLOSED DEFERRED QA Contact: wewang <wewang>
Severity: high Docs Contact:
Priority: urgent    
Version: 4.1.0CC: aos-bugs, bparees, dornelas, dwalsh, erich, hongkliu, nstielau, ochaloup, pkremens, pthomas, rheinzma, ricferna, skordas, vcojot, wzheng
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: aos-scalability-40
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-12-03 15:18:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
3.11 and 4.1 logs at build-loglevel=5
none
logs comparing binary builds in 3.11 and 4.1 none

Description Mike Fiedler 2019-03-05 02:26:35 UTC
Description of problem:

Building the same application on 4.0 vs 3.11 takes significantly longer.   Using the quickstarts as examples, the difference is:

Building the same app on m5.xlarge AWS instances running in us-west-2 on 4.0 and 3.11 yields the following build times.   The times below are a) first/last timestamp in build pod log, b) time reported in oc get builds and c) time reported by "time oc start-build --follow --timestamps".    A sample pod lifecycle comparison is in Additional Information.
                                  3.11                       4.0
                          a        b         c       a       b        c
nodejs-mongodb-example   18s      22s       26s     32s     54s      58s
django-psql-example      15s      26s       31s     38s     63s      68s
rails-postgresql-ex     101s     112s      117s    113s    142s     145s
dancer-mysql-example    114s     124s      129s    120s    149s     155s
cakephp-mysql-example    37s      49s       53s     51s     81s      89s


The time from start of Init to the build pod Running seems to consistently be 10 seconds longer on 4.0 regardless of build type.

Version-Release number of selected component (if applicable):  4.0.0-0.nightly-2019-03-04-180718


How reproducible: Always


Additional info:

Sample timeline for cakephp.  

3.11:

nodejs-mongodb-example-1-build   0/1       Completed   0          11m
nodejs-mongodb-example-2-build   0/1       Completed   0          7m
nodejs-mongodb-example-3-build   0/1       Pending   0         0s
nodejs-mongodb-example-3-build   0/1       Pending   0         0s
nodejs-mongodb-example-3-build   0/1       Init:0/2   0         0s
nodejs-mongodb-example-3-build   0/1       Init:0/2   0         3s
nodejs-mongodb-example-3-build   0/1       Init:1/2   0         8s
nodejs-mongodb-example-3-build   0/1       PodInitializing   0         10s
nodejs-mongodb-example-3-build   1/1       Running   0         11s
nodejs-mongodb-example-3-build   0/1       Completed   0         22s


4.0:

nodejs-mongodb-example-3-build   0/1   Pending   0     0s
nodejs-mongodb-example-3-build   0/1   Pending   0     0s
nodejs-mongodb-example-3-build   0/1   Init:0/2   0     0s
nodejs-mongodb-example-3-build   0/1   Init:0/2   0     8s
nodejs-mongodb-example-3-build   0/1   Init:0/2   0     8s
nodejs-mongodb-example-3-build   0/1   Init:1/2   0     19s
nodejs-mongodb-example-3-build   0/1   PodInitializing   0     20s
nodejs-mongodb-example-3-build   1/1   Running   0     21s
nodejs-mongodb-example-3-build   0/1   Completed   0     53s
nodejs-mongodb-example-3-build   0/1   Completed   0     53s

django-psql-example:

3.11:

django    django-psql-example-2-build   0/1       Pending   0         0s
django    django-psql-example-2-build   0/1       Pending   0         0s
django    django-psql-example-2-build   0/1       Init:0/2   0         0s
django    django-psql-example-2-build   0/1       Init:0/2   0         3s
django    django-psql-example-2-build   0/1       Init:1/2   0         9s
django    django-psql-example-2-build   0/1 PodInitializing   0         10s
django    django-psql-example-2-build   1/1       Running   0         11s
django    django-psql-example-2-build   0/1       Completed   0         26s


4.0:

django   django-psql-example-2-build   0/1   Pending   0     1s
django   django-psql-example-2-build   0/1   Pending   0     1s
django   django-psql-example-2-build   0/1   Init:0/2   0     1s
django   django-psql-example-2-build   0/1   Init:0/2   0     9s
django   django-psql-example-2-build   0/1   Init:0/2   0     9s
django   django-psql-example-2-build   0/1   Init:1/2   0     20s
django   django-psql-example-2-build   0/1   PodInitializing   0     21s
django   django-psql-example-2-build   1/1   Running   0     23s
django   django-psql-example-2-build   0/1   Completed   0     64s
django   django-psql-example-2-build   0/1   Completed   0     64s

Comment 14 Mike Fiedler 2019-05-29 16:48:51 UTC
I did a test between 3.11 and 4.1 running the eap72-basic-s2i build.   After warming up all of the nodes, I ran 5 builds of each with build-loglevel=5 and compared representative build output.   I did not see some of the huge differences mentioned in comment 12, but did breakdown the differences.   Most of the descrepancy can be explained by the upfront pull of the jboss-eap72-openshift image but there is a non-trivial difference in additional time in 4.1 after the BUILD SUCCESS message and before the push starts.

Details:

oc get builds -o wide

3.11: eap-app-9   Source   Git@99fa61a   Complete   3 minutes ago       36s
4.1:  eap-app-9   Source   Git@99fa61a   Complete   2 minutes ago       1m30s 

Seconds to reach milestone in build:
                                           3.11             4.1
Start -> Performing Maven build             19s             58s
BUILD SUCCESS                                7s              7s
Pushing image...                             1s             10s
Push complete                                6s              6s

I'll attach the logs.   Experiments continue.

Comment 15 Mike Fiedler 2019-05-29 16:51:51 UTC
Created attachment 1574916 [details]
3.11 and 4.1 logs at build-loglevel=5

Comment 16 Mike Fiedler 2019-05-30 00:52:46 UTC
The other phenomenon (which can be seen in https://bugzilla.redhat.com/show_bug.cgi?id=1704722#c17) is that the greater the number of concurrent builds started simultaneously on a node, the faster average build times degrade on 4.1.   The initial pulls of the build images get longer the more concurrent builds are running.   I'll try to profile the registry and network tomorrow, but my guess is general competition for pulling and writing to storage.

Example:  building cakephp avg build time - concurrent builds/node:
                   3.11               4.1         4.1 - time from build log start to "Starting S2I build" message
1 build/node        15s               46s              14s      
3 build/node        13s               82s              50s
5 build/node        17s              108s              76s

Comment 17 Mike Fiedler 2019-05-30 17:58:16 UTC
Created attachment 1575325 [details]
logs comparing binary builds in 3.11 and 4.1

Another example using binary builds.   On 3.11 the build runs in 8-10 seconds and in 4.1 it is 55-65 seconds.   The logs show the bulk of the difference in the initial pull of the jboss-eap71-openshift image and the delay before the start of the push.

Comment 21 Adam Kaplan 2019-12-03 15:18:28 UTC
Closing this as DEFERRED because there are multiple initiatives on this front aiming to improve build speeds.
We have already seen a 50% improvement in build speed/scale between 4.1 and 4.2 due to fixes in buildah and crio.

We still need to address the issue with providing a means of caching images on nodes and/or another persistent storage.
This is being tracked in JIRA - https://jira.coreos.com/browse/DEVEXP-467