Created attachment 1747985 [details] Describe of all the failed builds Description of problem: On a 250 worker node cluster, running 2000 concurrent builds causes SDN and monitoring pods to fail periodically. This seems to cause the builds themselves to build slower or fail completely. This is a regression from previous releases as this test was completed successfully (in a decent amount of time and all builds passing) in both the 4.5 and 4.6 releases. Version-Release number of selected component (if applicable): # oc version Client Version: 4.7.0-fc.2 Server Version: 4.7.0-fc.2 Kubernetes Version: v1.20.0+394a5a3 How reproducible: 100% Steps to Reproduce: 1. Clone https://github.com/openshift/svt repo 2. Cd to svt/openshift_performance/ci/scripts 3. Make sure "python --version" returns python 2 (see more info https://github.com/openshift/svt/blob/master/openshift_performance/ci/scripts/README.md) 4. Edit conc_builds.sh to have the following: build_array=(2000) #line 10 app_array=("cakephp") #line 12 5. Edit ../content/conc_builds_cakephp.yaml to have 2000 projects as well (second line) 6. Run command: ./conc_builds.sh Actual results: Multiple failed builds (~23) Increased build times and push time: build increase of about: 108.75912408759123% push increase of about: 39.38833851262775% Expected results: All builds complete successfully with no errors and build and push times are comparable/not regressed from previous releases Additional info: Previous releases run statistics: build time of build, push time of build, number of successful builds (should be 4000) 4.5 Average build time, all good builds: 117 Average push time, all good builds: 3.4631505 Good builds included in stats: 4000 4.6 Average build time, all good builds: 137 Average push time, all good builds: 3.18754688672 Good builds included in stats: 4000 4.7 Average build time, all good builds: 286 Average push time, all good builds: 4.44306864471 Good builds included in stats: 3977
*** Bug 1916930 has been marked as a duplicate of this bug. ***
Created attachment 1749481 [details] Build logs
Created attachment 1749482 [details] Another build log
From pod logs Warning Failed 174m kubelet Error: context deadline exceeded Warning Failed 173m kubelet Error: Kubelet may be retrying requests that are timing out in CRI-O due to system load: the requested container k8s_manage-dockerfile_cakephp-mysql-example-3-build_svt-cakephp-129_563485e7-6086-4adf-8413-45e464503d26_0 is now ready and will be provided to the kubelet on next retry: error reserving ctr name k8s_manage-dockerfile_cakephp-mysql-example-3-build_svt-cakephp-129_563485e7-6086-4adf-8413-45e464503d26_0 for id 6a40107100ae3535417df89ad53e68b9b8f15c604b09869e737b9de8f9f3ebb1: name is reserved Normal Pulled 173m (x3 over 176m) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8260d423f922c1673098679fc5ba68e069feababf8174c131400664429bea2eb" already present on machine Normal Created 173m kubelet Created container manage-dockerfile Normal Started 173m kubelet Started container manage-dockerfile Normal Pulled 171m (x2 over 173m) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8260d423f922c1673098679fc5ba68e069feababf8174c131400664429bea2eb" already present on machine Warning Failed 157m (x6 over 167m) kubelet Error: ImageInspectError Warning InspectFailed 147m (x11 over 167m) kubelet Failed to inspect image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8260d423f922c1673098679fc5ba68e069feababf8174c131400664429bea2eb": rpc error: code = DeadlineExceeded desc = context deadline exceeded https://bugzilla.redhat.com/show_bug.cgi?id=1785399 could be contributing to this huge regression in build times/failures
Hi The bug mentioned in #comment 5 has been fixed and should be included in the release 4.7.0-0.nightly-2021-01-19-033533, could you please try re-verifying this bug with that version? Thanks, Alex
@prubenda could you help verified this bug on 4.7.0-0.nightly-2021-01-19-033533 version?
Any update on this?
prubenda could you help check this bug if can be reproduced on the above version?
After many reruns and different set ups, I finally was successfully able to run this test. It seemed that the buildconfig was not set up correctly for my OpenShift cluster. I had to change the buildconfig to only assign builds to worker nodes and not worker and infra nodes. After doing this I get almost all successful builds and build/push times similar to previous releases. I also do not see any pods in the openshift-sdn namespace crashing during the test run. The push times for 4.7 were a little bit higher than 4.6; still doing some more investigation but I do not think it is an issue with SDN 4.6 Average build time, all good builds: 137 Average push time, all good builds: 3.18754688672 Good builds included in stats: 4000 4.7 Average build time, all good builds: 117 Average push time, all good builds: 6.4799659915 Good builds included in stats: 3999 # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-fc.4 True False 2d Cluster version is 4.7.0-fc.4
Just wanted to give an update; took a look into a smaller OpenShift cluster set up to analyze why the push times had doubled for the cakephp application. I ran other applications I had data on as well just to see if it was the one application. New test case I ran: 3, 15, 30, 75 concurrent builds on 3 nodes (m5.2xlarge) (concurrent builds/node = 1, 5, 10, 25) (the push times are an average of 15 random builds each executed 3 times) 4.7 Push times per app: Cakephp: [3.27362222222, 3.17795555556, 3.30935555556, 3.22613333333] Eap: [2.5452,2.61517777778,2.65466666667,2.46175555556] Nodejs: [2.34306666667, 2.47508888889, 2.5218, 2.78428888889] Rails: [ 5.44922222222, 4.90788888889, 4.844, 5.00088888889] 4.6 Push times per app: Cakephp: [ 3.1896, 3.22486666667, 3.15468888889, 3.2118] Eap: [ 2.46637777778, 2.44886666667, 2.57286666667, 2.5348] Nodejs: [ 2.35851111111, 2.41413333333, 2.33922222222, 2.38813333333] Rails: [ 4.779, 4.58784444444, 4.47655555556, 4.70086666667] Comparing the push times of 4.7 to 4.6 saw less than 5% increase in push times. No noticeable increase in timing. Wondering if the registry we were pushing to just got throttled because of the large number of builds (the whole run pushes about 6000 builds) note: The number of builds per node in the larger scaled scenario is only about 8 but here we were testing up to 25.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days