Bug 1916931 - SDN failures causing builds to fail and increased build/push times
Summary: SDN failures causing builds to fail and increased build/push times
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Dan Winship
QA Contact: Paige Rubendall
URL:
Whiteboard:
: 1916930 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-15 20:43 UTC by Paige Rubendall
Modified: 2023-09-15 00:58 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-03 21:54:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Describe of all the failed builds (91.34 KB, text/plain)
2021-01-15 20:43 UTC, Paige Rubendall
no flags Details
Build logs (21.34 KB, text/plain)
2021-01-21 17:58 UTC, Paige Rubendall
no flags Details
Another build log (21.69 KB, text/plain)
2021-01-21 17:59 UTC, Paige Rubendall
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1785399 0 urgent CLOSED Under condition of heavy pod creation, creation fails with 'error reserving pod name ...: name is reserved" 2024-06-13 22:21:03 UTC

Internal Links: 1979999

Description Paige Rubendall 2021-01-15 20:43:32 UTC
Created attachment 1747985 [details]
Describe of all the failed builds

Description of problem:
On a 250 worker node cluster, running 2000 concurrent builds causes SDN and monitoring pods to fail periodically. This seems to cause the builds themselves to build slower or fail completely. 

This is a regression from previous releases as this test was completed successfully (in a decent amount of time and all builds passing) in both the 4.5 and 4.6 releases. 


Version-Release number of selected component (if applicable):
# oc version
Client Version: 4.7.0-fc.2
Server Version: 4.7.0-fc.2
Kubernetes Version: v1.20.0+394a5a3

How reproducible:
100%


Steps to Reproduce:
1. Clone https://github.com/openshift/svt repo
2. Cd to svt/openshift_performance/ci/scripts
3. Make sure "python --version" returns python 2 (see more info https://github.com/openshift/svt/blob/master/openshift_performance/ci/scripts/README.md)
4. Edit conc_builds.sh to have the following:

build_array=(2000) #line 10
app_array=("cakephp") #line 12

5. Edit ../content/conc_builds_cakephp.yaml to have 2000 projects as well (second line)
6. Run command: ./conc_builds.sh

Actual results:

Multiple failed builds (~23)
Increased build times and push time:
build increase of about: 108.75912408759123%
push increase of about: 39.38833851262775%




Expected results:
All builds complete successfully with no errors and build and push times are comparable/not regressed from previous releases 


Additional info:

Previous releases run statistics: build time of build, push time of build, number of successful builds (should be 4000)

4.5 
Average build time, all good builds: 117
Average push time, all good builds: 3.4631505
Good builds included in stats: 4000


4.6
Average build time, all good builds: 137
Average push time, all good builds: 3.18754688672
Good builds included in stats: 4000


4.7
Average build time, all good builds: 286
Average push time, all good builds: 4.44306864471
Good builds included in stats: 3977

Comment 2 Ben Bennett 2021-01-18 15:58:55 UTC
*** Bug 1916930 has been marked as a duplicate of this bug. ***

Comment 3 Paige Rubendall 2021-01-21 17:58:25 UTC
Created attachment 1749481 [details]
Build logs

Comment 4 Paige Rubendall 2021-01-21 17:59:39 UTC
Created attachment 1749482 [details]
Another build log

Comment 5 Paige Rubendall 2021-01-21 18:04:41 UTC
From pod logs 

Warning  Failed          174m                  kubelet            Error: context deadline exceeded
  Warning  Failed          173m                  kubelet            Error: Kubelet may be retrying requests that are timing out in CRI-O due to system load: the requested container k8s_manage-dockerfile_cakephp-mysql-example-3-build_svt-cakephp-129_563485e7-6086-4adf-8413-45e464503d26_0 is now ready and will be provided to the kubelet on next retry: error reserving ctr name k8s_manage-dockerfile_cakephp-mysql-example-3-build_svt-cakephp-129_563485e7-6086-4adf-8413-45e464503d26_0 for id 6a40107100ae3535417df89ad53e68b9b8f15c604b09869e737b9de8f9f3ebb1: name is reserved
  Normal   Pulled          173m (x3 over 176m)   kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8260d423f922c1673098679fc5ba68e069feababf8174c131400664429bea2eb" already present on machine
  Normal   Created         173m                  kubelet            Created container manage-dockerfile
  Normal   Started         173m                  kubelet            Started container manage-dockerfile
  Normal   Pulled          171m (x2 over 173m)   kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8260d423f922c1673098679fc5ba68e069feababf8174c131400664429bea2eb" already present on machine
  Warning  Failed          157m (x6 over 167m)   kubelet            Error: ImageInspectError
  Warning  InspectFailed   147m (x11 over 167m)  kubelet            Failed to inspect image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8260d423f922c1673098679fc5ba68e069feababf8174c131400664429bea2eb": rpc error: code = DeadlineExceeded desc = context deadline exceeded


https://bugzilla.redhat.com/show_bug.cgi?id=1785399 could be contributing to this huge regression in build times/failures

Comment 6 Alexander Constantinescu 2021-01-25 16:04:00 UTC
Hi

The bug mentioned in #comment 5 has been fixed and should be included in the release 4.7.0-0.nightly-2021-01-19-033533, could you please try re-verifying this bug with that version?

Thanks,
Alex

Comment 7 zhaozhanqi 2021-01-26 07:36:39 UTC
@prubenda could you help verified this bug on 4.7.0-0.nightly-2021-01-19-033533 version?

Comment 8 Ben Bennett 2021-02-01 15:39:31 UTC
Any update on this?

Comment 9 zhaozhanqi 2021-02-02 10:38:09 UTC
prubenda  could you help check this bug if can be reproduced on the above version?

Comment 10 Paige Rubendall 2021-02-03 21:54:31 UTC
After many reruns and different set ups, I finally was successfully able to run this test. It seemed that the buildconfig was not set up correctly for my OpenShift cluster. I had to change the buildconfig to only assign builds to worker nodes and not worker and infra nodes. 

After doing this I get almost all successful builds and build/push times similar to previous releases. I also do not see any pods in the openshift-sdn namespace crashing during the test run. 

The push times for 4.7 were a little bit higher than 4.6; still doing some more investigation but I do not think it is an issue with SDN 

4.6
Average build time, all good builds: 137
Average push time, all good builds: 3.18754688672
Good builds included in stats: 4000

4.7
Average build time, all good builds: 117
Average push time, all good builds: 6.4799659915
Good builds included in stats: 3999

# oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-fc.4   True        False         2d      Cluster version is 4.7.0-fc.4

Comment 11 Paige Rubendall 2021-02-10 19:56:10 UTC
Just wanted to give an update; took a look into a smaller OpenShift cluster set up to analyze why the push times had doubled for the cakephp application. I ran other applications I had data on as well just to see if it was the one application. 


New test case I ran:
3, 15, 30, 75 concurrent builds on 3 nodes (m5.2xlarge) (concurrent builds/node = 1, 5, 10, 25) (the push times are an average of 15 random builds each executed 3 times) 

4.7 Push times per app: 
Cakephp: [3.27362222222, 3.17795555556, 3.30935555556, 3.22613333333]

Eap: [2.5452,2.61517777778,2.65466666667,2.46175555556]

Nodejs: [2.34306666667, 2.47508888889, 2.5218, 2.78428888889]

Rails: [ 5.44922222222, 4.90788888889, 4.844, 5.00088888889]

4.6 Push times per app:
Cakephp: [ 3.1896, 3.22486666667, 3.15468888889, 3.2118] 

Eap: [ 2.46637777778, 2.44886666667, 2.57286666667, 2.5348]

Nodejs: [ 2.35851111111, 2.41413333333, 2.33922222222, 2.38813333333]

Rails: [ 4.779, 4.58784444444, 4.47655555556, 4.70086666667]


Comparing the push times of 4.7 to 4.6 saw less than 5% increase in push times. No noticeable increase in timing. Wondering if the registry we were pushing to just got throttled because of the large number of builds (the whole run pushes about 6000 builds) 
note: The number of builds per node in the larger scaled scenario is only about 8 but here we were testing up to 25.

Comment 12 Red Hat Bugzilla 2023-09-15 00:58:24 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.