Bug 1979999

Summary: Sdn containers failing during builds causing increase in build and push times
Product: OpenShift Container Platform Reporter: Paige Rubendall <prubenda>
Component: NetworkingAssignee: Alexander Constantinescu <aconstan>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DEFERRED Docs Contact:
Severity: high    
Priority: unspecified CC: aconstan, astoycos, prubenda, sdodson
Version: 4.8Keywords: Regression
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-01-19 21:18:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Paige Rubendall 2021-07-07 15:02:10 UTC
Description of problem:
During creation of builds, some sdn pods and thanos pods are failing. This causes an increase in the build time (build and push) 
When doing a must gather get information: clusteroperator/network is progressing: DaemonSet "openshift-sdn/sdn" is not available (awaiting 1 nodes)
The same test was performed with 4.8.0-0.nightly-2021-06-08-034312 and 4.8.0-0.nightly-2021-05-05-030749 versions and no issues in these pods were seen 


Version-Release number of selected component (if applicable): 4.8.0-rc.3


How reproducible: 100%


Steps to Reproduce:
1. Clone https://github.com/openshift/svt repo
2. Cd to svt/openshift_performance/ci/scripts
3. Make sure "python --version" returns python 2 (see more info https://github.com/openshift/svt/blob/master/openshift_performance/ci/scripts/README.md)
4. Edit conc_builds.sh to have the following:

app_array=("cakephp") #line 12

5. Run command: ./conc_builds.sh

Actual results:
Sdn containers in multiple openshift-sdn pods are failing, as well as thanos-query pods in openshift-monitoring  

Expected results:
All pods in all components continue to run with no failures and no increase in build or push time 

Additional info:
Test Case in polarion: https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-9191
Similar test case with 2 4.8 nightly run results: https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-30069

4.8.0-rc.3
2021-07-06 21:42:50,267 - build_test - MainThread - INFO - Average build time, all good builds: 1110
2021-07-06 21:42:50,268 - build_test - MainThread - INFO - Average push time, all good builds: 8.59205333333
2021-07-06 21:42:50,267 - build_test - MainThread - INFO - Good builds included in stats: 150


During these runs the same components were monitored and there were no failed pods in either of the namespaces 
4.8.0-0.nightly-2021-06-08-034312
2021-06-10 16:03:46,892 - build_test - MainThread - INFO - Average build time, all good builds: 449
2021-06-10 16:03:46,893 - build_test - MainThread - INFO - Average push time, all good builds: 3.79988
2021-06-10 16:03:46,892 - build_test - MainThread - INFO - Good builds included in stats: 150

4.8.0-0.nightly-2021-05-05-030749
2021-05-03 16:02:02,873 - build_test - MainThread - INFO - Average build time, all good builds: 457
2021-05-03 16:02:02,873 - build_test - MainThread - INFO - Average push time, all good builds: 3.98708666667
2021-05-03 16:02:02,873 - build_test - MainThread - INFO - Good builds included in stats: 150

Comment 2 Paige Rubendall 2021-07-09 13:13:17 UTC
Any update on this bug?

I had seen a similar issue in 4.7 and was fixed by a newer 4.7 nightly. Linking

Comment 3 Paige Rubendall 2021-07-16 19:14:51 UTC
Reran this test on 4.9 and I am still seeing the same problem. During the run I watched all namespaces starting with "openshift-" and saw that the marketplace namespace also was failing/restarting pods during this run. 


# oc version
Client Version: 4.9.0-0.nightly-2021-07-14-083200
Server Version: 4.9.0-0.nightly-2021-07-14-083200


Average build time, all good builds: 1102
Average push time, all good builds: 12.9827466667
Good builds included in stats: 150


Will add a must gather, prometheus and list of failed pods in following comment

Comment 5 Scott Dodson 2021-08-23 18:18:56 UTC
Paige,

Does this issue persist into 4.8.5+ or in 4.9 nightlies?

Comment 6 Paige Rubendall 2021-08-23 18:20:34 UTC
Will rerun with the newest 4.8 and 4.9 nightly

Comment 7 Paige Rubendall 2021-09-01 03:00:35 UTC
Reran on 4.8 versions and 4.9.0-0.nightly-2021-08-31-081832. Saw the sdn issue happen less on 4.8.9; attaching must gather and prometheus dump for 4.8. 
 
Both versions were still showing an increase/regression in build times from previous release. Need to do more investigation across other applications and where the slow down is 


Client Version: 4.9.0-0.nightly-2021-08-30-100210
Server Version: 4.9.0-0.nightly-2021-08-31-081832
39012
================ Average times for cakephp app =================
2021-08-31 19:13:46,761 - build_test - MainThread - INFO - Average build time, all good builds: 54
2021-08-31 19:17:22,850 - build_test - MainThread - INFO - Average build time, all good builds: 70
2021-08-31 19:22:26,150 - build_test - MainThread - INFO - Average build time, all good builds: 116
2021-08-31 19:31:49,524 - build_test - MainThread - INFO - Average build time, all good builds: 236
2021-08-31 19:48:11,882 - build_test - MainThread - INFO - Average build time, all good builds: 434
2021-08-31 20:16:36,113 - build_test - MainThread - INFO - Average build time, all good builds: 624
2021-08-31 20:52:37,505 - build_test - MainThread - INFO - Average build time, all good builds: 802
2021-08-31 19:13:46,761 - build_test - MainThread - INFO - Average push time, all good builds: 2.6435
2021-08-31 19:17:22,850 - build_test - MainThread - INFO - Average push time, all good builds: 3.23725
2021-08-31 19:22:26,150 - build_test - MainThread - INFO - Average push time, all good builds: 4.00146666667
2021-08-31 19:31:49,525 - build_test - MainThread - INFO - Average push time, all good builds: 4.88183333333
2021-08-31 19:48:11,882 - build_test - MainThread - INFO - Average push time, all good builds: 6.20964444444
2021-08-31 20:16:36,114 - build_test - MainThread - INFO - Average push time, all good builds: 10.1405172414
2021-08-31 20:52:37,505 - build_test - MainThread - INFO - Average push time, all good builds: 10.087877551
2021-08-31 19:13:46,760 - build_test - MainThread - INFO - Good builds included in stats: 2
2021-08-31 19:17:22,850 - build_test - MainThread - INFO - Good builds included in stats: 16
2021-08-31 19:22:26,150 - build_test - MainThread - INFO - Good builds included in stats: 30
2021-08-31 19:31:49,524 - build_test - MainThread - INFO - Good builds included in stats: 60
2021-08-31 19:48:11,882 - build_test - MainThread - INFO - Good builds included in stats: 90
2021-08-31 20:16:36,113 - build_test - MainThread - INFO - Good builds included in stats: 116
2021-08-31 20:52:37,505 - build_test - MainThread - INFO - Good builds included in stats: 147
==============================================================





Client Version: 4.8.9
Server Version: 4.8.9
================ Average times for cakephp app =================
2021-08-31 19:46:36,754 - build_test - MainThread - INFO - Average build time, all good builds: 52
2021-08-31 19:50:17,343 - build_test - MainThread - INFO - Average build time, all good builds: 73
2021-08-31 19:55:46,540 - build_test - MainThread - INFO - Average build time, all good builds: 125
2021-08-31 20:06:42,895 - build_test - MainThread - INFO - Average build time, all good builds: 267
2021-08-31 20:25:18,951 - build_test - MainThread - INFO - Average build time, all good builds: 488
2021-08-31 20:53:14,810 - build_test - MainThread - INFO - Average build time, all good builds: 748
2021-08-31 21:31:03,306 - build_test - MainThread - INFO - Average build time, all good builds: 1049
2021-08-31 19:46:36,754 - build_test - MainThread - INFO - Average push time, all good builds: 2.5655
2021-08-31 19:50:17,343 - build_test - MainThread - INFO - Average push time, all good builds: 3.1709375
2021-08-31 19:55:46,541 - build_test - MainThread - INFO - Average push time, all good builds: 3.8516
2021-08-31 20:06:42,896 - build_test - MainThread - INFO - Average push time, all good builds: 5.24065
2021-08-31 20:25:18,952 - build_test - MainThread - INFO - Average push time, all good builds: 6.22831111111
2021-08-31 20:53:14,810 - build_test - MainThread - INFO - Average push time, all good builds: 10.3179333333
2021-08-31 21:31:03,307 - build_test - MainThread - INFO - Average push time, all good builds: 9.20786666667
2021-08-31 19:46:36,754 - build_test - MainThread - INFO - Good builds included in stats: 2
2021-08-31 19:50:17,343 - build_test - MainThread - INFO - Good builds included in stats: 16
2021-08-31 19:55:46,539 - build_test - MainThread - INFO - Good builds included in stats: 30
2021-08-31 20:06:42,895 - build_test - MainThread - INFO - Good builds included in stats: 60
2021-08-31 20:25:18,951 - build_test - MainThread - INFO - Good builds included in stats: 90
2021-08-31 20:53:14,809 - build_test - MainThread - INFO - Good builds included in stats: 120
2021-08-31 21:31:03,306 - build_test - MainThread - INFO - Good builds included in stats: 150
==============================================================

https://drive.google.com/drive/folders/1y6FC1Cc_XWNF2BdYbrGzZhb-ouqKHWOC?usp=sharing

List of failing pods during run
{
 "timestamp": "2021-08-31 19:42:22",
 "count": 1,
 "issue": "pod crash",
 "name": "sdn-7nqtf",
 "component": "sdn"
 },
 {
 "timestamp": "2021-08-31 19:42:22",
 "count": 1,
 "issue": "pod crash",
 "name": "sdn-phdmd",
 "component": "sdn"
 },
 {
 "timestamp": "2021-08-31 21:10:15",
 "count": 1,
 "issue": "pod crash",
 "name": "sdn-phdmd",
 "component": "sdn"
 },
 {
 "timestamp": "2021-09-01 00:00:15",
 "count": 1,
 "issue": "pod crash",
 "name": "image-pruner-27174240-k8vff",
 "component": "image-registry"
 },
 {
 "timestamp": "2021-09-01 00:00:41",
 "count": 1,
 "issue": "pod crash",
 "name": "image-pruner-27174240-k8vff",
 "component": "image-registry"
 }

Comment 8 Alexander Constantinescu 2022-01-13 14:37:50 UTC
Could you please upload SDN logs from one of these runs? We can't really tell why the SDN pods are crashing without that.

Comment 9 Red Hat Bugzilla 2023-09-15 01:11:05 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days