Bug 2103283

Summary:	In CI 4.10 HAProxy must-gather takes longer than 10 minutes
Product:	OpenShift Container Platform	Reporter:	Candace Holman <cholman>
Component:	Networking	Assignee:	Grant Spence <gspence>
Networking sub component:	router	QA Contact:	Shudi Li <shudili>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	hongli, jaldinge, wking
Version:	4.10
Target Milestone:	---
Target Release:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Previously, routers left in the terminating state delayed the `oc cp` command, which delayed the must gather until the terminating pod was terminated. With this update, a timeout is set for each `oc cp` command resulting in the must gathers not being delayed by terminating pods. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2103283[BZ#2103283*])	Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-17 19:51:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2104701

Comment 5 Shudi Li 2022-07-08 10:00:57 UTC

Verified it with 4.12.0-0.nightly-2022-07-07-144231 on an AWS cluster, the total time to must-gather was about 20 minutes, which was less than 40 minutes

1.
%oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-07-07-144231   True        False         132m    Cluster version is 4.12.0-0.nightly-2022-07-07-144231
% oc -n openshift-ingress patch deploy/router-default --type=strategic --patch='{"spec":{"template":{"spec":{"containers":[{"name":"router","livenessProbe":{"timeoutSeconds":15},"readinessProbe":{"timeoutSeconds":15}}]}}}}'
deployment.apps/router-default patched
% 

2.
%oc -n openshift-ingress get pods
NAME                              READY   STATUS        RESTARTS   AGE
router-default-6f67d6db6f-h4mqf   1/1     Terminating   0          2m40s
router-default-6f67d6db6f-t8p4j   0/1     Terminating   0          2m40s
router-default-86696fc96c-94cfp   1/1     Running       0          40s
router-default-86696fc96c-dc4pn   0/1     Pending       0          40s
% 

3. oc adm must-gather, it took about 20 minutes
% oc adm must-gather  
[must-gather      ] OUT Using must-gather plug-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f2a076583e9b0646014c1135a52bb7af45c141c96f162e2ae8c0ad5bdedbefec
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: ae2bfb96-bb30-475a-99a3-a61f297ccaf7
ClusterVersion: Stable at "4.12.0-0.nightly-2022-07-07-144231"
ClusterOperators:
	All healthy and stable


4. During must-gather, oc -n openshift-ingress patch deploy/router-default with different readinessProbe value, so the router pods kept being terminated and then being created.

5.  After the must-gather was done, check the route pods
oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-667876b84d-n48t5   1/1     Running   0          13m
router-default-667876b84d-xgdct   1/1     Running   0          13m
%

6. Check the log
% pwd
/Users/shudi/pppp/must-gather.local.1946140740011803351
shudi@Shudis-MacBook-Pro must-gather.local.1946140740011803351 % cat timestamp
2022-07-08 17:13:06.722968 +0800 CST m=+20.127377290
2022-07-08 17:35:31.548306 +0800 CST m=+1365.052625387
%

% ls -lht
total 0
drwxr-xr-x  3 shudi  staff    96B Jul  8 17:47 router-default-86696fc96c-dc4pn
drwxr-xr-x  3 shudi  staff    96B Jul  8 17:17 router-default-86696fc96c-94cfp
% pwd
/Users/shudi/pppp/must-gather.local.1946140740011803351/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-f2a076583e9b0646014c1135a52bb7af45c141c96f162e2ae8c0ad5bdedbefec/ingress_controllers/default
%

Comment 8 errata-xmlrpc 2023-01-17 19:51:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399