2103283 – In CI 4.10 HAProxy must-gather takes longer than 10 minutes

Bug 2103283 - In CI 4.10 HAProxy must-gather takes longer than 10 minutes

Summary: In CI 4.10 HAProxy must-gather takes longer than 10 minutes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Grant Spence
QA Contact:	Shudi Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2104701
TreeView+	depends on / blocked

Reported:	2022-07-01 23:09 UTC by Candace Holman
Modified:	2023-01-17 19:51 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, routers left in the terminating state delayed the `oc cp` command, which delayed the must gather until the terminating pod was terminated. With this update, a timeout is set for each `oc cp` command resulting in the must gathers not being delayed by terminating pods. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2103283[BZ#2103283*])
Clone Of:
Environment:
Last Closed:	2023-01-17 19:51:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift must-gather pull 317	0	None	open	Bug 2103283: Add timeout to oc cp command to fix must-gather delays when routers are terminating	2022-07-06 21:28:37 UTC
Red Hat Product Errata	RHSA-2022:7399	0	None	None	None	2023-01-17 19:51:37 UTC

Comment 5 Shudi Li 2022-07-08 10:00:57 UTC

Verified it with 4.12.0-0.nightly-2022-07-07-144231 on an AWS cluster, the total time to must-gather was about 20 minutes, which was less than 40 minutes

1.
%oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-07-07-144231   True        False         132m    Cluster version is 4.12.0-0.nightly-2022-07-07-144231
% oc -n openshift-ingress patch deploy/router-default --type=strategic --patch='{"spec":{"template":{"spec":{"containers":[{"name":"router","livenessProbe":{"timeoutSeconds":15},"readinessProbe":{"timeoutSeconds":15}}]}}}}'
deployment.apps/router-default patched
% 

2.
%oc -n openshift-ingress get pods
NAME                              READY   STATUS        RESTARTS   AGE
router-default-6f67d6db6f-h4mqf   1/1     Terminating   0          2m40s
router-default-6f67d6db6f-t8p4j   0/1     Terminating   0          2m40s
router-default-86696fc96c-94cfp   1/1     Running       0          40s
router-default-86696fc96c-dc4pn   0/1     Pending       0          40s
% 

3. oc adm must-gather, it took about 20 minutes
% oc adm must-gather  
[must-gather      ] OUT Using must-gather plug-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f2a076583e9b0646014c1135a52bb7af45c141c96f162e2ae8c0ad5bdedbefec
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: ae2bfb96-bb30-475a-99a3-a61f297ccaf7
ClusterVersion: Stable at "4.12.0-0.nightly-2022-07-07-144231"
ClusterOperators:
	All healthy and stable


4. During must-gather, oc -n openshift-ingress patch deploy/router-default with different readinessProbe value, so the router pods kept being terminated and then being created.

5.  After the must-gather was done, check the route pods
oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-667876b84d-n48t5   1/1     Running   0          13m
router-default-667876b84d-xgdct   1/1     Running   0          13m
%

6. Check the log
% pwd
/Users/shudi/pppp/must-gather.local.1946140740011803351
shudi@Shudis-MacBook-Pro must-gather.local.1946140740011803351 % cat timestamp
2022-07-08 17:13:06.722968 +0800 CST m=+20.127377290
2022-07-08 17:35:31.548306 +0800 CST m=+1365.052625387
%

% ls -lht
total 0
drwxr-xr-x  3 shudi  staff    96B Jul  8 17:47 router-default-86696fc96c-dc4pn
drwxr-xr-x  3 shudi  staff    96B Jul  8 17:17 router-default-86696fc96c-94cfp
% pwd
/Users/shudi/pppp/must-gather.local.1946140740011803351/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-f2a076583e9b0646014c1135a52bb7af45c141c96f162e2ae8c0ad5bdedbefec/ingress_controllers/default
%

Comment 8 errata-xmlrpc 2023-01-17 19:51:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.