Bug 2103283

Summary: In CI 4.10 HAProxy must-gather takes longer than 10 minutes
Product: OpenShift Container Platform Reporter: Candace Holman <cholman>
Component: NetworkingAssignee: Grant Spence <gspence>
Networking sub component: router QA Contact: Shudi Li <shudili>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: hongli, jaldinge, wking
Version: 4.10   
Target Milestone: ---   
Target Release: 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
*Previously, routers left in the terminating state delayed the `oc cp` command, which delayed the must gather until the terminating pod was terminated. With this update, a timeout is set for each `oc cp` command resulting in the must gathers not being delayed by terminating pods. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2103283[*BZ#2103283*])
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-17 19:51:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2104701    

Comment 5 Shudi Li 2022-07-08 10:00:57 UTC
Verified it with 4.12.0-0.nightly-2022-07-07-144231 on an AWS cluster, the total time to must-gather was about 20 minutes, which was less than 40 minutes

1.
%oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-07-07-144231   True        False         132m    Cluster version is 4.12.0-0.nightly-2022-07-07-144231
% oc -n openshift-ingress patch deploy/router-default --type=strategic --patch='{"spec":{"template":{"spec":{"containers":[{"name":"router","livenessProbe":{"timeoutSeconds":15},"readinessProbe":{"timeoutSeconds":15}}]}}}}'
deployment.apps/router-default patched
% 

2.
%oc -n openshift-ingress get pods
NAME                              READY   STATUS        RESTARTS   AGE
router-default-6f67d6db6f-h4mqf   1/1     Terminating   0          2m40s
router-default-6f67d6db6f-t8p4j   0/1     Terminating   0          2m40s
router-default-86696fc96c-94cfp   1/1     Running       0          40s
router-default-86696fc96c-dc4pn   0/1     Pending       0          40s
% 

3. oc adm must-gather, it took about 20 minutes
% oc adm must-gather  
[must-gather      ] OUT Using must-gather plug-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f2a076583e9b0646014c1135a52bb7af45c141c96f162e2ae8c0ad5bdedbefec
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: ae2bfb96-bb30-475a-99a3-a61f297ccaf7
ClusterVersion: Stable at "4.12.0-0.nightly-2022-07-07-144231"
ClusterOperators:
	All healthy and stable


4. During must-gather, oc -n openshift-ingress patch deploy/router-default with different readinessProbe value, so the router pods kept being terminated and then being created.

5.  After the must-gather was done, check the route pods
oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-667876b84d-n48t5   1/1     Running   0          13m
router-default-667876b84d-xgdct   1/1     Running   0          13m
%

6. Check the log
% pwd
/Users/shudi/pppp/must-gather.local.1946140740011803351
shudi@Shudis-MacBook-Pro must-gather.local.1946140740011803351 % cat timestamp
2022-07-08 17:13:06.722968 +0800 CST m=+20.127377290
2022-07-08 17:35:31.548306 +0800 CST m=+1365.052625387
%

% ls -lht
total 0
drwxr-xr-x  3 shudi  staff    96B Jul  8 17:47 router-default-86696fc96c-dc4pn
drwxr-xr-x  3 shudi  staff    96B Jul  8 17:17 router-default-86696fc96c-94cfp
% pwd
/Users/shudi/pppp/must-gather.local.1946140740011803351/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-f2a076583e9b0646014c1135a52bb7af45c141c96f162e2ae8c0ad5bdedbefec/ingress_controllers/default
%

Comment 8 errata-xmlrpc 2023-01-17 19:51:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399