1866399 – must-gather pod retry interval (10 minutes) doesn't make sense

Bug 1866399 - must-gather pod retry interval (10 minutes) doesn't make sense

Summary: must-gather pod retry interval (10 minutes) doesn't make sense

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	oc
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Sally
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1859972
TreeView+	depends on / blocked

Reported:	2020-08-05 13:49 UTC by Yossi Segev
Modified:	2020-10-27 16:25 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:24:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift oc pull 527	None	closed	Bug 1866399: Shorten retry interval in must-gather	2020-10-08 20:37:58 UTC
Github	openshift oc pull 545	None	closed	Bug 1866399: lessen retries in must-gather	2020-10-08 20:37:58 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 16:25:10 UTC

Description Yossi Segev 2020-08-05 13:49:47 UTC

Description of problem:
The must-gather and its pod are currently handled (in code) this way:
- The pod's init interval (i.e. interval for checking if the must-gather init container is in Running state) is 10 minutes
- The pod's timeout (timeout for the must-gather pod to be running) is 1 minute
And this setup makes no sense - the init interval should be smaller and more reasonable - something like 5 seconds.
This way every 5 seconds interval the init container will be sampled, and within 1 minute - both the init container is expected to finish, and the entire pod (the pod's copy container, to be exact) is expected to run.


Version-Release number of selected component (if applicable):
$ oc version
Client Version: 4.5.0-rc.1
Server Version: 4.5.0-rc.5
Kubernetes Version: v1.18.3+f291db1


How reproducible:
This bug is not always seen, but when it is - the must-gather ends with EOF exception, and there's no output directory created.


Steps to Reproduce:
1. Run a simple must-gather action, e.g.:
$ oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0 --dest-dir=/home/cnv-qe-jenkins/yossi/mg


Actual results:
It's environment dependent, but when the pod init time is too long - must-gather fails with EOF exception, and no data is collected.


Expected results:
Pod init interval should have a reasonable, much shorter value, to decrease the chance of must-gather to fail on timeout.


Additional info:
This was found when using a CNV (Openshift Virtualization) must-gather image, but was debugged (in bz#1859972), and found to be an issue that stems from the must-gather pod init retry interval, which is set and implemented in OCP.

Comment 1 Maciej Szulik 2020-08-11 10:44:28 UTC

The interval there is defaulted to 10mins, but a user can change at will through --timeout flag.

Comment 2 Yossi Segev 2020-08-11 12:38:40 UTC

It's true that this parameter is configurable, but I think it's configurable by mistake:
The help of the timeout parameter says:
      --timeout=600: The length of time to gather data, in seconds. Defaults to 10 minutes.

But what actually happens is that configuring this parameter doesn't affect the timeout to gather data - just the must-gather pod's *init interval*.
Clearly this is a bug: The user doesn't care what is the interval of the pod's init retries, but he does care on how much time he is willing to wait for data to be gathered.
So there are actually 3 issues here that should be fixed, in the order of their severity (the first is the most severe):
1. The user cannot set the total timeout for gathering data.
2. The default interval of the pod's initialization is way too long (10 minutes), and should be set to something more reasonable.
3. The pod init interval shouldn't be configurable.

What do you think?

Comment 3 Maciej Szulik 2020-08-11 13:06:02 UTC

> 1. The user cannot set the total timeout for gathering data.
> 2. The default interval of the pod's initialization is way too long (10
> minutes), and should be set to something more reasonable.
> 3. The pod init interval shouldn't be configurable.

Init pod is where the actual gathering is happening, I can agree about the overall timeout, though.

Comment 4 Sally 2020-08-20 18:03:42 UTC

The timeout is configurable, so leaving that alone.  I notice the retry intervals in must-gather of 1 min each add ~2 min to the must-gather, though.  I've added a PR to shorten the retry intervals.

Comment 7 RamaKasturi 2020-08-31 16:59:18 UTC

Tried to verify the bug and i see that time it takes now is 20 to 25 seconds and retries are happening for every 3 seconds for about 8-9 times, waiting for a query to be resolved from dev, once done will move bug to appropriate state.

Comment 9 Sally 2020-09-01 23:28:52 UTC

QA, Please verify now that I've bumped the retry interval to 10s that the must-gather command does not pause for 60s in these 2 places, thanks:
1) after this output line: "OUT pod for plug-in image"
2) after this output line: "OUT waiting for gather to complete"

The retry interval is shortened from the original 60s to 10s in 2 places, so the overall command running time is decreased by ~80-120s.  The overall timeout is configurable with the --timeout flag (defaults to 600s).

Comment 10 RamaKasturi 2020-09-02 12:09:40 UTC

still see retries for 3 seconds, will verify this tomorrow again

[ramakasturinarra@dhcp35-60 openshift-client-linux-4.6.0-0.nightly-2020-09-01-070508]$ ./oc version
Client Version: 4.6.0-0.nightly-2020-09-01-070508
Server Version: 4.6.0-0.nightly-2020-09-01-070508
Kubernetes Version: v1.19.0-rc.2+b5dc585-dirty

Comment 14 RamaKasturi 2020-09-04 12:24:50 UTC

Verified with the payload below and observation follows.

[ramakasturinarra@dhcp35-60 openshift-client-linux-4.6.0-0.nightly-2020-09-03-191144]$ ./oc version -o yaml
clientVersion:
  buildDate: "2020-09-03T06:56:57Z"
  compiler: gc
  gitCommit: 162e03f9f11dd645b0cc331b115a1afda8e8a2eb
  gitTreeState: clean
  gitVersion: openshift-clients-4.6.0-202006250705.p0-106-g162e03f9f
  goVersion: go1.14.4
  major: ""
  minor: ""
  platform: linux/amd64
openshiftVersion: 4.6.0-0.nightly-2020-09-03-191144
releaseClientVersion: 4.6.0-0.nightly-2020-09-03-191144
serverVersion:
  buildDate: "2020-06-26T23:02:45Z"
  compiler: gc
  gitCommit: 6e15712811e6e7955ec70854fc72153c8b417cba
  gitTreeState: dirty
  gitVersion: v4.6.0-202009031349.p0-dirty
  goVersion: go1.14.4
  major: "4"
  minor: 6+
  platform: linux/amd64

Steps followed to verify the bug:
===================================
1) Install latest oc 4.6 server & client
2) Run oc adm must-gather command
3) see that retry interval was 10 seconds before proceeding at "OUT pod for plug-in image" and less than 10 seconds at "OUT waiting for gather to complete"

[must-gather      ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:868b6c6113db3c0750d17dfc21193687bb8689137263578f4bc1ce0b71506af5 created
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
I0904 15:55:18.411934   22800 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-vnzl7/pods/must-gather-ts6ds

I0904 15:55:29.014294   22800 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-vnzl7/pods/must-gather-ts6ds

[must-gather-ts6ds] OUT waiting for gather to complete
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
[must-gather-gg9v6] OUT waiting for gather to complete
I0904 17:47:57.117725   25520 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-c2d9t/pods/must-gather-gg9v6
I0904 17:47:57.117774   25520 round_trippers.go:427] Request Headers:

I0904 17:48:08.758233   25520 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-c2d9t/pods/must-gather-gg9v6



4) Repeated the same steps with oc 4.5 client and i see that retry interval was 60  at "OUT pod for plug-in image and at "OUT waiting for gather to complete"
[must-gather      ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:868b6c6113db3c0750d17dfc21193687bb8689137263578f4bc1ce0b71506af5 created
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
I0904 15:42:41.827131   22303 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-79q2g/pods/must-gather-kflsm

I0904 15:43:42.377047   22303 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-79q2g/pods/must-gather-kflsm

[must-gather-xnqx8] OUT waiting for gather to complete
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
I0904 17:34:46.826636   25131 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-7n9xn/pods/must-gather-xnqx8

I0904 17:35:47.159863   25131 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-7n9xn/pods/must-gather-xnqx8



Based on the above moving the bug to verified state.

Comment 16 errata-xmlrpc 2020-10-27 16:24:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.