Bug 1866399 - must-gather pod retry interval (10 minutes) doesn't make sense
Summary: must-gather pod retry interval (10 minutes) doesn't make sense
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: oc
Version: 4.5
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.6.0
Assignee: Sally
QA Contact: RamaKasturi
URL:
Whiteboard:
Depends On:
Blocks: 1859972
TreeView+ depends on / blocked
 
Reported: 2020-08-05 13:49 UTC by Yossi Segev
Modified: 2020-09-04 12:24 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-11 10:44:28 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift oc pull 527 None closed Bug 1866399: Shorten retry interval in must-gather 2020-09-22 18:08:39 UTC
Github openshift oc pull 545 None closed Bug 1866399: lessen retries in must-gather 2020-09-22 18:08:39 UTC

Description Yossi Segev 2020-08-05 13:49:47 UTC
Description of problem:
The must-gather and its pod are currently handled (in code) this way:
- The pod's init interval (i.e. interval for checking if the must-gather init container is in Running state) is 10 minutes
- The pod's timeout (timeout for the must-gather pod to be running) is 1 minute
And this setup makes no sense - the init interval should be smaller and more reasonable - something like 5 seconds.
This way every 5 seconds interval the init container will be sampled, and within 1 minute - both the init container is expected to finish, and the entire pod (the pod's copy container, to be exact) is expected to run.


Version-Release number of selected component (if applicable):
$ oc version
Client Version: 4.5.0-rc.1
Server Version: 4.5.0-rc.5
Kubernetes Version: v1.18.3+f291db1


How reproducible:
This bug is not always seen, but when it is - the must-gather ends with EOF exception, and there's no output directory created.


Steps to Reproduce:
1. Run a simple must-gather action, e.g.:
$ oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0 --dest-dir=/home/cnv-qe-jenkins/yossi/mg


Actual results:
It's environment dependent, but when the pod init time is too long - must-gather fails with EOF exception, and no data is collected.


Expected results:
Pod init interval should have a reasonable, much shorter value, to decrease the chance of must-gather to fail on timeout.


Additional info:
This was found when using a CNV (Openshift Virtualization) must-gather image, but was debugged (in bz#1859972), and found to be an issue that stems from the must-gather pod init retry interval, which is set and implemented in OCP.

Comment 1 Maciej Szulik 2020-08-11 10:44:28 UTC
The interval there is defaulted to 10mins, but a user can change at will through --timeout flag.

Comment 2 Yossi Segev 2020-08-11 12:38:40 UTC
It's true that this parameter is configurable, but I think it's configurable by mistake:
The help of the timeout parameter says:
      --timeout=600: The length of time to gather data, in seconds. Defaults to 10 minutes.

But what actually happens is that configuring this parameter doesn't affect the timeout to gather data - just the must-gather pod's *init interval*.
Clearly this is a bug: The user doesn't care what is the interval of the pod's init retries, but he does care on how much time he is willing to wait for data to be gathered.
So there are actually 3 issues here that should be fixed, in the order of their severity (the first is the most severe):
1. The user cannot set the total timeout for gathering data.
2. The default interval of the pod's initialization is way too long (10 minutes), and should be set to something more reasonable.
3. The pod init interval shouldn't be configurable.

What do you think?

Comment 3 Maciej Szulik 2020-08-11 13:06:02 UTC
> 1. The user cannot set the total timeout for gathering data.
> 2. The default interval of the pod's initialization is way too long (10
> minutes), and should be set to something more reasonable.
> 3. The pod init interval shouldn't be configurable.

Init pod is where the actual gathering is happening, I can agree about the overall timeout, though.

Comment 4 Sally 2020-08-20 18:03:42 UTC
The timeout is configurable, so leaving that alone.  I notice the retry intervals in must-gather of 1 min each add ~2 min to the must-gather, though.  I've added a PR to shorten the retry intervals.

Comment 7 RamaKasturi 2020-08-31 16:59:18 UTC
Tried to verify the bug and i see that time it takes now is 20 to 25 seconds and retries are happening for every 3 seconds for about 8-9 times, waiting for a query to be resolved from dev, once done will move bug to appropriate state.

Comment 9 Sally 2020-09-01 23:28:52 UTC
QA, Please verify now that I've bumped the retry interval to 10s that the must-gather command does not pause for 60s in these 2 places, thanks:
1) after this output line: "OUT pod for plug-in image"
2) after this output line: "OUT waiting for gather to complete"

The retry interval is shortened from the original 60s to 10s in 2 places, so the overall command running time is decreased by ~80-120s.  The overall timeout is configurable with the --timeout flag (defaults to 600s).

Comment 10 RamaKasturi 2020-09-02 12:09:40 UTC
still see retries for 3 seconds, will verify this tomorrow again

[ramakasturinarra@dhcp35-60 openshift-client-linux-4.6.0-0.nightly-2020-09-01-070508]$ ./oc version
Client Version: 4.6.0-0.nightly-2020-09-01-070508
Server Version: 4.6.0-0.nightly-2020-09-01-070508
Kubernetes Version: v1.19.0-rc.2+b5dc585-dirty

Comment 14 RamaKasturi 2020-09-04 12:24:50 UTC
Verified with the payload below and observation follows.

[ramakasturinarra@dhcp35-60 openshift-client-linux-4.6.0-0.nightly-2020-09-03-191144]$ ./oc version -o yaml
clientVersion:
  buildDate: "2020-09-03T06:56:57Z"
  compiler: gc
  gitCommit: 162e03f9f11dd645b0cc331b115a1afda8e8a2eb
  gitTreeState: clean
  gitVersion: openshift-clients-4.6.0-202006250705.p0-106-g162e03f9f
  goVersion: go1.14.4
  major: ""
  minor: ""
  platform: linux/amd64
openshiftVersion: 4.6.0-0.nightly-2020-09-03-191144
releaseClientVersion: 4.6.0-0.nightly-2020-09-03-191144
serverVersion:
  buildDate: "2020-06-26T23:02:45Z"
  compiler: gc
  gitCommit: 6e15712811e6e7955ec70854fc72153c8b417cba
  gitTreeState: dirty
  gitVersion: v4.6.0-202009031349.p0-dirty
  goVersion: go1.14.4
  major: "4"
  minor: 6+
  platform: linux/amd64

Steps followed to verify the bug:
===================================
1) Install latest oc 4.6 server & client
2) Run oc adm must-gather command
3) see that retry interval was 10 seconds before proceeding at "OUT pod for plug-in image" and less than 10 seconds at "OUT waiting for gather to complete"

[must-gather      ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:868b6c6113db3c0750d17dfc21193687bb8689137263578f4bc1ce0b71506af5 created
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
I0904 15:55:18.411934   22800 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-vnzl7/pods/must-gather-ts6ds

I0904 15:55:29.014294   22800 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-vnzl7/pods/must-gather-ts6ds

[must-gather-ts6ds] OUT waiting for gather to complete
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
[must-gather-gg9v6] OUT waiting for gather to complete
I0904 17:47:57.117725   25520 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-c2d9t/pods/must-gather-gg9v6
I0904 17:47:57.117774   25520 round_trippers.go:427] Request Headers:

I0904 17:48:08.758233   25520 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-c2d9t/pods/must-gather-gg9v6



4) Repeated the same steps with oc 4.5 client and i see that retry interval was 60  at "OUT pod for plug-in image and at "OUT waiting for gather to complete"
[must-gather      ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:868b6c6113db3c0750d17dfc21193687bb8689137263578f4bc1ce0b71506af5 created
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
I0904 15:42:41.827131   22303 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-79q2g/pods/must-gather-kflsm

I0904 15:43:42.377047   22303 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-79q2g/pods/must-gather-kflsm

[must-gather-xnqx8] OUT waiting for gather to complete
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
I0904 17:34:46.826636   25131 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-7n9xn/pods/must-gather-xnqx8

I0904 17:35:47.159863   25131 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-7n9xn/pods/must-gather-xnqx8



Based on the above moving the bug to verified state.


Note You need to log in before you can comment on or make changes to this bug.