Description of problem: The must-gather and its pod are currently handled (in code) this way: - The pod's init interval (i.e. interval for checking if the must-gather init container is in Running state) is 10 minutes - The pod's timeout (timeout for the must-gather pod to be running) is 1 minute And this setup makes no sense - the init interval should be smaller and more reasonable - something like 5 seconds. This way every 5 seconds interval the init container will be sampled, and within 1 minute - both the init container is expected to finish, and the entire pod (the pod's copy container, to be exact) is expected to run. Version-Release number of selected component (if applicable): $ oc version Client Version: 4.5.0-rc.1 Server Version: 4.5.0-rc.5 Kubernetes Version: v1.18.3+f291db1 How reproducible: This bug is not always seen, but when it is - the must-gather ends with EOF exception, and there's no output directory created. Steps to Reproduce: 1. Run a simple must-gather action, e.g.: $ oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0 --dest-dir=/home/cnv-qe-jenkins/yossi/mg Actual results: It's environment dependent, but when the pod init time is too long - must-gather fails with EOF exception, and no data is collected. Expected results: Pod init interval should have a reasonable, much shorter value, to decrease the chance of must-gather to fail on timeout. Additional info: This was found when using a CNV (Openshift Virtualization) must-gather image, but was debugged (in bz#1859972), and found to be an issue that stems from the must-gather pod init retry interval, which is set and implemented in OCP.
The interval there is defaulted to 10mins, but a user can change at will through --timeout flag.
It's true that this parameter is configurable, but I think it's configurable by mistake: The help of the timeout parameter says: --timeout=600: The length of time to gather data, in seconds. Defaults to 10 minutes. But what actually happens is that configuring this parameter doesn't affect the timeout to gather data - just the must-gather pod's *init interval*. Clearly this is a bug: The user doesn't care what is the interval of the pod's init retries, but he does care on how much time he is willing to wait for data to be gathered. So there are actually 3 issues here that should be fixed, in the order of their severity (the first is the most severe): 1. The user cannot set the total timeout for gathering data. 2. The default interval of the pod's initialization is way too long (10 minutes), and should be set to something more reasonable. 3. The pod init interval shouldn't be configurable. What do you think?
> 1. The user cannot set the total timeout for gathering data. > 2. The default interval of the pod's initialization is way too long (10 > minutes), and should be set to something more reasonable. > 3. The pod init interval shouldn't be configurable. Init pod is where the actual gathering is happening, I can agree about the overall timeout, though.
The timeout is configurable, so leaving that alone. I notice the retry intervals in must-gather of 1 min each add ~2 min to the must-gather, though. I've added a PR to shorten the retry intervals.
Tried to verify the bug and i see that time it takes now is 20 to 25 seconds and retries are happening for every 3 seconds for about 8-9 times, waiting for a query to be resolved from dev, once done will move bug to appropriate state.
QA, Please verify now that I've bumped the retry interval to 10s that the must-gather command does not pause for 60s in these 2 places, thanks: 1) after this output line: "OUT pod for plug-in image" 2) after this output line: "OUT waiting for gather to complete" The retry interval is shortened from the original 60s to 10s in 2 places, so the overall command running time is decreased by ~80-120s. The overall timeout is configurable with the --timeout flag (defaults to 600s).
still see retries for 3 seconds, will verify this tomorrow again [ramakasturinarra@dhcp35-60 openshift-client-linux-4.6.0-0.nightly-2020-09-01-070508]$ ./oc version Client Version: 4.6.0-0.nightly-2020-09-01-070508 Server Version: 4.6.0-0.nightly-2020-09-01-070508 Kubernetes Version: v1.19.0-rc.2+b5dc585-dirty
Verified with the payload below and observation follows. [ramakasturinarra@dhcp35-60 openshift-client-linux-4.6.0-0.nightly-2020-09-03-191144]$ ./oc version -o yaml clientVersion: buildDate: "2020-09-03T06:56:57Z" compiler: gc gitCommit: 162e03f9f11dd645b0cc331b115a1afda8e8a2eb gitTreeState: clean gitVersion: openshift-clients-4.6.0-202006250705.p0-106-g162e03f9f goVersion: go1.14.4 major: "" minor: "" platform: linux/amd64 openshiftVersion: 4.6.0-0.nightly-2020-09-03-191144 releaseClientVersion: 4.6.0-0.nightly-2020-09-03-191144 serverVersion: buildDate: "2020-06-26T23:02:45Z" compiler: gc gitCommit: 6e15712811e6e7955ec70854fc72153c8b417cba gitTreeState: dirty gitVersion: v4.6.0-202009031349.p0-dirty goVersion: go1.14.4 major: "4" minor: 6+ platform: linux/amd64 Steps followed to verify the bug: =================================== 1) Install latest oc 4.6 server & client 2) Run oc adm must-gather command 3) see that retry interval was 10 seconds before proceeding at "OUT pod for plug-in image" and less than 10 seconds at "OUT waiting for gather to complete" [must-gather ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:868b6c6113db3c0750d17dfc21193687bb8689137263578f4bc1ce0b71506af5 created +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ I0904 15:55:18.411934 22800 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-vnzl7/pods/must-gather-ts6ds I0904 15:55:29.014294 22800 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-vnzl7/pods/must-gather-ts6ds [must-gather-ts6ds] OUT waiting for gather to complete +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ [must-gather-gg9v6] OUT waiting for gather to complete I0904 17:47:57.117725 25520 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-c2d9t/pods/must-gather-gg9v6 I0904 17:47:57.117774 25520 round_trippers.go:427] Request Headers: I0904 17:48:08.758233 25520 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-c2d9t/pods/must-gather-gg9v6 4) Repeated the same steps with oc 4.5 client and i see that retry interval was 60 at "OUT pod for plug-in image and at "OUT waiting for gather to complete" [must-gather ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:868b6c6113db3c0750d17dfc21193687bb8689137263578f4bc1ce0b71506af5 created ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ I0904 15:42:41.827131 22303 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-79q2g/pods/must-gather-kflsm I0904 15:43:42.377047 22303 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-79q2g/pods/must-gather-kflsm [must-gather-xnqx8] OUT waiting for gather to complete +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ I0904 17:34:46.826636 25131 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-7n9xn/pods/must-gather-xnqx8 I0904 17:35:47.159863 25131 round_trippers.go:420] GET https://api.knarra0904.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-must-gather-7n9xn/pods/must-gather-xnqx8 Based on the above moving the bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196