Bug 1781044

Summary: [must gather] oc adm must-gather failed to generate the directory, gather never finished: timed out waiting for the condition.
Product: Container Native Virtualization (CNV) Reporter: Ying Cui <ycui>
Component: ProvidersAssignee: Avram Levitter <alevitte>
Status: CLOSED ERRATA QA Contact: Ying Cui <ycui>
Severity: high Docs Contact:
Priority: high    
Version: 2.2.0CC: alevitte, cnv-qe-bugs, danken, fdeutsch, maszulik, ncredi, pkliczew
Target Milestone: ---Keywords: Regression
Target Release: 2.2.0Flags: maszulik: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: cnv-must-gather-container-v2.2.0-7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-30 16:27:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
screen_messages_output_mustgather
none
ocgetpods
none
ocdescribepod
none
mustgather_withoutimage_successful none

Description Ying Cui 2019-12-09 07:30:35 UTC
Description of problem:
Running oc adm must-gather to gather all CNV info, after for a while, there is no output directory generated, gather never finished: timed out waiting for the condition

Version-Release number of selected component (if applicable):
oc version: Client Version: openshift-clients-4.3.0-201910250623-70-g0ed83003
Server Version: 4.3.0-0.nightly-2019-11-28-103851
Kubernetes Version: v1.16.2
CNV 2.2

How reproducible:
100% in PSI


Steps to Reproduce:
1. Deployed OCP 4.3 and CNV 2.2 successful.

2. $ oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.2.0-6 --dest-dir=/tmp/pytest-of-cnv-qe-jenkins/pytest-1/must_gather0
# see attachment: output_mustgather.txt

3. 
$ oc get pods -A  -w  # see attachment: ocgetpods.txt
$ oc describe pod must-gather-fbwz2 -n openshift-must-gather-jbzp9 # see attachment: ocdescribpod.txt

Actual results:
Step 2, checking /tmp/pytest-of-cnv-qe-jenkins/pytest-1/must_gather0, there is no output directory generated, gather never finished: timed out waiting for the condition


Expected results:
The output directory generated. 

Additional info:
1. $ oc adm must-gather --image=quay.io/kubevirt/must-gather does NOT work.
2. The specific issue follows up by Bug 1781038 - [must gather] openshift-must-gather has been DEPRECATED. Use `oc adm inspect` instead.

Comment 1 Ying Cui 2019-12-09 07:32:07 UTC
Created attachment 1643190 [details]
screen_messages_output_mustgather

Comment 2 Ying Cui 2019-12-09 07:33:40 UTC
Created attachment 1643191 [details]
ocgetpods

Comment 3 Ying Cui 2019-12-09 07:34:53 UTC
Created attachment 1643204 [details]
ocdescribepod

Comment 4 Piotr Kliczewski 2019-12-09 08:25:08 UTC
Maciej, I remember you wanted to investigate this one. We agreed that no matter what happens some gathered logs should be collected.

Comment 5 Piotr Kliczewski 2019-12-09 09:20:58 UTC
It doesn't look like regression. In my opinion it never worked. Let's wait on Maciej to reply but I think he or anyone else from the platform should fix it.

Comment 6 Dan Kenigsberg 2019-12-09 17:19:33 UTC
Piotr, what is "it" that never worked? I though that Ying was attempting a very basic use case which was tested before. What am I missing?

Comment 7 Piotr Kliczewski 2019-12-09 17:54:08 UTC
Dan this issue was reported before as BZ #1755714. Maciej closed it as works on my machine and promised to investigate which seems like it never happened.

Comment 12 Ying Cui 2019-12-10 14:37:51 UTC
Created attachment 1643655 [details]
mustgather_withoutimage_successful

Comment 16 Avram Levitter 2019-12-11 06:55:07 UTC
It seems that it's failing specifically because of the 10 minute timeout built into `oc adm must-gather`. When I used the `--keep` flag (which will not delete the pod and namespace after execution), the pod finished after 13 minutes.

Comment 17 Avram Levitter 2019-12-11 10:26:39 UTC
The problem seems to be specifically in the gathering of the packagemanifests. That section has been taking close to 10 minutes. It takes around 3 seconds to execute `oc get packagemanifest $name -n $NS -o yaml >> ${NAMESPACE_PATH}/${NS}/packagemanifests` and on a test cluster there were 185 packagemanifests.

Comment 18 Avram Levitter 2019-12-12 13:57:50 UTC
There's a pending pull request that should fix this in upstream: https://github.com/kubevirt/must-gather/pull/60

Comment 19 Dan Kenigsberg 2019-12-12 20:07:03 UTC
(In reply to Avram Levitter from comment #18)
> There's a pending pull request that should fix this in upstream:
> https://github.com/kubevirt/must-gather/pull/60

That's exactly the reason to move a bz to the POST state.

Comment 21 Ying Cui 2019-12-24 07:29:28 UTC
VERIFIED this bug on cnv-must-gather-container-v2.2.0-7

Test Steps:

$ oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.2.0-7 --dest-dir=/tmp

The output directory generated, the issue is fixed.

Comment 23 errata-xmlrpc 2020-01-30 16:27:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:0307