Bug 1853028

Summary: CNV must-gather failure on CNV-QE BM-RHCOS environment
Product: Container Native Virtualization (CNV) Reporter: Yossi Segev <ysegev>
Component: ProvidersAssignee: Yuval Turgeman <yturgema>
Status: CLOSED ERRATA QA Contact: Yossi Segev <ysegev>
Severity: high Docs Contact:
Priority: medium    
Version: 2.4.0CC: cnv-qe-bugs, ncredi, stirabos
Target Milestone: ---   
Target Release: 2.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: cnv-must-gather-container-v2.4.0-64 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-28 19:10:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yossi Segev 2020-07-01 18:49:10 UTC
Description of problem:
CNV must-gather image fails running on our (CNV QE) BM-RHCOS environment. The default must-gather image runs successfully on the env, and in addition - the same CNV image runs successfully on PSI clusters which are constructed of VM nodes.


Version-Release number of selected component (if applicable):
CNV v2.4.0

$ oc version
Client Version: 4.4.3
Server Version: 4.5.0-rc.2
Kubernetes Version: v1.18.3+91d0edd

CNV must-gather: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0


How reproducible:
Always


Steps to Reproduce:
1. On CNV-QE's BM-RHCOS machine (10.0.98.16) - run CNV must-gather:
[cnv-qe-jenkins@cnv-executor-bm-rhcos ~]$ oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0 --dest-dir=/home/cnv-qe-jenkins/yossi/mg


Actual results:
The following output, and eventually nothing happens (no data is collected, not even dest-dir is created):
[cnv-qe-jenkins@cnv-executor-bm-rhcos cnv-tests]$ oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0 --dest-dir=/home/cnv-qe-jenkins/yossi/mg
[must-gather      ] OUT Using must-gather plugin-in image: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0
[must-gather      ] OUT namespace/openshift-must-gather-4zlql created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-24zmg created
[must-gather      ] OUT pod for plug-in image registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0 created
[must-gather-8bmw5] POD Gathering data for ns/openshift-cnv...
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Error from server (NotFound): namespaces "kubevirt-hyperconverged" not found
[must-gather-8bmw5] POD Gathering data for ns/openshift-operator-lifecycle-manager...
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Gathering data for ns/openshift-marketplace...
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Error from server (NotFound): namespaces "cluster-network-addons" not found
[must-gather-8bmw5] POD Gathering data for ns/openshift-sdn...
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Error from server (NotFound): namespaces "sriov-network-operator" not found
[must-gather-8bmw5] POD Error from server (NotFound): namespaces "kubevirt-web-ui" not found
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Error from server (NotFound): namespaces "cdi" not found
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD No resources found
[must-gather-8bmw5] POD No resources found
[must-gather-8bmw5] POD Error from server (AlreadyExists): error when creating "/etc/node-gather-crd.yaml": namespaces "node-gather" already exists
[must-gather-8bmw5] POD Error from server (AlreadyExists): error when creating "/etc/node-gather-crd.yaml": serviceaccounts "node-gather" already exists
[must-gather-8bmw5] POD securitycontextconstraints.security.openshift.io/privileged added to: ["system:serviceaccount:node-gather:node-gather"]
[must-gather-8bmw5] POD Error from server (AlreadyExists): error when creating "/etc/node-gather-ds.yaml": daemonsets.apps "node-gather-daemonset" already exists
   
[must-gather-8bmw5] OUT gather logs unavailable: unexpected EOF
[must-gather-8bmw5] OUT waiting for gather to complete
[must-gather-8bmw5] OUT gather never finished: timed out waiting for the condition
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-24zmg deleted
[must-gather      ] OUT namespace/openshift-must-gather-4zlql deleted
error: gather never finished for pod must-gather-8bmw5: timed out waiting for the condition


Expected results:
1. must-gather finishes successfully
2. dest-dir is created.
3. dest-dir is filled with collected log data


Additional info:
1. No error is seen on the must-gather pod description.
2. This was already debugged by Yuval. I quote his resolution from our email correspondence:
> Basically, gathering nodes data took too long due to many calls to `oc exec` in order to fetch the sriov information.
The PR fixes this by using a single exec call for the sriov, and it also parallelizes the rest of node gathering.

It looks like the timeouts happened actually in `oc must-gather` and not in our code simply because it didn't have any data to read for quite a long time (they have a timeout on read), so I added some verbosity in places that I couldn't see an obvious optimization.

3. Yuval has already submitted a fixing PR:
https://github.com/kubevirt/must-gather/pull/68
He created a local image with this fix (quay.io/yuvalturg/must-gather:latest).
I have tested this image (oc adm must-gather --image=quay.io/yuvalturg/must-gather:latest), and it seems to solve the issue.

4. Thank you very much Yuval!

Comment 1 Yossi Segev 2020-07-22 11:36:57 UTC
Verified using the same setup as in the original reproduction scenario:
On CNV-QE's bm-rhcos,
Running the same command:
cnv-qe-jenkins@cnv-executor-bm-rhcos ~]$ oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0 --dest-dir=/home/cnv-qe-jenkins/yossi/mg

Comment 4 errata-xmlrpc 2020-07-28 19:10:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3194