Bug 1853028 - CNV must-gather failure on CNV-QE BM-RHCOS environment
Summary: CNV must-gather failure on CNV-QE BM-RHCOS environment
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Providers
Version: 2.4.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 2.4.0
Assignee: Yuval Turgeman
QA Contact: Yossi Segev
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-01 18:49 UTC by Yossi Segev
Modified: 2020-07-28 19:10 UTC (History)
3 users (show)

Fixed In Version: cnv-must-gather-container-v2.4.0-64
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-28 19:10:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2020:3194 0 None None None 2020-07-28 19:10:50 UTC

Description Yossi Segev 2020-07-01 18:49:10 UTC
Description of problem:
CNV must-gather image fails running on our (CNV QE) BM-RHCOS environment. The default must-gather image runs successfully on the env, and in addition - the same CNV image runs successfully on PSI clusters which are constructed of VM nodes.


Version-Release number of selected component (if applicable):
CNV v2.4.0

$ oc version
Client Version: 4.4.3
Server Version: 4.5.0-rc.2
Kubernetes Version: v1.18.3+91d0edd

CNV must-gather: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0


How reproducible:
Always


Steps to Reproduce:
1. On CNV-QE's BM-RHCOS machine (10.0.98.16) - run CNV must-gather:
[cnv-qe-jenkins@cnv-executor-bm-rhcos ~]$ oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0 --dest-dir=/home/cnv-qe-jenkins/yossi/mg


Actual results:
The following output, and eventually nothing happens (no data is collected, not even dest-dir is created):
[cnv-qe-jenkins@cnv-executor-bm-rhcos cnv-tests]$ oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0 --dest-dir=/home/cnv-qe-jenkins/yossi/mg
[must-gather      ] OUT Using must-gather plugin-in image: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0
[must-gather      ] OUT namespace/openshift-must-gather-4zlql created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-24zmg created
[must-gather      ] OUT pod for plug-in image registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0 created
[must-gather-8bmw5] POD Gathering data for ns/openshift-cnv...
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Error from server (NotFound): namespaces "kubevirt-hyperconverged" not found
[must-gather-8bmw5] POD Gathering data for ns/openshift-operator-lifecycle-manager...
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Gathering data for ns/openshift-marketplace...
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Error from server (NotFound): namespaces "cluster-network-addons" not found
[must-gather-8bmw5] POD Gathering data for ns/openshift-sdn...
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Error from server (NotFound): namespaces "sriov-network-operator" not found
[must-gather-8bmw5] POD Error from server (NotFound): namespaces "kubevirt-web-ui" not found
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Error from server (NotFound): namespaces "cdi" not found
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD Wrote inspect data to must-gather.
[must-gather-8bmw5] POD No resources found
[must-gather-8bmw5] POD No resources found
[must-gather-8bmw5] POD Error from server (AlreadyExists): error when creating "/etc/node-gather-crd.yaml": namespaces "node-gather" already exists
[must-gather-8bmw5] POD Error from server (AlreadyExists): error when creating "/etc/node-gather-crd.yaml": serviceaccounts "node-gather" already exists
[must-gather-8bmw5] POD securitycontextconstraints.security.openshift.io/privileged added to: ["system:serviceaccount:node-gather:node-gather"]
[must-gather-8bmw5] POD Error from server (AlreadyExists): error when creating "/etc/node-gather-ds.yaml": daemonsets.apps "node-gather-daemonset" already exists
   
[must-gather-8bmw5] OUT gather logs unavailable: unexpected EOF
[must-gather-8bmw5] OUT waiting for gather to complete
[must-gather-8bmw5] OUT gather never finished: timed out waiting for the condition
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-24zmg deleted
[must-gather      ] OUT namespace/openshift-must-gather-4zlql deleted
error: gather never finished for pod must-gather-8bmw5: timed out waiting for the condition


Expected results:
1. must-gather finishes successfully
2. dest-dir is created.
3. dest-dir is filled with collected log data


Additional info:
1. No error is seen on the must-gather pod description.
2. This was already debugged by Yuval. I quote his resolution from our email correspondence:
> Basically, gathering nodes data took too long due to many calls to `oc exec` in order to fetch the sriov information.
The PR fixes this by using a single exec call for the sriov, and it also parallelizes the rest of node gathering.

It looks like the timeouts happened actually in `oc must-gather` and not in our code simply because it didn't have any data to read for quite a long time (they have a timeout on read), so I added some verbosity in places that I couldn't see an obvious optimization.

3. Yuval has already submitted a fixing PR:
https://github.com/kubevirt/must-gather/pull/68
He created a local image with this fix (quay.io/yuvalturg/must-gather:latest).
I have tested this image (oc adm must-gather --image=quay.io/yuvalturg/must-gather:latest), and it seems to solve the issue.

4. Thank you very much Yuval!

Comment 1 Yossi Segev 2020-07-22 11:36:57 UTC
Verified using the same setup as in the original reproduction scenario:
On CNV-QE's bm-rhcos,
Running the same command:
cnv-qe-jenkins@cnv-executor-bm-rhcos ~]$ oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0 --dest-dir=/home/cnv-qe-jenkins/yossi/mg

Comment 4 errata-xmlrpc 2020-07-28 19:10:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3194


Note You need to log in before you can comment on or make changes to this bug.