Bug 1859972

Summary:	CNV must-gather failure on CNV-QE BM environment (BM-RHEL, BM-RHCOS)
Product:	Container Native Virtualization (CNV)	Reporter:	Yossi Segev <ysegev>
Component:	Logging	Assignee:	Yuval Turgeman <yturgema>
Status:	CLOSED ERRATA	QA Contact:	Satyajit Bulage <sbulage>
Severity:	high	Docs Contact:
Priority:	high
Version:	2.4.0	CC:	cnv-qe-bugs, fdeutsch, irose, ncredi, sanchezl, stirabos, yadu, ycui, yturgema
Target Milestone:	---
Target Release:	2.4.3
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	cnv-must-gather-container-v2.4.0-73	Doc Type:	Known Issue
Doc Text:	Cause: The default interval for the gather pod's init container (600 sec) doesn't fit to the pod's timeout to get to Running state (60 sec). Consequence: The action times-out although the init container was ready, but its status was not inspected on time. Workaround (if any): Run the command with "--timeout=5". "timeout" is the init container's interval (although in the must-gather help it is documented as the entire timeout). Result: Data gather completes successfully.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-11-04 19:07:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1866380, 1866399
Bug Blocks:

Description Yossi Segev 2020-07-23 12:20:28 UTC

Description of problem:
CNV must-gather image fails running on our (CNV QE) BM-RHEL environment. The default must-gather image runs successfully on the env.


Version-Release number of selected component (if applicable):
$ oc version
Client Version: 4.5.0-rc.1
Server Version: 4.5.0-rc.5
Kubernetes Version: v1.18.3+f291db1

CNV must-gather: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0


How reproducible:
Always


Steps to Reproduce:
1. On CNV-QE's BM-RHEL machine - run CNV must-gather:
[cnv-qe-jenkins@cnv-executor-bm-rhel ~]$ oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0 --dest-dir=/home/cnv-qe-jenkins/yossi/mg


Actual results:
The following output, and eventually nothing happens (no data is collected, not even dest-dir is created):
[must-gather      ] OUT Using must-gather plugin-in image: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0
[must-gather      ] OUT namespace/openshift-must-gather-h2w25 created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-fm9bw created
[must-gather      ] OUT pod for plug-in image registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0 created
[must-gather-6cn6j] POD Gathering data for ns/openshift-cnv...
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Error from server (NotFound): namespaces "kubevirt-hyperconverged" not found
[must-gather-6cn6j] POD Gathering data for ns/openshift-operator-lifecycle-manager...
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Gathering data for ns/openshift-marketplace...
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Error from server (NotFound): namespaces "cluster-network-addons" not found
[must-gather-6cn6j] POD Gathering data for ns/openshift-sdn...
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Error from server (NotFound): namespaces "sriov-network-operator" not found
[must-gather-6cn6j] POD Error from server (NotFound): namespaces "kubevirt-web-ui" not found
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Error from server (NotFound): namespaces "cdi" not found
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD Wrote inspect data to must-gather.
[must-gather-6cn6j] POD No resources found
[must-gather-6cn6j] POD No resources found
[must-gather-6cn6j] POD Error from server (AlreadyExists): error when creating "/etc/node-gather-crd.yaml": namespaces "node-gather" already exists
[must-gather-6cn6j] POD Error from server (AlreadyExists): error when creating "/etc/node-gather-crd.yaml": serviceaccounts "node-gather" already exists
[must-gather-6cn6j] POD securitycontextconstraints.security.openshift.io/privileged added to: ["system:serviceaccount:node-gather:node-gather"]
[must-gather-6cn6j] POD Error from server (AlreadyExists): error when creating "/etc/node-gather-ds.yaml": daemonsets.apps "node-gather-daemonset" already exists
[must-gather-6cn6j] OUT gather logs unavailable: unexpected EOF
[must-gather-6cn6j] OUT waiting for gather to complete
[must-gather-6cn6j] OUT gather never finished: timed out waiting for the condition
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-fm9bw deleted
[must-gather      ] OUT namespace/openshift-must-gather-h2w25 deleted
error: gather never finished for pod must-gather-6cn6j: timed out waiting for the condition


Expected results:
1. must-gather finishes successfully
2. dest-dir is created.
3. dest-dir is filled with collected log data


Additional info:
1. No error is seen on the must-gather pod description.
2. I saw error messages in the output about node-gather, so I collected the following:
[cnv-qe-jenkins@cnv-executor-bm-rhel ~]$ oc get all -n node-gather 
NAME                              READY   STATUS    RESTARTS   AGE
pod/node-gather-daemonset-cfhqt   1/1     Running   0          8d
pod/node-gather-daemonset-ljlrg   1/1     Running   0          21h

NAME                                   DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/node-gather-daemonset   2         2         2       2            2           <none>          34d

NAME                                              AGE   PHASE
cdi.cdi.kubevirt.io/cdi-kubevirt-hyperconverged   3d    Deployed

NAME                               AGE
cdiconfig.cdi.kubevirt.io/config   3d

NAME                                                              AGE
vmimportconfig.v2v.kubevirt.io/vmimport-kubevirt-hyperconverged   3d

I don't know if its supposed to be this way, but the node-gather namespace with its resources remain after the must-gather finishes (maybe these are remainders of a previous run?)

Comment 1 Nelly Credi 2020-07-26 08:33:08 UTC

I don't understand why this bug got targeted to 2.5
if customers cannot collect logs this is a major issue
targeting to 2.4 until we will get a proper analysis

Comment 2 Yossi Segev 2020-07-26 12:35:17 UTC

Adding the description of the must-gather pod (before the action fails and the pod is removed):
[cnv-qe-jenkins@cnv-executor-bm-rhel ~]$ oc describe pod -n openshift-must-gather-b4w7z 
Name:         must-gather-lm6nn
Namespace:    openshift-must-gather-b4w7z
Priority:     0
Node:         cnv-qe-04.cnvqe.lab.eng.rdu2.redhat.com/10.9.96.43
Start Time:   Sun, 26 Jul 2020 12:31:48 +0000
Labels:       app=must-gather
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.128.0.190"
                    ],
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.128.0.190"
                    ],
                    "default": true,
                    "dns": {}
                }]
Status:       Pending
IP:           10.128.0.190
IPs:
  IP:  10.128.0.190
Init Containers:
  gather:
    Container ID:  cri-o://071fa2eac56ec2cd12fd86990be28ee2523bf2cf842cefce454a9b7447cfed1d
    Image:         registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0
    Image ID:      registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8@sha256:031e21c6597e597f11d443b06425a3b2059d1af61f345c26c8ed8d3227645315
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/bin/gather
    State:          Running
      Started:      Sun, 26 Jul 2020 12:31:51 +0000
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /must-gather from must-gather-output (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-gfgjm (ro)
Containers:
  copy:
    Container ID:  
    Image:         registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      trap : TERM INT; sleep infinity & wait
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /must-gather from must-gather-output (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-gfgjm (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  must-gather-output:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  default-token-gfgjm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-gfgjm
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     
Events:
  Type    Reason          Age   From                                              Message
  ----    ------          ----  ----                                              -------
  Normal  Scheduled       58s   default-scheduler                                 Successfully assigned openshift-must-gather-b4w7z/must-gather-lm6nn to cnv-qe-04.cnvqe.lab.eng.rdu2.redhat.com
  Normal  AddedInterface  56s   multus                                            Add eth0 [10.128.0.190/23]
  Normal  Pulled          56s   kubelet, cnv-qe-04.cnvqe.lab.eng.rdu2.redhat.com  Container image "registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0" already present on machine
  Normal  Created         55s   kubelet, cnv-qe-04.cnvqe.lab.eng.rdu2.redhat.com  Created container gather
  Normal  Started         55s   kubelet, cnv-qe-04.cnvqe.lab.eng.rdu2.redhat.com  Started container gather

Comment 5 Yuval Turgeman 2020-07-26 13:54:00 UTC

Nelly, I'm not sure how we plan to release this - v2.4.0 is floating, so pulling it (docker pull) should get you the latest.

However, lets say that we release image X to 2.4.0, and a day after we release Y to 2.4.0.

If the user runs must-gather (2.4.0) between the 2 releases, then it will continue to use X even after we released Y.

For QE I think it's best to go to errata, grab the version that is attached, and use it directly.

There's another issue with MG that sometimes the pod fails to initialize, but that's probably a different bug (even though the symptom, unexpected EOF, is identical)

Comment 9 Yossi Segev 2020-07-27 14:12:31 UTC

Yuval managed to find the issue, and the problem is how the gather and its pod are handled in Openshift code:
- The pod's init interval (i.e. interval for checking if the must-gather init container is in Running state) is 10 minutes
- The pod's timeout (timeout for the must-gather pod to be running) is 1 minute
And this setup makes no sense - the init interval should be smaller.

So Yuval is suggesting a patch where the init interval is 5 seconds. This way every 5 seconds interval the init container will be sampled, and within 1 minute - both the init container is expected to finish, and the entire pod (the pod's copy container, to be exact) is expected to run.
He has currently taken it up to the Openshift mst-gather maintainer.

@Yuval - please correct me if I missed or distorted anything.
And thanks for your work on this!

Comment 10 Yuval Turgeman 2020-07-27 14:55:27 UTC

Well I was only half correct, we dont need a patch - the timeout value is actually the retry value I was looking for.  So, bottom line, we need 2 things:

1. Make sure we're running the latest image (v2.4.0-73)
2. Add a smaller timeout (--timeout=5 seconds should be enough)

oc adm must-gather --image=url-to-latest-v2.4.0-73 --timeout=5

Comment 12 Yossi Segev 2020-07-28 09:21:42 UTC

I have run CNV d/s must-gather tests with using cnv-must-gather image v2.4.0 and timeout=5 (seconds), i.e.
oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0-73 --dest-dir=/tmp/pytest-of-cnv-qe-jenkins/pytest-7/must_gather0 --timeout=5
I ran on both discussed BMs and in both must-gather passed successfully (some specific tests failed, but I checked the failures and they were not related to the data gathering, and anyway - the data gathering completed successfully).

Comment 13 Nelly Credi 2020-07-28 10:33:55 UTC

Thanks you @Yossi
That is good news. Should we make this a known issue in our docs with this W/A of setting the TO
and in addition open a (linked) OCP bug to handle this?

Comment 14 Yossi Segev 2020-07-28 11:11:33 UTC

IMO - adding the "timeout=5" is a legitimate workaround to be indicated in the doc.
I am not sure how legitimate is to instruct the user about the other w/a - calling an explicit version build (e.g. v2.4.0-73), but that's not for me to decide.
I'll open 2 bugs for OCP, as the interval and the pull-policy are 2 different issues.

Comment 15 Nelly Credi 2020-07-28 14:21:41 UTC

thanks, can you please fill the doc text box?

Comment 16 Yossi Segev 2020-08-03 06:34:44 UTC

Clear needinfo flag as it was raised for me to be notified about the need to fill the doc text, which I have already done.

Comment 19 Yuval Turgeman 2020-08-26 15:16:56 UTC

Small update, we issued a PR to use exponential backoff [1], it was abandoned in favor of [2] which simply lowers the interval to 3 seconds, which should solve this as well.
The fix was merged and will address both this bug and bz#1866399 and will go in OCP 4.6.  For 4.5 we should continue using the workaround mentioned in comment 10.

[1] https://github.com/openshift/oc/pull/523
[2] https://github.com/openshift/oc/pull/527

Comment 20 Yuval Turgeman 2020-09-14 14:57:42 UTC

The pull policy will not be changed to always [1] but the timeout issue was updated [2] to fit our needs.

Can we please check this one again with latest oc binary from 4.6 ?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1866380#c7
[2] https://github.com/openshift/oc/pull/545

Comment 28 errata-xmlrpc 2020-11-04 19:07:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 2.4.3 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4928