Description of problem: CNV must-gather image fails running on our (CNV QE) BM-RHEL environment. The default must-gather image runs successfully on the env. Version-Release number of selected component (if applicable): $ oc version Client Version: 4.5.0-rc.1 Server Version: 4.5.0-rc.5 Kubernetes Version: v1.18.3+f291db1 CNV must-gather: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0 How reproducible: Always Steps to Reproduce: 1. On CNV-QE's BM-RHEL machine - run CNV must-gather: [cnv-qe-jenkins@cnv-executor-bm-rhel ~]$ oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0 --dest-dir=/home/cnv-qe-jenkins/yossi/mg Actual results: The following output, and eventually nothing happens (no data is collected, not even dest-dir is created): [must-gather ] OUT Using must-gather plugin-in image: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0 [must-gather ] OUT namespace/openshift-must-gather-h2w25 created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-fm9bw created [must-gather ] OUT pod for plug-in image registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0 created [must-gather-6cn6j] POD Gathering data for ns/openshift-cnv... [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Error from server (NotFound): namespaces "kubevirt-hyperconverged" not found [must-gather-6cn6j] POD Gathering data for ns/openshift-operator-lifecycle-manager... [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Gathering data for ns/openshift-marketplace... [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Error from server (NotFound): namespaces "cluster-network-addons" not found [must-gather-6cn6j] POD Gathering data for ns/openshift-sdn... [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Error from server (NotFound): namespaces "sriov-network-operator" not found [must-gather-6cn6j] POD Error from server (NotFound): namespaces "kubevirt-web-ui" not found [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Error from server (NotFound): namespaces "cdi" not found [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD Wrote inspect data to must-gather. [must-gather-6cn6j] POD No resources found [must-gather-6cn6j] POD No resources found [must-gather-6cn6j] POD Error from server (AlreadyExists): error when creating "/etc/node-gather-crd.yaml": namespaces "node-gather" already exists [must-gather-6cn6j] POD Error from server (AlreadyExists): error when creating "/etc/node-gather-crd.yaml": serviceaccounts "node-gather" already exists [must-gather-6cn6j] POD securitycontextconstraints.security.openshift.io/privileged added to: ["system:serviceaccount:node-gather:node-gather"] [must-gather-6cn6j] POD Error from server (AlreadyExists): error when creating "/etc/node-gather-ds.yaml": daemonsets.apps "node-gather-daemonset" already exists [must-gather-6cn6j] OUT gather logs unavailable: unexpected EOF [must-gather-6cn6j] OUT waiting for gather to complete [must-gather-6cn6j] OUT gather never finished: timed out waiting for the condition [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-fm9bw deleted [must-gather ] OUT namespace/openshift-must-gather-h2w25 deleted error: gather never finished for pod must-gather-6cn6j: timed out waiting for the condition Expected results: 1. must-gather finishes successfully 2. dest-dir is created. 3. dest-dir is filled with collected log data Additional info: 1. No error is seen on the must-gather pod description. 2. I saw error messages in the output about node-gather, so I collected the following: [cnv-qe-jenkins@cnv-executor-bm-rhel ~]$ oc get all -n node-gather NAME READY STATUS RESTARTS AGE pod/node-gather-daemonset-cfhqt 1/1 Running 0 8d pod/node-gather-daemonset-ljlrg 1/1 Running 0 21h NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/node-gather-daemonset 2 2 2 2 2 <none> 34d NAME AGE PHASE cdi.cdi.kubevirt.io/cdi-kubevirt-hyperconverged 3d Deployed NAME AGE cdiconfig.cdi.kubevirt.io/config 3d NAME AGE vmimportconfig.v2v.kubevirt.io/vmimport-kubevirt-hyperconverged 3d I don't know if its supposed to be this way, but the node-gather namespace with its resources remain after the must-gather finishes (maybe these are remainders of a previous run?)
I don't understand why this bug got targeted to 2.5 if customers cannot collect logs this is a major issue targeting to 2.4 until we will get a proper analysis
Adding the description of the must-gather pod (before the action fails and the pod is removed): [cnv-qe-jenkins@cnv-executor-bm-rhel ~]$ oc describe pod -n openshift-must-gather-b4w7z Name: must-gather-lm6nn Namespace: openshift-must-gather-b4w7z Priority: 0 Node: cnv-qe-04.cnvqe.lab.eng.rdu2.redhat.com/10.9.96.43 Start Time: Sun, 26 Jul 2020 12:31:48 +0000 Labels: app=must-gather Annotations: k8s.v1.cni.cncf.io/network-status: [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.128.0.190" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.128.0.190" ], "default": true, "dns": {} }] Status: Pending IP: 10.128.0.190 IPs: IP: 10.128.0.190 Init Containers: gather: Container ID: cri-o://071fa2eac56ec2cd12fd86990be28ee2523bf2cf842cefce454a9b7447cfed1d Image: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0 Image ID: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8@sha256:031e21c6597e597f11d443b06425a3b2059d1af61f345c26c8ed8d3227645315 Port: <none> Host Port: <none> Command: /usr/bin/gather State: Running Started: Sun, 26 Jul 2020 12:31:51 +0000 Ready: False Restart Count: 0 Environment: <none> Mounts: /must-gather from must-gather-output (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-gfgjm (ro) Containers: copy: Container ID: Image: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0 Image ID: Port: <none> Host Port: <none> Command: /bin/bash -c trap : TERM INT; sleep infinity & wait State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Environment: <none> Mounts: /must-gather from must-gather-output (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-gfgjm (ro) Conditions: Type Status Initialized False Ready False ContainersReady False PodScheduled True Volumes: must-gather-output: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> default-token-gfgjm: Type: Secret (a volume populated by a Secret) SecretName: default-token-gfgjm Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 58s default-scheduler Successfully assigned openshift-must-gather-b4w7z/must-gather-lm6nn to cnv-qe-04.cnvqe.lab.eng.rdu2.redhat.com Normal AddedInterface 56s multus Add eth0 [10.128.0.190/23] Normal Pulled 56s kubelet, cnv-qe-04.cnvqe.lab.eng.rdu2.redhat.com Container image "registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0" already present on machine Normal Created 55s kubelet, cnv-qe-04.cnvqe.lab.eng.rdu2.redhat.com Created container gather Normal Started 55s kubelet, cnv-qe-04.cnvqe.lab.eng.rdu2.redhat.com Started container gather
Nelly, I'm not sure how we plan to release this - v2.4.0 is floating, so pulling it (docker pull) should get you the latest. However, lets say that we release image X to 2.4.0, and a day after we release Y to 2.4.0. If the user runs must-gather (2.4.0) between the 2 releases, then it will continue to use X even after we released Y. For QE I think it's best to go to errata, grab the version that is attached, and use it directly. There's another issue with MG that sometimes the pod fails to initialize, but that's probably a different bug (even though the symptom, unexpected EOF, is identical)
Yuval managed to find the issue, and the problem is how the gather and its pod are handled in Openshift code: - The pod's init interval (i.e. interval for checking if the must-gather init container is in Running state) is 10 minutes - The pod's timeout (timeout for the must-gather pod to be running) is 1 minute And this setup makes no sense - the init interval should be smaller. So Yuval is suggesting a patch where the init interval is 5 seconds. This way every 5 seconds interval the init container will be sampled, and within 1 minute - both the init container is expected to finish, and the entire pod (the pod's copy container, to be exact) is expected to run. He has currently taken it up to the Openshift mst-gather maintainer. @Yuval - please correct me if I missed or distorted anything. And thanks for your work on this!
Well I was only half correct, we dont need a patch - the timeout value is actually the retry value I was looking for. So, bottom line, we need 2 things: 1. Make sure we're running the latest image (v2.4.0-73) 2. Add a smaller timeout (--timeout=5 seconds should be enough) oc adm must-gather --image=url-to-latest-v2.4.0-73 --timeout=5
I have run CNV d/s must-gather tests with using cnv-must-gather image v2.4.0 and timeout=5 (seconds), i.e. oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0-73 --dest-dir=/tmp/pytest-of-cnv-qe-jenkins/pytest-7/must_gather0 --timeout=5 I ran on both discussed BMs and in both must-gather passed successfully (some specific tests failed, but I checked the failures and they were not related to the data gathering, and anyway - the data gathering completed successfully).
Thanks you @Yossi That is good news. Should we make this a known issue in our docs with this W/A of setting the TO and in addition open a (linked) OCP bug to handle this?
IMO - adding the "timeout=5" is a legitimate workaround to be indicated in the doc. I am not sure how legitimate is to instruct the user about the other w/a - calling an explicit version build (e.g. v2.4.0-73), but that's not for me to decide. I'll open 2 bugs for OCP, as the interval and the pull-policy are 2 different issues.
thanks, can you please fill the doc text box?
Clear needinfo flag as it was raised for me to be notified about the need to fill the doc text, which I have already done.
Small update, we issued a PR to use exponential backoff [1], it was abandoned in favor of [2] which simply lowers the interval to 3 seconds, which should solve this as well. The fix was merged and will address both this bug and bz#1866399 and will go in OCP 4.6. For 4.5 we should continue using the workaround mentioned in comment 10. [1] https://github.com/openshift/oc/pull/523 [2] https://github.com/openshift/oc/pull/527
The pull policy will not be changed to always [1] but the timeout issue was updated [2] to fit our needs. Can we please check this one again with latest oc binary from 4.6 ? [1] https://bugzilla.redhat.com/show_bug.cgi?id=1866380#c7 [2] https://github.com/openshift/oc/pull/545
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Virtualization 2.4.3 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4928