Bug 1859972
Summary: | CNV must-gather failure on CNV-QE BM environment (BM-RHEL, BM-RHCOS) | ||
---|---|---|---|
Product: | Container Native Virtualization (CNV) | Reporter: | Yossi Segev <ysegev> |
Component: | Logging | Assignee: | Yuval Turgeman <yturgema> |
Status: | CLOSED ERRATA | QA Contact: | Satyajit Bulage <sbulage> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 2.4.0 | CC: | cnv-qe-bugs, fdeutsch, irose, ncredi, sanchezl, stirabos, yadu, ycui, yturgema |
Target Milestone: | --- | ||
Target Release: | 2.4.3 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | cnv-must-gather-container-v2.4.0-73 | Doc Type: | Known Issue |
Doc Text: |
Cause:
The default interval for the gather pod's init container (600 sec) doesn't fit to the pod's timeout to get to Running state (60 sec).
Consequence:
The action times-out although the init container was ready, but its status was not inspected on time.
Workaround (if any): Run the command with "--timeout=5". "timeout" is the init container's interval (although in the must-gather help it is documented as the entire timeout).
Result: Data gather completes successfully.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2020-11-04 19:07:13 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1866380, 1866399 | ||
Bug Blocks: |
Description
Yossi Segev
2020-07-23 12:20:28 UTC
I don't understand why this bug got targeted to 2.5 if customers cannot collect logs this is a major issue targeting to 2.4 until we will get a proper analysis Adding the description of the must-gather pod (before the action fails and the pod is removed): [cnv-qe-jenkins@cnv-executor-bm-rhel ~]$ oc describe pod -n openshift-must-gather-b4w7z Name: must-gather-lm6nn Namespace: openshift-must-gather-b4w7z Priority: 0 Node: cnv-qe-04.cnvqe.lab.eng.rdu2.redhat.com/10.9.96.43 Start Time: Sun, 26 Jul 2020 12:31:48 +0000 Labels: app=must-gather Annotations: k8s.v1.cni.cncf.io/network-status: [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.128.0.190" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.128.0.190" ], "default": true, "dns": {} }] Status: Pending IP: 10.128.0.190 IPs: IP: 10.128.0.190 Init Containers: gather: Container ID: cri-o://071fa2eac56ec2cd12fd86990be28ee2523bf2cf842cefce454a9b7447cfed1d Image: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0 Image ID: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8@sha256:031e21c6597e597f11d443b06425a3b2059d1af61f345c26c8ed8d3227645315 Port: <none> Host Port: <none> Command: /usr/bin/gather State: Running Started: Sun, 26 Jul 2020 12:31:51 +0000 Ready: False Restart Count: 0 Environment: <none> Mounts: /must-gather from must-gather-output (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-gfgjm (ro) Containers: copy: Container ID: Image: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0 Image ID: Port: <none> Host Port: <none> Command: /bin/bash -c trap : TERM INT; sleep infinity & wait State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Environment: <none> Mounts: /must-gather from must-gather-output (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-gfgjm (ro) Conditions: Type Status Initialized False Ready False ContainersReady False PodScheduled True Volumes: must-gather-output: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> default-token-gfgjm: Type: Secret (a volume populated by a Secret) SecretName: default-token-gfgjm Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 58s default-scheduler Successfully assigned openshift-must-gather-b4w7z/must-gather-lm6nn to cnv-qe-04.cnvqe.lab.eng.rdu2.redhat.com Normal AddedInterface 56s multus Add eth0 [10.128.0.190/23] Normal Pulled 56s kubelet, cnv-qe-04.cnvqe.lab.eng.rdu2.redhat.com Container image "registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0" already present on machine Normal Created 55s kubelet, cnv-qe-04.cnvqe.lab.eng.rdu2.redhat.com Created container gather Normal Started 55s kubelet, cnv-qe-04.cnvqe.lab.eng.rdu2.redhat.com Started container gather Nelly, I'm not sure how we plan to release this - v2.4.0 is floating, so pulling it (docker pull) should get you the latest. However, lets say that we release image X to 2.4.0, and a day after we release Y to 2.4.0. If the user runs must-gather (2.4.0) between the 2 releases, then it will continue to use X even after we released Y. For QE I think it's best to go to errata, grab the version that is attached, and use it directly. There's another issue with MG that sometimes the pod fails to initialize, but that's probably a different bug (even though the symptom, unexpected EOF, is identical) Yuval managed to find the issue, and the problem is how the gather and its pod are handled in Openshift code: - The pod's init interval (i.e. interval for checking if the must-gather init container is in Running state) is 10 minutes - The pod's timeout (timeout for the must-gather pod to be running) is 1 minute And this setup makes no sense - the init interval should be smaller. So Yuval is suggesting a patch where the init interval is 5 seconds. This way every 5 seconds interval the init container will be sampled, and within 1 minute - both the init container is expected to finish, and the entire pod (the pod's copy container, to be exact) is expected to run. He has currently taken it up to the Openshift mst-gather maintainer. @Yuval - please correct me if I missed or distorted anything. And thanks for your work on this! Well I was only half correct, we dont need a patch - the timeout value is actually the retry value I was looking for. So, bottom line, we need 2 things: 1. Make sure we're running the latest image (v2.4.0-73) 2. Add a smaller timeout (--timeout=5 seconds should be enough) oc adm must-gather --image=url-to-latest-v2.4.0-73 --timeout=5 I have run CNV d/s must-gather tests with using cnv-must-gather image v2.4.0 and timeout=5 (seconds), i.e. oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.4.0-73 --dest-dir=/tmp/pytest-of-cnv-qe-jenkins/pytest-7/must_gather0 --timeout=5 I ran on both discussed BMs and in both must-gather passed successfully (some specific tests failed, but I checked the failures and they were not related to the data gathering, and anyway - the data gathering completed successfully). Thanks you @Yossi That is good news. Should we make this a known issue in our docs with this W/A of setting the TO and in addition open a (linked) OCP bug to handle this? IMO - adding the "timeout=5" is a legitimate workaround to be indicated in the doc. I am not sure how legitimate is to instruct the user about the other w/a - calling an explicit version build (e.g. v2.4.0-73), but that's not for me to decide. I'll open 2 bugs for OCP, as the interval and the pull-policy are 2 different issues. thanks, can you please fill the doc text box? Clear needinfo flag as it was raised for me to be notified about the need to fill the doc text, which I have already done. Small update, we issued a PR to use exponential backoff [1], it was abandoned in favor of [2] which simply lowers the interval to 3 seconds, which should solve this as well. The fix was merged and will address both this bug and bz#1866399 and will go in OCP 4.6. For 4.5 we should continue using the workaround mentioned in comment 10. [1] https://github.com/openshift/oc/pull/523 [2] https://github.com/openshift/oc/pull/527 The pull policy will not be changed to always [1] but the timeout issue was updated [2] to fit our needs. Can we please check this one again with latest oc binary from 4.6 ? [1] https://bugzilla.redhat.com/show_bug.cgi?id=1866380#c7 [2] https://github.com/openshift/oc/pull/545 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Virtualization 2.4.3 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4928 |