Description of problem: When you run "oc adm must-gather" command in an OCPv4.2 bare metal environment the command will end with an error: "error: gather never finished for pod must-gather-xxxxxx: timed out waiting for the condition" without collecting data Version-Release number of selected component (if applicable): This behavior was tested against these versions: [root@upi-0 ~]# oc version Client Version: openshift-clients-4.2.2-201910250432 Server Version: 4.2.10 Kubernetes Version: v1.14.6+17b1cc6 [root@upi-0 ~]# [root@upi-0 ~]# oc version Client Version: openshift-clients-4.2.2-201910250432 Server Version: 4.2.9 Kubernetes Version: v1.14.6+20e2756 [root@upi-0 ~]# How reproducible: In an OCPv4.2 bare metal installation will always fail; in an OCPv4.2 public cloud installation (AWS for example), the "oc adm must-gather" command will always finish successfully Steps to Reproduce: 1. install an OCPv4.2 bare metal cluster 2. run "oc adm must-gather" command 3. Actual results: The "oc adm must-gather" command will never finish successfully, the error is: "error: gather never finished for pod must-gather-xxxxxx: timed out waiting for the condition" Expected results: The "oc adm must-gather" command will successfully completed Additional info:
Setting target release to 4.4 to perform investigation on the active development branch (will be re-set/cloned where fixes & backports, if any, are required).
The problem seems to be with running the gather script inside the pod. From what I see in the logs we start streaming the logs from the gather container these entries: [must-gather-ns7t5] POD 2019/12/16 16:34:46 Finished successfully with no errors. [must-gather-ns7t5] POD 2019/12/16 16:34:46 Gathering data for ns/openshift-cluster-version... [must-gather-ns7t5] POD 2019/12/16 16:34:46 Collecting resources for namespace "openshift-cluster-version"... [must-gather-ns7t5] POD 2019/12/16 16:34:46 Gathering pod data for namespace "openshift-cluster-version"... [must-gather-ns7t5] POD 2019/12/16 16:34:46 Gathering data for pod "cluster-version-operator-7487688fbb-x7jhk" [must-gather-ns7t5] POD 2019/12/16 16:34:46 Skipping container endpoint collection for pod "cluster-version-operator-7487688fbb-x7jhk" container "cluster-version-operator": No ports [must-gather-ns7t5] POD 2019/12/16 16:34:52 Finished successfully with no errors. ... But in the middle we get an EOF when doing so: [must-gather-ns7t5] OUT gather logs unavailable: unexpected EOF The next thing is we're waiting for the main container to be running but since the gather never finished or was interrupted the main container never executes and we timeout. It would be good to run must-gather with --keep flag and see why the pod is stuck in the init container trying to run the gather script. There are a few possible options: 1. the gather script is that long (the size of data) is too big 2. there's a network glitch that prevents further analysis from happening. For 2nd we've put a few additional failure modes in newer versions, can you verify if this succeeds with newer oc?
I've repeated the "oc adm must-gather" command with a newer "oc" version but it is always failing with the same error: "error: gather never finished for pod must-gather-XXXXX: timed out waiting for the condition" In attachment you can find the logs. # oc version Client Version: openshift-clients-4.2.2-201910250432-4-g4ac90784 Server Version: 4.2.9 Kubernetes Version: v1.14.6+20e2756 # # oc adm must-gather --keep --loglevel=10 #
After you've executed oc adm must-gather with --keep flag, can I get a dump of all resources (oc get po,events -n <the must gather ns>). Additionally try manually invoking oc logs against the must-gather pod.
Hello team, it seems that the issue is no more present in a newer freshly installed OCPv4.2.9 or OCPv4.2.10 with the same "oc client" versions: [root@upi-0 must-gather.local.8900911122555827191]# oc version Client Version: openshift-clients-4.2.2-201910250432 Server Version: 4.2.9 Kubernetes Version: v1.14.6+20e2756 [root@upi-0 must-gather.local.8900911122555827191]# [root@upi-0 must-gather.local.6903024444836690570]# oc version Client Version: openshift-clients-4.2.2-201910250432-4-g4ac90784 Server Version: 4.2.10 Kubernetes Version: v1.14.6+17b1cc6 [root@upi-0 must-gather.local.6903024444836690570]# The must-gather pod is always generating a dump and successfully finished. Anyway, the must-gather pod is always in "Running" state even after the must-gather has finished collecting data. I've collected all the "must-gather" logs along with a dump of all resources for your review. Please find in attachments.
Created attachment 1649213 [details] "oc adm must-gather --loglevel=10 --keep " against an OCPv4.2.10 bare metal env - second successful try
*** Bug 1755714 has been marked as a duplicate of this bug. ***
Sally, see how far you can go with scraping data out of whatever we've managed to run within the given timeout. Additionally, try exposing the timeout as a flag with at default of 10 minutes, as today.
Same issue found on 4.2.16 . It was required to re-run the oc adm must-gather twice.
Angelo do we have confirmation this is fixed in newer version?
Based on the previous comment I'm moving the bug to qa.
Can't reproduce the issue now with latest oc client: [root@dhcp-140-138 ~]# oc version -o yaml clientVersion: buildDate: "2020-02-28T23:32:38Z" compiler: gc gitCommit: bc08a48555986f64165555efd2705eff7ef2de81 gitTreeState: clean gitVersion: 4.4.0-202002282323-bc08a48 goVersion: go1.13.4 major: "" minor: "" platform: linux/amd64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581