Bug 1784348 - [OCPv4.2] "oc adm must-gather" "timed out waiting for the condition" in a bare metal installation
Summary: [OCPv4.2] "oc adm must-gather" "timed out waiting for the condition" in a bar...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: oc
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.4.0
Assignee: Sally
QA Contact: zhou ying
URL:
Whiteboard:
: 1755714 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-17 09:54 UTC by Angelo Gabrieli
Modified: 2023-09-07 21:17 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-04 11:20:24 UTC
Target Upstream Version:
Embargoed:
agabriel: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:20:58 UTC

Description Angelo Gabrieli 2019-12-17 09:54:05 UTC
Description of problem:
When you run "oc adm must-gather" command in an OCPv4.2 bare metal environment the command will end with an error:
"error: gather never finished for pod must-gather-xxxxxx: timed out waiting for the condition"
without collecting data


Version-Release number of selected component (if applicable):
This behavior was tested against these versions:

[root@upi-0 ~]# oc version
Client Version: openshift-clients-4.2.2-201910250432
Server Version: 4.2.10
Kubernetes Version: v1.14.6+17b1cc6
[root@upi-0 ~]#

[root@upi-0 ~]#  oc version
Client Version: openshift-clients-4.2.2-201910250432
Server Version: 4.2.9
Kubernetes Version: v1.14.6+20e2756
[root@upi-0 ~]#


How reproducible:
In an OCPv4.2 bare metal installation will always fail; in an OCPv4.2 public cloud installation (AWS for example), the "oc adm must-gather" command will always finish successfully


Steps to Reproduce:
1. install an OCPv4.2 bare metal cluster
2. run "oc adm must-gather" command
3.


Actual results:
The "oc adm must-gather" command will never finish successfully, the error is:

"error: gather never finished for pod must-gather-xxxxxx: timed out waiting for the condition"


Expected results:
The "oc adm must-gather" command will successfully completed


Additional info:

Comment 3 Stephen Cuppett 2019-12-17 12:07:04 UTC
Setting target release to 4.4 to perform investigation on the active development branch (will be re-set/cloned where fixes & backports, if any, are required).

Comment 4 Maciej Szulik 2019-12-19 12:31:04 UTC
The problem seems to be with running the gather script inside the pod. From what I see in the logs we start streaming 
the logs from the gather container these entries:

[must-gather-ns7t5] POD 2019/12/16 16:34:46 Finished successfully with no errors.
[must-gather-ns7t5] POD 2019/12/16 16:34:46 Gathering data for ns/openshift-cluster-version...
[must-gather-ns7t5] POD 2019/12/16 16:34:46     Collecting resources for namespace "openshift-cluster-version"...
[must-gather-ns7t5] POD 2019/12/16 16:34:46     Gathering pod data for namespace "openshift-cluster-version"...
[must-gather-ns7t5] POD 2019/12/16 16:34:46         Gathering data for pod "cluster-version-operator-7487688fbb-x7jhk"
[must-gather-ns7t5] POD 2019/12/16 16:34:46         Skipping container endpoint collection for pod "cluster-version-operator-7487688fbb-x7jhk" container "cluster-version-operator": No ports
[must-gather-ns7t5] POD 2019/12/16 16:34:52 Finished successfully with no errors.
...

But in the middle we get an EOF when doing so:

[must-gather-ns7t5] OUT gather logs unavailable: unexpected EOF

The next thing is we're waiting for the main container to be running but since the gather never finished or
was interrupted the main container never executes and we timeout. It would be good to run must-gather
with --keep flag and see why the pod is stuck in the init container trying to run the gather script.
There are a few possible options:
1. the gather script is that long (the size of data) is too big
2. there's a network glitch that prevents further analysis from happening.

For 2nd we've put a few additional failure modes in newer versions, can you verify if this succeeds with newer oc?

Comment 5 Angelo Gabrieli 2019-12-19 17:09:00 UTC
I've repeated the "oc adm must-gather" command with a newer "oc" version but it is always failing with the same error:

"error: gather never finished for pod must-gather-XXXXX: timed out waiting for the condition"

In attachment you can find the logs.

# oc version
Client Version: openshift-clients-4.2.2-201910250432-4-g4ac90784
Server Version: 4.2.9
Kubernetes Version: v1.14.6+20e2756
#
# oc adm must-gather --keep --loglevel=10
#

Comment 7 Maciej Szulik 2019-12-23 14:17:36 UTC
After you've executed oc adm must-gather with --keep flag, can I get a dump of all resources (oc get po,events -n <the must gather ns>).
Additionally try manually invoking oc logs against the must-gather pod.

Comment 8 Angelo Gabrieli 2020-01-02 15:16:45 UTC
Hello team,

it seems that the issue is no more present in a newer freshly installed OCPv4.2.9 or OCPv4.2.10 with the same "oc client" versions:


[root@upi-0 must-gather.local.8900911122555827191]# oc version
Client Version: openshift-clients-4.2.2-201910250432
Server Version: 4.2.9
Kubernetes Version: v1.14.6+20e2756
[root@upi-0 must-gather.local.8900911122555827191]#

[root@upi-0 must-gather.local.6903024444836690570]# oc version
Client Version: openshift-clients-4.2.2-201910250432-4-g4ac90784
Server Version: 4.2.10
Kubernetes Version: v1.14.6+17b1cc6
[root@upi-0 must-gather.local.6903024444836690570]#


The must-gather pod is always generating a dump and successfully finished.
Anyway, the must-gather pod is always in "Running" state even after the must-gather has finished collecting data.
I've collected all the "must-gather" logs along with a dump of all resources for your review.
Please find in attachments.

Comment 9 Angelo Gabrieli 2020-01-02 15:23:27 UTC
Created attachment 1649213 [details]
"oc adm must-gather --loglevel=10 --keep " against an OCPv4.2.10 bare metal env - second successful try

Comment 11 Maciej Szulik 2020-01-07 12:19:10 UTC
*** Bug 1755714 has been marked as a duplicate of this bug. ***

Comment 12 Maciej Szulik 2020-01-07 21:01:57 UTC
Sally, see how far you can go with scraping data out of whatever we've managed to run within the given timeout.
Additionally, try exposing the timeout as a flag with at default of 10 minutes, as today.

Comment 13 hgomes 2020-02-04 18:57:00 UTC
Same issue found on 4.2.16 . It was required to re-run the oc adm must-gather twice.

Comment 16 Maciej Szulik 2020-02-25 15:28:40 UTC
Angelo do we have confirmation this is fixed in newer version?

Comment 18 Maciej Szulik 2020-02-28 16:04:05 UTC
Based on the previous comment I'm moving the bug to qa.

Comment 21 zhou ying 2020-03-02 07:49:03 UTC
Can't reproduce the issue now with latest oc client:
[root@dhcp-140-138 ~]# oc version -o yaml 
clientVersion:
  buildDate: "2020-02-28T23:32:38Z"
  compiler: gc
  gitCommit: bc08a48555986f64165555efd2705eff7ef2de81
  gitTreeState: clean
  gitVersion: 4.4.0-202002282323-bc08a48
  goVersion: go1.13.4
  major: ""
  minor: ""
  platform: linux/amd64

Comment 23 errata-xmlrpc 2020-05-04 11:20:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.