Bug 1755714 - CNV must-gather fails on timeout - must-gather PodInitializin
Summary: CNV must-gather fails on timeout - must-gather PodInitializin
Keywords:
Status: CLOSED DUPLICATE of bug 1784348
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: oc
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.4.0
Assignee: Maciej Szulik
QA Contact: zhou ying
URL:
Whiteboard:
: 1755713 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-26 06:35 UTC by Ruth Netser
Modified: 2020-01-07 12:19 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-07 12:19:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
must-gather output (36.42 KB, text/plain)
2019-09-26 06:37 UTC, Ruth Netser
no flags Details

Description Ruth Netser 2019-09-26 06:35:06 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Ruth Netser 2019-09-26 06:37:04 UTC
CNV must-gather fails on timeout.
See attached log.

This behavior is inconsistent.
See attached must-gather output.

Comment 2 Ruth Netser 2019-09-26 06:37:45 UTC
Created attachment 1619364 [details]
must-gather output

Comment 3 Ruth Netser 2019-09-26 07:00:39 UTC
*** Bug 1755713 has been marked as a duplicate of this bug. ***

Comment 4 Piotr Kliczewski 2019-09-26 07:35:47 UTC
When there are issues with machines running nodes and it seems that scheduler is not able to schedule the pod on a node. When must-gather pod stays in PodInitializing we are unable to gather information helping us understand what is wrong with the cluster.

Comment 5 Ruth Netser 2019-09-26 08:03:08 UTC
In terms of environment:
3 masters + 3 workers
One worker (host-172-16-0-14) was in NotReady state (no ssh, ping worked).

Comment 6 Maciej Szulik 2019-09-26 10:20:31 UTC
must-gather relies on the fact that the basic functionality is working and helps identify and resolve problems. When the cluster is unable to run any pod it won't work.
The problem is not with the scheduler but with the cluster which is apparently having a lot of problems. I'm closing this issue, since it does not provide any valuable
information needed to further debug the problem.

Comment 7 Piotr Kliczewski 2019-09-26 10:39:38 UTC
In most cases must-gather will be run when there are problems with the cluster. It is intended to use to collect diagnostic information. We can't say that we are unable to collect because the cluster is broken. We know the cluster is broken in the first place so we want to collect the information.

It is possible that daemonset we use causes the timeout since it is not started on all of the nodes. Would this provide sufficient data to continue working on this BZ?

Comment 8 Maciej Szulik 2019-09-26 10:49:57 UTC
When we can't run a pod on the cluster must-gather is of no use, you could fall back to running the binary directly
which would try to scrape at least metadata from the cluster w/o requiring to run the image on the cluster.

Comment 9 Piotr Kliczewski 2019-09-26 10:57:00 UTC
Do we have it documented how to fall back? Is support and customers know how to do it? We can't say that cluster is broken so diagnostic of broken cluster is not possible :)

Comment 10 Piotr Kliczewski 2019-09-26 11:10:39 UTC
What do you think to implement best effort approach where we do not not fail collection but if timeout occurs we continue and provide anything we were able to collect. We could collect some of the data which is much better than no data at all.

Comment 11 Maciej Szulik 2019-09-26 11:41:31 UTC
(In reply to Piotr Kliczewski from comment #9)
> Do we have it documented how to fall back? Is support and customers know how
> to do it? We can't say that cluster is broken so diagnostic of broken
> cluster is not possible :)

I don't think so, you can extract that binary from the image, but the reason we are shipping image is becuase the image entry point is
a script (wrapper) which invokes the actual binary under the hood. 

(In reply to Piotr Kliczewski from comment #10)
> What do you think to implement best effort approach where we do not not fail
> collection but if timeout occurs we continue and provide anything we were
> able to collect. We could collect some of the data which is much better than
> no data at all.

That's how the current must-gather works.

Comment 12 Piotr Kliczewski 2019-09-26 12:14:01 UTC
(In reply to Maciej Szulik from comment #11)
> 
> (In reply to Piotr Kliczewski from comment #10)
> > What do you think to implement best effort approach where we do not not fail
> > collection but if timeout occurs we continue and provide anything we were
> > able to collect. We could collect some of the data which is much better than
> > no data at all.
> 
> That's how the current must-gather works.

As far as I understand the logs were not copied after the timeout. Ruth do I understand it correctly?

Comment 13 Ruth Netser 2019-09-26 12:23:39 UTC
(In reply to Piotr Kliczewski from comment #12)
Correct; nothing was gathered.

Comment 14 Piotr Kliczewski 2019-09-26 12:42:58 UTC
(In reply to Ruth Netser from comment #13)
> (In reply to Piotr Kliczewski from comment #12)
> Correct; nothing was gathered.

Maciej in this case it seems to be a bug. We should ignore timeout and attempt to rsync collected files.

Comment 15 Maciej Szulik 2019-09-26 13:49:46 UTC
I don't see any errors in the log showing that must-gather failed, but I'll do some tests locally.
I'll update the docs about how to react when you can't run must-gather image as well.

Comment 16 Tareq Alayan 2019-12-12 08:12:28 UTC
must-gather is the tool to collect data when cluster is in bad condition. In case the cluster can't spawn a pod and run the must-gather. must-gather should have some fallback and ability to collect something to give some hints about what is going wrong in the cluster.
As a customer, i would run must-gather only when cluster is in bad state and if it can't run, i think it won't give the customer a good experience about the product.

Comment 17 Avram Levitter 2019-12-31 06:36:33 UTC
I have an issue open in the openshift/oc repository: https://github.com/openshift/oc/issues/203 to address timeouts in the execution of must-gather.
I think that if there's an issue that means that a pod can't execute at all, then must-gather is the wrong tool to use in that situation. There should be something that can run purely from the oc tool (with a valid configuration) that can be used to diagnose the cluster in that situation.

Comment 18 Dan Kenigsberg 2020-01-06 07:10:07 UTC
> I have an issue open in the openshift/oc repository: https://github.com/openshift/oc/issues/203 to address timeouts in the execution of must-gather.

Avram, please attempt to write a PR to solve this issue; I don't think that anyone would do this for us.

Comment 19 Maciej Szulik 2020-01-07 12:19:10 UTC

*** This bug has been marked as a duplicate of bug 1784348 ***


Note You need to log in before you can comment on or make changes to this bug.