Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
CNV must-gather fails on timeout. See attached log. This behavior is inconsistent. See attached must-gather output.
Created attachment 1619364 [details] must-gather output
*** Bug 1755713 has been marked as a duplicate of this bug. ***
When there are issues with machines running nodes and it seems that scheduler is not able to schedule the pod on a node. When must-gather pod stays in PodInitializing we are unable to gather information helping us understand what is wrong with the cluster.
In terms of environment: 3 masters + 3 workers One worker (host-172-16-0-14) was in NotReady state (no ssh, ping worked).
must-gather relies on the fact that the basic functionality is working and helps identify and resolve problems. When the cluster is unable to run any pod it won't work. The problem is not with the scheduler but with the cluster which is apparently having a lot of problems. I'm closing this issue, since it does not provide any valuable information needed to further debug the problem.
In most cases must-gather will be run when there are problems with the cluster. It is intended to use to collect diagnostic information. We can't say that we are unable to collect because the cluster is broken. We know the cluster is broken in the first place so we want to collect the information. It is possible that daemonset we use causes the timeout since it is not started on all of the nodes. Would this provide sufficient data to continue working on this BZ?
When we can't run a pod on the cluster must-gather is of no use, you could fall back to running the binary directly which would try to scrape at least metadata from the cluster w/o requiring to run the image on the cluster.
Do we have it documented how to fall back? Is support and customers know how to do it? We can't say that cluster is broken so diagnostic of broken cluster is not possible :)
What do you think to implement best effort approach where we do not not fail collection but if timeout occurs we continue and provide anything we were able to collect. We could collect some of the data which is much better than no data at all.
(In reply to Piotr Kliczewski from comment #9) > Do we have it documented how to fall back? Is support and customers know how > to do it? We can't say that cluster is broken so diagnostic of broken > cluster is not possible :) I don't think so, you can extract that binary from the image, but the reason we are shipping image is becuase the image entry point is a script (wrapper) which invokes the actual binary under the hood. (In reply to Piotr Kliczewski from comment #10) > What do you think to implement best effort approach where we do not not fail > collection but if timeout occurs we continue and provide anything we were > able to collect. We could collect some of the data which is much better than > no data at all. That's how the current must-gather works.
(In reply to Maciej Szulik from comment #11) > > (In reply to Piotr Kliczewski from comment #10) > > What do you think to implement best effort approach where we do not not fail > > collection but if timeout occurs we continue and provide anything we were > > able to collect. We could collect some of the data which is much better than > > no data at all. > > That's how the current must-gather works. As far as I understand the logs were not copied after the timeout. Ruth do I understand it correctly?
(In reply to Piotr Kliczewski from comment #12) Correct; nothing was gathered.
(In reply to Ruth Netser from comment #13) > (In reply to Piotr Kliczewski from comment #12) > Correct; nothing was gathered. Maciej in this case it seems to be a bug. We should ignore timeout and attempt to rsync collected files.
I don't see any errors in the log showing that must-gather failed, but I'll do some tests locally. I'll update the docs about how to react when you can't run must-gather image as well.
must-gather is the tool to collect data when cluster is in bad condition. In case the cluster can't spawn a pod and run the must-gather. must-gather should have some fallback and ability to collect something to give some hints about what is going wrong in the cluster. As a customer, i would run must-gather only when cluster is in bad state and if it can't run, i think it won't give the customer a good experience about the product.
I have an issue open in the openshift/oc repository: https://github.com/openshift/oc/issues/203 to address timeouts in the execution of must-gather. I think that if there's an issue that means that a pod can't execute at all, then must-gather is the wrong tool to use in that situation. There should be something that can run purely from the oc tool (with a valid configuration) that can be used to diagnose the cluster in that situation.
> I have an issue open in the openshift/oc repository: https://github.com/openshift/oc/issues/203 to address timeouts in the execution of must-gather. Avram, please attempt to write a PR to solve this issue; I don't think that anyone would do this for us.
*** This bug has been marked as a duplicate of bug 1784348 ***