1755428 – must-gather fails to run and returns unhelpful error

Bug 1755428 - must-gather fails to run and returns unhelpful error

Summary: must-gather fails to run and returns unhelpful error

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	oc
Sub Component:
Version:	4.1.z
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Sally
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-25 13:42 UTC by Timothy Rees
Modified:	2020-05-04 11:14 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-04 11:13:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift oc pull 295	0	None	closed	Bug 1755428: improve error msgs in must-gather in case of must-gather pod not found	2021-02-01 18:50:56 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-04 11:14:26 UTC

Description Timothy Rees 2019-09-25 13:42:38 UTC

Description of problem:

Running 'oc adm must-gather' fails on a cluster which is in a regressed state.  The error message returned is unhelpful;

[root@int-lb ~]# oc adm must-gather
namespace/openshift-must-gather-kkdw4 created
clusterrolebinding.rbac.authorization.k8s.io/must-gather-c7vt8 created
WARNING: cannot use rsync: rsync not available in container
WARNING: cannot use tar: tar not available in container
clusterrolebinding.rbac.authorization.k8s.io/must-gather-c7vt8 deleted
namespace/openshift-must-gather-kkdw4 deleted
error: No available strategies to copy.


Version-Release number of selected component (if applicable):

Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0", GitCommit:"0af5e0f8b", GitTreeState:"clean", BuildDate:"2019-05-07T02:42:52Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+1c98dff", GitCommit:"1c98dff", GitTreeState:"clean", BuildDate:"2019-09-18T17:06:54Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}


How reproducible:

100%

Steps to Reproduce:
1.  Auth to cluster which needs troubleshooting
2.  Run 'oc adm must-gather' per the troubleshooting docs
3.

Actual results:

namespace/openshift-must-gather-kkdw4 created
clusterrolebinding.rbac.authorization.k8s.io/must-gather-c7vt8 created
WARNING: cannot use rsync: rsync not available in container
WARNING: cannot use tar: tar not available in container
clusterrolebinding.rbac.authorization.k8s.io/must-gather-c7vt8 deleted
namespace/openshift-must-gather-kkdw4 deleted
error: No available strategies to copy.


Expected results:

A more helpful error message to pinpoint why must-gather failed to run.  I suspect the api server is not able to schedule the pods needed.

Additional info:

Running against a newly deployed UPI server on vmware.  Server was deployed and running fine.  Since upgrading to 4.1.16 the Watchdog, TargetDown and KubletDown alerts are constantly firing.  Tried upgrading again when I saw 4.1.17 was available, server upgraded but same issue.  I wanted to use must-gather to troubleshoot the server issue.

oc get nodes
NAME       STATUS   ROLES    AGE   VERSION
etcd-0     Ready    master   23d   v1.13.4+244797462
etcd-1     Ready    master   23d   v1.13.4+244797462
etcd-2     Ready    master   23d   v1.13.4+244797462
worker-0   Ready    worker   23d   v1.13.4+244797462
worker-1   Ready    worker   23d   v1.13.4+244797462
worker-2   Ready    worker   23d   v1.13.4+244797462
worker-3   Ready    worker   23d   v1.13.4+244797462


oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.17    True        False         5h22m   Cluster version is 4.1.17


    channel: stable-4.1
    clusterID: 8da9f03b-0a2d-4104-be06-e7bd726ba6cb

Comment 1 Timothy Rees 2019-09-26 03:36:44 UTC

Addtional information, pods runs and immediately terminates:

# date && oc adm must-gather; date    
Thu Sep 26 05:28:12 CEST 2019   
namespace/openshift-must-gather-8lvfz created 
clusterrolebinding.rbac.authorization.k8s.io/must-gather-pphzr created 
WARNING: cannot use rsync: rsync not available in container 
WARNING: cannot use tar: tar not available in container 
clusterrolebinding.rbac.authorization.k8s.io/must-gather-pphzr deleted 
namespace/openshift-must-gather-8lvfz deleted 
error: No available strategies to copy.


# date; oc get pods -n openshift-must-gather-8lvfz -w ; date
Thu Sep 26 05:29:18 CEST 2019
NAME                READY   STATUS            RESTARTS   AGE
must-gather-kck76   0/1     PodInitializing   0          66s
must-gather-kck76   1/1   Running   0     67s
must-gather-kck76   1/1   Terminating   0     73s
must-gather-kck76   1/1   Terminating   0     73s
Thu Sep 26 05:30:26 CEST 2019
[root@int-lb ~]# oc logs must-gather-kck76 -p

Comment 2 Timothy Rees 2019-09-26 14:16:07 UTC

Somehow there were csr approvals pending.  I don't understand how this is the case since the cluster was working fine before it had been upgraded.  Approving the CSRs per [1] resolved the issue and must-gather now runs.

Is there any way to get a more useful error message from must-gather in this scenario?

[1] https://access.redhat.com/solutions/4307511

Comment 3 Maciej Szulik 2019-12-02 17:56:46 UTC

This is not going to make 4.3, moving to 4.4

Comment 4 Sally 2020-02-13 01:15:57 UTC

I've looked through the rsync, must-gather code for this.  The must-gather code has been restructured since 4.1 (moved from openshift/origin to openshift/oc as of 4.2).  It's difficult to reproduce this issue.  I do notice with -v=4 you'd get more information regarding errors. 

I found in trying to reproduce this that if I delete the must-gather pod, the command hangs for the timeout (10 min).  I've opened a PR to fix that specifically, and overall it will aid in getting more information from failed must-gather runs.
For this bz, however, I suggest running must-gather with higher log-level.  I assume the cmd hung for you, also, since the must-gather pod was terminated? In that sense, this PR will serve as a fix. 

https://github.com/openshift/oc/pull/295

Comment 6 zhou ying 2020-02-14 02:31:06 UTC

Confirmed with latest oc client , can't reproduce the issue now: 
[root@dhcp-140-138 ~]# oc version -o yaml 
clientVersion:
  buildDate: "2020-02-13T22:50:14Z"
  compiler: gc
  gitCommit: 5d7a12f03389b03b651f963cb5ee8ddfa9cff559
  gitTreeState: clean
  gitVersion: v4.4.0
  goVersion: go1.13.4
  major: ""
  minor: ""
  platform: linux/amd64

Comment 8 errata-xmlrpc 2020-05-04 11:13:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.