Description of problem: Running 'oc adm must-gather' fails on a cluster which is in a regressed state. The error message returned is unhelpful; [root@int-lb ~]# oc adm must-gather namespace/openshift-must-gather-kkdw4 created clusterrolebinding.rbac.authorization.k8s.io/must-gather-c7vt8 created WARNING: cannot use rsync: rsync not available in container WARNING: cannot use tar: tar not available in container clusterrolebinding.rbac.authorization.k8s.io/must-gather-c7vt8 deleted namespace/openshift-must-gather-kkdw4 deleted error: No available strategies to copy. Version-Release number of selected component (if applicable): Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0", GitCommit:"0af5e0f8b", GitTreeState:"clean", BuildDate:"2019-05-07T02:42:52Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+1c98dff", GitCommit:"1c98dff", GitTreeState:"clean", BuildDate:"2019-09-18T17:06:54Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"} How reproducible: 100% Steps to Reproduce: 1. Auth to cluster which needs troubleshooting 2. Run 'oc adm must-gather' per the troubleshooting docs 3. Actual results: namespace/openshift-must-gather-kkdw4 created clusterrolebinding.rbac.authorization.k8s.io/must-gather-c7vt8 created WARNING: cannot use rsync: rsync not available in container WARNING: cannot use tar: tar not available in container clusterrolebinding.rbac.authorization.k8s.io/must-gather-c7vt8 deleted namespace/openshift-must-gather-kkdw4 deleted error: No available strategies to copy. Expected results: A more helpful error message to pinpoint why must-gather failed to run. I suspect the api server is not able to schedule the pods needed. Additional info: Running against a newly deployed UPI server on vmware. Server was deployed and running fine. Since upgrading to 4.1.16 the Watchdog, TargetDown and KubletDown alerts are constantly firing. Tried upgrading again when I saw 4.1.17 was available, server upgraded but same issue. I wanted to use must-gather to troubleshoot the server issue. oc get nodes NAME STATUS ROLES AGE VERSION etcd-0 Ready master 23d v1.13.4+244797462 etcd-1 Ready master 23d v1.13.4+244797462 etcd-2 Ready master 23d v1.13.4+244797462 worker-0 Ready worker 23d v1.13.4+244797462 worker-1 Ready worker 23d v1.13.4+244797462 worker-2 Ready worker 23d v1.13.4+244797462 worker-3 Ready worker 23d v1.13.4+244797462 oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.17 True False 5h22m Cluster version is 4.1.17 channel: stable-4.1 clusterID: 8da9f03b-0a2d-4104-be06-e7bd726ba6cb
Addtional information, pods runs and immediately terminates: # date && oc adm must-gather; date Thu Sep 26 05:28:12 CEST 2019 namespace/openshift-must-gather-8lvfz created clusterrolebinding.rbac.authorization.k8s.io/must-gather-pphzr created WARNING: cannot use rsync: rsync not available in container WARNING: cannot use tar: tar not available in container clusterrolebinding.rbac.authorization.k8s.io/must-gather-pphzr deleted namespace/openshift-must-gather-8lvfz deleted error: No available strategies to copy. # date; oc get pods -n openshift-must-gather-8lvfz -w ; date Thu Sep 26 05:29:18 CEST 2019 NAME READY STATUS RESTARTS AGE must-gather-kck76 0/1 PodInitializing 0 66s must-gather-kck76 1/1 Running 0 67s must-gather-kck76 1/1 Terminating 0 73s must-gather-kck76 1/1 Terminating 0 73s Thu Sep 26 05:30:26 CEST 2019 [root@int-lb ~]# oc logs must-gather-kck76 -p
Somehow there were csr approvals pending. I don't understand how this is the case since the cluster was working fine before it had been upgraded. Approving the CSRs per [1] resolved the issue and must-gather now runs. Is there any way to get a more useful error message from must-gather in this scenario? [1] https://access.redhat.com/solutions/4307511
This is not going to make 4.3, moving to 4.4
I've looked through the rsync, must-gather code for this. The must-gather code has been restructured since 4.1 (moved from openshift/origin to openshift/oc as of 4.2). It's difficult to reproduce this issue. I do notice with -v=4 you'd get more information regarding errors. I found in trying to reproduce this that if I delete the must-gather pod, the command hangs for the timeout (10 min). I've opened a PR to fix that specifically, and overall it will aid in getting more information from failed must-gather runs. For this bz, however, I suggest running must-gather with higher log-level. I assume the cmd hung for you, also, since the must-gather pod was terminated? In that sense, this PR will serve as a fix. https://github.com/openshift/oc/pull/295
Confirmed with latest oc client , can't reproduce the issue now: [root@dhcp-140-138 ~]# oc version -o yaml clientVersion: buildDate: "2020-02-13T22:50:14Z" compiler: gc gitCommit: 5d7a12f03389b03b651f963cb5ee8ddfa9cff559 gitTreeState: clean gitVersion: v4.4.0 goVersion: go1.13.4 major: "" minor: "" platform: linux/amd64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581