Description of problem: Some 4.5.7 clusters I have tried to collect a must-gather on have failed. Short summary of client output of warnings and errors: WARNING: cannot use rsync: rsync not available in container WARNING: cannot use tar: tar not available in container [must-gather-87vsw] OUT gather output not downloaded: No available strategies to copy. [must-gather-87vsw] OUT error: unable to download output from pod must-gather-87vsw: No available strategies to copy. Version-Release number of selected component (if applicable): 4.5.7 How reproducible: 100% every time when trying to collect output for https://bugzilla.redhat.com/show_bug.cgi?id=1876919 Steps to Reproduce: 1. SD App SRE team provision OSD cluster on 4.5.7 2. Wait for failure seen in https://bugzilla.redhat.com/show_bug.cgi?id=1876919 (seems to be maybe 30%?) 3. Break glass to login as system:admin 4. oc adm must-gather Actual results: Fails to download any output. Get an empty must-gather. Expected results: Output is downloaded. Additional info: If I try --keep and use rsh or exec I see the following: $ oc rsh must-gather-87vsw error: error sending request: Post https://api.domain.whatever:6443/api/v1/namespaces/openshift-must-gather-fwtzv/pods/must-gather-87vsw/exec?command=%2Fbin%2Fsh&command=-c&command=TERM%3D%22screen.xterm-256color%22+%2Fbin%2Fsh&container=copy&stdin=true&stdout=true&tty=true: EOF $ oc exec must-gather-87vsw -- pwd error: error sending request: Post https://api.domain.whatever:6443/api/v1/namespaces/openshift-must-gather-fwtzv/pods/must-gather-87vsw/exec?command=pwd&container=copy&stderr=true&stdout=true: EOF
Note I am able to run `oc cluster-info dump --all-namespaces --output-directory=somewhere` so I can get something to file on the related BZ but it's possibly missing things that would be useful.
From comment 2's private --v=8 must gather: I0909 16:25:59.761590 66867 util.go:26] error: error sending request: Post https://api.fastt02.i8v0.p1.openshiftapps.com:6443/api/v1/namespaces/openshift-must-gather-8d5bx/pods/must-gather-pblf7/exec?command=rsync&command=--version&container=copy&stderr=true&stdout=true: EOF I0909 16:25:59.761599 66867 copy_multi.go:30] Error output: WARNING: cannot use rsync: rsync not available in container So I expect this is: 1. oc tried to talk to the API server, got an EOF. 2. oc is not distinguishing between "failed in the asking" and "successful ask confirmed no $COMMAND". I expect the fix is either or both of: a. Teach 'oc' to retry some reasonable number of times if it gets an EOF or other retry-able error. b. Teach 'oc' to report "failed in the asking" errors with something that is distinct from "successful ask confirmed no $COMMAND". And probably also fixing, via a separate bug, whatever is causing the API to EOF these execs.
Based on the previous comment where we're talking about improving must-gather I'm moving this to 4.7 since although being important it's not a 4.6 blocker.
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
@mfojtik any update on this? Don't want it to disappear, I see you're tagged for needsinfo.
The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified.
@mfojtik still pending info and tagged upcoming sprint. ETA on this work? Still hitting this periodically.
Naveen there's an open PR improving this situation in https://github.com/openshift/oc/pull/631, you can also track https://bugzilla.redhat.com/show_bug.cgi?id=1888192
*** This bug has been marked as a duplicate of bug 1888192 ***