Bug 1730815 - oc adm must-gather fails times out after an hour
Summary: oc adm must-gather fails times out after an hour
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.3.0
Assignee: Luis Sanchez
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-17 18:44 UTC by Brian 'redbeard' Harrington
Modified: 2019-09-10 15:14 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-09-10 15:14:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
must-gather failure log (number 1) (122.47 KB, text/plain)
2019-07-17 18:44 UTC, Brian 'redbeard' Harrington
no flags Details
second must gather failure log (61.04 KB, application/octet-stream)
2019-07-17 18:45 UTC, Brian 'redbeard' Harrington
no flags Details

Description Brian 'redbeard' Harrington 2019-07-17 18:44:46 UTC
Created attachment 1591523 [details]
must-gather failure log (number 1)

Description of problem:

The cascading processes started by the command `oc adm must-gather` time out after an hour.  This becomes a problem for clusters which have been up and running for more than 30 days as the volume of logs between the API server, audit-logs, notes, etc take too much time to process.


Version-Release number of selected component (if applicable):

Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.4-201906271212+6b97d85-dirty", GitCommit:"6b97d85", GitTreeState:"dirty", BuildDate:"2019-06-27T18:11:21Z", GoVersion:"go1.11.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+c62ce01", GitCommit:"c62ce01", GitTreeState:"clean", BuildDate:"2019-06-27T18:14:14Z", GoVersion:"go1.11.6", Compiler:"gc", Platform:"linux/amd64"}



This is for cluster id 0d3a3386-76c4-4ee2-ad97-b9cf45d10d64


Attached are two failure logs.

Comment 1 Brian 'redbeard' Harrington 2019-07-17 18:45:47 UTC
Created attachment 1591524 [details]
second must gather failure log

second must gather failure log

Comment 3 Luis Sanchez 2019-08-01 13:39:36 UTC
Some notes from browsing the logs:

• Elaspsed time:  60 minutes 41 seconds.
• Never got to collecting audit or service logs.
• This message appears 27 times: 

    round_tripper.go:58] CancelRequest not implemented by *rest.tokenSourceTransport
  
  and seems to be responsible for 37 out of the 60 minutes that elapsed in the log.

Comment 4 Luis Sanchez 2019-08-02 04:25:35 UTC
There is an open upstream PR that would help by making the tokenSourceTransport cancelable:

https://github.com/kubernetes/kubernetes/pull/71757

Comment 5 Xingxing Xia 2019-09-06 03:36:47 UTC
There conducts an OCP 4.2 Deferred Bugs Review from PE and OK. Comment for this bug: without a working must-gather, once customer cluster hits serious issues, would be hard to be debugged by Engineering

Comment 8 Maciej Szulik 2019-09-10 15:14:00 UTC
Created jira RFE, closing.


Note You need to log in before you can comment on or make changes to this bug.