Created attachment 1591523 [details] must-gather failure log (number 1) Description of problem: The cascading processes started by the command `oc adm must-gather` time out after an hour. This becomes a problem for clusters which have been up and running for more than 30 days as the volume of logs between the API server, audit-logs, notes, etc take too much time to process. Version-Release number of selected component (if applicable): Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.4-201906271212+6b97d85-dirty", GitCommit:"6b97d85", GitTreeState:"dirty", BuildDate:"2019-06-27T18:11:21Z", GoVersion:"go1.11.6", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+c62ce01", GitCommit:"c62ce01", GitTreeState:"clean", BuildDate:"2019-06-27T18:14:14Z", GoVersion:"go1.11.6", Compiler:"gc", Platform:"linux/amd64"} This is for cluster id 0d3a3386-76c4-4ee2-ad97-b9cf45d10d64 Attached are two failure logs.
Created attachment 1591524 [details] second must gather failure log second must gather failure log
Some notes from browsing the logs: • Elaspsed time: 60 minutes 41 seconds. • Never got to collecting audit or service logs. • This message appears 27 times: round_tripper.go:58] CancelRequest not implemented by *rest.tokenSourceTransport and seems to be responsible for 37 out of the 60 minutes that elapsed in the log.
There is an open upstream PR that would help by making the tokenSourceTransport cancelable: https://github.com/kubernetes/kubernetes/pull/71757
There conducts an OCP 4.2 Deferred Bugs Review from PE and OK. Comment for this bug: without a working must-gather, once customer cluster hits serious issues, would be hard to be debugged by Engineering
Created jira RFE, closing.