1730815 – oc adm must-gather fails times out after an hour

Bug 1730815 - oc adm must-gather fails times out after an hour

Summary: oc adm must-gather fails times out after an hour

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	openshift-apiserver
Sub Component:
Version:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Luis Sanchez
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-17 18:44 UTC by Brian 'redbeard' Harrington
Modified:	2019-09-10 15:14 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-09-10 15:14:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
must-gather failure log (number 1) (122.47 KB, text/plain) 2019-07-17 18:44 UTC, Brian 'redbeard' Harrington	no flags	Details
second must gather failure log (61.04 KB, application/octet-stream) 2019-07-17 18:45 UTC, Brian 'redbeard' Harrington	no flags	Details
View All

Description Brian 'redbeard' Harrington 2019-07-17 18:44:46 UTC

Created attachment 1591523 [details]
must-gather failure log (number 1)

Description of problem:

The cascading processes started by the command `oc adm must-gather` time out after an hour.  This becomes a problem for clusters which have been up and running for more than 30 days as the volume of logs between the API server, audit-logs, notes, etc take too much time to process.


Version-Release number of selected component (if applicable):

Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.4-201906271212+6b97d85-dirty", GitCommit:"6b97d85", GitTreeState:"dirty", BuildDate:"2019-06-27T18:11:21Z", GoVersion:"go1.11.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+c62ce01", GitCommit:"c62ce01", GitTreeState:"clean", BuildDate:"2019-06-27T18:14:14Z", GoVersion:"go1.11.6", Compiler:"gc", Platform:"linux/amd64"}



This is for cluster id 0d3a3386-76c4-4ee2-ad97-b9cf45d10d64


Attached are two failure logs.

Comment 1 Brian 'redbeard' Harrington 2019-07-17 18:45:47 UTC

Created attachment 1591524 [details]
second must gather failure log

second must gather failure log

Comment 3 Luis Sanchez 2019-08-01 13:39:36 UTC

Some notes from browsing the logs:

• Elaspsed time:  60 minutes 41 seconds.
• Never got to collecting audit or service logs.
• This message appears 27 times: 

    round_tripper.go:58] CancelRequest not implemented by *rest.tokenSourceTransport
  
  and seems to be responsible for 37 out of the 60 minutes that elapsed in the log.

Comment 4 Luis Sanchez 2019-08-02 04:25:35 UTC

There is an open upstream PR that would help by making the tokenSourceTransport cancelable:

https://github.com/kubernetes/kubernetes/pull/71757

Comment 5 Xingxing Xia 2019-09-06 03:36:47 UTC

There conducts an OCP 4.2 Deferred Bugs Review from PE and OK. Comment for this bug: without a working must-gather, once customer cluster hits serious issues, would be hard to be debugged by Engineering

Comment 8 Maciej Szulik 2019-09-10 15:14:00 UTC

Created jira RFE, closing.

Note You need to log in before you can comment on or make changes to this bug.