Bug 1456380

Summary: Provide automated tools for collecting troubleshooting data
Product: OpenShift Container Platform Reporter: Marko Myllynen <myllynen>
Component: RFEAssignee: Brenton Leanhardt <bleanhar>
Status: CLOSED NEXTRELEASE QA Contact: Xiaoli Tian <xtian>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.5.0CC: aos-bugs, bbreard, bleanhar, erich, jokerman, kdube, mmccomas, pdwyer, pep
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-18 18:19:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marko Myllynen 2017-05-29 09:17:22 UTC
Description of problem:
There is a need to have automated and straightforward tools to allow laymen to collect information needed for troubleshooting OpenShift installation by domain experts and/or Red Hat Support. Some of the OpenShift components are very loosely coupled so collecting "everything" for every issue would be overkill.

For example, an issue could roughly be identified as something like "no new logs are written" or "a node is not part of SDN" or "all metrics are flat" or "internal registry unreachable" or "no new pods are started anywhere" and then non-experts would need to be able to collect related data for others for further analysis.

There are some documents (existing or being written) and some random scripts floating around here and there to collect but no consistent and supported tools exist for this. Also, reading a document which has at least a dozen steps at the time of a crisis is not ideal, a simple tool would be needed, perhaps something akin to:

# ocp-collect-info -n logging

or

# ocp-collect-info -n sdn

Such a command should be available to do The Right Thing (tm) for initial data collection, later steps could be more involved. A sunny day vision might also include also some self-healing tools but on this RFE let us concentrate on the automated tools for collecting troubleshooting data.

Some references to prior art (thanks for Eric Rich for pointing most of these out):

* https://access.redhat.com/search/#/?q=openshift
* https://bugzilla.redhat.com/show_bug.cgi?id=1259118
* https://access.redhat.com/articles/2913431
* https://github.com/openshift/origin-aggregated-logging/pull/365
* https://github.com/openshift/openshift-sdn/blob/master/hack/debug.sh