Bug 1456380 - Provide automated tools for collecting troubleshooting data
Summary: Provide automated tools for collecting troubleshooting data
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RFE
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Brenton Leanhardt
QA Contact: Xiaoli Tian
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-29 09:17 UTC by Marko Myllynen
Modified: 2019-10-23 11:54 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-18 18:19:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Marko Myllynen 2017-05-29 09:17:22 UTC
Description of problem:
There is a need to have automated and straightforward tools to allow laymen to collect information needed for troubleshooting OpenShift installation by domain experts and/or Red Hat Support. Some of the OpenShift components are very loosely coupled so collecting "everything" for every issue would be overkill.

For example, an issue could roughly be identified as something like "no new logs are written" or "a node is not part of SDN" or "all metrics are flat" or "internal registry unreachable" or "no new pods are started anywhere" and then non-experts would need to be able to collect related data for others for further analysis.

There are some documents (existing or being written) and some random scripts floating around here and there to collect but no consistent and supported tools exist for this. Also, reading a document which has at least a dozen steps at the time of a crisis is not ideal, a simple tool would be needed, perhaps something akin to:

# ocp-collect-info -n logging

or

# ocp-collect-info -n sdn

Such a command should be available to do The Right Thing (tm) for initial data collection, later steps could be more involved. A sunny day vision might also include also some self-healing tools but on this RFE let us concentrate on the automated tools for collecting troubleshooting data.

Some references to prior art (thanks for Eric Rich for pointing most of these out):

* https://access.redhat.com/search/#/?q=openshift
* https://bugzilla.redhat.com/show_bug.cgi?id=1259118
* https://access.redhat.com/articles/2913431
* https://github.com/openshift/origin-aggregated-logging/pull/365
* https://github.com/openshift/openshift-sdn/blob/master/hack/debug.sh


Note You need to log in before you can comment on or make changes to this bug.