Bug 1741901

Summary: OLM data failed to be gathered by the `openshift-must-gather` tool
Product: OpenShift Container Platform Reporter: Jian Zhang <jiazha>
Component: ocAssignee: Luis Sanchez <sanchezl>
Status: CLOSED WONTFIX QA Contact: Jian Zhang <jiazha>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.2.0CC: aos-bugs, bandrade, chezhang, chuo, deads, jfan, jokerman, mfojtik, nagrawal, nhale, scolange, xjiang
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-06 20:27:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jian Zhang 2019-08-16 11:54:40 UTC
Description of problem:
Got errors below when gathering the OLM data via the openshift-must-gather tool.
Get https://localhost:37587/metrics: http: server gave HTTP response to HTTPS client]
And, no relevant ClusterOperator 'operator-lifecycle-manager-packageserver' object definiattion in https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/lib/operatorstatus/status.go#L247

Version-Release number of selected component (if applicable):
Cluster version is 4.2.0-0.nightly-2019-08-15-232721
mac:~ jianzhang$ oc exec catalog-operator-7686cdfb56-m7rwr -- olm --version
OLM version: 0.11.0
git commit: 586e941bd1f42ea1f331453ed431fb43699fef70

How reproducible:
always

Steps to Reproduce:
1. Install the OCP 4.2 cluster.
2. Gather the ClusterOperator OLM data via the `oc adm must-gather co/operator-lifecycle-manager` command.
But, got the timeout errors. bug: https://bugzilla.redhat.com/show_bug.cgi?id=1724321
And then, use the `openshift-must-gather` binary to gather OLM data. 
1) $ git clone git:openshift/must-gather.git
2) $ make
mac:must-gather jianzhang$ pwd
/Users/jianzhang/goproject/src/github.com/openshift/must-gather

Actual results:
Failed to gather OLM data. Details:
mac:~ jianzhang$ openshift-must-gather inspect clusteroperator/operator-lifecycle-manager 
2019/08/16 17:26:25 Gathering config.openshift.io resource data...
2019/08/16 17:26:30 Gathering kubeapiserver.operator.openshift.io resource data...
2019/08/16 17:26:30 Gathering cluster operator resource data...
2019/08/16 17:26:30     Gathering related object reference information for ClusterOperator "operator-lifecycle-manager"...
2019/08/16 17:26:30     Found related object "OperatorGroup.operators.coreos.com" for ClusterOperator "operator-lifecycle-manager"...
2019/08/16 17:26:30     Found related object "ClusterServiceVersion.operators.coreos.com" for ClusterOperator "operator-lifecycle-manager"...
2019/08/16 17:26:30     Found related object "namespaces/openshift-operator-lifecycle-manager" for ClusterOperator "operator-lifecycle-manager"...
2019/08/16 17:27:13 Gathering data for ns/openshift-operator-lifecycle-manager...
2019/08/16 17:27:13     Collecting resources for namespace "openshift-operator-lifecycle-manager"...
2019/08/16 17:27:13     Gathering pod data for namespace "openshift-operator-lifecycle-manager"...
2019/08/16 17:27:13         Gathering data for pod "catalog-operator-7686cdfb56-m7rwr"
2019/08/16 17:27:14         Unable to gather previous container logs: previous terminated container "catalog-operator" in pod "catalog-operator-7686cdfb56-m7rwr" not found
2019/08/16 17:27:20         Gathering data for pod "olm-operator-5c9bc6657f-gk2vd"
2019/08/16 17:27:28         Unable to gather previous container logs: previous terminated container "olm-operator" in pod "olm-operator-5c9bc6657f-gk2vd" not found
2019/08/16 17:27:33         Gathering data for pod "packageserver-55ddc7bb8f-g7s94"
2019/08/16 17:27:37         Unable to gather previous container logs: previous terminated container "packageserver" in pod "packageserver-55ddc7bb8f-g7s94" not found
2019/08/16 17:27:57         Gathering data for pod "packageserver-55ddc7bb8f-vcjqd"
2019/08/16 17:28:00         Unable to gather previous container logs: previous terminated container "packageserver" in pod "packageserver-55ddc7bb8f-vcjqd" not found
Error: one or more errors ocurred while gathering pod-specific data for namespace: openshift-operator-lifecycle-manager

    [one or more errors ocurred while gathering container data for pod catalog-operator-7686cdfb56-m7rwr:

    [unable to gather container /healthz: Get https://localhost:37587/: http: server gave HTTP response to HTTPS client, unable to gather container /version: Get https://localhost:37587/: http: server gave HTTP response to HTTPS client, unable to gather container /metrics: Get https://localhost:37587/metrics: http: server gave HTTP response to HTTPS client], one or more errors ocurred while gathering container data for pod olm-operator-5c9bc6657f-gk2vd:

    [unable to gather container /healthz: Get https://localhost:37587/: http: server gave HTTP response to HTTPS client, unable to gather container /version: Get https://localhost:37587/: http: server gave HTTP response to HTTPS client, unable to gather container /metrics: Get https://localhost:37587/metrics: http: server gave HTTP response to HTTPS client]]

Expected results:
OLM data can be gathered successfully by this 'openshift-must-gather' tool. 


Additional info:

Comment 1 Nick Hale 2019-08-19 14:14:26 UTC
OLM serves its metrics from the second port listed, and it looks like must-gather only attempts to gather metrics (or any endpoint data for that matter) from the first port found on a given pod (https://github.com/openshift/must-gather/blob/9646f4bf643ad8c4201ac8dd7073e7294c90df20/pkg/cmd/inspect/pod.go#L75). Moreover, I can't find this expectation documented in the must-gather repo. Having more than one exposed port is common and it seems like either documenting the expectation or making must-gather perform a "search" of the available ports would be a follow up from that side.

For now, I can see if it makes sense for us to consolidate our health/metrics under one port.

Comment 4 David Eads 2019-08-26 12:41:59 UTC
It's a heuristic. Does it actually fail or just give you a list of "couldn't collect"?  It's a heuristic that we aren't likely to twiddle much since it mostly works.  You can conform so that your stuff gets collected if you want it to be.  For certain messages, I'm happy to drop them to a V(1), so you don't get concerned.

Not a 4.2 blocker.

Comment 5 Maciej Szulik 2019-09-25 08:37:49 UTC
*** Bug 1755111 has been marked as a duplicate of this bug. ***

Comment 6 Michal Fojtik 2019-11-06 20:27:19 UTC
This was rejected twice, closing.

Comment 7 Jian Zhang 2019-11-18 03:12:22 UTC
Hi, David

Sorry for the late to reply. Thank you for your explanation. But, as a default component, the metrics of OLM should be gathered.
Do you mean this `openshift-must-gather` tool will not be shipped to the customers?

Comment 8 Red Hat Bugzilla 2023-09-14 05:41:43 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days