Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1720308

Summary: Unable to join cluster version upgrade info in promql for monitoring dashboards of upgrades
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: Cluster Version OperatorAssignee: Abhinav Dahiya <adahiya>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: aos-bugs, gblomqui, jokerman, juzhao, mmccomas, tnozicka, xtian, xxia
Target Milestone: ---   
Target Release: 4.1.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: 4.1.4
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1713207 Environment:
Last Closed: 2019-07-04 09:01:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2019-06-13 15:59:29 UTC
A key goal of the prometheus metrics returned through telemetry is to determine what upgrades are actually broken.  We currently return cluster_version with multiple types and the same label "version" (so "updating" is the TO, "current" is the FROM).  However, there are significant limitations in promql that prevent prometheus or Grafana from easily querying and joining that data to other fields - you can't take a query for a list of all degraded operators and determine whether they are failing during an upgrade.

We need to work within the bounds of what promql allows and ensure the cluster_version{type="updating"} metric is useful for attaching to other queries, since this is the encouraged way of joining in prometheus.

To do that, we need to have a metric that contains both from and to versions and the _id of the cluster.

The simplest option is to add a "from_version" label to all "cluster_version" fields and set it appropriately empty for the different types - for instance:

updating - from_version is the last completed
current - (which is the current completed) from_version is the previous completed, if any
cluster - from_version is empty (cluster should mean "initial")
failure - from_version is the last completed

This would allow us to join queries that have "_id" in telemetry (all of them) with group_left(version,from_version) and get a result.

Without a fix here we can't effectively do a dashboard of "cluster operators that are broken and the version you are upgrading from and to"

Comment 1 Clayton Coleman 2019-06-18 16:11:33 UTC
Merged to origin in https://github.com/openshift/cluster-version-operator/pull/204, cherrypicking once prerequisite is in.  Verified in telemetry.

Comment 2 Abhinav Dahiya 2019-06-24 20:36:29 UTC
https://github.com/openshift/cluster-version-operator/pull/208 was merged to release-4.1

Comment 6 errata-xmlrpc 2019-07-04 09:01:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1635

Comment 7 Red Hat Bugzilla 2023-09-14 05:30:19 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days