Bug 1720308 - Unable to join cluster version upgrade info in promql for monitoring dashboards of upgrades [NEEDINFO]
Summary: Unable to join cluster version upgrade info in promql for monitoring dashboar...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.1.z
Assignee: Abhinav Dahiya
QA Contact: Junqi Zhao
URL:
Whiteboard: 4.1.4
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-06-13 15:59 UTC by Clayton Coleman
Modified: 2019-07-04 09:01 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1713207
Environment:
Last Closed: 2019-07-04 09:01:40 UTC
Target Upstream Version:
juzhao: needinfo? (ccoleman)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:1635 None None None 2019-07-04 09:01:47 UTC

Description Clayton Coleman 2019-06-13 15:59:29 UTC
A key goal of the prometheus metrics returned through telemetry is to determine what upgrades are actually broken.  We currently return cluster_version with multiple types and the same label "version" (so "updating" is the TO, "current" is the FROM).  However, there are significant limitations in promql that prevent prometheus or Grafana from easily querying and joining that data to other fields - you can't take a query for a list of all degraded operators and determine whether they are failing during an upgrade.

We need to work within the bounds of what promql allows and ensure the cluster_version{type="updating"} metric is useful for attaching to other queries, since this is the encouraged way of joining in prometheus.

To do that, we need to have a metric that contains both from and to versions and the _id of the cluster.

The simplest option is to add a "from_version" label to all "cluster_version" fields and set it appropriately empty for the different types - for instance:

updating - from_version is the last completed
current - (which is the current completed) from_version is the previous completed, if any
cluster - from_version is empty (cluster should mean "initial")
failure - from_version is the last completed

This would allow us to join queries that have "_id" in telemetry (all of them) with group_left(version,from_version) and get a result.

Without a fix here we can't effectively do a dashboard of "cluster operators that are broken and the version you are upgrading from and to"

Comment 1 Clayton Coleman 2019-06-18 16:11:33 UTC
Merged to origin in https://github.com/openshift/cluster-version-operator/pull/204, cherrypicking once prerequisite is in.  Verified in telemetry.

Comment 2 Abhinav Dahiya 2019-06-24 20:36:29 UTC
https://github.com/openshift/cluster-version-operator/pull/208 was merged to release-4.1

Comment 6 errata-xmlrpc 2019-07-04 09:01:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1635


Note You need to log in before you can comment on or make changes to this bug.