1720308 – Unable to join cluster version upgrade info in promql for monitoring dashboards of upgrades

Bug 1720308 - Unable to join cluster version upgrade info in promql for monitoring dashboards of upgrades

Summary: Unable to join cluster version upgrade info in promql for monitoring dashboar...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.1.z
Assignee:	Abhinav Dahiya
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:	4.1.4
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-13 15:59 UTC by Clayton Coleman
Modified:	2023-09-14 05:30 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1713207
Environment:
Last Closed:	2019-07-04 09:01:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:1635	0	None	None	None	2019-07-04 09:01:47 UTC

Description Clayton Coleman 2019-06-13 15:59:29 UTC

A key goal of the prometheus metrics returned through telemetry is to determine what upgrades are actually broken.  We currently return cluster_version with multiple types and the same label "version" (so "updating" is the TO, "current" is the FROM).  However, there are significant limitations in promql that prevent prometheus or Grafana from easily querying and joining that data to other fields - you can't take a query for a list of all degraded operators and determine whether they are failing during an upgrade.

We need to work within the bounds of what promql allows and ensure the cluster_version{type="updating"} metric is useful for attaching to other queries, since this is the encouraged way of joining in prometheus.

To do that, we need to have a metric that contains both from and to versions and the _id of the cluster.

The simplest option is to add a "from_version" label to all "cluster_version" fields and set it appropriately empty for the different types - for instance:

updating - from_version is the last completed
current - (which is the current completed) from_version is the previous completed, if any
cluster - from_version is empty (cluster should mean "initial")
failure - from_version is the last completed

This would allow us to join queries that have "_id" in telemetry (all of them) with group_left(version,from_version) and get a result.

Without a fix here we can't effectively do a dashboard of "cluster operators that are broken and the version you are upgrading from and to"

Comment 1 Clayton Coleman 2019-06-18 16:11:33 UTC

Merged to origin in https://github.com/openshift/cluster-version-operator/pull/204, cherrypicking once prerequisite is in.  Verified in telemetry.

Comment 2 Abhinav Dahiya 2019-06-24 20:36:29 UTC

https://github.com/openshift/cluster-version-operator/pull/208 was merged to release-4.1

Comment 6 errata-xmlrpc 2019-07-04 09:01:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1635

Comment 7 Red Hat Bugzilla 2023-09-14 05:30:19 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.