A key goal of the prometheus metrics returned through telemetry is to determine what upgrades are actually broken. We currently return cluster_version with multiple types and the same label "version" (so "updating" is the TO, "current" is the FROM). However, there are significant limitations in promql that prevent prometheus or Grafana from easily querying and joining that data to other fields - you can't take a query for a list of all degraded operators and determine whether they are failing during an upgrade. We need to work within the bounds of what promql allows and ensure the cluster_version{type="updating"} metric is useful for attaching to other queries, since this is the encouraged way of joining in prometheus. To do that, we need to have a metric that contains both from and to versions and the _id of the cluster. The simplest option is to add a "from_version" label to all "cluster_version" fields and set it appropriately empty for the different types - for instance: updating - from_version is the last completed current - (which is the current completed) from_version is the previous completed, if any cluster - from_version is empty (cluster should mean "initial") failure - from_version is the last completed This would allow us to join queries that have "_id" in telemetry (all of them) with group_left(version,from_version) and get a result. Without a fix here we can't effectively do a dashboard of "cluster operators that are broken and the version you are upgrading from and to"
Merged to origin in https://github.com/openshift/cluster-version-operator/pull/204, cherrypicking once prerequisite is in. Verified in telemetry.
https://github.com/openshift/cluster-version-operator/pull/208 was merged to release-4.1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1635
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days